{"id":1741,"date":"2026-02-15T06:51:09","date_gmt":"2026-02-15T06:51:09","guid":{"rendered":"https:\/\/sreschool.com\/blog\/partial-outage\/"},"modified":"2026-05-05T07:28:40","modified_gmt":"2026-05-05T07:28:40","slug":"partial-outage","status":"publish","type":"post","link":"https:\/\/sreschool.com\/blog\/partial-outage\/","title":{"rendered":"What is Partial outage? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide)"},"content":{"rendered":"\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Quick Definition (30\u201360 words)<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">A partial outage is when a subset of a service, region, or user group experiences degraded or unavailable functionality while other parts remain healthy. Analogy: a partial blackout where some city blocks lose power while others stay lit. Formal: a scoped availability failure affecting non-global service surface area.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">What is Partial outage?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">A partial outage is a scoped availability or degradation incident that does not fully take down a product or service globally. It affects a subset of users, features, regions, or infrastructure components. It is NOT a full outage, a planned maintenance event (unless unplanned), nor purely a performance slowdown that impacts all traffic equally.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Key properties and constraints:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Scope-limited: constrained to specific components, regions, or request types.<\/li>\n<li>Partial user impact: some users unaffected; others have degraded or no service.<\/li>\n<li>Heterogeneous symptoms: errors, timeouts, increased latency, incorrect responses.<\/li>\n<li>Transient or persistent: can be temporary (minutes) or persistent until remediated.<\/li>\n<li>Operationally ambiguous: hard to detect with coarse global metrics.<\/li>\n<li>Requires targeted mitigation strategies: routing, feature flags, retries, regional failover.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Where it fits in modern cloud\/SRE workflows:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Incident classification and priority: often urgent due to user segmentation.<\/li>\n<li>SLO-level impact: consumes error budget for affected SLIs but not global SLIs.<\/li>\n<li>Runbooks: needs scoped runbooks and playbooks for isolation and rollback.<\/li>\n<li>Observability: demands high cardinality telemetry and localized alerts.<\/li>\n<li>Automation: benefits from intelligent routing, canary rollbacks, and auto-heal.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Diagram description (text-only visualization):<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Users from multiple regions -&gt; Edge\/load balancer -&gt; Traffic split by region and feature flag -&gt; Microservices cluster A and B in region X and Y -&gt; Dependencies include DB shard 1 and 2, third-party API -&gt; Partial outage manifests as errors from cluster B and DB shard 2, while cluster A and shard 1 remain healthy.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Partial outage in one sentence<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">A partial outage is a constrained failure where only a portion of the service surface\u2014users, regions, features, or infrastructure\u2014fails or degrades while others continue to operate.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Partial outage vs related terms (TABLE REQUIRED)<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Term<\/th>\n<th>How it differs from Partial outage<\/th>\n<th>Common confusion<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>T1<\/td>\n<td>Full outage<\/td>\n<td>Global service is unavailable<\/td>\n<td>People call any outage a full outage<\/td>\n<\/tr>\n<tr>\n<td>T2<\/td>\n<td>Degradation<\/td>\n<td>Performance drop can be global or partial<\/td>\n<td>Degradation can be partial or total<\/td>\n<\/tr>\n<tr>\n<td>T3<\/td>\n<td>Incident<\/td>\n<td>Any operational problem<\/td>\n<td>Not all incidents are outages<\/td>\n<\/tr>\n<tr>\n<td>T4<\/td>\n<td>Partial deployment failure<\/td>\n<td>Only new release causes issues for subset<\/td>\n<td>Blamed as an outage incorrectly<\/td>\n<\/tr>\n<tr>\n<td>T5<\/td>\n<td>Regional outage<\/td>\n<td>Affects a geographic region only<\/td>\n<td>Partial outage may be multi-region subset<\/td>\n<\/tr>\n<tr>\n<td>T6<\/td>\n<td>Feature flag failure<\/td>\n<td>Feature-specific user impact<\/td>\n<td>Can be mistaken for general outage<\/td>\n<\/tr>\n<tr>\n<td>T7<\/td>\n<td>Network partition<\/td>\n<td>Connectivity split between components<\/td>\n<td>Network partition can cause partial outage<\/td>\n<\/tr>\n<tr>\n<td>T8<\/td>\n<td>Capacity exhaust<\/td>\n<td>Resource limits hit in subset<\/td>\n<td>Often causes partial service unavailability<\/td>\n<\/tr>\n<tr>\n<td>T9<\/td>\n<td>Latency spike<\/td>\n<td>Short delay increases response time<\/td>\n<td>Latency may not cause request failures<\/td>\n<\/tr>\n<tr>\n<td>T10<\/td>\n<td>Dependency outage<\/td>\n<td>Third party fails, affecting subset<\/td>\n<td>Can cascade into partial outage<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if any cell says \u201cSee details below\u201d)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>None.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Why does Partial outage matter?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Business impact:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Revenue: even partial outages can block high-value customers or regions, causing measurable revenue loss.<\/li>\n<li>Trust: repeated partial outages erode customer confidence and increase churn.<\/li>\n<li>Compliance and contracts: SLAs tied to availability for subset services can trigger credits or legal exposure.<\/li>\n<li>Opportunity cost: manual mitigation consumes senior engineers and delays feature delivery.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Engineering impact:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Incident burden: fragmented incidents increase mean time to repair (MTTR) without proper isolation.<\/li>\n<li>Velocity trade-offs: teams may pause deployments or add guardrails that slow release cadence.<\/li>\n<li>Technical debt exposure: hidden single points of failure become visible.<\/li>\n<li>Increased complexity: handling multiple partial outages across microservices calls for better automation and testing.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">SRE framing:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>SLIs\/SLOs: partial outage requires scoped SLIs (per-region or per-feature) rather than only global SLIs.<\/li>\n<li>Error budgets: consume error budget for specific slices; global budget may remain unused.<\/li>\n<li>Toil: manual routing changes or customer communications increase toil.<\/li>\n<li>On-call: needs targeted routing of incidents to owners who understand the affected slice.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">What breaks in production \u2014 realistic examples:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Database shard reachable for 80% of users but one shard returns timeouts causing errors for 20% of users.<\/li>\n<li>A new feature rolled out via phased deployment has a bug that crashes only mobile clients.<\/li>\n<li>CDN edge POP in a region misroutes TLS handshakes causing regional failures for corporate customers.<\/li>\n<li>A third-party payment gateway rate-limits specific merchant IDs leading to payment errors for a subset of transactions.<\/li>\n<li>Autoscaling misconfiguration causes backend pool depletion under specific request patterns.<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Where is Partial outage used? (TABLE REQUIRED)<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Layer\/Area<\/th>\n<th>How Partial outage appears<\/th>\n<th>Typical telemetry<\/th>\n<th>Common tools<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>L1<\/td>\n<td>Edge and CDN<\/td>\n<td>Some POPs fail or misroute<\/td>\n<td>Edge errors and regional RUM<\/td>\n<td>CDN logs CDN config<\/td>\n<\/tr>\n<tr>\n<td>L2<\/td>\n<td>Network<\/td>\n<td>Packet loss in specific AZ<\/td>\n<td>Packet loss counters and traceroutes<\/td>\n<td>Network monitors traceroute<\/td>\n<\/tr>\n<tr>\n<td>L3<\/td>\n<td>Service mesh<\/td>\n<td>One subset of pods drops requests<\/td>\n<td>Per-pod request success<\/td>\n<td>Mesh telemetry mesh dashboard<\/td>\n<\/tr>\n<tr>\n<td>L4<\/td>\n<td>Application<\/td>\n<td>Feature endpoint errors for subset<\/td>\n<td>Error rate by user segment<\/td>\n<td>APM user traces<\/td>\n<\/tr>\n<tr>\n<td>L5<\/td>\n<td>Data layer<\/td>\n<td>Shard or replica lag<\/td>\n<td>Replica lag metrics and errors<\/td>\n<td>DB monitoring DB alerts<\/td>\n<\/tr>\n<tr>\n<td>L6<\/td>\n<td>Serverless<\/td>\n<td>Cold-start spikes for specific region<\/td>\n<td>Invocation errors and latency<\/td>\n<td>Serverless dashboard logs<\/td>\n<\/tr>\n<tr>\n<td>L7<\/td>\n<td>Kubernetes<\/td>\n<td>Node pool or daemonset issue<\/td>\n<td>Pod crashloop, node Ready<\/td>\n<td>K8s monitoring kubectl<\/td>\n<\/tr>\n<tr>\n<td>L8<\/td>\n<td>CI\/CD<\/td>\n<td>Canary fails for subset users<\/td>\n<td>Deployment failure metrics<\/td>\n<td>CI job logs rollout tool<\/td>\n<\/tr>\n<tr>\n<td>L9<\/td>\n<td>Security<\/td>\n<td>WAF rule blocks specific clients<\/td>\n<td>Block count and false positives<\/td>\n<td>WAF logs SIEM<\/td>\n<\/tr>\n<tr>\n<td>L10<\/td>\n<td>Third-party API<\/td>\n<td>Vendor returns 403 for certain IDs<\/td>\n<td>Vendor error code distribution<\/td>\n<td>API gateway logs<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if needed)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>None.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">When should you use Partial outage?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">When it\u2019s necessary:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>You need to limit collateral damage during failures: route traffic away from unhealthy regions or features.<\/li>\n<li>You want to preserve availability for unaffected users while isolating a problematic slice.<\/li>\n<li>SLA\/SLOs are defined per-customer, region, or feature and you need targeted incident handling.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">When it\u2019s optional:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>You can tolerate brief global degradation for a simpler remediation when impact is minimal.<\/li>\n<li>If the affected slice represents negligible traffic or low-value users.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">When NOT to use \/ overuse it:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Don\u2019t over-segment SLIs\/SLOs to the point of creating operational noise and indistinguishable alerts.<\/li>\n<li>Avoid using partial outages as a persistent workaround\u2014fix root cause.<\/li>\n<li>Avoid wide-ranging feature flags for every small behavior; complexity increases risk.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Decision checklist:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>If high-value users are affected AND global traffic healthy -&gt; prioritize scoped remediation.<\/li>\n<li>If errors affect majority of customers -&gt; treat as full outage and invoke broader playbook.<\/li>\n<li>If third-party dependency affects subset -&gt; consider retry\/backoff and degrade gracefully.<\/li>\n<li>If deployment caused issue in canary stage -&gt; rollback canary or disable feature flag.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Maturity ladder:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Beginner: basic regional metrics and manual routing.<\/li>\n<li>Intermediate: scoped SLIs, feature flags, automated traffic shifting.<\/li>\n<li>Advanced: automated guarded rollouts, AI-assisted anomaly detection, dynamic failover with no human intervention.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">How does Partial outage work?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Components and workflow:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Ingress: load balancer, CDN, API gateway routes traffic by region, client type, or feature.<\/li>\n<li>Routing rules: service discovery and routing tables determine target clusters.<\/li>\n<li>Service instances: pods, VMs, or serverless functions serving traffic.<\/li>\n<li>Data stores: sharded or partitioned storage with per-shard health.<\/li>\n<li>Observability plane: high-cardinality traces, logs, metrics, RUM.<\/li>\n<li>Control plane: CI\/CD, feature flagging, orchestration for rollback and traffic control.<\/li>\n<li>Automation: policies for circuit breaking, rate limiting, and auto-remediation.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Data flow and lifecycle:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Request enters via edge; routing evaluates rules.<\/li>\n<li>Request lands on an instance; instance consults local dependencies.<\/li>\n<li>Failure occurs in a subset (shard, region, feature).<\/li>\n<li>Observability emits high-cardinality telemetry scoped to the affected slice.<\/li>\n<li>Alerting triggers scoped incident responses and automated mitigations.<\/li>\n<li>Traffic reroutes or feature is disabled; validation checks restore service.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">Edge cases and failure modes:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Cross-dependency cascades: one shard failure causes a fan-out of retries and overload.<\/li>\n<li>Split-brain routing: control plane thinks traffic is safe to route while data plane fails.<\/li>\n<li>Monitoring blind spots: no per-slice SLIs leads to unnoticed partial outage.<\/li>\n<li>Automation misfires: automated rollback targets wrong revision under noisy signals.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Typical architecture patterns for Partial outage<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Canary and Feature-flagged Rollouts: use flags and canaries to limit blast radius; best for new features.<\/li>\n<li>Regional Failover with Active-Standby: route traffic to standby region when primary region shows partial failures; best for regional disasters.<\/li>\n<li>Shard-aware Circuit Breakers: per-shard circuit breakers prevent cascade and limit failures to shards; best for databases and cache layers.<\/li>\n<li>Service Mesh Traffic Shaping: leverage mesh to route away from unhealthy pods or versions; best for microservices with high cardinality routing.<\/li>\n<li>Edge-level Request Filtering: apply WAF or edge rules to block malformed traffic that causes subset failures; best for security-triggered incidents.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Failure modes &amp; mitigation (TABLE REQUIRED)<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Failure mode<\/th>\n<th>Symptom<\/th>\n<th>Likely cause<\/th>\n<th>Mitigation<\/th>\n<th>Observability signal<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>F1<\/td>\n<td>Shard failure<\/td>\n<td>Errors for subset keys<\/td>\n<td>Hardware or DB index issue<\/td>\n<td>Isolate shard and failover<\/td>\n<td>Replica lag and error spike<\/td>\n<\/tr>\n<tr>\n<td>F2<\/td>\n<td>Canary regression<\/td>\n<td>New version fails for subset<\/td>\n<td>Bug in feature code path<\/td>\n<td>Rollback canary or disable flag<\/td>\n<td>Canary error rate high<\/td>\n<\/tr>\n<tr>\n<td>F3<\/td>\n<td>Edge POP failure<\/td>\n<td>Regional TLS errors<\/td>\n<td>CDN POP config error<\/td>\n<td>Reroute to healthy POPs<\/td>\n<td>Edge 5xxs and RUM drops<\/td>\n<\/tr>\n<tr>\n<td>F4<\/td>\n<td>Mesh sidecar crash<\/td>\n<td>Pod subset fails requests<\/td>\n<td>Sidecar misconfiguration<\/td>\n<td>Restart sidecars or roll back<\/td>\n<td>Pod restarts and traces<\/td>\n<\/tr>\n<tr>\n<td>F5<\/td>\n<td>Rate-limited vendor<\/td>\n<td>4xx from dependency for some IDs<\/td>\n<td>Vendor throttling per merchant<\/td>\n<td>Throttle back or switch vendor<\/td>\n<td>Upstream error codes<\/td>\n<\/tr>\n<tr>\n<td>F6<\/td>\n<td>Autoscaler misconfig<\/td>\n<td>Pod starvation under pattern<\/td>\n<td>Wrong metrics for scale<\/td>\n<td>Adjust autoscaling policy<\/td>\n<td>CPU queue length and OOMs<\/td>\n<\/tr>\n<tr>\n<td>F7<\/td>\n<td>Security rule false positive<\/td>\n<td>Legit traffic blocked<\/td>\n<td>Overaggressive WAF rule<\/td>\n<td>Patch or scope rule<\/td>\n<td>Block counts and client IDs<\/td>\n<\/tr>\n<tr>\n<td>F8<\/td>\n<td>Network micro-partition<\/td>\n<td>Inter-AZ timeouts<\/td>\n<td>Routing table or SDN bug<\/td>\n<td>Reconfigure routes or failover<\/td>\n<td>Packet loss and TCP retries<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if needed)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>None.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Key Concepts, Keywords &amp; Terminology for Partial outage<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Glossary (40+ terms). Each term line: term \u2014 definition \u2014 why it matters \u2014 common pitfall<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Availability \u2014 Measure of uptime for a service \u2014 Core of outage analysis \u2014 Confusing availability with latency<\/li>\n<li>Partial outage \u2014 Scoped availability failure \u2014 Primary topic \u2014 Misclassified as full outage<\/li>\n<li>SLI \u2014 Service Level Indicator metric \u2014 Basis for SLOs \u2014 Choosing wrong metric<\/li>\n<li>SLO \u2014 Service Level Objective target \u2014 Guides reliability effort \u2014 Setting unrealistic targets<\/li>\n<li>Error budget \u2014 Allowable errors before action \u2014 Enables paced engineering \u2014 Misusing to ignore issues<\/li>\n<li>Canary deployment \u2014 Small scale release for testing \u2014 Limits blast radius \u2014 Not representative sample<\/li>\n<li>Feature flag \u2014 Toggle to change behavior at runtime \u2014 Quick mitigation tool \u2014 Flag sprawl and complexity<\/li>\n<li>Circuit breaker \u2014 Prevents cascading failures \u2014 Protects dependencies \u2014 Incorrect thresholds cause blocking<\/li>\n<li>Rate limiting \u2014 Controls request rates \u2014 Prevents overload \u2014 Overly strict limits affect users<\/li>\n<li>Sharding \u2014 Data partitioning by key \u2014 Limits impact to a shard \u2014 Uneven shard distribution<\/li>\n<li>Replica lag \u2014 Delay in replicas catching up \u2014 Risk to consistency \u2014 Blind spots in monitoring<\/li>\n<li>Regional failover \u2014 Redirect traffic between regions \u2014 Resilience for outages \u2014 Data sovereignty issues<\/li>\n<li>Active-active \u2014 Multiple regions serve traffic simultaneously \u2014 Improves availability \u2014 Consistency challenges<\/li>\n<li>Active-passive \u2014 One region serves traffic, others standby \u2014 Simpler consistency \u2014 Longer failover time<\/li>\n<li>Observability \u2014 Telemetry that reveals system state \u2014 Essential to detect partial outages \u2014 High-cardinality costs<\/li>\n<li>High cardinality \u2014 Many dimensions in metrics\/traces \u2014 Enables slicing by user\/region \u2014 Storage and cost implications<\/li>\n<li>RUM \u2014 Real User Monitoring \u2014 Client side performance insights \u2014 Privacy and sampling constraints<\/li>\n<li>APM \u2014 Application Performance Monitoring \u2014 Deep tracing of requests \u2014 Instrumentation overhead<\/li>\n<li>Log aggregation \u2014 Centralized logs for analysis \u2014 Debugging incidents \u2014 Log blowup and retention costs<\/li>\n<li>Metrics \u2014 Numeric system measures \u2014 Trend detection \u2014 Metric resolution vs alert noise<\/li>\n<li>Tracing \u2014 Distributed request flow tracking \u2014 Root cause analysis \u2014 Sampling may drop critical traces<\/li>\n<li>Error budget policy \u2014 Rules for handling error budgets \u2014 Operational discipline \u2014 Ignoring enforcement<\/li>\n<li>Incident response \u2014 Process to manage incidents \u2014 Lowers MTTR \u2014 Poor role definition causes confusion<\/li>\n<li>Runbook \u2014 Step-by-step remediation guidance \u2014 Guides responders \u2014 Stale runbooks are harmful<\/li>\n<li>Playbook \u2014 Higher-level incident actions \u2014 Situational flexibility \u2014 Overly generic playbooks fail to help<\/li>\n<li>Chaos engineering \u2014 Fault injection testing \u2014 Validates resilience \u2014 Unsafe experiments in prod<\/li>\n<li>Auto-heal \u2014 Automated corrective actions \u2014 Rapid recovery \u2014 Bad automation can worsen outage<\/li>\n<li>Service mesh \u2014 Layer for service-to-service routing \u2014 Fine-grained control \u2014 Complexity and sidecar overhead<\/li>\n<li>Edge POP \u2014 CDN point of presence \u2014 Affects regional users \u2014 POP misconfig causes broad impact<\/li>\n<li>SDN \u2014 Software-defined networking \u2014 Dynamic routing control \u2014 Misconfig risks partitioning<\/li>\n<li>Throttling \u2014 Intentional slowdown for fairness \u2014 Protects system \u2014 Poorly tuned throttles block critical traffic<\/li>\n<li>Graceful degradation \u2014 Reduced functionality mode \u2014 Keeps system partially usable \u2014 Hard to design fallback UX<\/li>\n<li>Compensation logic \u2014 Business-level undo measures \u2014 Maintains invariants \u2014 Complex to implement<\/li>\n<li>Blue-green deploy \u2014 Deployment pattern with two environments \u2014 Fast rollback \u2014 Costly in infra duplication<\/li>\n<li>Rollback \u2014 Reverting to known good state \u2014 Quick mitigation \u2014 Data migrations complicate rollback<\/li>\n<li>Postmortem \u2014 Incident analysis document \u2014 Drives long-term improvement \u2014 Blameful culture prevents truth<\/li>\n<li>MTTR \u2014 Mean time to repair \u2014 Operational health indicator \u2014 Focusing only on MTTR misses prevention<\/li>\n<li>SLA \u2014 Service Level Agreement contractual commitment \u2014 Customer expectations \u2014 Legal and financial impact<\/li>\n<li>Synthetic monitoring \u2014 Simulated user checks \u2014 Early detection \u2014 Failing to align with real user paths<\/li>\n<li>Health check \u2014 Endpoint for readiness or liveness \u2014 Orchestrator uses it \u2014 Fragile or too lax checks<\/li>\n<li>Blast radius \u2014 Magnitude of impact from change \u2014 Drives design decisions \u2014 Poorly quantified<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">How to Measure Partial outage (Metrics, SLIs, SLOs) (TABLE REQUIRED)<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Metric\/SLI<\/th>\n<th>What it tells you<\/th>\n<th>How to measure<\/th>\n<th>Starting target<\/th>\n<th>Gotchas<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>M1<\/td>\n<td>Per-region availability<\/td>\n<td>Region-specific uptime<\/td>\n<td>Successful requests divided by total per region<\/td>\n<td>99.9% per critical region<\/td>\n<td>Aggregation hides region failures<\/td>\n<\/tr>\n<tr>\n<td>M2<\/td>\n<td>Feature-flag success rate<\/td>\n<td>Feature-specific health<\/td>\n<td>Success rate for flagged users<\/td>\n<td>99.5% for new flags<\/td>\n<td>Low sample size for canaries<\/td>\n<\/tr>\n<tr>\n<td>M3<\/td>\n<td>Shard error rate<\/td>\n<td>Errors scoped to data shard<\/td>\n<td>Errors per shard key group<\/td>\n<td>99.9% success per shard<\/td>\n<td>Hot keys skew results<\/td>\n<\/tr>\n<tr>\n<td>M4<\/td>\n<td>Per-customer latency<\/td>\n<td>Latency for high-value customers<\/td>\n<td>95th percentile per customer<\/td>\n<td>200ms p95 for premium<\/td>\n<td>Cardinality explosion<\/td>\n<\/tr>\n<tr>\n<td>M5<\/td>\n<td>Dependency error proportion<\/td>\n<td>Fraction of errors due to upstreams<\/td>\n<td>Count of upstream errors \/ total<\/td>\n<td>&lt;5% of total errors<\/td>\n<td>Mapping errors to vendor sometimes hard<\/td>\n<\/tr>\n<tr>\n<td>M6<\/td>\n<td>Edge POP error rate<\/td>\n<td>POP-specific request failures<\/td>\n<td>Edge 5xx per POP<\/td>\n<td>99.7% per POP<\/td>\n<td>POP naming and discovery complexity<\/td>\n<\/tr>\n<tr>\n<td>M7<\/td>\n<td>Canary error ratio<\/td>\n<td>New version vs baseline errors<\/td>\n<td>Error ratio new version divided by baseline<\/td>\n<td>&lt;1.5x baseline<\/td>\n<td>Baseline drift during peak traffic<\/td>\n<\/tr>\n<tr>\n<td>M8<\/td>\n<td>Pod restart rate<\/td>\n<td>Stability of subset pods<\/td>\n<td>Restarts per pod per hour<\/td>\n<td>&lt;0.01 restarts\/hr<\/td>\n<td>Some restarts are normal due to updates<\/td>\n<\/tr>\n<tr>\n<td>M9<\/td>\n<td>Replica lag ms<\/td>\n<td>Data consistency exposure<\/td>\n<td>Seconds lag on replicas<\/td>\n<td>&lt;200ms for critical data<\/td>\n<td>Asymmetric replication patterns<\/td>\n<\/tr>\n<tr>\n<td>M10<\/td>\n<td>Circuit breaker trips<\/td>\n<td>Dependency health signal<\/td>\n<td>Count of CB opens per time<\/td>\n<td>Minimal allowed per policy<\/td>\n<td>Excessive CBs mask real issues<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if needed)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>None.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Best tools to measure Partial outage<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">List of 6 popular types: observability platforms, APM, RUM, CDN\/edge monitoring, service mesh telemetry, DB monitoring.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 Observability platform (example: modern metrics\/tracing platform)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for Partial outage: Metrics, traces, logs and high-cardinality slices.<\/li>\n<li>Best-fit environment: Cloud-native microservices and multi-region deployments.<\/li>\n<li>Setup outline:<\/li>\n<li>Instrument services with metrics and distributed tracing.<\/li>\n<li>Tag spans and metrics with region, shard, feature flag.<\/li>\n<li>Configure high-cardinality indexing and sampling rules.<\/li>\n<li>Build dashboards per-slice.<\/li>\n<li>Integrate with alerting and incident system.<\/li>\n<li>Strengths:<\/li>\n<li>Centralized correlated telemetry.<\/li>\n<li>Powerful slicing by dimensions.<\/li>\n<li>Limitations:<\/li>\n<li>Storage cost at high cardinality.<\/li>\n<li>Query performance vs volume tradeoffs.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 APM<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for Partial outage: End-to-end traces and transaction errors for affected paths.<\/li>\n<li>Best-fit environment: Complex distributed transactions and backend services.<\/li>\n<li>Setup outline:<\/li>\n<li>Instrument key transactions.<\/li>\n<li>Enable distributed context propagation.<\/li>\n<li>Tag by customer ID and feature flag.<\/li>\n<li>Strengths:<\/li>\n<li>Fast root cause pinpointing.<\/li>\n<li>Transaction-level visibility.<\/li>\n<li>Limitations:<\/li>\n<li>Sampling may drop low-volume failures.<\/li>\n<li>Agent overhead on hosts.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 RUM \/ Client telemetry<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for Partial outage: Client-side errors, performance, and regional user impact.<\/li>\n<li>Best-fit environment: Web and mobile frontends.<\/li>\n<li>Setup outline:<\/li>\n<li>Integrate lightweight SDK in client.<\/li>\n<li>Capture errors, timings, geo, and user ID if allowed.<\/li>\n<li>Configure sampling and privacy handling.<\/li>\n<li>Strengths:<\/li>\n<li>Real user experience visibility.<\/li>\n<li>Detects client-specific partial outages.<\/li>\n<li>Limitations:<\/li>\n<li>Data privacy and sampling.<\/li>\n<li>Ad blockers can reduce signal.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 CDN \/ Edge monitoring<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for Partial outage: POP-specific errors and routing issues.<\/li>\n<li>Best-fit environment: High traffic CDNs and global edge services.<\/li>\n<li>Setup outline:<\/li>\n<li>Enable per-POP logging and synthetic checks.<\/li>\n<li>Track TLS negotiation and origin health per POP.<\/li>\n<li>Strengths:<\/li>\n<li>Early detection of POP-specific failures.<\/li>\n<li>Controls at edge for mitigation.<\/li>\n<li>Limitations:<\/li>\n<li>Limited trace propagation past edge.<\/li>\n<li>Depends on CDN feature set.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 Service mesh telemetry<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for Partial outage: Per-pod and per-route metrics and circuit breaker state.<\/li>\n<li>Best-fit environment: Kubernetes microservices.<\/li>\n<li>Setup outline:<\/li>\n<li>Install mesh sidecars.<\/li>\n<li>Enable telemetry and mTLS if needed.<\/li>\n<li>Create routing rules for canary isolation.<\/li>\n<li>Strengths:<\/li>\n<li>Fine-grained control and telemetry.<\/li>\n<li>Dynamic traffic management.<\/li>\n<li>Limitations:<\/li>\n<li>Sidecar overhead and operational complexity.<\/li>\n<li>Mesh upgrades can cause perturbations.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 Database monitoring<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for Partial outage: Replica lag, errors, slow queries per shard.<\/li>\n<li>Best-fit environment: Sharded, replicated data stores.<\/li>\n<li>Setup outline:<\/li>\n<li>Instrument replication metrics.<\/li>\n<li>Track per-shard query latency and error rates.<\/li>\n<li>Strengths:<\/li>\n<li>Direct insight into data layer health.<\/li>\n<li>Can trigger targeted failover.<\/li>\n<li>Limitations:<\/li>\n<li>Some DBs lack per-shard granularity.<\/li>\n<li>Monitoring agents may add load.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Recommended dashboards &amp; alerts for Partial outage<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Executive dashboard:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Overall global availability: top-line percentage and trend.<\/li>\n<li>Regions with notable deviation: per-region availability sparkline.<\/li>\n<li>Business impact: percentage of revenue affected.<\/li>\n<li>Error budget consumption: scoped and global.<\/li>\n<li>High-level remediation status: open incidents and mitigations.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">On-call dashboard:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Active alerts and affected slice details.<\/li>\n<li>Per-region and per-feature SLI panels.<\/li>\n<li>Recent deploys and canary status.<\/li>\n<li>Dependency error counts and circuit breaker state.<\/li>\n<li>Runbook link and recent incidents.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Debug dashboard:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Raw traces for affected requests.<\/li>\n<li>Per-pod logs with tail capability.<\/li>\n<li>DB shard metrics and replica lag.<\/li>\n<li>Network path metrics and edge POP logs.<\/li>\n<li>Feature flag membership and rollout percentages.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Alerting guidance:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Page vs ticket: page when high-severity partial outage affects multiple high-value customers or core payments; create ticket for minor feature regressions affecting low-value slice.<\/li>\n<li>Burn-rate guidance: alert on sustained burn above planned error budget pace; for partial SLIs, consider proportional burn-rate thresholds.<\/li>\n<li>Noise reduction tactics:<\/li>\n<li>Dedupe alerts by grouping identical error signatures.<\/li>\n<li>Use alert suppression windows during planned rollouts.<\/li>\n<li>Aggregate low-volume errors into periodic tickets rather than paging.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Implementation Guide (Step-by-step)<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">1) Prerequisites\n&#8211; Ownership defined for services and dependencies.\n&#8211; Instrumentation libraries and telemetry export configured.\n&#8211; Feature flagging system available.\n&#8211; CI\/CD with canary capability.\n&#8211; Incident communication channels and runbook templates.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">2) Instrumentation plan\n&#8211; Tag all telemetry with region, AZ, customer ID, feature flag, and deployment version.\n&#8211; Add health checks per shard and per feature.\n&#8211; Ensure traces propagate context across services.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">3) Data collection\n&#8211; Ingest metrics at high cardinality only where necessary.\n&#8211; Sample traces but keep full traces for high-value paths.\n&#8211; Use RUM to capture client-side errors.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">4) SLO design\n&#8211; Define SLIs per region, feature, or customer tier.\n&#8211; Set SLOs based on business impact and historical data.\n&#8211; Create error budget policies for scoped SLOs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">5) Dashboards\n&#8211; Build executive, on-call, and debug dashboards as above.\n&#8211; Include links to runbooks and recent deploy metadata.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">6) Alerts &amp; routing\n&#8211; Create scoped alerts for per-region or per-feature SLO breaches.\n&#8211; Route alerts to owning teams and escalation paths.\n&#8211; Integrate with incident management and paging tools.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">7) Runbooks &amp; automation\n&#8211; Create runbooks for common partial outage patterns: shard failover, feature rollback, POP reroute.\n&#8211; Automate mitigation steps where safe: disable feature flag, shift traffic, scale nodes.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">8) Validation (load\/chaos\/game days)\n&#8211; Run targeted chaos experiments on shards, POPs, and node pools.\n&#8211; Validate runbook steps and automated responses in pre-production.\n&#8211; Conduct game days with on-call teams.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">9) Continuous improvement\n&#8211; Postmortems with action items and SLO adjustments.\n&#8211; Track runbook effectiveness and update after incidents.\n&#8211; Invest in automation to reduce manual steps.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Checklists<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Pre-production checklist:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Telemetry tags implemented and validated.<\/li>\n<li>Canary and rollback paths tested.<\/li>\n<li>Synthetic checks for critical slices in place.<\/li>\n<li>Runbook for partial outage scenarios available.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Production readiness checklist:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Scoped SLIs and SLOs defined and monitored.<\/li>\n<li>Alerts routed and tested to on-call.<\/li>\n<li>Feature flags can be toggled safely in prod.<\/li>\n<li>Failover and reroute automation configured.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Incident checklist specific to Partial outage:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Identify affected slice: region, shard, feature, customer.<\/li>\n<li>Correlate deploys and recent config changes.<\/li>\n<li>Execute runbook: toggle feature flag or reroute.<\/li>\n<li>Notify stakeholders and open incident.<\/li>\n<li>Validate mitigation via metrics and RUM.<\/li>\n<li>Postmortem and action items.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Use Cases of Partial outage<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Provide 8\u201312 concise use cases.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">1) Global SaaS with multi-tenant DB shards\n&#8211; Context: Multi-tenant database with per-tenant shard mapping.\n&#8211; Problem: One shard experiences IO errors.\n&#8211; Why Partial outage helps: Isolate tenant impact and failover shard.\n&#8211; What to measure: Per-shard error rate and replica lag.\n&#8211; Typical tools: DB monitoring, feature flag, tenant router.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">2) Phased feature rollout for mobile clients\n&#8211; Context: Major UI feature rolled to 10% users.\n&#8211; Problem: Mobile clients crash for subset of OS versions.\n&#8211; Why Partial outage helps: Limit blast radius and revert for affected users.\n&#8211; What to measure: Crash rate by OS and feature flag segment.\n&#8211; Typical tools: Crash reporting, feature flags, APM.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">3) CDN edge misconfiguration\n&#8211; Context: CDN POP misrouting requests to wrong origin.\n&#8211; Problem: Regional users receive errors.\n&#8211; Why Partial outage helps: Detect by per-POP telemetry and reroute.\n&#8211; What to measure: Edge errors per POP, TLS handshake failure.\n&#8211; Typical tools: CDN logs, synthetic monitoring, edge control plane.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">4) Vendor API rate limits for payment gateway\n&#8211; Context: Payment vendor throttles merchant accounts.\n&#8211; Problem: Payments fail for certain merchants.\n&#8211; Why Partial outage helps: Detect and route to backup vendor for those IDs.\n&#8211; What to measure: Upstream 4xx per merchant and success ratio.\n&#8211; Typical tools: API gateway, vendor monitoring, circuit breaker.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">5) Kubernetes node pool AMI bug\n&#8211; Context: New AMI causes kubelet crash on certain instance types.\n&#8211; Problem: Node pool loses subset pods.\n&#8211; Why Partial outage helps: Evacuate nodes and shift traffic to healthy node pools.\n&#8211; What to measure: Node Ready status, pod restarts.\n&#8211; Typical tools: K8s monitoring, deployment automation.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">6) Serverless cold-start region issue\n&#8211; Context: One cloud region exhibits high cold-start latency.\n&#8211; Problem: Lambda functions slow for specific region.\n&#8211; Why Partial outage helps: Route traffic to warm regional replicas.\n&#8211; What to measure: Invocation latency and error rates per region.\n&#8211; Typical tools: Serverless metrics, edge routing.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">7) CI\/CD rollout causing regression\n&#8211; Context: Gradual deployment to 20% of users.\n&#8211; Problem: New code causes downstream errors under specific query pattern.\n&#8211; Why Partial outage helps: Stop rollout and revert for affected bucket.\n&#8211; What to measure: Canary vs baseline error rate.\n&#8211; Typical tools: CI\/CD canary, APM.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">8) WAF rule misfire blocking corporate clients\n&#8211; Context: New WAF rule intended to stop bots.\n&#8211; Problem: Rule blocks customers from specific IP ranges.\n&#8211; Why Partial outage helps: Disable or scope rule to reduce impact.\n&#8211; What to measure: WAF block counts per IP range and client signature.\n&#8211; Typical tools: WAF dashboard, SIEM.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">9) Multi-region cache inconsistency\n&#8211; Context: CDN or cache invalidation only partially propagated.\n&#8211; Problem: Some regions see stale or inconsistent responses.\n&#8211; Why Partial outage helps: Fallback to origin for affected regions.\n&#8211; What to measure: Cache hit ratio by region and error rates.\n&#8211; Typical tools: Cache metrics, synthetic checks.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">10) Internal API breaking subset of services\n&#8211; Context: Internal API contract changed without versioning.\n&#8211; Problem: Only services using new path fail.\n&#8211; Why Partial outage helps: Re-introduce old contract or route affected services to fallback.\n&#8211; What to measure: 4xx\/5xx by client service and API version used.\n&#8211; Typical tools: API gateway, service mesh traces.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Scenario Examples (Realistic, End-to-End)<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Provide 4\u20136 scenarios with required inclusion.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario #1 \u2014 Kubernetes node pool AMI regression (Kubernetes)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Context:<\/strong> A new node image is deployed to a Kubernetes node pool in one region.<br\/>\n<strong>Goal:<\/strong> Restore service for workloads affected by node crashes without impacting unaffected regions.<br\/>\n<strong>Why Partial outage matters here:<\/strong> Only workloads in that node pool are impacted; preserving other regions maintains availability.<br\/>\n<strong>Architecture \/ workflow:<\/strong> Ingress -&gt; regional clusters -&gt; node pools with nodes running problematic AMI -&gt; pods scheduled to those nodes -&gt; monitoring collects pod restarts and node readiness.<br\/>\n<strong>Step-by-step implementation:<\/strong> <\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Alert triggers on increased pod restarts for the node pool.  <\/li>\n<li>On-call inspects node readiness and AMI version tag.  <\/li>\n<li>Drain affected nodes using cordon and drain.  <\/li>\n<li>Scale up healthy node pool or spin up instances with previous AMI.  <\/li>\n<li>Rollback node image in infrastructure pipeline.  <\/li>\n<li>Validate via per-cluster SLI and pod restart metrics.<br\/>\n<strong>What to measure:<\/strong> Node Ready percentage, pod restart rate, per-cluster availability.<br\/>\n<strong>Tools to use and why:<\/strong> Kubernetes monitoring for node\/pod metrics, infra automation for AMI rollbacks, cluster autoscaler.<br\/>\n<strong>Common pitfalls:<\/strong> Assuming node issue is due to app container rather than AMI; forgetting to re-enable autoscaler.<br\/>\n<strong>Validation:<\/strong> Run synthetic traffic to cluster region and verify error rates normalized.<br\/>\n<strong>Outcome:<\/strong> Affected pod scheduling moves to healthy nodes; incident resolved with minimal customer impact.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario #2 \u2014 Serverless cold-start regional degradation (serverless\/managed-PaaS)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Context:<\/strong> A managed serverless function experiences elevated cold start latency in region B during peak hours.<br\/>\n<strong>Goal:<\/strong> Reduce latency for affected users and maintain throughput while root cause is identified.<br\/>\n<strong>Why Partial outage matters here:<\/strong> Only region B users see degraded performance; global service remains available.<br\/>\n<strong>Architecture \/ workflow:<\/strong> Edge routing -&gt; region-aware routing rules -&gt; serverless functions in multiple regions -&gt; telemetry: invocation latency, error counts.<br\/>\n<strong>Step-by-step implementation:<\/strong> <\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Detect spike in invocation latency for region B via RUM and serverless metrics.  <\/li>\n<li>Shift traffic from region B to region A for new sessions using edge routing rules.  <\/li>\n<li>Spin up warmers or provision concurrency in region B for critical functions.  <\/li>\n<li>Investigate provider logs and resource limits for region B.  <\/li>\n<li>Apply long-term fix or request provider support.  <\/li>\n<li>Ramp traffic back gradually once validated.<br\/>\n<strong>What to measure:<\/strong> p95 invocation latency per region, error rate, warm concurrency count.<br\/>\n<strong>Tools to use and why:<\/strong> Edge routing controls, serverless provider metrics, RUM for real user effect.<br\/>\n<strong>Common pitfalls:<\/strong> Moving traffic without considering data locality leading to consistency problems.<br\/>\n<strong>Validation:<\/strong> Compare error and latency metrics after partial reroute.<br\/>\n<strong>Outcome:<\/strong> Region B load reduced; user experience maintained while root cause addressed.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario #3 \u2014 Payment vendor throttling affecting subset merchants (incident-response\/postmortem)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Context:<\/strong> A vendor starts returning 429s for certain merchant IDs during a peak sale.<br\/>\n<strong>Goal:<\/strong> Keep merchant transactions flowing for critical accounts and remediate vendor impact.<br\/>\n<strong>Why Partial outage matters here:<\/strong> Only specific merchants affected; broad outage avoided.<br\/>\n<strong>Architecture \/ workflow:<\/strong> Payment request -&gt; routing with merchant ID -&gt; payment gateway -&gt; vendor API; telemetry includes upstream response codes per merchant.<br\/>\n<strong>Step-by-step implementation:<\/strong> <\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Alert on increased 4xx errors for payment transactions; identify merchant IDs.  <\/li>\n<li>Engage incident commander and route critical merchant traffic to an alternative vendor or retry policy.  <\/li>\n<li>Apply graceful degradation for noncritical merchants with retry\/backoff.  <\/li>\n<li>Open vendor support ticket and share error traces.  <\/li>\n<li>Postmortem: analyze why merchant-specific throttling happened and add vendor isolation patterns.<br\/>\n<strong>What to measure:<\/strong> Payment success rate per merchant, vendor error codes, retry success.<br\/>\n<strong>Tools to use and why:<\/strong> API gateway with per-merchant metrics, payment routing rules, incident tracking.<br\/>\n<strong>Common pitfalls:<\/strong> Global retry loops causing vendor overload; failing to prioritize VIP merchants.<br\/>\n<strong>Validation:<\/strong> Confirm backup vendor handles critical transactions and metrics return to baseline.<br\/>\n<strong>Outcome:<\/strong> Critical merchants processed; vendor mitigated; long-term vendor routing added.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario #4 \u2014 Cost-driven cache eviction causing users to see stale content (cost\/performance trade-off)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Context:<\/strong> To reduce costs, cache TTLs were shortened for a region, but some high-value traffic experienced cache misses and higher latency.<br\/>\n<strong>Goal:<\/strong> Balance cost and performance for premium customers by selectively increasing TTLs.<br\/>\n<strong>Why Partial outage matters here:<\/strong> Only certain user segments experienced degraded performance due to cache policy change.<br\/>\n<strong>Architecture \/ workflow:<\/strong> Edge caching -&gt; cache rules by path and user tier -&gt; origin servers -&gt; metrics show cache hit ratio per user tier.<br\/>\n<strong>Step-by-step implementation:<\/strong> <\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Detect increased origin latency correlated with specific user tier.  <\/li>\n<li>Update CDN rules to extend TTL for premium user paths using header-based rules.  <\/li>\n<li>Monitor hit ratio and origin load.  <\/li>\n<li>Implement cost allocation tracking for cache rules to show ROI.  <\/li>\n<li>Consider tiered caching strategy or regional cache sizing adjustments.<br\/>\n<strong>What to measure:<\/strong> Cache hit ratio per user tier, origin latency, cost per GB transferred.<br\/>\n<strong>Tools to use and why:<\/strong> CDN config and analytics, RUM, cost monitoring.<br\/>\n<strong>Common pitfalls:<\/strong> Applying global TTL changes; forgetting to test edge invalidation behaviors.<br\/>\n<strong>Validation:<\/strong> Premium user latency returns to SLA while cost delta analyzed.<br\/>\n<strong>Outcome:<\/strong> Premium users restored to expected performance with controlled cost increase.<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Common Mistakes, Anti-patterns, and Troubleshooting<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">List of 20 mistakes with symptom -&gt; root cause -&gt; fix. Include observability pitfalls.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">1) Symptom: Global alert but only one region has errors -&gt; Root cause: Aggregated metric hides slice -&gt; Fix: Implement per-region SLIs and dashboards.<br\/>\n2) Symptom: Repeated partial outages after deploys -&gt; Root cause: Missing canary testing -&gt; Fix: Enforce canary gating and automated rollback.<br\/>\n3) Symptom: On-call blames CDN for all issues -&gt; Root cause: Lack of end-to-end traces -&gt; Fix: Correlate edge logs with backend traces.<br\/>\n4) Symptom: High latency only for premium customers -&gt; Root cause: Tiered routing misconfiguration -&gt; Fix: Validate routing rules and telemetry by customer tier.<br\/>\n5) Symptom: Low sample traces for failing requests -&gt; Root cause: Tracing sampling drops rare errors -&gt; Fix: Implement error-based trace retention.<br\/>\n6) Symptom: Alerts flood during partial outage -&gt; Root cause: Alert per-instance not aggregated -&gt; Fix: Group alerts and use dedupe rules.<br\/>\n7) Symptom: Runbook steps fail or outdated -&gt; Root cause: Stale documentation -&gt; Fix: Review and test runbooks periodically.<br\/>\n8) Symptom: Automatic rollback re-applies bad config -&gt; Root cause: CI\/CD misconfiguration -&gt; Fix: Add artifact immutability and rollback checks.<br\/>\n9) Symptom: False positives in WAF cause blocks -&gt; Root cause: Overly broad rules -&gt; Fix: Scope WAF rules and add exclusions.<br\/>\n10) Symptom: Dependency errors not traced to vendor -&gt; Root cause: Missing upstream tagging -&gt; Fix: Tag upstream calls and log vendor IDs.<br\/>\n11) Symptom: Partial outage persists unnoticed -&gt; Root cause: No RUM or client telemetry -&gt; Fix: Add RUM and align synthetic checks to real paths.<br\/>\n12) Symptom: Performance degrades under specific key patterns -&gt; Root cause: Hot key on shard -&gt; Fix: Repartition or introduce caching for hot key.<br\/>\n13) Symptom: Circuit breakers trip too often -&gt; Root cause: Tight thresholds or noisy metrics -&gt; Fix: Tune CB thresholds and hysteresis.<br\/>\n14) Symptom: Pager fatigue from low-impact slices -&gt; Root cause: Overly aggressive paging rules -&gt; Fix: Set tiered paging and ticketing for low-impact slices.<br\/>\n15) Symptom: Metrics explode in cardinality -&gt; Root cause: Tagging everything without plan -&gt; Fix: Define cardinality policy and apply rollup metrics.<br\/>\n16) Symptom: Postmortem lacks action items -&gt; Root cause: Blameful culture or poor facilitation -&gt; Fix: Use blameless postmortems with clear owners.<br\/>\n17) Symptom: Automation escalates outages -&gt; Root cause: Unsafe auto-heal scripts -&gt; Fix: Add safety checks and canary for automation.<br\/>\n18) Symptom: Partial outage due to data migration -&gt; Root cause: Migration not backward compatible -&gt; Fix: Use backward-compatible migrations and feature gates.<br\/>\n19) Symptom: Observability gaps for low-volume customers -&gt; Root cause: Sampling discards minority traffic -&gt; Fix: Implement retention for flagged customer traces.<br\/>\n20) Symptom: Cost spike after per-slice telemetry -&gt; Root cause: Uncontrolled high-cardinality metrics -&gt; Fix: Apply cardinality limits and targeted indexing.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Observability-specific pitfalls (at least 5 included above):<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Dropped traces due to sampling.<\/li>\n<li>Cardinality explosion from unbounded tags.<\/li>\n<li>Aggregated metrics hiding slices.<\/li>\n<li>Lack of client-side telemetry causing blind spot.<\/li>\n<li>Metrics retention causing loss of historical slice context.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Best Practices &amp; Operating Model<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ownership and on-call:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Define clear ownership for services and dependent components.<\/li>\n<li>Map ownership to customer tiers and regions.<\/li>\n<li>On-call rotation should include subject matter experts for regional and feature slices.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Runbooks vs playbooks:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Runbooks: prescriptive steps for frequent, deterministic mitigations.<\/li>\n<li>Playbooks: higher-level strategies for ambiguous or complex incidents.<\/li>\n<li>Keep both versioned and easily accessible.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Safe deployments:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Canary releases with automated verification gates.<\/li>\n<li>Blue-green for stateful changes where rollback is expensive.<\/li>\n<li>Feature flags for business logic and dark launches.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Toil reduction and automation:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Automate common mitigation tasks: flag toggles, traffic shifts, shard failover.<\/li>\n<li>Record automation outcomes and ensure revert options.<\/li>\n<li>Measure toil reduction and adjust accordingly.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Security basics:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Ensure observability pipelines encrypt and redact PII.<\/li>\n<li>Limit feature flag control to authorized engineers.<\/li>\n<li>Harden edge controls and guardrails to prevent accidental global blocks.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Weekly\/monthly routines:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Weekly: review per-slice SLI trends and failed alerts.<\/li>\n<li>Monthly: validate runbooks with tabletop exercises.<\/li>\n<li>Quarterly: game day focusing on partial outage scenarios and automation stress tests.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Postmortem review items specific to Partial outage:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Affected slice identification latency.<\/li>\n<li>Accuracy of SLI segmentation.<\/li>\n<li>Effectiveness of runbook and automation.<\/li>\n<li>Action items to reduce similar future partial outages.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Tooling &amp; Integration Map for Partial outage (TABLE REQUIRED)<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Category<\/th>\n<th>What it does<\/th>\n<th>Key integrations<\/th>\n<th>Notes<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>I1<\/td>\n<td>Observability platform<\/td>\n<td>Centralize metrics logs traces<\/td>\n<td>APM CDN mesh DB<\/td>\n<td>Use for high-cardinality slices<\/td>\n<\/tr>\n<tr>\n<td>I2<\/td>\n<td>APM<\/td>\n<td>End-to-end traces and traces sampling<\/td>\n<td>App frameworks DB<\/td>\n<td>Helps with transaction tracing<\/td>\n<\/tr>\n<tr>\n<td>I3<\/td>\n<td>RUM<\/td>\n<td>Client-side performance telemetry<\/td>\n<td>CDN frontend<\/td>\n<td>Detects client-specific partial outages<\/td>\n<\/tr>\n<tr>\n<td>I4<\/td>\n<td>CDN \/ Edge<\/td>\n<td>Global routing and POP controls<\/td>\n<td>Edge logs origin<\/td>\n<td>Useful for per-POP mitigation<\/td>\n<\/tr>\n<tr>\n<td>I5<\/td>\n<td>Service mesh<\/td>\n<td>Per-route traffic control<\/td>\n<td>K8s metrics APM<\/td>\n<td>Enables fine-grained routing<\/td>\n<\/tr>\n<tr>\n<td>I6<\/td>\n<td>Feature flagging<\/td>\n<td>Toggle features for segments<\/td>\n<td>CI\/CD APM logs<\/td>\n<td>Rapid mitigation for feature regressions<\/td>\n<\/tr>\n<tr>\n<td>I7<\/td>\n<td>CI\/CD<\/td>\n<td>Canary and rollback automation<\/td>\n<td>VCS observability<\/td>\n<td>Gate deployments by canary SLIs<\/td>\n<\/tr>\n<tr>\n<td>I8<\/td>\n<td>DB monitoring<\/td>\n<td>Shard and replica observability<\/td>\n<td>Orchestration APM<\/td>\n<td>Key for data-layer partial outages<\/td>\n<\/tr>\n<tr>\n<td>I9<\/td>\n<td>Incident mgmt<\/td>\n<td>Pager routing and timeline<\/td>\n<td>Chat ops observability<\/td>\n<td>Ties alerts to responders<\/td>\n<\/tr>\n<tr>\n<td>I10<\/td>\n<td>WAF \/ Sec<\/td>\n<td>Block malicious traffic at edge<\/td>\n<td>CDN SIEM<\/td>\n<td>Can cause partial outages if misconfigured<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if needed)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>None.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Frequently Asked Questions (FAQs)<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What exactly counts as a partial outage?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">A partial outage is a scoped failure affecting a subset of the service surface such as region, feature, shard, or customer segment while other parts remain functional.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How is it different from degradation?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Degradation often refers to performance loss across a broader surface. Partial outage implies availability or correctness loss in a subset, though degradation can be partial too.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How should SLIs be structured for partial outages?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Use scoped SLIs by region, customer tier, or feature flag. Complement global SLIs with targeted ones for critical slices.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">When should I page engineers for a partial outage?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Page when the partial outage impacts high-value customers, core revenue paths, or critical infrastructure components. Otherwise use a ticketing escalation.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How to avoid too many alerts from partial slices?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Group alerts, reduce cardinality in alert rules, and use aggregation thresholds. Prioritize alerts by business impact.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Does feature flagging cause complexity?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Yes; feature flags are powerful mitigation tools but cause complexity if overused. Track flags and have flag governance.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Can automation make partial outages worse?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Yes; unsafe automation or faulty auto-heal scripts can worsen situations. Implement safety checks and canary for automation.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How do I measure customer impact during a partial outage?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Use per-customer SLIs, RUM, and revenue attribution to estimate affected revenue and user sessions.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What telemetry is most important?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">High-cardinality metrics for region, customer ID, and feature; traces for failing transactions; logs for contextual debugging.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How to test partial outage scenarios?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Run targeted chaos experiments, game days, and rehearsed runbook drills in staging and safe production windows.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How many SLOs should we have?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Varies \/ depends. Start with global and a few critical per-slice SLOs for regions, features, and premium customers, then expand based on risk and capacity.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Who owns partial outage mitigation?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The service owning the affected slice owns mitigation, supported by platform and infra teams for underlying failure domains.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Are partial outages covered by SLAs?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">They can be if SLAs are scoped; often SLAs are global, so check contract wording. Not publicly stated if specific SLA terms apply.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How to handle third-party-induced partial outages?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Implement circuit breakers, fallback vendors, and per-vendor SLIs. Route critical customers to backup vendors as needed.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How to prevent shard hot keys causing partial outage?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Monitor key distributions, add cache layers, and redesign partitions to spread load or hot-key handling.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Is it expensive to monitor at high cardinality?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Yes; high-cardinality telemetry increases cost. Use targeted indexing, sampling, and rollups to control costs.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How to perform postmortem for partial outage?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Document timeline, affected slice, detection latency, mitigations applied, runbook effectiveness, and actionable remediation items.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Can partial outages be fully automated away?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Varies \/ depends. Some mitigations are automatable, but others require human judgment, especially complex multi-dependency failures.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Partial outages are a common and impactful class of incidents in cloud-native systems. They demand scoped SLIs, high-cardinality telemetry, targeted runbooks, and automation where safe. Prioritize identifying affected slices quickly and isolating the failure to preserve availability for unaffected users.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Next 7 days plan (5 bullets):<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Day 1: Inventory current SLIs and tag telemetry by region, shard, and feature.<\/li>\n<li>Day 2: Implement or validate feature flagging and canary controls for critical services.<\/li>\n<li>Day 3: Build per-slice dashboards for executive and on-call views.<\/li>\n<li>Day 4: Create runbooks for top 3 partial outage failure modes and test them.<\/li>\n<li>Day 5\u20137: Run a targeted game day focusing on shard and region failure scenarios and update postmortem actions.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Appendix \u2014 Partial outage Keyword Cluster (SEO)<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Primary keywords<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Partial outage<\/li>\n<li>Partial service outage<\/li>\n<li>Scoped outage<\/li>\n<li>Regional outage<\/li>\n<li>Shard outage<\/li>\n<li>Partial degradation<\/li>\n<li>Partial availability failure<\/li>\n<li>Partial downtime<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Secondary keywords<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Partial outage detection<\/li>\n<li>Partial outage mitigation<\/li>\n<li>Partial outage monitoring<\/li>\n<li>Per-region SLI<\/li>\n<li>Per-feature SLO<\/li>\n<li>High-cardinality telemetry<\/li>\n<li>Feature flag rollback<\/li>\n<li>Canary deployment failures<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Long-tail questions<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What is a partial outage in cloud computing<\/li>\n<li>How to detect partial outage in Kubernetes<\/li>\n<li>How to measure partial outage for SaaS platforms<\/li>\n<li>How to create runbooks for partial outage scenarios<\/li>\n<li>Partial outage vs full outage difference explained<\/li>\n<li>How to route traffic during a partial outage<\/li>\n<li>How to implement per-customer SLIs for partial outages<\/li>\n<li>How to automate partial outage mitigation with feature flags<\/li>\n<li>Best practices for partial outage incident response<\/li>\n<li>How to use RUM to identify partial outages<\/li>\n<li>How to reduce blast radius of deployments causing partial outages<\/li>\n<li>How to test partial outage scenarios in production safely<\/li>\n<li>How to set SLOs for regional partial outages<\/li>\n<li>How to handle vendor-induced partial outages<\/li>\n<li>How to tune circuit breakers to prevent partial outages<\/li>\n<li>How to debug shard failures causing partial outage<\/li>\n<li>How to detect edge POP partial outages quickly<\/li>\n<li>How to design dashboards for partial outage detection<\/li>\n<li>How to prioritize paging for partial outages<\/li>\n<li>How to avoid alert fatigue from partial slice alerts<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Related terminology<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>SLI and SLO<\/li>\n<li>Error budget<\/li>\n<li>Canary rollout<\/li>\n<li>Feature flagging<\/li>\n<li>Service mesh<\/li>\n<li>Replica lag<\/li>\n<li>Circuit breaker<\/li>\n<li>RUM and APM<\/li>\n<li>Observability pipeline<\/li>\n<li>Synthetic monitoring<\/li>\n<li>Chaos engineering<\/li>\n<li>Auto-heal automation<\/li>\n<li>WAF rule tuning<\/li>\n<li>CDN POP monitoring<\/li>\n<li>Shard-aware architecture<\/li>\n<li>High cardinality metrics<\/li>\n<li>Per-tenant monitoring<\/li>\n<li>Blue-green deployment<\/li>\n<li>Rollback strategy<\/li>\n<li>Postmortem<\/li>\n<li>Incident command system<\/li>\n<li>On-call rotation<\/li>\n<li>Runbook<\/li>\n<li>Playbook<\/li>\n<li>Blast radius<\/li>\n<li>Hot key mitigation<\/li>\n<li>Tiered caching<\/li>\n<li>Edge routing<\/li>\n<li>Multi-region failover<\/li>\n<li>Vendor fallback routing<\/li>\n<li>Scoped alerts<\/li>\n<li>Dedupe alerting<\/li>\n<li>Game day testing<\/li>\n<li>Tracing sampling policies<\/li>\n<li>Metrics retention policy<\/li>\n<li>Cost-aware telemetry<\/li>\n<li>Security redaction<\/li>\n<li>Data partitioning<\/li>\n<li>Backward-compatible migration<\/li>\n<li>Graceful degradation<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>&#8212;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[149],"tags":[],"class_list":["post-1741","post","type-post","status-publish","format-standard","hentry","category-terminology"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.8 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>What is Partial outage? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide) - SRE School<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/sreschool.com\/blog\/partial-outage\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"What is Partial outage? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide) - SRE School\" \/>\n<meta property=\"og:description\" content=\"---\" \/>\n<meta property=\"og:url\" content=\"https:\/\/sreschool.com\/blog\/partial-outage\/\" \/>\n<meta property=\"og:site_name\" content=\"SRE School\" \/>\n<meta property=\"article:published_time\" content=\"2026-02-15T06:51:09+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-05-05T07:28:40+00:00\" \/>\n<meta name=\"author\" content=\"Rajesh Kumar\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Rajesh Kumar\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"32 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/sreschool.com\\\/blog\\\/partial-outage\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/sreschool.com\\\/blog\\\/partial-outage\\\/\"},\"author\":{\"name\":\"Rajesh Kumar\",\"@id\":\"https:\\\/\\\/sreschool.com\\\/blog\\\/#\\\/schema\\\/person\\\/0ffe446f77bb2589992dbe3a7f417201\"},\"headline\":\"What is Partial outage? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide)\",\"datePublished\":\"2026-02-15T06:51:09+00:00\",\"dateModified\":\"2026-05-05T07:28:40+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/sreschool.com\\\/blog\\\/partial-outage\\\/\"},\"wordCount\":6392,\"commentCount\":1,\"articleSection\":[\"Terminology\"],\"inLanguage\":\"en\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/sreschool.com\\\/blog\\\/partial-outage\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/sreschool.com\\\/blog\\\/partial-outage\\\/\",\"url\":\"https:\\\/\\\/sreschool.com\\\/blog\\\/partial-outage\\\/\",\"name\":\"What is Partial outage? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide) - SRE School\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/sreschool.com\\\/blog\\\/#website\"},\"datePublished\":\"2026-02-15T06:51:09+00:00\",\"dateModified\":\"2026-05-05T07:28:40+00:00\",\"author\":{\"@id\":\"https:\\\/\\\/sreschool.com\\\/blog\\\/#\\\/schema\\\/person\\\/0ffe446f77bb2589992dbe3a7f417201\"},\"breadcrumb\":{\"@id\":\"https:\\\/\\\/sreschool.com\\\/blog\\\/partial-outage\\\/#breadcrumb\"},\"inLanguage\":\"en\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/sreschool.com\\\/blog\\\/partial-outage\\\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/sreschool.com\\\/blog\\\/partial-outage\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/sreschool.com\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"What is Partial outage? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide)\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/sreschool.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/sreschool.com\\\/blog\\\/\",\"name\":\"SRESchool\",\"description\":\"Master SRE. Build Resilient Systems. Lead the Future of Reliability\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/sreschool.com\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en\"},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/sreschool.com\\\/blog\\\/#\\\/schema\\\/person\\\/0ffe446f77bb2589992dbe3a7f417201\",\"name\":\"Rajesh Kumar\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/f901a4f2929fa034a291a8363d589791d5a3c1f6a051c22e744acb8bfc8e022a?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/f901a4f2929fa034a291a8363d589791d5a3c1f6a051c22e744acb8bfc8e022a?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/f901a4f2929fa034a291a8363d589791d5a3c1f6a051c22e744acb8bfc8e022a?s=96&d=mm&r=g\",\"caption\":\"Rajesh Kumar\"},\"sameAs\":[\"http:\\\/\\\/sreschool.com\\\/blog\"],\"url\":\"https:\\\/\\\/sreschool.com\\\/blog\\\/author\\\/admin\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"What is Partial outage? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide) - SRE School","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/sreschool.com\/blog\/partial-outage\/","og_locale":"en_US","og_type":"article","og_title":"What is Partial outage? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide) - SRE School","og_description":"---","og_url":"https:\/\/sreschool.com\/blog\/partial-outage\/","og_site_name":"SRE School","article_published_time":"2026-02-15T06:51:09+00:00","article_modified_time":"2026-05-05T07:28:40+00:00","author":"Rajesh Kumar","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Rajesh Kumar","Est. reading time":"32 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/sreschool.com\/blog\/partial-outage\/#article","isPartOf":{"@id":"https:\/\/sreschool.com\/blog\/partial-outage\/"},"author":{"name":"Rajesh Kumar","@id":"https:\/\/sreschool.com\/blog\/#\/schema\/person\/0ffe446f77bb2589992dbe3a7f417201"},"headline":"What is Partial outage? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide)","datePublished":"2026-02-15T06:51:09+00:00","dateModified":"2026-05-05T07:28:40+00:00","mainEntityOfPage":{"@id":"https:\/\/sreschool.com\/blog\/partial-outage\/"},"wordCount":6392,"commentCount":1,"articleSection":["Terminology"],"inLanguage":"en","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/sreschool.com\/blog\/partial-outage\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/sreschool.com\/blog\/partial-outage\/","url":"https:\/\/sreschool.com\/blog\/partial-outage\/","name":"What is Partial outage? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide) - SRE School","isPartOf":{"@id":"https:\/\/sreschool.com\/blog\/#website"},"datePublished":"2026-02-15T06:51:09+00:00","dateModified":"2026-05-05T07:28:40+00:00","author":{"@id":"https:\/\/sreschool.com\/blog\/#\/schema\/person\/0ffe446f77bb2589992dbe3a7f417201"},"breadcrumb":{"@id":"https:\/\/sreschool.com\/blog\/partial-outage\/#breadcrumb"},"inLanguage":"en","potentialAction":[{"@type":"ReadAction","target":["https:\/\/sreschool.com\/blog\/partial-outage\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/sreschool.com\/blog\/partial-outage\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/sreschool.com\/blog\/"},{"@type":"ListItem","position":2,"name":"What is Partial outage? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide)"}]},{"@type":"WebSite","@id":"https:\/\/sreschool.com\/blog\/#website","url":"https:\/\/sreschool.com\/blog\/","name":"SRESchool","description":"Master SRE. Build Resilient Systems. Lead the Future of Reliability","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/sreschool.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en"},{"@type":"Person","@id":"https:\/\/sreschool.com\/blog\/#\/schema\/person\/0ffe446f77bb2589992dbe3a7f417201","name":"Rajesh Kumar","image":{"@type":"ImageObject","inLanguage":"en","@id":"https:\/\/secure.gravatar.com\/avatar\/f901a4f2929fa034a291a8363d589791d5a3c1f6a051c22e744acb8bfc8e022a?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/f901a4f2929fa034a291a8363d589791d5a3c1f6a051c22e744acb8bfc8e022a?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/f901a4f2929fa034a291a8363d589791d5a3c1f6a051c22e744acb8bfc8e022a?s=96&d=mm&r=g","caption":"Rajesh Kumar"},"sameAs":["http:\/\/sreschool.com\/blog"],"url":"https:\/\/sreschool.com\/blog\/author\/admin\/"}]}},"_links":{"self":[{"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/posts\/1741","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/comments?post=1741"}],"version-history":[{"count":1,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/posts\/1741\/revisions"}],"predecessor-version":[{"id":2699,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/posts\/1741\/revisions\/2699"}],"wp:attachment":[{"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/media?parent=1741"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/categories?post=1741"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/tags?post=1741"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}