{"id":2030,"date":"2026-02-15T12:41:33","date_gmt":"2026-02-15T12:41:33","guid":{"rendered":"https:\/\/sreschool.com\/blog\/multi-az\/"},"modified":"2026-02-15T12:41:33","modified_gmt":"2026-02-15T12:41:33","slug":"multi-az","status":"publish","type":"post","link":"https:\/\/sreschool.com\/blog\/multi-az\/","title":{"rendered":"What is Multi AZ? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide)"},"content":{"rendered":"\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Quick Definition (30\u201360 words)<\/h2>\n\n\n\n<p>Multi AZ is the architectural practice of deploying services and data redundantly across multiple isolated availability zones to reduce outage blast radius and maintain service continuity. Analogy: like having several independent backup generators at different buildings. Formal line: Multi AZ provides zone-level physical and network isolation with automated routing and failover controls.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">What is Multi AZ?<\/h2>\n\n\n\n<p>Multi AZ (Multiple Availability Zones) is a cloud-architecture strategy that places compute, storage, and networking resources across separately powered and networked datacenter zones within a region to improve resilience and fault tolerance.<\/p>\n\n\n\n<p>What it is NOT<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Not a full disaster recovery solution across regions.<\/li>\n<li>Not guaranteed zero downtime; it reduces but does not eliminate risk.<\/li>\n<li>Not a substitute for application-level resilience and design.<\/li>\n<\/ul>\n\n\n\n<p>Key properties and constraints<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Zone isolation: hardware, power, and local network faults are isolated to a zone.<\/li>\n<li>Low-latency sync: designed for synchronous or asynchronous replication within region latency bounds.<\/li>\n<li>Automatic failover: often paired with providers&#8217; automations for health checks and routing.<\/li>\n<li>Cost and complexity: adds replication, cross-zone data transfer, and operational overhead.<\/li>\n<li>Consistency tradeoffs: synchronous replication can add latency; asynchronous risks data loss.<\/li>\n<\/ul>\n\n\n\n<p>Where it fits in modern cloud\/SRE workflows<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Foundation for availability SLOs and error budgets.<\/li>\n<li>Baseline for platform reliability in IaaS\/PaaS and managed databases.<\/li>\n<li>Integrated with CI\/CD to test deployments across zones.<\/li>\n<li>Used with observability, chaos engineering, and runbook automation.<\/li>\n<\/ul>\n\n\n\n<p>Diagram description (text-only)<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Picture a region with three boxes labeled AZ-A, AZ-B, AZ-C.<\/li>\n<li>Each AZ has its own compute fleet, local storage caches, and network stack.<\/li>\n<li>A load balancer sits in front, health-checking instances in all AZs and routing traffic.<\/li>\n<li>Data storage replicates across AZs with a primary writer and replicas.<\/li>\n<li>Control plane coordinates failover and config sync.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Multi AZ in one sentence<\/h3>\n\n\n\n<p>Multi AZ replicates critical components across multiple independent datacenter zones within a cloud region to maintain service availability during zone failures and localized incidents.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Multi AZ vs related terms (TABLE REQUIRED)<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Term<\/th>\n<th>How it differs from Multi AZ<\/th>\n<th>Common confusion<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>T1<\/td>\n<td>Multi Region<\/td>\n<td>Cross-region replication and failover rather than intra-region zones<\/td>\n<td>Confused as higher-availability substitute<\/td>\n<\/tr>\n<tr>\n<td>T2<\/td>\n<td>High Availability<\/td>\n<td>HA is a goal; Multi AZ is one implementation approach<\/td>\n<td>HA can be achieved without Multi AZ<\/td>\n<\/tr>\n<tr>\n<td>T3<\/td>\n<td>Disaster Recovery<\/td>\n<td>DR includes RTO\/RPO planning and runbooks beyond zones<\/td>\n<td>DR often implies cross-region plans<\/td>\n<\/tr>\n<tr>\n<td>T4<\/td>\n<td>Multi-Subnet<\/td>\n<td>Network segmentation inside same AZ not separate zones<\/td>\n<td>Assumed equal to AZ isolation<\/td>\n<\/tr>\n<tr>\n<td>T5<\/td>\n<td>Active-Active<\/td>\n<td>All zones accept writes vs typical active-passive setups<\/td>\n<td>Many Multi AZ setups are active-passive<\/td>\n<\/tr>\n<tr>\n<td>T6<\/td>\n<td>Active-Passive<\/td>\n<td>Primary in one AZ with failover to others<\/td>\n<td>Some assume passive is immediate zero-loss<\/td>\n<\/tr>\n<tr>\n<td>T7<\/td>\n<td>Edge Replication<\/td>\n<td>Geographically distributed at edge rather than zones<\/td>\n<td>Equated with Multi AZ for performance<\/td>\n<\/tr>\n<tr>\n<td>T8<\/td>\n<td>Zone-Aware Scheduling<\/td>\n<td>Scheduler places pods on different zones not replication<\/td>\n<td>Thought to fully replace Multi AZ replication<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if any cell says \u201cSee details below\u201d)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>None<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Why does Multi AZ matter?<\/h2>\n\n\n\n<p>Business impact<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Revenue continuity: Reduces customer-facing downtime which directly affects sales and renewals.<\/li>\n<li>Trust and brand protection: Frequent or prolonged outages harm reputation and customer trust.<\/li>\n<li>Risk reduction: Limits blast radius to a zone instead of entire region or service.<\/li>\n<\/ul>\n\n\n\n<p>Engineering impact<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Incident reduction: Lowers frequency of outages tied to single-zone failures.<\/li>\n<li>Velocity tradeoff: Requires more upfront work to design for cross-zone consistency and testing.<\/li>\n<li>Complexity: Increases CI\/CD matrix and operational runbook surface.<\/li>\n<\/ul>\n\n\n\n<p>SRE framing<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>SLIs\/SLOs: Multi AZ enables tighter availability and latency SLIs for regional failures.<\/li>\n<li>Error budgets: Reduces burn for zone failures but requires monitoring for cross-zone degradations.<\/li>\n<li>Toil reduction: Automating failover and recovery removes repetitive manual steps.<\/li>\n<li>On-call: Introduces new failure types to train on but reduces single-point failures.<\/li>\n<\/ul>\n\n\n\n<p>What breaks in production (realistic examples)<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Load balancer misconfiguration causing traffic to only hit one AZ.<\/li>\n<li>Synchronous replication latency spikes leading to write timeouts.<\/li>\n<li>Cross-zone networking ACL updated incorrectly blocking replication.<\/li>\n<li>Auto-scaling mis-scheduled instances all landing in one AZ due to quota.<\/li>\n<li>Deployment orchestration rolling updates that simultaneously drain instances in every AZ.<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Where is Multi AZ used? (TABLE REQUIRED)<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Layer\/Area<\/th>\n<th>How Multi AZ appears<\/th>\n<th>Typical telemetry<\/th>\n<th>Common tools<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>L1<\/td>\n<td>Edge network<\/td>\n<td>Load balancers in each AZ with anycast routing<\/td>\n<td>Request latency per AZ<\/td>\n<td>Cloud LB, DNS, CDN<\/td>\n<\/tr>\n<tr>\n<td>L2<\/td>\n<td>Compute<\/td>\n<td>VM or nodes scheduled across AZs<\/td>\n<td>Instance health and AZ distribution<\/td>\n<td>Cloud compute, Kubernetes<\/td>\n<\/tr>\n<tr>\n<td>L3<\/td>\n<td>Storage<\/td>\n<td>Replicated block and object across zones<\/td>\n<td>Replication lag and bandwidth<\/td>\n<td>Managed storage services<\/td>\n<\/tr>\n<tr>\n<td>L4<\/td>\n<td>Databases<\/td>\n<td>Primary and standby across AZs<\/td>\n<td>Commit latency and replica delay<\/td>\n<td>Managed DB, operator<\/td>\n<\/tr>\n<tr>\n<td>L5<\/td>\n<td>Kubernetes<\/td>\n<td>Zone-aware scheduling and topology spread<\/td>\n<td>Pod distribution and node health<\/td>\n<td>K8s scheduler, CNI<\/td>\n<\/tr>\n<tr>\n<td>L6<\/td>\n<td>Serverless<\/td>\n<td>Platform spreads functions across AZs<\/td>\n<td>Invocation errors by AZ<\/td>\n<td>Serverless platform<\/td>\n<\/tr>\n<tr>\n<td>L7<\/td>\n<td>CI\/CD<\/td>\n<td>Deployment targets include zone policies<\/td>\n<td>Deployment success by AZ<\/td>\n<td>CI\/CD pipelines<\/td>\n<\/tr>\n<tr>\n<td>L8<\/td>\n<td>Observability<\/td>\n<td>Aggregation across AZ metrics and logs<\/td>\n<td>Missing telemetry per AZ<\/td>\n<td>Metrics, logs, tracing<\/td>\n<\/tr>\n<tr>\n<td>L9<\/td>\n<td>Security<\/td>\n<td>IDS and firewall rules replicated per AZ<\/td>\n<td>Event correlation by AZ<\/td>\n<td>WAF, IAM, security tooling<\/td>\n<\/tr>\n<tr>\n<td>L10<\/td>\n<td>DR &amp; Backup<\/td>\n<td>Snapshot and replication across AZs<\/td>\n<td>Backup success and restore time<\/td>\n<td>Backup tools, snapshot service<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if needed)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>None<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">When should you use Multi AZ?<\/h2>\n\n\n\n<p>When it\u2019s necessary<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Customer-facing systems with strict availability SLAs.<\/li>\n<li>Stateful services where zone failure would cause significant data loss.<\/li>\n<li>Financial, healthcare, or regulated applications with compliance needs.<\/li>\n<\/ul>\n\n\n\n<p>When it\u2019s optional<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Non-critical batch workloads or dev\/test environments.<\/li>\n<li>Internal developer tools where brief downtime is tolerable.<\/li>\n<\/ul>\n\n\n\n<p>When NOT to use \/ overuse it<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Small projects where cost outweighs availability needs.<\/li>\n<li>When your application can&#8217;t support replication semantics needed (e.g., tight single-writer requirements without redesign).<\/li>\n<li>Where latency budget is extremely tight and synchronous replication increases tail latency excessively.<\/li>\n<\/ul>\n\n\n\n<p>Decision checklist<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>If customer impact on outage &gt; revenue threshold AND SLA requires &gt;99.95% -&gt; use Multi AZ.<\/li>\n<li>If data loss tolerance &lt;= acceptable RPO AND latency budget allows -&gt; consider synchronous Multi AZ.<\/li>\n<li>If cost-sensitive and recovery window acceptable -&gt; consider single AZ + cross-region DR.<\/li>\n<\/ul>\n\n\n\n<p>Maturity ladder<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Beginner: Spread stateless services across 2 AZs with LB; use managed DB with Multi AZ option.<\/li>\n<li>Intermediate: Zone-aware k8s clusters, cross-zone replicas, automated failover with tested runbooks.<\/li>\n<li>Advanced: Active-active multi-region designs with automated traffic steering, chaos testing, and policy-as-code.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">How does Multi AZ work?<\/h2>\n\n\n\n<p>Components and workflow<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Health checks run per AZ at load balancer and service level.<\/li>\n<li>Control plane maintains desired instance counts per AZ via scheduler or autoscaler.<\/li>\n<li>Data replicated between primary and replicas using sync\/async mechanisms.<\/li>\n<li>Failover triggers promoted standby or traffic routed away from unhealthy AZ.<\/li>\n<\/ul>\n\n\n\n<p>Data flow and lifecycle<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Writes from clients go through LB to primary writer in one AZ (or multiple in active-active).<\/li>\n<li>Replication streams send data to replicas in other AZs.<\/li>\n<li>Reads served from local replicas or via routed requests.<\/li>\n<li>On failure, monitoring detects loss of primary and triggers failover\/promotion.<\/li>\n<\/ul>\n\n\n\n<p>Edge cases and failure modes<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Split-brain on network partitions causing two primaries.<\/li>\n<li>DNS caching preventing fast client failover.<\/li>\n<li>Capacity skew where autoscaling lags in one AZ.<\/li>\n<li>Replication backlog causing data divergence during failover.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Typical architecture patterns for Multi AZ<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Active-Passive managed database: Use for strong consistency with quick automated promotion.<\/li>\n<li>Active-Active read replicas: Use for read-scalable workloads where eventual consistency is acceptable.<\/li>\n<li>Zone-aware Kubernetes cluster: Scheduler ensures pods spread across AZs; use for containerized apps.<\/li>\n<li>Multi-AZ object storage: Replicate objects across AZs for durability.<\/li>\n<li>Edge-located LB + regional processing: LB terminates at edge AZs and forwards to Multi AZ backends.<\/li>\n<li>Global load balancer + Multi AZ regional backends: For failover between regions while preserving Multi AZ within region.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Failure modes &amp; mitigation (TABLE REQUIRED)<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Failure mode<\/th>\n<th>Symptom<\/th>\n<th>Likely cause<\/th>\n<th>Mitigation<\/th>\n<th>Observability signal<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>F1<\/td>\n<td>Zone outage<\/td>\n<td>Traffic 0 to AZ<\/td>\n<td>Power or network loss<\/td>\n<td>Re-route traffic and scale others<\/td>\n<td>AZ request drop<\/td>\n<\/tr>\n<tr>\n<td>F2<\/td>\n<td>Replication lag<\/td>\n<td>Increased write latency<\/td>\n<td>Network saturation<\/td>\n<td>Throttle writes and catch up<\/td>\n<td>Replica delay metric<\/td>\n<\/tr>\n<tr>\n<td>F3<\/td>\n<td>Split brain<\/td>\n<td>Conflicting writes<\/td>\n<td>Partitioned control plane<\/td>\n<td>Quorum-based arbitration<\/td>\n<td>Conflicting commit logs<\/td>\n<\/tr>\n<tr>\n<td>F4<\/td>\n<td>Single AZ scheduling<\/td>\n<td>Uneven capacity<\/td>\n<td>Scheduler misconfig<\/td>\n<td>Rebalance nodes and quotas<\/td>\n<td>Pod per AZ skew<\/td>\n<\/tr>\n<tr>\n<td>F5<\/td>\n<td>DNS caching<\/td>\n<td>Clients hit dead AZ<\/td>\n<td>TTL too long<\/td>\n<td>Lower TTL and use TTL-aware routing<\/td>\n<td>Failed endpoint count<\/td>\n<\/tr>\n<tr>\n<td>F6<\/td>\n<td>Config drift<\/td>\n<td>Different versions per AZ<\/td>\n<td>Deployment race<\/td>\n<td>Enforce canary and rollout checks<\/td>\n<td>Version by AZ<\/td>\n<\/tr>\n<tr>\n<td>F7<\/td>\n<td>Storage corruption<\/td>\n<td>Read errors<\/td>\n<td>Disk or software bug<\/td>\n<td>Promote clean replica<\/td>\n<td>CRC or integrity alerts<\/td>\n<\/tr>\n<tr>\n<td>F8<\/td>\n<td>Security policy gap<\/td>\n<td>Blocked replication<\/td>\n<td>ACL\/Firewall change<\/td>\n<td>Test rules and rollback<\/td>\n<td>Replication error logs<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if needed)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>None<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Key Concepts, Keywords &amp; Terminology for Multi AZ<\/h2>\n\n\n\n<p>(40+ terms; each term 1\u20132 line definition, why it matters, common pitfall)<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Availability Zone \u2014 Isolated datacenter within a region \u2014 Critical for reducing blast radius \u2014 Pitfall: Not fully independent.<\/li>\n<li>Region \u2014 Geographic grouping of AZs \u2014 Enables broader failure containment \u2014 Pitfall: Higher latency cross-region.<\/li>\n<li>Failover \u2014 Switching to standby resources \u2014 Maintains availability during failure \u2014 Pitfall: Untested runbooks.<\/li>\n<li>Failback \u2014 Restoring primary after outage \u2014 Restores preferred topology \u2014 Pitfall: Data divergence during failback.<\/li>\n<li>Active-Active \u2014 All zones serve traffic and accept writes \u2014 Higher availability and throughput \u2014 Pitfall: Consistency complexity.<\/li>\n<li>Active-Passive \u2014 Standbys ready to be promoted \u2014 Simpler consistency \u2014 Pitfall: Longer failover time.<\/li>\n<li>Replication lag \u2014 Delay between primary and replica \u2014 Affects RPO \u2014 Pitfall: Hidden tail latency.<\/li>\n<li>Synchronous replication \u2014 Writes wait for replicas \u2014 Strong consistency \u2014 Pitfall: Higher write latency.<\/li>\n<li>Asynchronous replication \u2014 Writes don&#8217;t wait \u2014 Lower latency \u2014 Pitfall: Potential data loss.<\/li>\n<li>Quorum \u2014 Majority agreement for state changes \u2014 Avoids split-brain \u2014 Pitfall: Requires odd node counts.<\/li>\n<li>Load balancer \u2014 Distributes traffic across AZs \u2014 Ensures health-based routing \u2014 Pitfall: Single misconfig can route to bad AZ.<\/li>\n<li>Health check \u2014 Probe that determines instance health \u2014 Drives automated routing \u2014 Pitfall: Overly strict checks cause false failovers.<\/li>\n<li>DNS failover \u2014 DNS-based routing changes on failure \u2014 Useful for cross-region \u2014 Pitfall: TTL caching delays.<\/li>\n<li>Anycast \u2014 Same IP announced from multiple locations \u2014 Fast routing \u2014 Pitfall: Complexity in stateful services.<\/li>\n<li>Network partition \u2014 Broken connectivity between zones \u2014 Causes inconsistent views \u2014 Pitfall: Recovery complexity.<\/li>\n<li>Split brain \u2014 Two primaries due to partition \u2014 Leads to data conflicts \u2014 Pitfall: Hard to reconcile.<\/li>\n<li>Topology spread \u2014 Scheduling constraint to distribute pods \u2014 Improves availability \u2014 Pitfall: Can limit bin-packing.<\/li>\n<li>Anti-affinity \u2014 Prevent same-host placement \u2014 Reduces correlated failures \u2014 Pitfall: May reduce density.<\/li>\n<li>Cross-zone traffic \u2014 Data transfer across AZs \u2014 Required for replication \u2014 Pitfall: Cost and bandwidth limits.<\/li>\n<li>Egress charges \u2014 Cross-AZ transfer fees \u2014 Affects cost model \u2014 Pitfall: Unexpected billing.<\/li>\n<li>Consistency model \u2014 Guarantees about data visibility \u2014 Informs design \u2014 Pitfall: Choosing wrong model for workload.<\/li>\n<li>RTO \u2014 Recovery Time Objective \u2014 Max acceptable downtime \u2014 Pitfall: Unmet without tested automation.<\/li>\n<li>RPO \u2014 Recovery Point Objective \u2014 Max acceptable data loss \u2014 Pitfall: Misaligned with replication policy.<\/li>\n<li>Drift \u2014 Unintended divergence between AZs \u2014 Causes inconsistent behavior \u2014 Pitfall: Hard to detect without telemetry.<\/li>\n<li>Chaos engineering \u2014 Controlled fault injection \u2014 Validates resilience \u2014 Pitfall: Run without guardrails.<\/li>\n<li>Observability \u2014 Metrics, logs, traces across AZs \u2014 Required for diagnosis \u2014 Pitfall: Aggregation gaps by AZ.<\/li>\n<li>Runbook \u2014 Prescribed steps for incidents \u2014 Speeds recovery \u2014 Pitfall: Stale or untested content.<\/li>\n<li>Playbook \u2014 Decision-oriented incident guide \u2014 Helps on-call triage \u2014 Pitfall: Overly generic.<\/li>\n<li>Canary deployment \u2014 Gradual rollout across zones \u2014 Limits blast radius \u2014 Pitfall: Canary not representative.<\/li>\n<li>Blue-green deployment \u2014 Swap traffic between environments \u2014 Simple rollback \u2014 Pitfall: Double capacity cost.<\/li>\n<li>Statefulset \u2014 Kubernetes object for stateful apps \u2014 Controls pod identity across AZs \u2014 Pitfall: Volume attachment constraints.<\/li>\n<li>Multi-AZ snapshot \u2014 Point-in-time backups across zones \u2014 Enables restores \u2014 Pitfall: Snapshot consistency on writes.<\/li>\n<li>Topology-aware routing \u2014 Routing decisions based on AZ health \u2014 Reduces latency \u2014 Pitfall: Complexity in multi-tenant setups.<\/li>\n<li>Service mesh \u2014 Layer for cross-AZ traffic control \u2014 Adds observability and resilience \u2014 Pitfall: Increased operational surface.<\/li>\n<li>Auto scaling groups \u2014 Ensure capacity across AZs \u2014 Mitigates overload \u2014 Pitfall: Scaling cooldowns causing gaps.<\/li>\n<li>Leader election \u2014 Choose primary among nodes \u2014 Prevents conflict \u2014 Pitfall: Misconfigured timeouts cause churn.<\/li>\n<li>Consensus protocol \u2014 Mechanism to agree on state \u2014 Critical for safe failover \u2014 Pitfall: Misunderstanding quorums.<\/li>\n<li>Immutable infrastructure \u2014 Replace not patch \u2014 Reduces drift \u2014 Pitfall: Needs robust CI\/CD.<\/li>\n<li>Topology spread constraints \u2014 K8s primitive for AZ distribution \u2014 Ensures spread \u2014 Pitfall: Resource fragmentation.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">How to Measure Multi AZ (Metrics, SLIs, SLOs) (TABLE REQUIRED)<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Metric\/SLI<\/th>\n<th>What it tells you<\/th>\n<th>How to measure<\/th>\n<th>Starting target<\/th>\n<th>Gotchas<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>M1<\/td>\n<td>AZ availability<\/td>\n<td>Uptime of each AZ endpoint<\/td>\n<td>Percent healthy per AZ from LB probes<\/td>\n<td>99.95% per AZ<\/td>\n<td>Probe config impacts result<\/td>\n<\/tr>\n<tr>\n<td>M2<\/td>\n<td>Cross-zone latency<\/td>\n<td>Network delay between AZs<\/td>\n<td>P95 latency between AZ endpoints<\/td>\n<td>&lt;20ms within region<\/td>\n<td>Varies by provider<\/td>\n<\/tr>\n<tr>\n<td>M3<\/td>\n<td>Replication lag<\/td>\n<td>Delay for data sync<\/td>\n<td>Seconds between commit and replica apply<\/td>\n<td>&lt;1s for sync DB<\/td>\n<td>Burst traffic increases lag<\/td>\n<\/tr>\n<tr>\n<td>M4<\/td>\n<td>Failover time<\/td>\n<td>Time to restore service after AZ failure<\/td>\n<td>Time from failure detection to traffic reroute<\/td>\n<td>&lt;30s for critical apps<\/td>\n<td>DNS TTL prolongs failover<\/td>\n<\/tr>\n<tr>\n<td>M5<\/td>\n<td>Error rate per AZ<\/td>\n<td>5xx errors originating in each AZ<\/td>\n<td>Error count over requests<\/td>\n<td>&lt;0.1%<\/td>\n<td>Aggregation masks AZ spikes<\/td>\n<\/tr>\n<tr>\n<td>M6<\/td>\n<td>Request distribution<\/td>\n<td>Load balance across AZs<\/td>\n<td>Percent requests per AZ<\/td>\n<td>Even within 10%<\/td>\n<td>Autoscaler can skew<\/td>\n<\/tr>\n<tr>\n<td>M7<\/td>\n<td>Replica health<\/td>\n<td>Ready and synced replicas<\/td>\n<td>Replica state and sync metrics<\/td>\n<td>100% ready<\/td>\n<td>Silent corruption possible<\/td>\n<\/tr>\n<tr>\n<td>M8<\/td>\n<td>Capacity headroom<\/td>\n<td>Spare capacity per AZ<\/td>\n<td>Reserved vs used compute<\/td>\n<td>20% headroom<\/td>\n<td>Cost vs resilience tradeoff<\/td>\n<\/tr>\n<tr>\n<td>M9<\/td>\n<td>DNS failover latency<\/td>\n<td>Time clients take to switch<\/td>\n<td>Median client DNS resolution time<\/td>\n<td>&lt;60s<\/td>\n<td>Client-side cache varies<\/td>\n<\/tr>\n<tr>\n<td>M10<\/td>\n<td>Recovery RPO<\/td>\n<td>Data loss window after failover<\/td>\n<td>Data missing duration in seconds<\/td>\n<td>Aligned with SLO<\/td>\n<td>Hard to measure precisely<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if needed)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>None<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Best tools to measure Multi AZ<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 Prometheus<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for Multi AZ: Metrics for LB, instances, replication lag, and custom exporters.<\/li>\n<li>Best-fit environment: Kubernetes, VMs, hybrid.<\/li>\n<li>Setup outline:<\/li>\n<li>Deploy exporters per AZ for local metrics.<\/li>\n<li>Use federation or remote write to central store.<\/li>\n<li>Configure alerting rules per AZ.<\/li>\n<li>Strengths:<\/li>\n<li>Flexible query language.<\/li>\n<li>Wide ecosystem of exporters.<\/li>\n<li>Limitations:<\/li>\n<li>Storage scaling needs planning.<\/li>\n<li>Long-term retention requires external storage.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 Grafana<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for Multi AZ: Visual dashboards aggregating AZ metrics and traces.<\/li>\n<li>Best-fit environment: Any environment with metrics backend.<\/li>\n<li>Setup outline:<\/li>\n<li>Create AZ-specific panels.<\/li>\n<li>Use templating to compare AZs.<\/li>\n<li>Embed error budget panels.<\/li>\n<li>Strengths:<\/li>\n<li>Powerful visualization and annotation.<\/li>\n<li>Limitations:<\/li>\n<li>Not a metrics store itself.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 OpenTelemetry<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for Multi AZ: Distributed traces and context propagation across AZs.<\/li>\n<li>Best-fit environment: Microservices and k8s.<\/li>\n<li>Setup outline:<\/li>\n<li>Instrument services with OTLP.<\/li>\n<li>Tag traces with AZ metadata.<\/li>\n<li>Export to tracing backend.<\/li>\n<li>Strengths:<\/li>\n<li>End-to-end request visibility.<\/li>\n<li>Limitations:<\/li>\n<li>Sampling may miss rare AZ issues.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 Chaos Engineering Platform (e.g., open tool) \u2014 Varies \/ Not publicly stated<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for Multi AZ: Resilience to AZ failures and recovery workflows.<\/li>\n<li>Best-fit environment: Pre-prod and staging.<\/li>\n<li>Setup outline:<\/li>\n<li>Define experiments scoped to AZ.<\/li>\n<li>Automate failover and rollback tests.<\/li>\n<li>Integrate with CI pipelines.<\/li>\n<li>Strengths:<\/li>\n<li>Validates runbooks under controlled conditions.<\/li>\n<li>Limitations:<\/li>\n<li>Needs safety gating.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 Cloud Provider Monitoring (native) \u2014 Varies \/ Not publicly stated<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for Multi AZ: Provider-level AZ health, network metrics, and service events.<\/li>\n<li>Best-fit environment: Native cloud services.<\/li>\n<li>Setup outline:<\/li>\n<li>Enable provider health events.<\/li>\n<li>Wire provider metrics into central dashboard.<\/li>\n<li>Set provider-health alerts.<\/li>\n<li>Strengths:<\/li>\n<li>Provider context and notifications.<\/li>\n<li>Limitations:<\/li>\n<li>Vendor lock-in for visibility depth.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Recommended dashboards &amp; alerts for Multi AZ<\/h3>\n\n\n\n<p>Executive dashboard<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Panels:<\/li>\n<li>Region and AZ availability summary.<\/li>\n<li>Error budget remaining for top services.<\/li>\n<li>Business impact indicators (transactions per minute).<\/li>\n<li>Why: Gives leadership visibility into risk posture.<\/li>\n<\/ul>\n\n\n\n<p>On-call dashboard<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Panels:<\/li>\n<li>AZ error rate and request distribution.<\/li>\n<li>Failover progress and replication lag.<\/li>\n<li>Recent deployment status by AZ.<\/li>\n<li>Why: Rapid triage and decision-making.<\/li>\n<\/ul>\n\n\n\n<p>Debug dashboard<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Panels:<\/li>\n<li>Traces filtered by AZ and endpoint.<\/li>\n<li>Replica health and commit logs.<\/li>\n<li>Network path latency matrix.<\/li>\n<li>Why: Deep diagnostics to resolve root cause.<\/li>\n<\/ul>\n\n\n\n<p>Alerting guidance<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Page (page immediately) vs ticket:<\/li>\n<li>Page for multi-AZ outage signals: total region failure, replication lag exceeding SLO, split brain detection.<\/li>\n<li>Ticket for degraded but noncritical issues: one AZ increased errors but within error budget.<\/li>\n<li>Burn-rate guidance:<\/li>\n<li>If error budget burn rate &gt;2x baseline, escalate to incident review.<\/li>\n<li>Use burn-rate windows (1h, 6h, 24h) for trend detection.<\/li>\n<li>Noise reduction tactics:<\/li>\n<li>Deduplicate alerts by grouping per service and AZ.<\/li>\n<li>Suppression during known maintenance windows.<\/li>\n<li>Use correlation rules to combine related alerts into one incident.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Implementation Guide (Step-by-step)<\/h2>\n\n\n\n<p>1) Prerequisites\n&#8211; Account quotas and AZ capacity verified.\n&#8211; IAM roles and cross-AZ network connectivity configured.\n&#8211; Observability and CI\/CD pipelines ready.<\/p>\n\n\n\n<p>2) Instrumentation plan\n&#8211; Define SLIs and tag metrics with AZ metadata.\n&#8211; Instrument health, latency, and replication metrics.\n&#8211; Ensure traces include AZ label.<\/p>\n\n\n\n<p>3) Data collection\n&#8211; Centralize metrics, logs, and traces.\n&#8211; Ensure each AZ exports telemetry to central system with source AZ.<\/p>\n\n\n\n<p>4) SLO design\n&#8211; Define AZ-aware availability and latency SLOs.\n&#8211; Define error budgets and burn-rate thresholds.<\/p>\n\n\n\n<p>5) Dashboards\n&#8211; Build executive, on-call, and debug dashboards as above.<\/p>\n\n\n\n<p>6) Alerts &amp; routing\n&#8211; Create alert rules for failovers, replication lag, and AZ skew.\n&#8211; Configure escalation policies and runbook links.<\/p>\n\n\n\n<p>7) Runbooks &amp; automation\n&#8211; Author playbooks for common Multi AZ incidents.\n&#8211; Automate failover steps where safe.<\/p>\n\n\n\n<p>8) Validation (load\/chaos\/game days)\n&#8211; Run scheduled chaos tests that simulate AZ loss.\n&#8211; Use game days for on-call practice.<\/p>\n\n\n\n<p>9) Continuous improvement\n&#8211; Postmortem after each incident and iterate on runbooks.\n&#8211; Use metrics to quantify improved resilience.<\/p>\n\n\n\n<p>Checklists<\/p>\n\n\n\n<p>Pre-production checklist<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Multi-AZ test coverage in CI.<\/li>\n<li>Load balancer health checks configured.<\/li>\n<li>Replication and snapshot tested.<\/li>\n<\/ul>\n\n\n\n<p>Production readiness checklist<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Adequate capacity headroom per AZ.<\/li>\n<li>Monitoring and alerts operational.<\/li>\n<li>Runbooks published and verified.<\/li>\n<\/ul>\n\n\n\n<p>Incident checklist specific to Multi AZ<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Identify scope: single AZ or region.<\/li>\n<li>Verify telemetry and replication state.<\/li>\n<li>Redirect traffic and promote replica if needed.<\/li>\n<li>Communicate status and timeline to stakeholders.<\/li>\n<li>Run postmortem and update runbooks.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Use Cases of Multi AZ<\/h2>\n\n\n\n<p>Provide 8\u201312 condensed use cases:<\/p>\n\n\n\n<p>1) Customer-facing API\n&#8211; Context: External API serving users globally.\n&#8211; Problem: Single AZ outage takes API offline.\n&#8211; Why Multi AZ helps: Reduces downtime and preserves user sessions.\n&#8211; What to measure: Error rate per AZ, failover time.\n&#8211; Typical tools: LB, K8s, managed DB.<\/p>\n\n\n\n<p>2) Managed relational database\n&#8211; Context: Transactional database for payments.\n&#8211; Problem: Data loss risk during AZ failure.\n&#8211; Why: Replication across AZs reduces RPO.\n&#8211; What to measure: Replication lag, commit success.\n&#8211; Typical tools: Managed DB Multi AZ.<\/p>\n\n\n\n<p>3) Stateful Kubernetes service\n&#8211; Context: Statefulset with persistent volumes.\n&#8211; Problem: Volume attachment constraints break pods in failed AZ.\n&#8211; Why: Multi AZ scheduling and replicated volumes improve resilience.\n&#8211; What to measure: Pod distribution, PVC attachment failures.\n&#8211; Typical tools: K8s, CSI drivers, topology-aware storage.<\/p>\n\n\n\n<p>4) Real-time analytics\n&#8211; Context: Stream processing with low latency reads.\n&#8211; Problem: Zone outage creates processing backlog.\n&#8211; Why: Multi AZ replicates brokers and consumers.\n&#8211; What to measure: Consumer lag, throughput per AZ.\n&#8211; Typical tools: Stream platform with cross-AZ replication.<\/p>\n\n\n\n<p>5) Serverless webhooks\n&#8211; Context: Event-driven functions for webhooks.\n&#8211; Problem: Provider AZ outage causes missed events.\n&#8211; Why: Platform spreads invocations preventing single-point outage.\n&#8211; What to measure: Invocation failures by AZ.\n&#8211; Typical tools: Serverless platform with Multi AZ.<\/p>\n\n\n\n<p>6) Compliance backups\n&#8211; Context: Regulatory requirement for redundancy.\n&#8211; Problem: Single AZ backups insufficient.\n&#8211; Why: Snapshots replicated across AZs meet requirements.\n&#8211; What to measure: Backup success and restore time.\n&#8211; Typical tools: Backup orchestration and provider snapshot.<\/p>\n\n\n\n<p>7) Edge termination with regional backends\n&#8211; Context: Edge LB terminates TLS in each AZ.\n&#8211; Problem: Single AZ termination causes latency spikes.\n&#8211; Why: Local termination reduces cross-AZ hops.\n&#8211; What to measure: Edge latency and backend errors.\n&#8211; Typical tools: Edge LB, CDN, regional services.<\/p>\n\n\n\n<p>8) CI\/CD runners\n&#8211; Context: Build fleet for deployments.\n&#8211; Problem: AZ outage halts pipelines.\n&#8211; Why: Spread runners across AZs ensures continuity.\n&#8211; What to measure: Build success rate by AZ.\n&#8211; Typical tools: CI system with AZ-aware runners.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Scenario Examples (Realistic, End-to-End)<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario #1 \u2014 Kubernetes Multi AZ failover<\/h3>\n\n\n\n<p><strong>Context:<\/strong> Production K8s cluster hosting web services across three AZs.<br\/>\n<strong>Goal:<\/strong> Survive single AZ outage without dropping requests.<br\/>\n<strong>Why Multi AZ matters here:<\/strong> Node and AZ failures are common; Multi AZ reduces user impact.<br\/>\n<strong>Architecture \/ workflow:<\/strong> K8s cluster with topology spread constraints, multi-AZ storage class, LB health checks.<br\/>\n<strong>Step-by-step implementation:<\/strong> <\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Configure topologySpreadConstraints for critical deployments.<\/li>\n<li>Use a storage class that supports multi-AZ volumes or replicate state externally.<\/li>\n<li>Set LB health checks and session stickiness minimal TTL.<\/li>\n<li>Test by cordoning and draining nodes in one AZ.\n<strong>What to measure:<\/strong> Pod distribution, request errors by AZ, failover time.<br\/>\n<strong>Tools to use and why:<\/strong> Kubernetes scheduler, Prometheus, Grafana, CSI multi-AZ storage.<br\/>\n<strong>Common pitfalls:<\/strong> Stateful volumes not multi-attach, scheduler misconfig.<br\/>\n<strong>Validation:<\/strong> Chaos experiment: simulate AZ failure and measure error rate within SLO.<br\/>\n<strong>Outcome:<\/strong> Service continues with minimal request loss and automated pod rescheduling.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario #2 \u2014 Serverless ingestion across AZs<\/h3>\n\n\n\n<p><strong>Context:<\/strong> Event ingestion pipeline using managed serverless functions and managed DB.<br\/>\n<strong>Goal:<\/strong> Ensure events accepted and persisted despite one AZ failing.<br\/>\n<strong>Why Multi AZ matters here:<\/strong> Serverless platform spreads compute; DB needs Multi AZ for durability.<br\/>\n<strong>Architecture \/ workflow:<\/strong> API gateway routes to functions in any AZ; functions write to Multi AZ DB with retries.<br\/>\n<strong>Step-by-step implementation:<\/strong> <\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Enable provider Multi AZ for database.<\/li>\n<li>Instrument retries and idempotency in functions.<\/li>\n<li>Monitor DB replication lag and function error rates.\n<strong>What to measure:<\/strong> Invocation success by AZ, DB commit latency.<br\/>\n<strong>Tools to use and why:<\/strong> Provider serverless, managed DB Multi AZ, observability backend.<br\/>\n<strong>Common pitfalls:<\/strong> Cold starts and provider throttling during failover.<br\/>\n<strong>Validation:<\/strong> Inject DB failover event and verify ingestion continues.<br\/>\n<strong>Outcome:<\/strong> Events accepted across AZs with minimal loss.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario #3 \u2014 Incident response and postmortem for AZ outage<\/h3>\n\n\n\n<p><strong>Context:<\/strong> One AZ experienced network partition for 20 minutes causing partial downtime.<br\/>\n<strong>Goal:<\/strong> Postmortem that prevents recurrence and improves runbooks.<br\/>\n<strong>Why Multi AZ matters here:<\/strong> Root cause tied to cross-AZ routing and failover automation inefficiencies.<br\/>\n<strong>Architecture \/ workflow:<\/strong> LB, managed DB with async replica, k8s cluster.<br\/>\n<strong>Step-by-step implementation:<\/strong> <\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Triage using AZ metrics and logs.<\/li>\n<li>Execute runbook to promote replica and update routing.<\/li>\n<li>Communicate rapidly to stakeholders.<\/li>\n<li>Post-incident, update runbook with missing steps.\n<strong>What to measure:<\/strong> Time to detection, failover time, error budget burn.<br\/>\n<strong>Tools to use and why:<\/strong> Observability stack, incident management, runbook automation.<br\/>\n<strong>Common pitfalls:<\/strong> Incomplete telemetry for decision making.<br\/>\n<strong>Validation:<\/strong> Run tabletop exercises simulating same failure.<br\/>\n<strong>Outcome:<\/strong> Runbook improvements reduced future failover time.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario #4 \u2014 Cost vs performance trade-off<\/h3>\n\n\n\n<p><strong>Context:<\/strong> E-commerce platform considering synchronous Multi AZ writes.<br\/>\n<strong>Goal:<\/strong> Decide between synchronous replication for zero RPO and async for lower latency.<br\/>\n<strong>Why Multi AZ matters here:<\/strong> Trade-offs impact conversion rates and customer experience.<br\/>\n<strong>Architecture \/ workflow:<\/strong> Checkout flow writes sensitive payment records.<br\/>\n<strong>Step-by-step implementation:<\/strong> <\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Measure current write latency contribution.<\/li>\n<li>Estimate increased latency with synchronous replication.<\/li>\n<li>Prototype synchronous and measure conversion impact.<\/li>\n<li>If too slow, use async with strong reconciliation and compensating transactions.\n<strong>What to measure:<\/strong> P95 write latency, checkout conversion, replication lag.<br\/>\n<strong>Tools to use and why:<\/strong> Load testing, A\/B testing, observability.<br\/>\n<strong>Common pitfalls:<\/strong> Ignoring user-perceived latency vs internal metrics.<br\/>\n<strong>Validation:<\/strong> Run controlled A\/B experiment with traffic to measure conversion delta.<br\/>\n<strong>Outcome:<\/strong> Chosen async replication with compensating logic and stricter monitoring.<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Common Mistakes, Anti-patterns, and Troubleshooting<\/h2>\n\n\n\n<p>List of mistakes with symptom -&gt; root cause -&gt; fix (15\u201325)<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Symptom: All traffic hits a single AZ -&gt; Root cause: LB misconfiguration -&gt; Fix: Verify LB cross-AZ routing and health checks.<\/li>\n<li>Symptom: Replica lag spikes under load -&gt; Root cause: Bandwidth saturation -&gt; Fix: Increase throughput capacity or use async with compensation.<\/li>\n<li>Symptom: Split brain occurred -&gt; Root cause: No quorum or leader election misconfig -&gt; Fix: Implement quorum-based consensus and reap settings.<\/li>\n<li>Symptom: DNS changes not respected -&gt; Root cause: High TTL caching -&gt; Fix: Lower TTL and use client-aware retry logic.<\/li>\n<li>Symptom: Deployment broke service in all AZs -&gt; Root cause: Simultaneous draining across AZs -&gt; Fix: Enforce rolling updates with per-AZ concurrency limits.<\/li>\n<li>Symptom: Persistent data corruption -&gt; Root cause: Silent replication bug -&gt; Fix: Run consistency checks and promote clean replicas.<\/li>\n<li>Symptom: Observability gaps by AZ -&gt; Root cause: Missing AZ tags in telemetry -&gt; Fix: Tag all metrics and logs with AZ metadata.<\/li>\n<li>Symptom: Alerts fire repeatedly -&gt; Root cause: No dedupe or grouping -&gt; Fix: Alert grouping and deduplication rules.<\/li>\n<li>Symptom: Excessive cross-AZ costs -&gt; Root cause: Chatty replication or misrouted traffic -&gt; Fix: Optimize replication and reduce cross-AZ egress.<\/li>\n<li>Symptom: Autoscaler launches in same AZ -&gt; Root cause: Quota or scheduler bugs -&gt; Fix: Check quotas and configure zone balancing.<\/li>\n<li>Symptom: Stateful pods reschedule slowly -&gt; Root cause: Volume attachment delays -&gt; Fix: Use multi-AZ storage or redesign stateful handling.<\/li>\n<li>Symptom: Unclear postmortem -&gt; Root cause: Missing timelines and telemetry -&gt; Fix: Capture events with timestamps and enrich logs.<\/li>\n<li>Symptom: Unexpected failback issues -&gt; Root cause: Data drift during failover -&gt; Fix: Reconcile data before failback and test runbook.<\/li>\n<li>Symptom: Test passes in staging but fails prod -&gt; Root cause: Incomplete staging parity -&gt; Fix: Increase staging parity and run chaos tests in prod-like env.<\/li>\n<li>Symptom: On-call overload during AZ issues -&gt; Root cause: Poor automation -&gt; Fix: Automate common recovery actions.<\/li>\n<li>Symptom: Slow replication during peak -&gt; Root cause: Underprovisioned IO -&gt; Fix: Increase IO settings or shard writes.<\/li>\n<li>Symptom: Vault or secrets unavailable in one AZ -&gt; Root cause: Regional misconfiguration -&gt; Fix: Replicate secrets stores across AZs.<\/li>\n<li>Symptom: Traces don&#8217;t show AZ context -&gt; Root cause: No AZ labels in tracing -&gt; Fix: Add AZ tags to trace spans.<\/li>\n<li>Symptom: Canary tests not catching AZ-specific bug -&gt; Root cause: Canary not executed across all AZs -&gt; Fix: Run canaries in each AZ.<\/li>\n<li>Symptom: Security rules block cross-AZ replication -&gt; Root cause: ACL changes -&gt; Fix: Use immutable security policy templates and test changes.<\/li>\n<\/ol>\n\n\n\n<p>Observability pitfalls included above: missing AZ tags, sketchy telemetry, traces without AZ context, alerts that flood, and dashboards that mask AZ differences.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Best Practices &amp; Operating Model<\/h2>\n\n\n\n<p>Ownership and on-call<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Platform team owns Multi AZ platform and runbooks.<\/li>\n<li>Service teams own application-level resilience and SLIs.<\/li>\n<li>On-call rotations include platform and service on-call for cross-AZ incidents.<\/li>\n<\/ul>\n\n\n\n<p>Runbooks vs playbooks<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Runbooks: Step-by-step commands for operational tasks.<\/li>\n<li>Playbooks: Decision trees for triage and escalation.<\/li>\n<\/ul>\n\n\n\n<p>Safe deployments<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Canary per AZ, limit concurrent AZ drain.<\/li>\n<li>Automatic rollback triggers on SLO violations.<\/li>\n<\/ul>\n\n\n\n<p>Toil reduction and automation<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Automate failover promotion, capacity rebalancing, and remediation.<\/li>\n<li>Use policy-as-code to prevent drift.<\/li>\n<\/ul>\n\n\n\n<p>Security basics<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Replicate IAM policies and security configurations across AZs.<\/li>\n<li>Ensure key management supports multi-AZ access.<\/li>\n<\/ul>\n\n\n\n<p>Weekly\/monthly routines<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Weekly: Verify backup jobs and restore tests.<\/li>\n<li>Monthly: Run chaos test or tabletop for one AZ failure.<\/li>\n<li>Quarterly: Review capacity headroom and runbook updates.<\/li>\n<\/ul>\n\n\n\n<p>What to review in postmortems<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Timeline with AZ-specific telemetry.<\/li>\n<li>Root cause and whether Multi AZ mitigations worked.<\/li>\n<li>Action items: automation, instrumentation, and runbook changes.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Tooling &amp; Integration Map for Multi AZ (TABLE REQUIRED)<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Category<\/th>\n<th>What it does<\/th>\n<th>Key integrations<\/th>\n<th>Notes<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>I1<\/td>\n<td>Metrics store<\/td>\n<td>Collects and queries metrics<\/td>\n<td>Prometheus, Grafana<\/td>\n<td>Centralized metrics per AZ<\/td>\n<\/tr>\n<tr>\n<td>I2<\/td>\n<td>Tracing<\/td>\n<td>Distributed tracing across AZs<\/td>\n<td>OpenTelemetry, tracing backend<\/td>\n<td>Tag traces with AZ<\/td>\n<\/tr>\n<tr>\n<td>I3<\/td>\n<td>Logging<\/td>\n<td>Aggregates logs from AZs<\/td>\n<td>Log pipeline<\/td>\n<td>Ensure AZ label on logs<\/td>\n<\/tr>\n<tr>\n<td>I4<\/td>\n<td>Load balancing<\/td>\n<td>Routes traffic by health<\/td>\n<td>LB, DNS, anycast<\/td>\n<td>Multi-AZ routing policies<\/td>\n<\/tr>\n<tr>\n<td>I5<\/td>\n<td>Storage<\/td>\n<td>Replicated storage across AZs<\/td>\n<td>Provider storage, CSI<\/td>\n<td>Check consistency guarantees<\/td>\n<\/tr>\n<tr>\n<td>I6<\/td>\n<td>Database<\/td>\n<td>Managed multi-AZ DB services<\/td>\n<td>DB engines and operators<\/td>\n<td>Understand failover semantics<\/td>\n<\/tr>\n<tr>\n<td>I7<\/td>\n<td>CI\/CD<\/td>\n<td>Deploy with AZ constraints<\/td>\n<td>Pipeline, k8s<\/td>\n<td>Canary and per-AZ rollout<\/td>\n<\/tr>\n<tr>\n<td>I8<\/td>\n<td>Chaos platform<\/td>\n<td>Run resilience experiments<\/td>\n<td>CI and observability<\/td>\n<td>Gate experiments with safety<\/td>\n<\/tr>\n<tr>\n<td>I9<\/td>\n<td>Incident mgmt<\/td>\n<td>Coordinate response and comms<\/td>\n<td>Pager, ticketing<\/td>\n<td>Link runbooks and telemetry<\/td>\n<\/tr>\n<tr>\n<td>I10<\/td>\n<td>Policy-as-code<\/td>\n<td>Enforce zoning policies<\/td>\n<td>IAM, infra tooling<\/td>\n<td>Prevent config drift<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if needed)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>None<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Frequently Asked Questions (FAQs)<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">H3: What is the difference between Multi AZ and Multi Region?<\/h3>\n\n\n\n<p>Multi AZ is within a region across isolated datacenters; Multi Region spans multiple geographic regions and provides higher resilience against regional failures but with higher latency and complexity.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">H3: Does Multi AZ guarantee zero downtime?<\/h3>\n\n\n\n<p>No. Multi AZ reduces the likelihood and impact of zone failures but does not guarantee zero downtime; failures in control planes, software, or simultaneous faults can still cause outages.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">H3: How many AZs should I use?<\/h3>\n\n\n\n<p>Typically at least two for redundancy, three for better quorum and higher resilience; exact number varies by provider and costs.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">H3: Is synchronous replication required for Multi AZ?<\/h3>\n\n\n\n<p>No. It depends on RPO requirements. Synchronous provides stronger guarantees but increases latency; asynchronous reduces latency but risks data loss.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">H3: How do I test Multi AZ failover?<\/h3>\n\n\n\n<p>Use automated chaos tests, simulated AZ drains, and game days to validate failover and runbooks under controlled conditions.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">H3: What are costs associated with Multi AZ?<\/h3>\n\n\n\n<p>Costs include cross-AZ data transfer, duplicated resources, and additional operational overhead. Evaluate against outage risk.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">H3: Can serverless benefit from Multi AZ?<\/h3>\n\n\n\n<p>Yes. Managed serverless platforms often spread functions across AZs, but dependent services like databases must be Multi AZ too.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">H3: How do I avoid split-brain?<\/h3>\n\n\n\n<p>Use quorum-based leader election, consensus protocols, and fencing mechanisms to prevent simultaneous primaries.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">H3: Should I measure per-AZ SLIs?<\/h3>\n\n\n\n<p>Yes. Per-AZ SLIs help detect skew and prevent issues from aggregating and masking localized problems.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">H3: What telemetry is most important for Multi AZ?<\/h3>\n\n\n\n<p>Health checks, replication lag, request distribution, per-AZ error rates, and cross-zone latency are critical.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">H3: How does DNS affect failover?<\/h3>\n\n\n\n<p>DNS caching and TTLs can delay client re-routing; use low TTLs and regional routing where possible.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">H3: Can Multi AZ replace backups?<\/h3>\n\n\n\n<p>No. Multi AZ provides availability within a region but backups protect against corruption, operator error, and ransomware.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">H3: How does Multi AZ impact CI\/CD?<\/h3>\n\n\n\n<p>CI\/CD must be AZ-aware, performing canary rollouts and ensuring not to drain capacity across all AZs simultaneously.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">H3: What security considerations are unique to Multi AZ?<\/h3>\n\n\n\n<p>Replicate security configuration, ensure key access across AZs, and test cross-AZ incident response to lock down exposures.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">H3: How to handle stateful workloads?<\/h3>\n\n\n\n<p>Use storage with multi-AZ replication or design the application for external replicated state services.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">H3: Will Multi AZ fix provider outages?<\/h3>\n\n\n\n<p>Not always. If provider has a regional control-plane issue, Multi AZ may still be affected; Multi Region is needed for regional failures.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">H3: What is typical failover time?<\/h3>\n\n\n\n<p>Varies by implementation; short windows like &lt;30s for critical systems are possible with proper automation, but DNS and client behavior can extend it.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">H3: How to balance cost and availability?<\/h3>\n\n\n\n<p>Define SLOs and error budgets, then choose Multi AZ level that meets business tolerance without unnecessary duplication.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">H3: Are there managed services that provide Multi AZ automatically?<\/h3>\n\n\n\n<p>Yes, many managed databases and storage services offer Multi AZ options; behavior and guarantees vary by provider.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>Multi AZ is a foundational resilience pattern for modern cloud-native systems that reduces the blast radius of zone failures while introducing trade-offs in cost and complexity. It should be combined with strong observability, automated runbooks, and regular validation to meet SLOs.<\/p>\n\n\n\n<p>Next 7 days plan<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Day 1: Inventory critical services and annotate AZ deployment footprint.<\/li>\n<li>Day 2: Tag metrics, logs, and traces with AZ metadata.<\/li>\n<li>Day 3: Define or refine SLIs\/SLOs for AZ availability and replication.<\/li>\n<li>Day 4: Implement per-AZ dashboards and key alerts.<\/li>\n<li>Day 5: Run a controlled failover or chaos test in staging.<\/li>\n<li>Day 6: Update runbooks and automation based on test findings.<\/li>\n<li>Day 7: Schedule a production game day and on-call readiness review.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Appendix \u2014 Multi AZ Keyword Cluster (SEO)<\/h2>\n\n\n\n<p>Primary keywords<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Multi AZ<\/li>\n<li>Multi Availability Zone<\/li>\n<li>Multi AZ architecture<\/li>\n<li>Multi AZ deployment<\/li>\n<li>Multi AZ best practices<\/li>\n<\/ul>\n\n\n\n<p>Secondary keywords<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>AZ redundancy<\/li>\n<li>Availability zone replication<\/li>\n<li>cross-AZ replication<\/li>\n<li>zone failure mitigation<\/li>\n<li>AZ failover<\/li>\n<\/ul>\n\n\n\n<p>Long-tail questions<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What is Multi AZ in cloud architecture<\/li>\n<li>How does Multi AZ work for databases<\/li>\n<li>Multi AZ vs Multi Region differences<\/li>\n<li>When to use Multi AZ for Kubernetes<\/li>\n<li>How to measure Multi AZ availability<\/li>\n<li>How to test Multi AZ failover<\/li>\n<li>Multi AZ cost considerations for startups<\/li>\n<li>Best practices for Multi AZ deployments<\/li>\n<li>How to monitor replication lag across AZs<\/li>\n<li>How to design Multi AZ storage for stateful apps<\/li>\n<\/ul>\n\n\n\n<p>Related terminology<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>availability zone<\/li>\n<li>region redundancy<\/li>\n<li>failover automation<\/li>\n<li>replication lag<\/li>\n<li>quorum election<\/li>\n<li>synchronous replication<\/li>\n<li>asynchronous replication<\/li>\n<li>topology spread constraints<\/li>\n<li>anti-affinity scheduling<\/li>\n<li>load balancer health checks<\/li>\n<li>DNS TTL and failover<\/li>\n<li>cross-AZ data transfer<\/li>\n<li>replication backlog<\/li>\n<li>error budget burn rate<\/li>\n<li>chaos engineering<\/li>\n<li>runbook automation<\/li>\n<li>canary per-AZ<\/li>\n<li>blue-green deployment<\/li>\n<li>active-active topology<\/li>\n<li>active-passive topology<\/li>\n<li>consistency model<\/li>\n<li>RTO RPO<\/li>\n<li>topology-aware routing<\/li>\n<li>service mesh for AZ routing<\/li>\n<li>multi-AZ CSI drivers<\/li>\n<li>cloud provider health events<\/li>\n<li>AZ-aware observability<\/li>\n<li>global load balancer with regional backends<\/li>\n<li>immutable infrastructure practices<\/li>\n<li>backup and snapshot replication<\/li>\n<li>policy-as-code for zoning<\/li>\n<li>incident response for AZ events<\/li>\n<li>postmortem AZ timeline<\/li>\n<li>capacity headroom per AZ<\/li>\n<li>DB promotion and failback<\/li>\n<li>vault replication across AZs<\/li>\n<li>secrets access multi-AZ<\/li>\n<li>tracing AZ labels<\/li>\n<li>metrics federation per AZ<\/li>\n<li>automated runbook testing<\/li>\n<li>staging parity for AZs<\/li>\n<li>traffic steering for AZ health<\/li>\n<li>throttling for cross-AZ bandwidth<\/li>\n<li>client-side retry design<\/li>\n<li>idempotent writes for failover<\/li>\n<li>reconciliation after failback<\/li>\n<li>topology constraints in schedulers<\/li>\n<li>AZ-specific service quotas<\/li>\n<li>operational maturity ladder for Multi AZ<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>&#8212;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[149],"tags":[],"class_list":["post-2030","post","type-post","status-publish","format-standard","hentry","category-terminology"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>What is Multi AZ? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide) - SRE School<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/sreschool.com\/blog\/multi-az\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"What is Multi AZ? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide) - SRE School\" \/>\n<meta property=\"og:description\" content=\"---\" \/>\n<meta property=\"og:url\" content=\"https:\/\/sreschool.com\/blog\/multi-az\/\" \/>\n<meta property=\"og:site_name\" content=\"SRE School\" \/>\n<meta property=\"article:published_time\" content=\"2026-02-15T12:41:33+00:00\" \/>\n<meta name=\"author\" content=\"Rajesh Kumar\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Rajesh Kumar\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"27 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/sreschool.com\/blog\/multi-az\/\",\"url\":\"https:\/\/sreschool.com\/blog\/multi-az\/\",\"name\":\"What is Multi AZ? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide) - SRE School\",\"isPartOf\":{\"@id\":\"https:\/\/sreschool.com\/blog\/#website\"},\"datePublished\":\"2026-02-15T12:41:33+00:00\",\"author\":{\"@id\":\"https:\/\/sreschool.com\/blog\/#\/schema\/person\/0ffe446f77bb2589992dbe3a7f417201\"},\"breadcrumb\":{\"@id\":\"https:\/\/sreschool.com\/blog\/multi-az\/#breadcrumb\"},\"inLanguage\":\"en\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/sreschool.com\/blog\/multi-az\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/sreschool.com\/blog\/multi-az\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/sreschool.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"What is Multi AZ? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide)\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/sreschool.com\/blog\/#website\",\"url\":\"https:\/\/sreschool.com\/blog\/\",\"name\":\"SRESchool\",\"description\":\"Master SRE. Build Resilient Systems. Lead the Future of Reliability\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/sreschool.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/sreschool.com\/blog\/#\/schema\/person\/0ffe446f77bb2589992dbe3a7f417201\",\"name\":\"Rajesh Kumar\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en\",\"@id\":\"https:\/\/sreschool.com\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/f901a4f2929fa034a291a8363d589791d5a3c1f6a051c22e744acb8bfc8e022a?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/f901a4f2929fa034a291a8363d589791d5a3c1f6a051c22e744acb8bfc8e022a?s=96&d=mm&r=g\",\"caption\":\"Rajesh Kumar\"},\"sameAs\":[\"http:\/\/sreschool.com\/blog\"],\"url\":\"https:\/\/sreschool.com\/blog\/author\/admin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"What is Multi AZ? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide) - SRE School","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/sreschool.com\/blog\/multi-az\/","og_locale":"en_US","og_type":"article","og_title":"What is Multi AZ? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide) - SRE School","og_description":"---","og_url":"https:\/\/sreschool.com\/blog\/multi-az\/","og_site_name":"SRE School","article_published_time":"2026-02-15T12:41:33+00:00","author":"Rajesh Kumar","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Rajesh Kumar","Est. reading time":"27 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/sreschool.com\/blog\/multi-az\/","url":"https:\/\/sreschool.com\/blog\/multi-az\/","name":"What is Multi AZ? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide) - SRE School","isPartOf":{"@id":"https:\/\/sreschool.com\/blog\/#website"},"datePublished":"2026-02-15T12:41:33+00:00","author":{"@id":"https:\/\/sreschool.com\/blog\/#\/schema\/person\/0ffe446f77bb2589992dbe3a7f417201"},"breadcrumb":{"@id":"https:\/\/sreschool.com\/blog\/multi-az\/#breadcrumb"},"inLanguage":"en","potentialAction":[{"@type":"ReadAction","target":["https:\/\/sreschool.com\/blog\/multi-az\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/sreschool.com\/blog\/multi-az\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/sreschool.com\/blog\/"},{"@type":"ListItem","position":2,"name":"What is Multi AZ? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide)"}]},{"@type":"WebSite","@id":"https:\/\/sreschool.com\/blog\/#website","url":"https:\/\/sreschool.com\/blog\/","name":"SRESchool","description":"Master SRE. Build Resilient Systems. Lead the Future of Reliability","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/sreschool.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en"},{"@type":"Person","@id":"https:\/\/sreschool.com\/blog\/#\/schema\/person\/0ffe446f77bb2589992dbe3a7f417201","name":"Rajesh Kumar","image":{"@type":"ImageObject","inLanguage":"en","@id":"https:\/\/sreschool.com\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/f901a4f2929fa034a291a8363d589791d5a3c1f6a051c22e744acb8bfc8e022a?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/f901a4f2929fa034a291a8363d589791d5a3c1f6a051c22e744acb8bfc8e022a?s=96&d=mm&r=g","caption":"Rajesh Kumar"},"sameAs":["http:\/\/sreschool.com\/blog"],"url":"https:\/\/sreschool.com\/blog\/author\/admin\/"}]}},"_links":{"self":[{"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/posts\/2030","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/comments?post=2030"}],"version-history":[{"count":0,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/posts\/2030\/revisions"}],"wp:attachment":[{"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/media?parent=2030"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/categories?post=2030"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/tags?post=2030"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}