{"id":1010,"date":"2025-09-02T08:40:20","date_gmt":"2025-09-02T08:40:20","guid":{"rendered":"https:\/\/sreschool.com\/blog\/?p=1010"},"modified":"2025-09-02T08:40:21","modified_gmt":"2025-09-02T08:40:21","slug":"what-is-redundancy","status":"publish","type":"post","link":"https:\/\/sreschool.com\/blog\/what-is-redundancy\/","title":{"rendered":"What is Redundancy?"},"content":{"rendered":"\n<h1 class=\"wp-block-heading\"><\/h1>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1001\" height=\"471\" src=\"https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/09\/image-1.png\" alt=\"\" class=\"wp-image-1012\" srcset=\"https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/09\/image-1.png 1001w, https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/09\/image-1-300x141.png 300w, https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/09\/image-1-768x361.png 768w\" sizes=\"auto, (max-width: 1001px) 100vw, 1001px\" \/><\/figure>\n\n\n\n<p><strong>Redundancy<\/strong> is the deliberate duplication of critical components or paths so that a failure doesn\u2019t violate your <strong>SLOs<\/strong>. Put simply: remove single points of failure (SPOFs) and make sure something else can take over fast enough that users don\u2019t notice.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Where you add redundancy (failure domains)<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Process \/ pod:<\/strong> multiple workers for the same service.<\/li>\n\n\n\n<li><strong>Host \/ node:<\/strong> more than one VM\/node per service tier.<\/li>\n\n\n\n<li><strong>Availability Zone (AZ):<\/strong> replicas spread across \u22652 AZs.<\/li>\n\n\n\n<li><strong>Region:<\/strong> active-active or active-passive between regions.<\/li>\n\n\n\n<li><strong>Vendor:<\/strong> multi-provider or alternate managed service (only when justified).<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Common patterns<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>N+1 \/ N+M:<\/strong> have at least one (or M) spare capacity unit beyond steady-state needs.<\/li>\n\n\n\n<li><strong>2N (\u201cmirrored\u201d):<\/strong> two full-capacity stacks; either can serve 100%.<\/li>\n\n\n\n<li><strong>Active-active:<\/strong> all sites handle traffic; failover is mostly automatic and fast.<\/li>\n\n\n\n<li><strong>Active-passive:<\/strong> a hot\/warm standby takes over on failure (some failover time).<\/li>\n\n\n\n<li><strong>Quorum-based replication:<\/strong> e.g., 3 or 5 nodes (Raft\/Paxos) so a majority can proceed.<\/li>\n\n\n\n<li><strong>Erasure coding \/ parity:<\/strong> data survives disk\/node loss without full duplication.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">How redundancy improves reliability<\/h2>\n\n\n\n<p>If one replica has availability <em>A<\/em>, two independent replicas behind a good load balancer have availability \u2248 <strong>1 \u2013 (1\u2013A)\u00b2<\/strong> (and so on), assuming <strong>independent<\/strong> failures. Correlation kills this benefit\u2014so separate replicas across failure domains (different AZs\/regions, power, network, versions).<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Design principles<\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Eliminate SPOFs:<\/strong> control planes, queues, caches, secrets stores, DNS, load balancers, and CI\/CD paths all need redundancy or fast recovery.<\/li>\n\n\n\n<li><strong>Isolate failure domains:<\/strong> spread replicas across AZs; don\u2019t co-locate primaries and standbys.<\/li>\n\n\n\n<li><strong>Diversity beats duplication:<\/strong> different versions, hardware, or providers reduce correlated risk.<\/li>\n\n\n\n<li><strong>Automate failover:<\/strong> health checks, timeouts, circuit breakers, and quick DNS\/LB re-routing.<\/li>\n\n\n\n<li><strong>Right-size capacity:<\/strong> spare headroom for failover (e.g., N+1) and pre-scale if needed.<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">Trade-offs &amp; pitfalls<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Cost vs reliability:<\/strong> more replicas, more money. Tie decisions to SLO\/error-budget math.<\/li>\n\n\n\n<li><strong>Complexity:<\/strong> multi-region state is hard (consistency, latency, split-brain).<\/li>\n\n\n\n<li><strong>Hidden coupling:<\/strong> two \u201credundant\u201d services sharing one database = still a SPOF.<\/li>\n\n\n\n<li><strong>False redundancy:<\/strong> two pods on one node or one AZ adds little resilience.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">What to monitor to prove redundancy works<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Per-AZ\/region health<\/strong> and synthetic checks (not just aggregate).<\/li>\n\n\n\n<li><strong>Failover time<\/strong> (MTTR) and success rate of automated promotions.<\/li>\n\n\n\n<li><strong>Quorum \/ ISR health<\/strong> (for Kafka\/etcd\/Consul), replication lag, and RPO\/RTO.<\/li>\n\n\n\n<li><strong>Capacity headroom<\/strong> after a node\/AZ loss (can you still meet SLO?).<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Test it (don\u2019t just hope)<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Game days \/ chaos experiments:<\/strong> kill a node, drain an AZ, sever a NAT gateway, block a dependency; verify traffic stays healthy and alerts are actionable.<\/li>\n\n\n\n<li><strong>Runbooks &amp; drills:<\/strong> promote replicas, restore from backups, and rehearse DNS\/LB failover.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Concrete examples (EKS\/AWS flavored)<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Stateless services:<\/strong> <code>replicas: 3+<\/code>, <strong>PodDisruptionBudget<\/strong>, <strong>Pod Topology Spread<\/strong> across 3 AZs, <strong>HPA<\/strong> with spare headroom; ALB\/NLB across subnets in all AZs.<\/li>\n\n\n\n<li><strong>Stateful stores:<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>RDS\/Aurora Multi-AZ<\/strong>, cross-region replica for DR; test failovers.<\/li>\n\n\n\n<li><strong>Kafka<\/strong> (or MSK\/Confluent): replication factor \u22653, <code>min.insync.replicas=2<\/code>, rack-aware across AZs.<\/li>\n\n\n\n<li><strong>Redis\/ElastiCache:<\/strong> cluster mode enabled with multi-AZ, automatic failover.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Storage &amp; DNS:<\/strong> S3 with versioning + (if needed) cross-region replication; Route 53 health-check + failover\/latency records.<\/li>\n\n\n\n<li><strong>Control plane dependencies:<\/strong> multiple NAT gateways (per AZ), duplicate VPC endpoints for critical services, redundant CI runners, dual logging\/metrics paths when feasible.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Quick checklist<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Do we meet capacity with <strong>one node\/AZ down<\/strong>?<\/li>\n\n\n\n<li>Are <strong>replicas spread across AZs<\/strong> and enforced by policy?<\/li>\n\n\n\n<li>Is failover <strong>automatic<\/strong>, <strong>observed<\/strong>, and <strong>rehearsed<\/strong>?<\/li>\n\n\n\n<li>Are <strong>dependencies<\/strong> (DB, cache, queue, DNS, secrets) redundant too?<\/li>\n\n\n\n<li>Do monitors alert on <strong>loss of redundancy<\/strong> (e.g., quorum at risk), not just total outage?<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">4 Pillors of High Availability<\/h2>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"987\" src=\"https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/09\/image-1024x987.png\" alt=\"\" class=\"wp-image-1011\" srcset=\"https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/09\/image-1024x987.png 1024w, https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/09\/image-300x289.png 300w, https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/09\/image-768x740.png 768w, https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/09\/image.png 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>Redundancy is the deliberate duplication of critical components or paths so that a failure doesn\u2019t violate your SLOs. Put simply: [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-1010","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>What is Redundancy? - SRE School<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/sreschool.com\/blog\/what-is-redundancy\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"What is Redundancy? - SRE School\" \/>\n<meta property=\"og:description\" content=\"Redundancy is the deliberate duplication of critical components or paths so that a failure doesn\u2019t violate your SLOs. Put simply: [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/sreschool.com\/blog\/what-is-redundancy\/\" \/>\n<meta property=\"og:site_name\" content=\"SRE School\" \/>\n<meta property=\"article:published_time\" content=\"2025-09-02T08:40:20+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-09-02T08:40:21+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/09\/image-1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1001\" \/>\n\t<meta property=\"og:image:height\" content=\"471\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Rajesh Kumar\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Rajesh Kumar\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/sreschool.com\/blog\/what-is-redundancy\/\",\"url\":\"https:\/\/sreschool.com\/blog\/what-is-redundancy\/\",\"name\":\"What is Redundancy? - SRE School\",\"isPartOf\":{\"@id\":\"https:\/\/sreschool.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/sreschool.com\/blog\/what-is-redundancy\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/sreschool.com\/blog\/what-is-redundancy\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/09\/image-1.png\",\"datePublished\":\"2025-09-02T08:40:20+00:00\",\"dateModified\":\"2025-09-02T08:40:21+00:00\",\"author\":{\"@id\":\"https:\/\/sreschool.com\/blog\/#\/schema\/person\/0ffe446f77bb2589992dbe3a7f417201\"},\"breadcrumb\":{\"@id\":\"https:\/\/sreschool.com\/blog\/what-is-redundancy\/#breadcrumb\"},\"inLanguage\":\"en\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/sreschool.com\/blog\/what-is-redundancy\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en\",\"@id\":\"https:\/\/sreschool.com\/blog\/what-is-redundancy\/#primaryimage\",\"url\":\"https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/09\/image-1.png\",\"contentUrl\":\"https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/09\/image-1.png\",\"width\":1001,\"height\":471},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/sreschool.com\/blog\/what-is-redundancy\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/sreschool.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"What is Redundancy?\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/sreschool.com\/blog\/#website\",\"url\":\"https:\/\/sreschool.com\/blog\/\",\"name\":\"SRESchool\",\"description\":\"Master SRE. Build Resilient Systems. Lead the Future of Reliability\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/sreschool.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/sreschool.com\/blog\/#\/schema\/person\/0ffe446f77bb2589992dbe3a7f417201\",\"name\":\"Rajesh Kumar\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en\",\"@id\":\"https:\/\/sreschool.com\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/f901a4f2929fa034a291a8363d589791d5a3c1f6a051c22e744acb8bfc8e022a?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/f901a4f2929fa034a291a8363d589791d5a3c1f6a051c22e744acb8bfc8e022a?s=96&d=mm&r=g\",\"caption\":\"Rajesh Kumar\"},\"sameAs\":[\"http:\/\/sreschool.com\/blog\"],\"url\":\"https:\/\/sreschool.com\/blog\/author\/admin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"What is Redundancy? - SRE School","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/sreschool.com\/blog\/what-is-redundancy\/","og_locale":"en_US","og_type":"article","og_title":"What is Redundancy? - SRE School","og_description":"Redundancy is the deliberate duplication of critical components or paths so that a failure doesn\u2019t violate your SLOs. Put simply: [&hellip;]","og_url":"https:\/\/sreschool.com\/blog\/what-is-redundancy\/","og_site_name":"SRE School","article_published_time":"2025-09-02T08:40:20+00:00","article_modified_time":"2025-09-02T08:40:21+00:00","og_image":[{"width":1001,"height":471,"url":"https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/09\/image-1.png","type":"image\/png"}],"author":"Rajesh Kumar","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Rajesh Kumar","Est. reading time":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/sreschool.com\/blog\/what-is-redundancy\/","url":"https:\/\/sreschool.com\/blog\/what-is-redundancy\/","name":"What is Redundancy? - SRE School","isPartOf":{"@id":"https:\/\/sreschool.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/sreschool.com\/blog\/what-is-redundancy\/#primaryimage"},"image":{"@id":"https:\/\/sreschool.com\/blog\/what-is-redundancy\/#primaryimage"},"thumbnailUrl":"https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/09\/image-1.png","datePublished":"2025-09-02T08:40:20+00:00","dateModified":"2025-09-02T08:40:21+00:00","author":{"@id":"https:\/\/sreschool.com\/blog\/#\/schema\/person\/0ffe446f77bb2589992dbe3a7f417201"},"breadcrumb":{"@id":"https:\/\/sreschool.com\/blog\/what-is-redundancy\/#breadcrumb"},"inLanguage":"en","potentialAction":[{"@type":"ReadAction","target":["https:\/\/sreschool.com\/blog\/what-is-redundancy\/"]}]},{"@type":"ImageObject","inLanguage":"en","@id":"https:\/\/sreschool.com\/blog\/what-is-redundancy\/#primaryimage","url":"https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/09\/image-1.png","contentUrl":"https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/09\/image-1.png","width":1001,"height":471},{"@type":"BreadcrumbList","@id":"https:\/\/sreschool.com\/blog\/what-is-redundancy\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/sreschool.com\/blog\/"},{"@type":"ListItem","position":2,"name":"What is Redundancy?"}]},{"@type":"WebSite","@id":"https:\/\/sreschool.com\/blog\/#website","url":"https:\/\/sreschool.com\/blog\/","name":"SRESchool","description":"Master SRE. Build Resilient Systems. Lead the Future of Reliability","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/sreschool.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en"},{"@type":"Person","@id":"https:\/\/sreschool.com\/blog\/#\/schema\/person\/0ffe446f77bb2589992dbe3a7f417201","name":"Rajesh Kumar","image":{"@type":"ImageObject","inLanguage":"en","@id":"https:\/\/sreschool.com\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/f901a4f2929fa034a291a8363d589791d5a3c1f6a051c22e744acb8bfc8e022a?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/f901a4f2929fa034a291a8363d589791d5a3c1f6a051c22e744acb8bfc8e022a?s=96&d=mm&r=g","caption":"Rajesh Kumar"},"sameAs":["http:\/\/sreschool.com\/blog"],"url":"https:\/\/sreschool.com\/blog\/author\/admin\/"}]}},"_links":{"self":[{"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/posts\/1010","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/comments?post=1010"}],"version-history":[{"count":1,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/posts\/1010\/revisions"}],"predecessor-version":[{"id":1013,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/posts\/1010\/revisions\/1013"}],"wp:attachment":[{"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/media?parent=1010"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/categories?post=1010"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/tags?post=1010"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}