What is Redundancy?

Redundancy is the deliberate duplication of critical components or paths so that a failure doesn’t violate your SLOs. Put simply: remove single points of failure (SPOFs) and make sure something else can take over fast enough that users don’t notice.

Where you add redundancy (failure domains)

  • Process / pod: multiple workers for the same service.
  • Host / node: more than one VM/node per service tier.
  • Availability Zone (AZ): replicas spread across ≥2 AZs.
  • Region: active-active or active-passive between regions.
  • Vendor: multi-provider or alternate managed service (only when justified).

Common patterns

  • N+1 / N+M: have at least one (or M) spare capacity unit beyond steady-state needs.
  • 2N (“mirrored”): two full-capacity stacks; either can serve 100%.
  • Active-active: all sites handle traffic; failover is mostly automatic and fast.
  • Active-passive: a hot/warm standby takes over on failure (some failover time).
  • Quorum-based replication: e.g., 3 or 5 nodes (Raft/Paxos) so a majority can proceed.
  • Erasure coding / parity: data survives disk/node loss without full duplication.

How redundancy improves reliability

If one replica has availability A, two independent replicas behind a good load balancer have availability ≈ 1 – (1–A)² (and so on), assuming independent failures. Correlation kills this benefit—so separate replicas across failure domains (different AZs/regions, power, network, versions).

Design principles

  1. Eliminate SPOFs: control planes, queues, caches, secrets stores, DNS, load balancers, and CI/CD paths all need redundancy or fast recovery.
  2. Isolate failure domains: spread replicas across AZs; don’t co-locate primaries and standbys.
  3. Diversity beats duplication: different versions, hardware, or providers reduce correlated risk.
  4. Automate failover: health checks, timeouts, circuit breakers, and quick DNS/LB re-routing.
  5. Right-size capacity: spare headroom for failover (e.g., N+1) and pre-scale if needed.

Trade-offs & pitfalls

  • Cost vs reliability: more replicas, more money. Tie decisions to SLO/error-budget math.
  • Complexity: multi-region state is hard (consistency, latency, split-brain).
  • Hidden coupling: two “redundant” services sharing one database = still a SPOF.
  • False redundancy: two pods on one node or one AZ adds little resilience.

What to monitor to prove redundancy works

  • Per-AZ/region health and synthetic checks (not just aggregate).
  • Failover time (MTTR) and success rate of automated promotions.
  • Quorum / ISR health (for Kafka/etcd/Consul), replication lag, and RPO/RTO.
  • Capacity headroom after a node/AZ loss (can you still meet SLO?).

Test it (don’t just hope)

  • Game days / chaos experiments: kill a node, drain an AZ, sever a NAT gateway, block a dependency; verify traffic stays healthy and alerts are actionable.
  • Runbooks & drills: promote replicas, restore from backups, and rehearse DNS/LB failover.

Concrete examples (EKS/AWS flavored)

  • Stateless services: replicas: 3+, PodDisruptionBudget, Pod Topology Spread across 3 AZs, HPA with spare headroom; ALB/NLB across subnets in all AZs.
  • Stateful stores:
    • RDS/Aurora Multi-AZ, cross-region replica for DR; test failovers.
    • Kafka (or MSK/Confluent): replication factor ≥3, min.insync.replicas=2, rack-aware across AZs.
    • Redis/ElastiCache: cluster mode enabled with multi-AZ, automatic failover.
  • Storage & DNS: S3 with versioning + (if needed) cross-region replication; Route 53 health-check + failover/latency records.
  • Control plane dependencies: multiple NAT gateways (per AZ), duplicate VPC endpoints for critical services, redundant CI runners, dual logging/metrics paths when feasible.

Quick checklist

  • Do we meet capacity with one node/AZ down?
  • Are replicas spread across AZs and enforced by policy?
  • Is failover automatic, observed, and rehearsed?
  • Are dependencies (DB, cache, queue, DNS, secrets) redundant too?
  • Do monitors alert on loss of redundancy (e.g., quorum at risk), not just total outage?

4 Pillors of High Availability

Related Posts

Kafka Complete Guide: Ways to Connect, Authenticate, and Use Confluent Kafka

1. First understand the four layers Confluent Cloud supports native Kafka clients in many languages, including Java, Python, Go, JavaScript, .NET, C/C++, and others. For normal producer/consumer…

Read More

Comprehensive Guide to Container Orchestration and Cluster Management

Container orchestration platform technology completely transforms how modern software engineering teams deploy, scale, and manage applications in production environments. For site reliability professionals, understanding cluster architecture provides…

Read More

Navigating Global Healthcare Complexity with MyMedicPlus Digital Platforms

Finding reliable healthcare options across borders presents immense operational and administrative challenges. Therefore, modern patients require robust, unified digital systems to navigate diverse hospital ecosystems and verifying…

Read More

Empowering Medical Decisions Globally Through Seamless Access to Advanced Care with MyHospitalNow

Finding the right medical treatment often presents overwhelming challenges for patients worldwide. Therefore, people frequently struggle to find verifiable information regarding elite specialists, modern hospital infrastructure, and…

Read More

How to Fix Royal TSX SSH Session Disconnecting After a Few Minutes on macOS

Problem If you are using Royal TSX on macOS and your SSH session disconnects after a few minutes of idle time, the problem is usually not your…

Read More

How Prometheus and Grafana are Revolutionizing Monitoring for SREs

Distributed infrastructure systems often present significant visibility challenges. For a modern Site Reliability Engineer (SRE), keeping complex microservices, Kubernetes clusters, and cloud-native applications running smoothly requires deep…

Read More
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
0
Would love your thoughts, please comment.x
()
x