What is Redundancy?

Posted on September 2, 2025September 2, 2025 | by Rajesh Kumar

Redundancy is the deliberate duplication of critical components or paths so that a failure doesn’t violate your SLOs. Put simply: remove single points of failure (SPOFs) and make sure something else can take over fast enough that users don’t notice.

Where you add redundancy (failure domains)

Process / pod: multiple workers for the same service.
Host / node: more than one VM/node per service tier.
Availability Zone (AZ): replicas spread across ≥2 AZs.
Region: active-active or active-passive between regions.
Vendor: multi-provider or alternate managed service (only when justified).

Common patterns

N+1 / N+M: have at least one (or M) spare capacity unit beyond steady-state needs.
2N (“mirrored”): two full-capacity stacks; either can serve 100%.
Active-active: all sites handle traffic; failover is mostly automatic and fast.
Active-passive: a hot/warm standby takes over on failure (some failover time).
Quorum-based replication: e.g., 3 or 5 nodes (Raft/Paxos) so a majority can proceed.
Erasure coding / parity: data survives disk/node loss without full duplication.

How redundancy improves reliability

If one replica has availability A, two independent replicas behind a good load balancer have availability ≈ 1 – (1–A)² (and so on), assuming independent failures. Correlation kills this benefit—so separate replicas across failure domains (different AZs/regions, power, network, versions).

Design principles

Eliminate SPOFs: control planes, queues, caches, secrets stores, DNS, load balancers, and CI/CD paths all need redundancy or fast recovery.
Isolate failure domains: spread replicas across AZs; don’t co-locate primaries and standbys.
Diversity beats duplication: different versions, hardware, or providers reduce correlated risk.
Automate failover: health checks, timeouts, circuit breakers, and quick DNS/LB re-routing.
Right-size capacity: spare headroom for failover (e.g., N+1) and pre-scale if needed.

Trade-offs & pitfalls

Cost vs reliability: more replicas, more money. Tie decisions to SLO/error-budget math.
Complexity: multi-region state is hard (consistency, latency, split-brain).
Hidden coupling: two “redundant” services sharing one database = still a SPOF.
False redundancy: two pods on one node or one AZ adds little resilience.

What to monitor to prove redundancy works

Per-AZ/region health and synthetic checks (not just aggregate).
Failover time (MTTR) and success rate of automated promotions.
Quorum / ISR health (for Kafka/etcd/Consul), replication lag, and RPO/RTO.
Capacity headroom after a node/AZ loss (can you still meet SLO?).

Test it (don’t just hope)

Game days / chaos experiments: kill a node, drain an AZ, sever a NAT gateway, block a dependency; verify traffic stays healthy and alerts are actionable.
Runbooks & drills: promote replicas, restore from backups, and rehearse DNS/LB failover.

Concrete examples (EKS/AWS flavored)

Stateless services: replicas: 3+, PodDisruptionBudget, Pod Topology Spread across 3 AZs, HPA with spare headroom; ALB/NLB across subnets in all AZs.
Stateful stores:
- RDS/Aurora Multi-AZ, cross-region replica for DR; test failovers.
- Kafka (or MSK/Confluent): replication factor ≥3, min.insync.replicas=2, rack-aware across AZs.
- Redis/ElastiCache: cluster mode enabled with multi-AZ, automatic failover.
Storage & DNS: S3 with versioning + (if needed) cross-region replication; Route 53 health-check + failover/latency records.
Control plane dependencies: multiple NAT gateways (per AZ), duplicate VPC endpoints for critical services, redundant CI runners, dual logging/metrics paths when feasible.

Quick checklist

Do we meet capacity with one node/AZ down?
Are replicas spread across AZs and enforced by policy?
Is failover automatic, observed, and rehearsed?
Are dependencies (DB, cache, queue, DNS, secrets) redundant too?
Do monitors alert on loss of redundancy (e.g., quorum at risk), not just total outage?