What is Fault tolerance?

Posted on September 2, 2025September 2, 2025 | by Rajesh Kumar

Fault tolerance is a system’s ability to keep meeting its SLOs despite expected failures—machines dying, networks flaking, processes crashing, disks filling—without human intervention. It’s the practical outcome of designing for failure: the system either continues normally or degrades gracefully when parts break.

How it differs from related terms

Redundancy: the means (extra components/paths).
High availability: the result (little downtime).
Resilience: broader ability to absorb, adapt, and recover (includes human/ops).

What “good” fault tolerance looks like

No single points of failure across failure domains (process → node → AZ → region).
Automatic detection and recovery (health checks, failover, restarts).
Predictable degradation (shed load, read-only mode) instead of hard outages.
Sufficient spare capacity (N+1/N+M) to absorb losses.

Core techniques

Redundancy & isolation: multi-replica services, spread across AZs; bulkheads to stop blast radius.
Automated failover: leader election, health-checked load balancing, fast DNS/LB re-routing.
Idempotency + retries with backoff/jitter; timeouts & circuit breakers to avoid cascading failure.
Quorum/replication: e.g., Raft/Paxos, Kafka RF≥3 with min.insync.replicas=2.
Data durability: snapshots, multi-AZ/region replicas, erasure coding.
Graceful degradation: feature flags to disable non-critical work, serve cached results, partial results.
Self-healing: auto-restart/replace (Kubernetes controllers, ASGs).

What to measure

SLIs: availability, latency, error rate (per AZ/region—not just global).
MTTR & failover time (how fast a healthy replica takes over).
Redundancy health: quorum size, ISR status, replication lag.
Headroom after failure: can you meet SLO with one node/AZ down?

How to verify (continuously)

Game days / chaos tests: kill nodes, cut an AZ, block a dependency; confirm service stays within SLO and alerts are actionable.
Runbooks & drills: rehearse promotions, restores, and traffic shifts.
Alert on loss of tolerance: e.g., quorum at risk, only 1 AZ serving.

Common pitfalls

Hidden SPOFs: shared DB, cache, NAT, or CI/CD path behind “redundant” apps.
Correlated failures: all replicas in one AZ/version; dependency coupling.
Insufficient capacity: N replicas but no spare to handle failover load.
Unbounded retries: amplify an incident; always pair with timeouts/circuit breakers.

Concrete patterns (AWS/EKS flavored)

Stateless services: 3+ replicas, PDBs, Pod Topology Spread across 3 AZs; ALB/NLB across subnets in all AZs; HPA with spare headroom.
Stateful stores: RDS/Aurora Multi-AZ + tested failover; Kafka/MSK RF≥3 with rack-aware placement; Redis/ElastiCache with multi-AZ and auto-failover.
Global stance: active-active or active-passive across regions for tier-1 APIs; Route 53 health-check failover; S3 versioning + cross-region replication where needed.

Rule of thumb: design for one unit down (node/AZ) without breaching SLOs, test it regularly, and alert when you lose that safety margin. If you share your current topology/SLOs, I can map each tier to specific configs (k8s YAML + AWS settings) to reach concrete fault-tolerance targets.