What is Fault tolerance?

Uncategorized

Fault tolerance is a system’s ability to keep meeting its SLOs despite expected failures—machines dying, networks flaking, processes crashing, disks filling—without human intervention. It’s the practical outcome of designing for failure: the system either continues normally or degrades gracefully when parts break.

How it differs from related terms

  • Redundancy: the means (extra components/paths).
  • High availability: the result (little downtime).
  • Resilience: broader ability to absorb, adapt, and recover (includes human/ops).

What “good” fault tolerance looks like

  • No single points of failure across failure domains (process → node → AZ → region).
  • Automatic detection and recovery (health checks, failover, restarts).
  • Predictable degradation (shed load, read-only mode) instead of hard outages.
  • Sufficient spare capacity (N+1/N+M) to absorb losses.

Core techniques

  • Redundancy & isolation: multi-replica services, spread across AZs; bulkheads to stop blast radius.
  • Automated failover: leader election, health-checked load balancing, fast DNS/LB re-routing.
  • Idempotency + retries with backoff/jitter; timeouts & circuit breakers to avoid cascading failure.
  • Quorum/replication: e.g., Raft/Paxos, Kafka RF≥3 with min.insync.replicas=2.
  • Data durability: snapshots, multi-AZ/region replicas, erasure coding.
  • Graceful degradation: feature flags to disable non-critical work, serve cached results, partial results.
  • Self-healing: auto-restart/replace (Kubernetes controllers, ASGs).

What to measure

  • SLIs: availability, latency, error rate (per AZ/region—not just global).
  • MTTR & failover time (how fast a healthy replica takes over).
  • Redundancy health: quorum size, ISR status, replication lag.
  • Headroom after failure: can you meet SLO with one node/AZ down?

How to verify (continuously)

  • Game days / chaos tests: kill nodes, cut an AZ, block a dependency; confirm service stays within SLO and alerts are actionable.
  • Runbooks & drills: rehearse promotions, restores, and traffic shifts.
  • Alert on loss of tolerance: e.g., quorum at risk, only 1 AZ serving.

Common pitfalls

  • Hidden SPOFs: shared DB, cache, NAT, or CI/CD path behind “redundant” apps.
  • Correlated failures: all replicas in one AZ/version; dependency coupling.
  • Insufficient capacity: N replicas but no spare to handle failover load.
  • Unbounded retries: amplify an incident; always pair with timeouts/circuit breakers.

Concrete patterns (AWS/EKS flavored)

  • Stateless services: 3+ replicas, PDBs, Pod Topology Spread across 3 AZs; ALB/NLB across subnets in all AZs; HPA with spare headroom.
  • Stateful stores: RDS/Aurora Multi-AZ + tested failover; Kafka/MSK RF≥3 with rack-aware placement; Redis/ElastiCache with multi-AZ and auto-failover.
  • Global stance: active-active or active-passive across regions for tier-1 APIs; Route 53 health-check failover; S3 versioning + cross-region replication where needed.

Rule of thumb: design for one unit down (node/AZ) without breaching SLOs, test it regularly, and alert when you lose that safety margin. If you share your current topology/SLOs, I can map each tier to specific configs (k8s YAML + AWS settings) to reach concrete fault-tolerance targets.

Fault tolerance Vs High Availability