What is Fault tolerance?

Fault tolerance is a system’s ability to keep meeting its SLOs despite expected failures—machines dying, networks flaking, processes crashing, disks filling—without human intervention. It’s the practical outcome of designing for failure: the system either continues normally or degrades gracefully when parts break.

How it differs from related terms

  • Redundancy: the means (extra components/paths).
  • High availability: the result (little downtime).
  • Resilience: broader ability to absorb, adapt, and recover (includes human/ops).

What “good” fault tolerance looks like

  • No single points of failure across failure domains (process → node → AZ → region).
  • Automatic detection and recovery (health checks, failover, restarts).
  • Predictable degradation (shed load, read-only mode) instead of hard outages.
  • Sufficient spare capacity (N+1/N+M) to absorb losses.

Core techniques

  • Redundancy & isolation: multi-replica services, spread across AZs; bulkheads to stop blast radius.
  • Automated failover: leader election, health-checked load balancing, fast DNS/LB re-routing.
  • Idempotency + retries with backoff/jitter; timeouts & circuit breakers to avoid cascading failure.
  • Quorum/replication: e.g., Raft/Paxos, Kafka RF≥3 with min.insync.replicas=2.
  • Data durability: snapshots, multi-AZ/region replicas, erasure coding.
  • Graceful degradation: feature flags to disable non-critical work, serve cached results, partial results.
  • Self-healing: auto-restart/replace (Kubernetes controllers, ASGs).

What to measure

  • SLIs: availability, latency, error rate (per AZ/region—not just global).
  • MTTR & failover time (how fast a healthy replica takes over).
  • Redundancy health: quorum size, ISR status, replication lag.
  • Headroom after failure: can you meet SLO with one node/AZ down?

How to verify (continuously)

  • Game days / chaos tests: kill nodes, cut an AZ, block a dependency; confirm service stays within SLO and alerts are actionable.
  • Runbooks & drills: rehearse promotions, restores, and traffic shifts.
  • Alert on loss of tolerance: e.g., quorum at risk, only 1 AZ serving.

Common pitfalls

  • Hidden SPOFs: shared DB, cache, NAT, or CI/CD path behind “redundant” apps.
  • Correlated failures: all replicas in one AZ/version; dependency coupling.
  • Insufficient capacity: N replicas but no spare to handle failover load.
  • Unbounded retries: amplify an incident; always pair with timeouts/circuit breakers.

Concrete patterns (AWS/EKS flavored)

  • Stateless services: 3+ replicas, PDBs, Pod Topology Spread across 3 AZs; ALB/NLB across subnets in all AZs; HPA with spare headroom.
  • Stateful stores: RDS/Aurora Multi-AZ + tested failover; Kafka/MSK RF≥3 with rack-aware placement; Redis/ElastiCache with multi-AZ and auto-failover.
  • Global stance: active-active or active-passive across regions for tier-1 APIs; Route 53 health-check failover; S3 versioning + cross-region replication where needed.

Rule of thumb: design for one unit down (node/AZ) without breaching SLOs, test it regularly, and alert when you lose that safety margin. If you share your current topology/SLOs, I can map each tier to specific configs (k8s YAML + AWS settings) to reach concrete fault-tolerance targets.

Fault tolerance Vs High Availability

Related Posts

Kafka Complete Guide: Ways to Connect, Authenticate, and Use Confluent Kafka

1. First understand the four layers Confluent Cloud supports native Kafka clients in many languages, including Java, Python, Go, JavaScript, .NET, C/C++, and others. For normal producer/consumer…

Read More

Comprehensive Guide to Container Orchestration and Cluster Management

Container orchestration platform technology completely transforms how modern software engineering teams deploy, scale, and manage applications in production environments. For site reliability professionals, understanding cluster architecture provides…

Read More

Navigating Global Healthcare Complexity with MyMedicPlus Digital Platforms

Finding reliable healthcare options across borders presents immense operational and administrative challenges. Therefore, modern patients require robust, unified digital systems to navigate diverse hospital ecosystems and verifying…

Read More

Empowering Medical Decisions Globally Through Seamless Access to Advanced Care with MyHospitalNow

Finding the right medical treatment often presents overwhelming challenges for patients worldwide. Therefore, people frequently struggle to find verifiable information regarding elite specialists, modern hospital infrastructure, and…

Read More

How to Fix Royal TSX SSH Session Disconnecting After a Few Minutes on macOS

Problem If you are using Royal TSX on macOS and your SSH session disconnects after a few minutes of idle time, the problem is usually not your…

Read More

How Prometheus and Grafana are Revolutionizing Monitoring for SREs

Distributed infrastructure systems often present significant visibility challenges. For a modern Site Reliability Engineer (SRE), keeping complex microservices, Kubernetes clusters, and cloud-native applications running smoothly requires deep…

Read More
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
0
Would love your thoughts, please comment.x
()
x