{"id":1674,"date":"2026-02-15T05:30:49","date_gmt":"2026-02-15T05:30:49","guid":{"rendered":"https:\/\/sreschool.com\/blog\/major-incident\/"},"modified":"2026-05-05T07:28:47","modified_gmt":"2026-05-05T07:28:47","slug":"major-incident","status":"publish","type":"post","link":"https:\/\/sreschool.com\/blog\/major-incident\/","title":{"rendered":"What is Major incident? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide)"},"content":{"rendered":"\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Quick Definition (30\u201360 words)<\/h2>\n\n\n\n<p>A major incident is a service outage or degradation causing substantial business impact that requires coordinated cross-team response. Analogy: a multi-car pileup on a highway that blocks traffic and needs traffic control, tow trucks, and medical teams. Formal: an incident declared when impact thresholds and escalation policies meet pre-defined major incident criteria.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">What is Major incident?<\/h2>\n\n\n\n<p>A major incident is an escalation tier for incidents that exceed routine on-call handling. It is NOT a routine bug, minor outage, or scheduled maintenance. It requires cross-discipline coordination, executive visibility, and often temporary mitigation work instead of immediate root cause fixes.<\/p>\n\n\n\n<p>Key properties and constraints:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>High-impact: affects large user segments, revenue, or critical business processes.<\/li>\n<li>Fast escalation: declaration triggers specific communication and resource allocation.<\/li>\n<li>Time-bounded goal: focus on restoring service, minimizing harm, and preserving evidence.<\/li>\n<li>Governance: follows playbooks, runbooks, and accountability assignments.<\/li>\n<li>Post-incident: triggers detailed postmortem and remediations with timelines.<\/li>\n<\/ul>\n\n\n\n<p>Where it fits in modern cloud\/SRE workflows:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Detection via SLIs and alerting rules.<\/li>\n<li>Triage by on-call or triage team.<\/li>\n<li>Major incident declared when impact thresholds met.<\/li>\n<li>War-room style coordination with incident commander, communications lead, and engineering leads.<\/li>\n<li>Temporary mitigations, rollback, or failover applied.<\/li>\n<li>Transition to remediation and postmortem with corrective actions and SLO impact accounting.<\/li>\n<\/ul>\n\n\n\n<p>Diagram description (text-only):<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Monitoring layer detects anomaly -&gt; Alert router evaluates severity -&gt; If severity &gt;= threshold, trigger major incident -&gt; Notify incident manager, paging, and incident workspace -&gt; Triage and implement mitigation (rollback\/failover\/scaling) -&gt; Monitor restoration -&gt; Postmortem and remediation tasks assigned.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Major incident in one sentence<\/h3>\n\n\n\n<p>A major incident is a high-severity, cross-functional outage requiring immediate, coordinated response to restore service and mitigate business impact.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Major incident vs related terms (TABLE REQUIRED)<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Term<\/th>\n<th>How it differs from Major incident<\/th>\n<th>Common confusion<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>T1<\/td>\n<td>Outage<\/td>\n<td>Less scope and impact than major incident<\/td>\n<td>People call any outage major<\/td>\n<\/tr>\n<tr>\n<td>T2<\/td>\n<td>Incident<\/td>\n<td>Generic term that may be low or high severity<\/td>\n<td>Not all incidents are major<\/td>\n<\/tr>\n<tr>\n<td>T3<\/td>\n<td>P0<\/td>\n<td>Priority label often maps to major incident but varies<\/td>\n<td>P0 versus major sometimes conflated<\/td>\n<\/tr>\n<tr>\n<td>T4<\/td>\n<td>Incident report<\/td>\n<td>Post-event documentation not the live response<\/td>\n<td>Confused with live incident command<\/td>\n<\/tr>\n<tr>\n<td>T5<\/td>\n<td>Outage window<\/td>\n<td>Scheduled downtime not an incident<\/td>\n<td>People equate downtime with incident<\/td>\n<\/tr>\n<tr>\n<td>T6<\/td>\n<td>Degradation<\/td>\n<td>Partial functionality loss, may or may not be major<\/td>\n<td>Degradation vs outage confusion<\/td>\n<\/tr>\n<tr>\n<td>T7<\/td>\n<td>Major outage<\/td>\n<td>Synonym in some organizations<\/td>\n<td>Terminology varies by org<\/td>\n<\/tr>\n<tr>\n<td>T8<\/td>\n<td>Disaster recovery<\/td>\n<td>Broader strategy for catastrophic events<\/td>\n<td>DR is not the same as day-to-day incidents<\/td>\n<\/tr>\n<tr>\n<td>T9<\/td>\n<td>Security incident<\/td>\n<td>Involves breach and special handling<\/td>\n<td>Security incidents may be declared separately<\/td>\n<\/tr>\n<tr>\n<td>T10<\/td>\n<td>Crisis<\/td>\n<td>Business-level emergency beyond tech scope<\/td>\n<td>Crisis includes PR and legal considerations<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if any cell says \u201cSee details below\u201d)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>None.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Why does Major incident matter?<\/h2>\n\n\n\n<p>Business impact:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Revenue loss: degraded checkout or API throughput reduces transactions.<\/li>\n<li>Brand trust: repeated major incidents reduce user retention and partner confidence.<\/li>\n<li>Regulatory risk: outages can trigger compliance reporting and fines.<\/li>\n<li>Opportunity cost: executives divert time to crisis management.<\/li>\n<\/ul>\n\n\n\n<p>Engineering impact:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Velocity slows as engineers shift to firefighting.<\/li>\n<li>Technical debt grows if quick fixes are not remediated.<\/li>\n<li>On-call fatigue increases, hurting retention.<\/li>\n<li>Improved practices emerge when incidents are analyzed and corrected.<\/li>\n<\/ul>\n\n\n\n<p>SRE framing:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>SLIs and SLOs detect trends and set thresholds for major incident declaration.<\/li>\n<li>Error budgets guide trade-offs between feature delivery and reliability.<\/li>\n<li>Toil reduction is a primary SRE goal to avoid frequent major incidents.<\/li>\n<li>On-call rotations must reflect realistic major incident workload.<\/li>\n<\/ul>\n\n\n\n<p>3\u20135 realistic \u201cwhat breaks in production\u201d examples:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Global auth service failing under a schema change, blocking login globally.<\/li>\n<li>Managed database provider experiencing failover loop causing request errors.<\/li>\n<li>Kubernetes control plane API throttle due to misconfigured autoscaler.<\/li>\n<li>Edge CDN misconfiguration causing cache poisoning and serving stale content.<\/li>\n<li>Payment gateway regional outage resulting in failed transactions.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Where is Major incident used? (TABLE REQUIRED)<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Layer\/Area<\/th>\n<th>How Major incident appears<\/th>\n<th>Typical telemetry<\/th>\n<th>Common tools<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>L1<\/td>\n<td>Edge\/Network<\/td>\n<td>Global connectivity loss or DDoS<\/td>\n<td>High error rate, RTT spikes<\/td>\n<td>WAF\/CDN logs, network monitors<\/td>\n<\/tr>\n<tr>\n<td>L2<\/td>\n<td>Service\/API<\/td>\n<td>API 5xx surge or latency spike<\/td>\n<td>5xx rate, p95 latency<\/td>\n<td>APM, service metrics<\/td>\n<\/tr>\n<tr>\n<td>L3<\/td>\n<td>Application<\/td>\n<td>Authentication or core flows broken<\/td>\n<td>Error traces, user complaints<\/td>\n<td>Tracing, logs<\/td>\n<\/tr>\n<tr>\n<td>L4<\/td>\n<td>Data\/DB<\/td>\n<td>DB failover or corruption<\/td>\n<td>Replica lag, transaction errors<\/td>\n<td>DB monitoring, backups<\/td>\n<\/tr>\n<tr>\n<td>L5<\/td>\n<td>Platform\/K8s<\/td>\n<td>Control plane issues or node drain<\/td>\n<td>Pod failures, API errors<\/td>\n<td>K8s metrics, control plane logs<\/td>\n<\/tr>\n<tr>\n<td>L6<\/td>\n<td>Serverless\/PaaS<\/td>\n<td>Throttling or cold-start storms<\/td>\n<td>Invocation errors, throttles<\/td>\n<td>Platform metrics, invocation logs<\/td>\n<\/tr>\n<tr>\n<td>L7<\/td>\n<td>CI\/CD<\/td>\n<td>Bad deploy causing mass failures<\/td>\n<td>Deploy rollbacks, new errors<\/td>\n<td>CI logs, deployment traces<\/td>\n<\/tr>\n<tr>\n<td>L8<\/td>\n<td>Security<\/td>\n<td>Compromise detected with impact<\/td>\n<td>Alert count, unusual activity<\/td>\n<td>SIEM, IDS<\/td>\n<\/tr>\n<tr>\n<td>L9<\/td>\n<td>Observability<\/td>\n<td>Telemetry gaps during outage<\/td>\n<td>Missing metrics, delayed logs<\/td>\n<td>Monitoring tools<\/td>\n<\/tr>\n<tr>\n<td>L10<\/td>\n<td>Billing\/Cost<\/td>\n<td>Unexpected cost spike causing limits<\/td>\n<td>Budget alert, quota reached<\/td>\n<td>Cloud billing alerts<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if needed)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>None.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">When should you use Major incident?<\/h2>\n\n\n\n<p>When it\u2019s necessary:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Service affecting large user base or revenue.<\/li>\n<li>Critical functionality broken for high-value flows.<\/li>\n<li>Multi-system outages or cross-region failures.<\/li>\n<li>Regulatory or security impact requiring expedited action.<\/li>\n<\/ul>\n\n\n\n<p>When it\u2019s optional:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Localized region outage affecting subset of users.<\/li>\n<li>Single microservice degraded but can be mitigated by retries.<\/li>\n<li>Non-critical feature failures with low user impact.<\/li>\n<\/ul>\n\n\n\n<p>When NOT to use \/ overuse it:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Using major incident for every high-severity pager creates fatigue.<\/li>\n<li>Avoid declaring for routine maintenance or expected degradations.<\/li>\n<li>Do not declare when automated failover already resolved issue without human coordination.<\/li>\n<\/ul>\n\n\n\n<p>Decision checklist:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>If 5xx rate &gt; X and affected users &gt; Y -&gt; declare major incident.<\/li>\n<li>If SLO burn-rate &gt; Z over 15m and no automated mitigation -&gt; declare.<\/li>\n<li>If security breach with data exfiltration -&gt; declare security incident (use specialized workflow).<\/li>\n<li>If localized and mitigated by a single owner within T minutes -&gt; no major.<\/li>\n<\/ul>\n\n\n\n<p>Maturity ladder:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Beginner: Manual declaration, email\/slack pages, basic runbooks.<\/li>\n<li>Intermediate: Automated detection, incident commander rotation, war-room templates.<\/li>\n<li>Advanced: Automated mitigation playbooks, multi-cloud failovers, AI-assisted triage, integrated postmortem pipelines.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">How does Major incident work?<\/h2>\n\n\n\n<p>Components and workflow:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Detection: SLIs trigger alerts; anomaly detection flags unusual patterns.<\/li>\n<li>Triage: On-call triages and determines severity.<\/li>\n<li>Declaration: Incident commander declared; communication channels opened.<\/li>\n<li>Coordination: Roles assigned (IC, communications, tech leads, scribe).<\/li>\n<li>Mitigation: Execute runbooks, mitigations, rollbacks, or failovers.<\/li>\n<li>Restoration: Monitor restoration and confirm impact reduced.<\/li>\n<li>Recovery: Stabilize systems and transition to remediation.<\/li>\n<li>Postmortem: Document RCA, actions, and timelines.<\/li>\n<\/ol>\n\n\n\n<p>Data flow and lifecycle:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Metrics and logs -&gt; alerting engine -&gt; incident system -&gt; human action -&gt; mitigation -&gt; telemetry shows improvement -&gt; incident closed -&gt; postmortem logged.<\/li>\n<\/ul>\n\n\n\n<p>Edge cases and failure modes:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Alerting system down: fallback escalation via phone\/SMS.<\/li>\n<li>Communication channel overloaded: pre-configured backup channels.<\/li>\n<li>Multiple concurrent majors: designate escalation tier and prioritize by business impact.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Typical architecture patterns for Major incident<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Centralized incident command: Single IC with global view; use when multiple teams involved.<\/li>\n<li>Federated incident hubs: Team-level ICs coordinated through a central coordinator; use in large orgs.<\/li>\n<li>Automated rollback\/failover: Automated playbooks for well-defined failures.<\/li>\n<li>Circuit breaker and feature flag fallback: Use when recent deploys introduce risk.<\/li>\n<li>Multi-region failover: For cloud-native apps with active-passive regions.<\/li>\n<li>Canary isolation: Isolate failing service via routing rules and progressive traffic shifting.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Failure modes &amp; mitigation (TABLE REQUIRED)<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Failure mode<\/th>\n<th>Symptom<\/th>\n<th>Likely cause<\/th>\n<th>Mitigation<\/th>\n<th>Observability signal<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>F1<\/td>\n<td>Alert storm<\/td>\n<td>Many alerts in short time<\/td>\n<td>Upstream service spike or flapping<\/td>\n<td>Throttle alerts, aggregate<\/td>\n<td>High alert rate<\/td>\n<\/tr>\n<tr>\n<td>F2<\/td>\n<td>Missing telemetry<\/td>\n<td>Dashboards blank<\/td>\n<td>Logging pipeline failed<\/td>\n<td>Switch to backup pipeline<\/td>\n<td>Missing metrics<\/td>\n<\/tr>\n<tr>\n<td>F3<\/td>\n<td>Incorrect escalation<\/td>\n<td>Wrong on-call paged<\/td>\n<td>Misconfigured routing<\/td>\n<td>Update escalation policy<\/td>\n<td>Pager logs show misroute<\/td>\n<\/tr>\n<tr>\n<td>F4<\/td>\n<td>Runbook not found<\/td>\n<td>Teams confused<\/td>\n<td>Poor documentation<\/td>\n<td>Create and publish runbook<\/td>\n<td>Search failures<\/td>\n<\/tr>\n<tr>\n<td>F5<\/td>\n<td>Communication overload<\/td>\n<td>Channel clogged<\/td>\n<td>No structured updates<\/td>\n<td>Use status updates cadence<\/td>\n<td>Message rate spike<\/td>\n<\/tr>\n<tr>\n<td>F6<\/td>\n<td>Automated rollback fails<\/td>\n<td>New errors after rollback<\/td>\n<td>Incomplete rollback steps<\/td>\n<td>Manual rollback path<\/td>\n<td>Deploy trace shows failure<\/td>\n<\/tr>\n<tr>\n<td>F7<\/td>\n<td>Cross-region sync failure<\/td>\n<td>Data inconsistency<\/td>\n<td>Replication lag or network<\/td>\n<td>Promote backups, re-sync<\/td>\n<td>Replication lag metric<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if needed)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>None.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Key Concepts, Keywords &amp; Terminology for Major incident<\/h2>\n\n\n\n<p>(40+ terms; each line: Term \u2014 1\u20132 line definition \u2014 why it matters \u2014 common pitfall)<\/p>\n\n\n\n<p>Service Level Indicator (SLI) \u2014 A quantitative measure of service performance, e.g., success rate \u2014 Measures user-facing behavior \u2014 Pitfall: choosing irrelevant SLI\nService Level Objective (SLO) \u2014 Target for an SLI over time \u2014 Sets reliability goals \u2014 Pitfall: unrealistic targets\nError budget \u2014 Allowable SLO breach over time \u2014 Enables trade-offs for changes \u2014 Pitfall: not tracking consumption\nIncident commander (IC) \u2014 Single person coordinating response \u2014 Provides authority and decisions \u2014 Pitfall: ambiguous IC handoff\nWar room \u2014 Communication channel for incident coordination \u2014 Centralizes info flow \u2014 Pitfall: unstructured chat noise\nRunbook \u2014 Step-by-step remediation guide \u2014 Speeds mitigation \u2014 Pitfall: stale runbooks\nPlaybook \u2014 Higher-level response plan for classes of incidents \u2014 Aligns teams \u2014 Pitfall: too generic\nPagerduty rotation \u2014 On-call schedule system \u2014 Ensures 24&#215;7 coverage \u2014 Pitfall: over-alerting operators\nPager fatigue \u2014 Burnout from repetitive pages \u2014 Causes retention issues \u2014 Pitfall: not addressing noisy alerts\nPostmortem \u2014 Detailed incident analysis document \u2014 Drives learning \u2014 Pitfall: blamelessness missing\nRoot cause analysis (RCA) \u2014 Investigation into underlying cause \u2014 Prevents recurrence \u2014 Pitfall: premature RCA\nMitigation \u2014 Temporary actions to reduce impact \u2014 Restores user service fast \u2014 Pitfall: leaving mitigation permanent\nRemediation \u2014 Permanent fix addressing root cause \u2014 Eliminates recurrence \u2014 Pitfall: delayed remediation\nSLA (Service Level Agreement) \u2014 Contractual reliability promise \u2014 Affects penalties and trust \u2014 Pitfall: misaligned SLA and SLO\nObservation window \u2014 Time period for evaluating SLOs \u2014 Defines measurement span \u2014 Pitfall: wrong window masking trends\nAlert burn rate \u2014 Rate of SLO consumption \u2014 Helps pace responses \u2014 Pitfall: miscalculation leads to wrong escalation\nAnomaly detection \u2014 Automated detection of abnormal behavior \u2014 Faster detection than static thresholds \u2014 Pitfall: false positives\nSynthetic monitoring \u2014 Simulated user checks \u2014 Detects endpoint regressions \u2014 Pitfall: false negatives vs real user flows\nReal-user monitoring (RUM) \u2014 Collects client-side metrics \u2014 Measures actual user impact \u2014 Pitfall: sampling bias\nTracing \u2014 Distributed tracing across services \u2014 Pinpoints latency sources \u2014 Pitfall: incomplete traces\nLogs \u2014 Event records from systems \u2014 Essential for forensic analysis \u2014 Pitfall: not centralized\nMetrics \u2014 Quantitative counters and gauges \u2014 Primary input for alarms \u2014 Pitfall: cardinality issues\nDashboards \u2014 Visual representations of telemetry \u2014 Rapid situational awareness \u2014 Pitfall: cluttered dashboards\nEscalation policy \u2014 Rules mapping alerts to responders \u2014 Ensures appropriate response \u2014 Pitfall: outdated contacts\nIncident lifecycle \u2014 Stages from detection to postmortem \u2014 Framework for workflows \u2014 Pitfall: skipping steps\nService map \u2014 Dependency graph of services \u2014 Shows blast radius \u2014 Pitfall: not maintained\nBlast radius \u2014 Scope of impact from an event \u2014 Prioritizes response \u2014 Pitfall: underestimated dependencies\nFailover \u2014 Switching to backup system \u2014 Reduces downtime \u2014 Pitfall: failover not tested\nRollback \u2014 Reverting to previous state or version \u2014 Rapid mitigation for bad deploys \u2014 Pitfall: data schema incompatibility\nFeature flag \u2014 Toggle to control features at runtime \u2014 Enables surgical mitigation \u2014 Pitfall: flag entanglement\nCanary deploy \u2014 Gradual rollout to subset of users \u2014 Limits blast radius \u2014 Pitfall: canary not representative\nChaos engineering \u2014 Controlled failure injection \u2014 Validates resilience \u2014 Pitfall: inadequate safety guards\nAutomation playbook \u2014 Scripted remediation tasks \u2014 Reduces toil and error \u2014 Pitfall: over-reliance without human oversight\nIncident budget \u2014 Time allocated for major incidents in SRE plan \u2014 Resource planning \u2014 Pitfall: misalignment with actual load\nOn-call runbook \u2014 Quick actions for on-call responders \u2014 Increases speed \u2014 Pitfall: too verbose\nScribe \u2014 Incident note taker \u2014 Keeps timeline record \u2014 Pitfall: missing timestamps\nCommunications lead \u2014 Manages external and internal messaging \u2014 Maintains trust \u2014 Pitfall: inconsistent messaging\nBackfill \u2014 Restoring data after outage \u2014 Ensures correctness \u2014 Pitfall: silent data loss\nObservability debt \u2014 Missing telemetry or poor instrumentation \u2014 Hinders diagnosis \u2014 Pitfall: deferred instrumentation\nPost-incident action (PIA) \u2014 Tasks from postmortem \u2014 Drives remediation \u2014 Pitfall: action items not tracked to completion\nBlameless culture \u2014 Focus on system fixes not people \u2014 Encourages openness \u2014 Pitfall: lack of accountability\nMajor incident playbook \u2014 Organization-specific document \u2014 Standardizes response \u2014 Pitfall: not practiced<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">How to Measure Major incident (Metrics, SLIs, SLOs) (TABLE REQUIRED)<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Metric\/SLI<\/th>\n<th>What it tells you<\/th>\n<th>How to measure<\/th>\n<th>Starting target<\/th>\n<th>Gotchas<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>M1<\/td>\n<td>Success rate<\/td>\n<td>Fraction of successful requests<\/td>\n<td>Successful responses \/ total<\/td>\n<td>99.9% for critical APIs<\/td>\n<td>Depends on traffic patterns<\/td>\n<\/tr>\n<tr>\n<td>M2<\/td>\n<td>Request latency p95<\/td>\n<td>Tail latency user experiences<\/td>\n<td>Measure p95 from trace or metric<\/td>\n<td>&lt;300ms for web UI<\/td>\n<td>Outliers can hide distribution<\/td>\n<\/tr>\n<tr>\n<td>M3<\/td>\n<td>Error rate by code<\/td>\n<td>Root cause by error class<\/td>\n<td>5xx count \/ total requests<\/td>\n<td>&lt;0.1% for core flows<\/td>\n<td>Aggregation may hide service-level issues<\/td>\n<\/tr>\n<tr>\n<td>M4<\/td>\n<td>Availability<\/td>\n<td>Uptime per SLO window<\/td>\n<td>Time service responds correctly \/ window<\/td>\n<td>99.95% regionally<\/td>\n<td>Maintenance windows affect calc<\/td>\n<\/tr>\n<tr>\n<td>M5<\/td>\n<td>SLO burn rate<\/td>\n<td>Speed of error budget consumption<\/td>\n<td>Error rate vs SLO over time<\/td>\n<td>Alert at 2x burn rate<\/td>\n<td>Short windows noisy<\/td>\n<\/tr>\n<tr>\n<td>M6<\/td>\n<td>MTTR<\/td>\n<td>Mean time to restore service<\/td>\n<td>Time from detection to fix<\/td>\n<td>&lt;60m for P0s (varies)<\/td>\n<td>Depends on mitigation types<\/td>\n<\/tr>\n<tr>\n<td>M7<\/td>\n<td>MTTA<\/td>\n<td>Mean time to acknowledge alerts<\/td>\n<td>Time from alert to human ack<\/td>\n<td>&lt;5m for majors<\/td>\n<td>High false positives inflate<\/td>\n<\/tr>\n<tr>\n<td>M8<\/td>\n<td>Incident frequency<\/td>\n<td>How often majors occur<\/td>\n<td>Count per quarter<\/td>\n<td>&lt;1 per quarter per service<\/td>\n<td>Colocation of services skews<\/td>\n<\/tr>\n<tr>\n<td>M9<\/td>\n<td>User impact count<\/td>\n<td>Number of affected users\/events<\/td>\n<td>Unique users with failures<\/td>\n<td>Minimal acceptable per product<\/td>\n<td>Privacy and sampling issues<\/td>\n<\/tr>\n<tr>\n<td>M10<\/td>\n<td>Cost impact<\/td>\n<td>Cloud cost increase from incident<\/td>\n<td>Delta in billing for incident window<\/td>\n<td>Varies by business<\/td>\n<td>Hard to compute in real time<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if needed)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>M1: Compute on a per-endpoint basis and aggregate to service level. Use weighted averages for traffic splits.<\/li>\n<li>M5: Burn rate is best measured over multiple windows (15m, 1h, 24h).<\/li>\n<li>M6: MTTR should separate detection-to-mitigation and mitigation-to-remediation.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Best tools to measure Major incident<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 Observability Platform (APM\/metrics tracing)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for Major incident: Traces, latencies, error rates, distributed context<\/li>\n<li>Best-fit environment: Microservices in cloud or hybrid apps<\/li>\n<li>Setup outline:<\/li>\n<li>Instrument services for traces and metrics<\/li>\n<li>Configure sampling and trace headers<\/li>\n<li>Correlate traces with logs and metrics<\/li>\n<li>Strengths:<\/li>\n<li>Fast root cause paths<\/li>\n<li>Correlated distributed traces<\/li>\n<li>Limitations:<\/li>\n<li>High cardinality costs<\/li>\n<li>Sampling gaps may miss low-frequency errors<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 Logging Pipeline<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for Major incident: Application and infrastructure events<\/li>\n<li>Best-fit environment: All environments needing forensic detail<\/li>\n<li>Setup outline:<\/li>\n<li>Centralize logs<\/li>\n<li>Enrich logs with trace IDs<\/li>\n<li>Ensure retention and indexing strategy<\/li>\n<li>Strengths:<\/li>\n<li>Forensic troubleshooting<\/li>\n<li>Flexible queries<\/li>\n<li>Limitations:<\/li>\n<li>Cost and volume management<\/li>\n<li>Not real-time if ingestion lags<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 Synthetic Monitoring<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for Major incident: End-to-end user flows and availability<\/li>\n<li>Best-fit environment: Public-facing APIs and web UIs<\/li>\n<li>Setup outline:<\/li>\n<li>Create synthetic journeys<\/li>\n<li>Schedule checks globally<\/li>\n<li>Alert on failures and latency thresholds<\/li>\n<li>Strengths:<\/li>\n<li>Early detection of regressions<\/li>\n<li>SLA proof points<\/li>\n<li>Limitations:<\/li>\n<li>May not reflect real-user conditions<\/li>\n<li>Maintenance overhead for scripts<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 Incident Management Platform<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for Major incident: Pages, actions, timelines, ownership<\/li>\n<li>Best-fit environment: Organizations with distributed teams<\/li>\n<li>Setup outline:<\/li>\n<li>Integrate alerting and on-call schedules<\/li>\n<li>Define escalation policies<\/li>\n<li>Create incident templates<\/li>\n<li>Strengths:<\/li>\n<li>Coordination and audit trail<\/li>\n<li>Role-based routing<\/li>\n<li>Limitations:<\/li>\n<li>Process overhead if misused<\/li>\n<li>May centralize decision-making too much<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 Chaos Engineering Tools<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for Major incident: System resilience and failover behavior<\/li>\n<li>Best-fit environment: Mature teams with staging and safeguards<\/li>\n<li>Setup outline:<\/li>\n<li>Define hypotheses<\/li>\n<li>Run safe experiments<\/li>\n<li>Measure outcomes against SLIs<\/li>\n<li>Strengths:<\/li>\n<li>Proactive discovery of weak points<\/li>\n<li>Improves runbooks<\/li>\n<li>Limitations:<\/li>\n<li>Risk if experiments not scoped<\/li>\n<li>Requires investment in automation<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Recommended dashboards &amp; alerts for Major incident<\/h3>\n\n\n\n<p>Executive dashboard:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Uptime by critical service: shows current availability vs SLO.<\/li>\n<li>Business metrics: checkout rate, payments processed, revenue delta.<\/li>\n<li>Active major incidents count and status.<\/li>\n<li>SLO burn rate summary.\nWhy: Executives need impact, scope, and trending.<\/li>\n<\/ul>\n\n\n\n<p>On-call dashboard:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Real-time errors by service and region.<\/li>\n<li>Top alerts with severity and owner.<\/li>\n<li>Recent deploys and change history.<\/li>\n<li>Active mitigation steps and runbook links.\nWhy: On-call needs immediate actionable info and context.<\/li>\n<\/ul>\n\n\n\n<p>Debug dashboard:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Traces for failure paths and top slow traces.<\/li>\n<li>Error logs grouped by root cause.<\/li>\n<li>Infrastructure metrics (CPU, memory, network).<\/li>\n<li>Deployment timelines and traffic routing.\nWhy: Engineers debug and verify fixes.<\/li>\n<\/ul>\n\n\n\n<p>Alerting guidance:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Page for urgent (service down, data loss) incidents.<\/li>\n<li>Ticket for degradations that need attention but not immediate coordination.<\/li>\n<li>Use burn-rate alerts for SLO consumption: page when burn rate &gt; 5x sustained over 15 minutes.<\/li>\n<li>Noise reduction: dedupe repeated alerts, group by root cause, silence known noisy windows, use alert suppression for automated mitigation.<\/li>\n<li>Use correlation rules to avoid paging for transient upstream blips.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Implementation Guide (Step-by-step)<\/h2>\n\n\n\n<p>1) Prerequisites\n&#8211; Clear SLOs for critical services.\n&#8211; On-call rotation and escalation policies.\n&#8211; Instrumentation for metrics, logs, and traces.\n&#8211; Incident management platform and communication channels.\n&#8211; Runbooks and playbooks for known failure modes.<\/p>\n\n\n\n<p>2) Instrumentation plan\n&#8211; Identify critical user journeys and services.\n&#8211; Define SLIs and required telemetry for each SLI.\n&#8211; Add trace IDs to logs and propagate context across services.\n&#8211; Implement synthetic checks for key flows.<\/p>\n\n\n\n<p>3) Data collection\n&#8211; Centralize metrics, logs, and traces.\n&#8211; Ensure redundancy in telemetry pipelines.\n&#8211; Set retention policies for postmortem evidence.\n&#8211; Configure alerting rules and anomaly detectors.<\/p>\n\n\n\n<p>4) SLO design\n&#8211; Choose meaningful SLIs aligned to user experience.\n&#8211; Set SLO windows (30d, 90d) and targets based on risk tolerance.\n&#8211; Define error budgets and escalation thresholds.<\/p>\n\n\n\n<p>5) Dashboards\n&#8211; Build three tiers: executive, on-call, debug.\n&#8211; Use service maps and dependency views.\n&#8211; Link dashboards to runbooks and incident pages.<\/p>\n\n\n\n<p>6) Alerts &amp; routing\n&#8211; Define alert priorities and what pages vs tickets.\n&#8211; Configure escalation policies and phone\/SMS fallbacks.\n&#8211; Implement dedupe and grouping rules.<\/p>\n\n\n\n<p>7) Runbooks &amp; automation\n&#8211; Create runbooks for top failure scenarios.\n&#8211; Automate safe mitigations (traffic shift, rollback).\n&#8211; Keep runbooks versioned and test them in staging.<\/p>\n\n\n\n<p>8) Validation (load\/chaos\/game days)\n&#8211; Regularly run game days simulating major incidents.\n&#8211; Test failover, rollback, and communication procedures.\n&#8211; Measure MTTR and refine processes.<\/p>\n\n\n\n<p>9) Continuous improvement\n&#8211; Conduct blameless postmortems with actionable PIAs.\n&#8211; Track completion of action items.\n&#8211; Review SLOs and adjust alerts to reduce noise.<\/p>\n\n\n\n<p>Checklists:<\/p>\n\n\n\n<p>Pre-production checklist<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>SLOs defined for critical flows.<\/li>\n<li>Synthetic checks in place.<\/li>\n<li>Rollback and deployment automation validated.<\/li>\n<li>Observability for services enabled.<\/li>\n<li>Runbooks written and accessible.<\/li>\n<\/ul>\n\n\n\n<p>Production readiness checklist<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>On-call rotation assigned.<\/li>\n<li>Escalation policies tested.<\/li>\n<li>Incident tooling integrated with chat and paging.<\/li>\n<li>Communication templates prepared.<\/li>\n<li>Backups and failover tested.<\/li>\n<\/ul>\n\n\n\n<p>Incident checklist specific to Major incident<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Declare incident and open incident page.<\/li>\n<li>Assign IC, communications lead, scribe.<\/li>\n<li>Post initial status update within 10 minutes.<\/li>\n<li>Execute mitigation runbook and record actions.<\/li>\n<li>Monitor telemetry and update stakeholders regularly.<\/li>\n<li>Capture timeline and evidence for postmortem.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Use Cases of Major incident<\/h2>\n\n\n\n<p>(8\u201312 concise use cases)<\/p>\n\n\n\n<p>1) Global login failure\n&#8211; Context: Auth service returns 500s.\n&#8211; Problem: Users cannot access accounts.\n&#8211; Why Major incident helps: Coordinates multiple teams (auth, DB, infra).\n&#8211; What to measure: Success rate for \/login, latency, DB errors.\n&#8211; Typical tools: Tracing, DB monitors, incident platform.<\/p>\n\n\n\n<p>2) Payment processing outage\n&#8211; Context: Payment gateway region degraded.\n&#8211; Problem: Transactions failing, revenue loss.\n&#8211; Why Major incident helps: Rapid failover and business comms.\n&#8211; What to measure: Transaction success, queue lengths.\n&#8211; Typical tools: Payment monitor, dashboard, billing alerts.<\/p>\n\n\n\n<p>3) K8s control plane API high latency\n&#8211; Context: API throttle causing pod scheduling failures.\n&#8211; Problem: New pods failing and deployments stuck.\n&#8211; Why Major incident helps: Orchestrates platform and app teams.\n&#8211; What to measure: API latency, pod restart rate.\n&#8211; Typical tools: K8s metrics, control plane logs.<\/p>\n\n\n\n<p>4) Database corruption discovered\n&#8211; Context: Erroneous write pattern corrupted a table.\n&#8211; Problem: Data integrity and customer trust at risk.\n&#8211; Why Major incident helps: Coordinates recovery, legal, and comms.\n&#8211; What to measure: Corrupt rows, replication status.\n&#8211; Typical tools: DB backup systems, logs.<\/p>\n\n\n\n<p>5) CDN misconfiguration serving stale content\n&#8211; Context: Cache invalidation failed globally.\n&#8211; Problem: Users see outdated data and errors.\n&#8211; Why Major incident helps: Coordinates CDN provider, cache purges.\n&#8211; What to measure: Cache hit\/miss, response headers.\n&#8211; Typical tools: CDN logs, synthetic checks.<\/p>\n\n\n\n<p>6) Security breach detection\n&#8211; Context: Suspicious data exfiltration patterns.\n&#8211; Problem: Potential data leak requiring legal response.\n&#8211; Why Major incident helps: Engage security, legal, comms.\n&#8211; What to measure: Anomalous access counts, IP patterns.\n&#8211; Typical tools: SIEM, auth logs.<\/p>\n\n\n\n<p>7) Billing spikes causing quotas\n&#8211; Context: Unexpected autoscaling increased costs and triggered quotas.\n&#8211; Problem: Services throttled by provider limits.\n&#8211; Why Major incident helps: Mitigate cost and maintain service.\n&#8211; What to measure: Cost per hour, quota use.\n&#8211; Typical tools: Cloud billing, autoscaler metrics.<\/p>\n\n\n\n<p>8) Third-party API outage\n&#8211; Context: External provider returns errors.\n&#8211; Problem: Dependent features fail.\n&#8211; Why Major incident helps: Decide degrade vs wait and route traffic.\n&#8211; What to measure: External API success rate, fallback efficacy.\n&#8211; Typical tools: Synthetic monitors, circuit breaker metrics.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Scenario Examples (Realistic, End-to-End)<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario #1 \u2014 Kubernetes API throttling and app failures<\/h3>\n\n\n\n<p><strong>Context:<\/strong> K8s control plane experiencing high request latencies causing scheduling failures.\n<strong>Goal:<\/strong> Restore scheduling and prevent cascading pod restarts.\n<strong>Why Major incident matters here:<\/strong> Affects many services and can create a cluster-wide blackout.\n<strong>Architecture \/ workflow:<\/strong> K8s API -&gt; controllers -&gt; kubelets -&gt; workloads.\n<strong>Step-by-step implementation:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Detect via control plane API latency alert.<\/li>\n<li>Declare major incident; assign IC and platform lead.<\/li>\n<li>Reduce create\/delete activity by pausing CI\/CD pipelines.<\/li>\n<li>Scale control plane or move workloads to standby cluster.<\/li>\n<li>Rollback recent control-plane-affecting changes.<\/li>\n<li>Monitor pod scheduling and API latency.\n<strong>What to measure:<\/strong> API p95, pod pending count, controller errors.\n<strong>Tools to use and why:<\/strong> K8s metrics, control plane logs, incident platform.\n<strong>Common pitfalls:<\/strong> Not pausing CI leads to continued pressure.\n<strong>Validation:<\/strong> Run create\/delete load tests in staging after fix.\n<strong>Outcome:<\/strong> Cluster stabilized and scheduling restored; postmortem action to rate-limit controllers.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario #2 \u2014 Serverless cold-start storm causing function timeouts<\/h3>\n\n\n\n<p><strong>Context:<\/strong> Sudden traffic spike to serverless function causing high cold starts and timeouts.\n<strong>Goal:<\/strong> Reduce errors and stabilize latency.\n<strong>Why Major incident matters here:<\/strong> Many upstream services rely on low-latency functions.\n<strong>Architecture \/ workflow:<\/strong> API Gateway -&gt; Serverless functions -&gt; Downstream DB.\n<strong>Step-by-step implementation:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Detect increased 5xx and latency via SLI alerts.<\/li>\n<li>Declare incident and engage platform and dev teams.<\/li>\n<li>Enable provisioned concurrency or scale warmers.<\/li>\n<li>Apply rate limiting at edge and degrade noncritical features.<\/li>\n<li>Monitor invocation success and latency.\n<strong>What to measure:<\/strong> Invocation error rate, cold start duration, provisioned concurrency fill.\n<strong>Tools to use and why:<\/strong> Platform metrics, APM, synthetic checks.\n<strong>Common pitfalls:<\/strong> Provisioned concurrency adds cost and provisioning delay.\n<strong>Validation:<\/strong> Load test with warmers and synthetic checks.\n<strong>Outcome:<\/strong> Errors reduced; plan to add adaptive concurrency and circuit breakers.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario #3 \u2014 Postmortem-driven remediation after major incident<\/h3>\n\n\n\n<p><strong>Context:<\/strong> Service outage due to bad schema migration.\n<strong>Goal:<\/strong> Document root cause and implement remediation.\n<strong>Why Major incident matters here:<\/strong> Data loss risk and repeated rollbacks.\n<strong>Architecture \/ workflow:<\/strong> Application -&gt; DB migration -&gt; downstream services.\n<strong>Step-by-step implementation:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>During incident, roll back application and restore DB from backups.<\/li>\n<li>After stabilization, declare postmortem and collect timeline.<\/li>\n<li>Analyze migration process, permissions, and testing gaps.<\/li>\n<li>Implement gatekeeping for migrations and add pre-deploy checks.\n<strong>What to measure:<\/strong> Number of failed migrations, time to rollback.\n<strong>Tools to use and why:<\/strong> DB logs, CI\/CD logs, version control.\n<strong>Common pitfalls:<\/strong> Skipping postmortem or not tracking action completion.\n<strong>Validation:<\/strong> Run game day for migration path.\n<strong>Outcome:<\/strong> Migration gating implemented, reduced risk.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario #4 \u2014 Cost\/performance trade-off under heavy load<\/h3>\n\n\n\n<p><strong>Context:<\/strong> Scaling out to handle traffic increases costs beyond budget and triggers provider quota.\n<strong>Goal:<\/strong> Balance cost with acceptable performance while restoring service.\n<strong>Why Major incident matters here:<\/strong> Financial overruns and service degradation risk.\n<strong>Architecture \/ workflow:<\/strong> Auto-scaling group -&gt; instances -&gt; load balancer.\n<strong>Step-by-step implementation:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Detect cost and quota alerts; declare incident.<\/li>\n<li>Apply throttles to non-critical traffic and use degraded modes.<\/li>\n<li>Shift traffic to cheaper compute paths or reserved capacity.<\/li>\n<li>Iterate on scaling policies and implement load shedding strategies.\n<strong>What to measure:<\/strong> Cost per request, response time, queue lengths.\n<strong>Tools to use and why:<\/strong> Cloud billing, autoscaler metrics, traffic management tools.\n<strong>Common pitfalls:<\/strong> Sudden throttles harm user experience.\n<strong>Validation:<\/strong> Simulate traffic spikes and cost impact in staging.\n<strong>Outcome:<\/strong> Temporary cost controls, longer-term autoscaling policy changes.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario #5 \u2014 Third-party API outage with internal fallback<\/h3>\n\n\n\n<p><strong>Context:<\/strong> Payment provider API returns 503s.\n<strong>Goal:<\/strong> Keep payment flow working using fallback options.\n<strong>Why Major incident matters here:<\/strong> Direct revenue impact and refunds risk.\n<strong>Architecture \/ workflow:<\/strong> Checkout -&gt; Payment provider -&gt; Confirmation.\n<strong>Step-by-step implementation:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Alert triggers major incident.<\/li>\n<li>Route traffic to backup provider or offline payment queuing.<\/li>\n<li>Notify stakeholders and initiate customer messaging.<\/li>\n<li>Monitor success rate and process queued payments.\n<strong>What to measure:<\/strong> Payment success, queue size, retry success.\n<strong>Tools to use and why:<\/strong> Payment gateway metrics, queue monitors.\n<strong>Common pitfalls:<\/strong> Data duplication or double charges.\n<strong>Validation:<\/strong> Test fallback flow with synthetic transactions.\n<strong>Outcome:<\/strong> Payments processed via fallback; contract review with primary provider.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Common Mistakes, Anti-patterns, and Troubleshooting<\/h2>\n\n\n\n<p>(15\u201325 mistakes with Symptom -&gt; Root cause -&gt; Fix; include 5 observability pitfalls)<\/p>\n\n\n\n<p>1) Symptom: Pages keep firing for same error -&gt; Root cause: alerts not deduped -&gt; Fix: group by root cause and add suppression\n2) Symptom: On-call overwhelmed -&gt; Root cause: too many major incident declarations -&gt; Fix: stricter declaration criteria and training\n3) Symptom: Postmortems missing -&gt; Root cause: No ownership after incident -&gt; Fix: require postmortem within SLA and track PIAs\n4) Symptom: Runbooks outdated -&gt; Root cause: No versioning or reviews -&gt; Fix: schedule runbook reviews and test runbooks\n5) Symptom: Dashboards blank during outage -&gt; Root cause: telemetry pipeline outage -&gt; Fix: redundant telemetry paths and synthetic monitors\n6) Symptom: Wrong person paged -&gt; Root cause: stale escalation policy -&gt; Fix: update on-call schedules and verify contacts\n7) Symptom: Rollback fails -&gt; Root cause: DB schema incompatible -&gt; Fix: add backward-compatible migrations and preflight checks\n8) Symptom: High MTTR -&gt; Root cause: missing instrumentation -&gt; Fix: add traces and logs for critical paths\n9) Symptom: Executive surprises -&gt; Root cause: no exec notification procedure -&gt; Fix: predefine communication templates and cadence\n10) Symptom: Noise from transient errors -&gt; Root cause: low-threshold alerts -&gt; Fix: add aggregation and anomaly detection\n11) Symptom: Security incident handled like regular outage -&gt; Root cause: lack of security-specific playbook -&gt; Fix: create security incident process\n12) Symptom: Data loss discovered late -&gt; Root cause: insufficient backups and verification -&gt; Fix: backup policies and restore drills\n13) Symptom: Cost surge during mitigation -&gt; Root cause: uncontrolled autoscale or fallback -&gt; Fix: cost-aware mitigation strategies\n14) Symptom: Multiple concurrent majors -&gt; Root cause: lack of prioritization -&gt; Fix: central coordination and business-impact scoring\n15) Symptom: Blame culture after incident -&gt; Root cause: poor postmortem facilitation -&gt; Fix: enforce blameless language and focus on systems\n16) Observability pitfall: Missing correlation IDs -&gt; Root cause: not propagating trace IDs -&gt; Fix: enforce propagation and enrich logs\n17) Observability pitfall: High-cardinality metrics exploding cost -&gt; Root cause: unbounded labels -&gt; Fix: sanitize labels and sample\n18) Observability pitfall: Logs not centralized -&gt; Root cause: local logging only -&gt; Fix: centralized logging with retention policy\n19) Observability pitfall: Metrics delayed due to batching -&gt; Root cause: long ingestion windows -&gt; Fix: reduce batch windows for critical metrics\n20) Observability pitfall: Tracing disabled in production -&gt; Root cause: performance fears -&gt; Fix: sampling and low-overhead tracing\n21) Symptom: Communication chaos -&gt; Root cause: no communications lead -&gt; Fix: assign communications role on declaration\n22) Symptom: Incident page lacks timeline -&gt; Root cause: no scribe -&gt; Fix: designate scribe and require timestamped entries\n23) Symptom: Remediations never completed -&gt; Root cause: no tracking or accountability -&gt; Fix: assign owners and track to completion<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Best Practices &amp; Operating Model<\/h2>\n\n\n\n<p>Ownership and on-call:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Define clear ownership for services and SLOs.<\/li>\n<li>Rotate IC and on-call to avoid burnout.<\/li>\n<li>Provide financial and career recognition for on-call work.<\/li>\n<\/ul>\n\n\n\n<p>Runbooks vs playbooks:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Runbooks: step-by-step fixes for specific failures.<\/li>\n<li>Playbooks: decision trees and coordination templates for classes of incidents.<\/li>\n<li>Keep runbooks tight, testable, and accessible.<\/li>\n<\/ul>\n\n\n\n<p>Safe deployments:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use canary and progressive rollouts.<\/li>\n<li>Feature flags to disable problematic features quickly.<\/li>\n<li>Automated rollback on key error thresholds.<\/li>\n<\/ul>\n\n\n\n<p>Toil reduction and automation:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Automate detection-to-mitigation paths where safe.<\/li>\n<li>Reduce manual repetitive tasks such as log collection and paging.<\/li>\n<li>Audit automated actions and provide clear human override.<\/li>\n<\/ul>\n\n\n\n<p>Security basics:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Separate incident flows for security incidents.<\/li>\n<li>Lock down access during incidents and rotate credentials if compromised.<\/li>\n<li>Maintain audit trails and preserve forensic data.<\/li>\n<\/ul>\n\n\n\n<p>Weekly\/monthly routines:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Weekly: review high-severity alerts and action items.<\/li>\n<li>Monthly: runbook drills and observability audits.<\/li>\n<li>Quarterly: SLO and incident frequency review with execs.<\/li>\n<\/ul>\n\n\n\n<p>What to review in postmortems:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Timeline and detection-to-mitigation times.<\/li>\n<li>Root cause and contributing factors.<\/li>\n<li>Action items with owners and deadlines.<\/li>\n<li>SLO impact and whether SLAs were breached.<\/li>\n<li>Lessons and process updates for future prevention.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Tooling &amp; Integration Map for Major incident (TABLE REQUIRED)<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Category<\/th>\n<th>What it does<\/th>\n<th>Key integrations<\/th>\n<th>Notes<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>I1<\/td>\n<td>Metrics\/Monitoring<\/td>\n<td>Collects and visualizes metrics<\/td>\n<td>Alerting, dashboards, tracing<\/td>\n<td>Core for detection<\/td>\n<\/tr>\n<tr>\n<td>I2<\/td>\n<td>Tracing\/APM<\/td>\n<td>Distributed traces and span context<\/td>\n<td>Logs, metrics, issue trackers<\/td>\n<td>Critical for root cause<\/td>\n<\/tr>\n<tr>\n<td>I3<\/td>\n<td>Logging<\/td>\n<td>Centralized logs and search<\/td>\n<td>Tracing, alerts, incident pages<\/td>\n<td>Forensics and audits<\/td>\n<\/tr>\n<tr>\n<td>I4<\/td>\n<td>Incident Mgmt<\/td>\n<td>Pages, tracks incidents, roles<\/td>\n<td>Chat, monitoring, ticketing<\/td>\n<td>Orchestrates response<\/td>\n<\/tr>\n<tr>\n<td>I5<\/td>\n<td>ChatOps<\/td>\n<td>Real-time coordination and automation<\/td>\n<td>Incident Mgmt, CI\/CD, tools<\/td>\n<td>Executes runbook commands<\/td>\n<\/tr>\n<tr>\n<td>I6<\/td>\n<td>CI\/CD<\/td>\n<td>Deploys and can rollback releases<\/td>\n<td>Feature flags, monitoring<\/td>\n<td>Can trigger incidents<\/td>\n<\/tr>\n<tr>\n<td>I7<\/td>\n<td>Feature Flags<\/td>\n<td>Control feature exposure at runtime<\/td>\n<td>CI\/CD, monitoring<\/td>\n<td>Used for mitigation<\/td>\n<\/tr>\n<tr>\n<td>I8<\/td>\n<td>Chaos Tools<\/td>\n<td>Inject faults to validate resilience<\/td>\n<td>Scheduling, monitoring<\/td>\n<td>For proactive testing<\/td>\n<\/tr>\n<tr>\n<td>I9<\/td>\n<td>SIEM<\/td>\n<td>Security analysis and alerts<\/td>\n<td>Logs, identity systems<\/td>\n<td>For security incidents<\/td>\n<\/tr>\n<tr>\n<td>I10<\/td>\n<td>Cost\/Billing<\/td>\n<td>Tracks cloud spend and quotas<\/td>\n<td>Monitoring, alerts<\/td>\n<td>For cost-related incidents<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if needed)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>None.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Frequently Asked Questions (FAQs)<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What exactly qualifies as a major incident?<\/h3>\n\n\n\n<p>A major incident is declared when impact thresholds in your incident policy are met, such as significant user impact, revenue loss, or regulatory exposure.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Who should declare a major incident?<\/h3>\n\n\n\n<p>Typically the on-call engineer or triage lead can declare, but organizations may require a platform or product lead to confirm based on policy.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How does a major incident differ from a P0?<\/h3>\n\n\n\n<p>P0 is a priority label; many organizations map P0 to major incidents but definitions vary by team.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How long should a major incident stay declared?<\/h3>\n\n\n\n<p>Until the service is restored to acceptable impact levels and temporary mitigations stabilize the system; often until the first post-incidence handoff.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Are major incidents always public?<\/h3>\n\n\n\n<p>Not always. Public communication depends on impact, legal, and PR considerations; security incidents often have separate disclosure rules.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How do SLOs factor into declaring majors?<\/h3>\n\n\n\n<p>SLO breaches or rapid SLO burn rates are common triggers for escalation to major incident status.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How do you avoid alert fatigue while still detecting majors?<\/h3>\n\n\n\n<p>Use aggregation, smarter thresholds, anomaly detection, and dedupe\/grouping to reduce noise without losing signal.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Should runbooks be automated?<\/h3>\n\n\n\n<p>Automate safe, repeatable steps. Manual oversight is necessary for high-risk actions like DB restores.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How do you measure success after a major incident?<\/h3>\n\n\n\n<p>Measure MTTR, recurrence, time to complete PIAs, and impact on SLO\/error budget.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Who owns postmortems?<\/h3>\n\n\n\n<p>The owning team for the affected service should lead the postmortem with cross-team contributions and an executive reviewer.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How often should you do game days?<\/h3>\n\n\n\n<p>Quarterly for critical systems; more frequent for high-change environments.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What role does chaos engineering play?<\/h3>\n\n\n\n<p>It proactively reveals brittle behavior and validates mitigations before production incidents occur.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How to handle multiple concurrent major incidents?<\/h3>\n\n\n\n<p>Prioritize by business impact, assign separate ICs, and maintain a central coordinator for cross-incident dependencies.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What are reasonable starting targets for SLOs?<\/h3>\n\n\n\n<p>Start with conservative targets for critical flows (e.g., 99.9% success) and adjust based on capability and business needs.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">When should executives be notified?<\/h3>\n\n\n\n<p>Notify executives for high-revenue impact, long or public-facing outages, or regulatory\/security incidents.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How to track remediation completion?<\/h3>\n\n\n\n<p>Use tracked PIAs in your task system with owners, due dates, and executive visibility.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Can AI help with major incident response?<\/h3>\n\n\n\n<p>Yes; AI can assist triage, correlate signals, and summarize timelines but requires careful guardrails and human validation.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How to ensure legal\/compliance needs are met during incidents?<\/h3>\n\n\n\n<p>Involve legal and compliance early for incidents impacting data or regulated services and preserve forensic evidence.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>Major incidents are high-impact events that require disciplined detection, rapid coordination, and rigorous follow-through. Modern cloud-native environments demand observable systems, automated mitigations, and well-practiced playbooks. Investing in instrumentation, SLOs, role clarity, and postmortem culture reduces frequency and impact over time.<\/p>\n\n\n\n<p>Next 7 days plan (5 bullets):<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Day 1: Audit critical SLIs and ensure telemetry exists for top 3 user journeys.<\/li>\n<li>Day 2: Verify escalation policies and on-call contacts; run a paging drill.<\/li>\n<li>Day 3: Review and update top 5 runbooks; add missing runbooks for critical flows.<\/li>\n<li>Day 4: Build or refine on-call and executive dashboards for SLO burn rates.<\/li>\n<li>Day 5\u20137: Run a focused game day simulating one major incident and complete a mini postmortem.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Appendix \u2014 Major incident Keyword Cluster (SEO)<\/h2>\n\n\n\n<p>Primary keywords<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>major incident<\/li>\n<li>major incident management<\/li>\n<li>major incident response<\/li>\n<li>major incident playbook<\/li>\n<li>major incident definition<\/li>\n<\/ul>\n\n\n\n<p>Secondary keywords<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>incident command system<\/li>\n<li>SRE major incident<\/li>\n<li>major outage handling<\/li>\n<li>incident commander role<\/li>\n<li>major incident runbook<\/li>\n<\/ul>\n\n\n\n<p>Long-tail questions<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>what is a major incident in it operations<\/li>\n<li>how to declare a major incident<\/li>\n<li>major incident vs outage vs incident<\/li>\n<li>how to measure major incident impact<\/li>\n<li>best practices for major incident response<\/li>\n<li>how to write a major incident runbook<\/li>\n<li>how to recover from a major outage<\/li>\n<li>major incident communication templates<\/li>\n<li>when to notify executives during major incident<\/li>\n<li>how to measure SLO during major incident<\/li>\n<li>how to use feature flags to mitigate incidents<\/li>\n<li>automating major incident mitigation with playbooks<\/li>\n<li>running game days for major incidents<\/li>\n<li>handling security incidents during major outage<\/li>\n<li>how to prioritize multiple major incidents<\/li>\n<li>roles in major incident response team<\/li>\n<li>telemetry required for major incident detection<\/li>\n<li>major incident postmortem template<\/li>\n<li>how to track remediation after a major incident<\/li>\n<li>major incident escalation checklist<\/li>\n<\/ul>\n\n\n\n<p>Related terminology<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>SLO monitoring<\/li>\n<li>SLI definitions<\/li>\n<li>error budget burn<\/li>\n<li>postmortem analysis<\/li>\n<li>incident timeline<\/li>\n<li>war room coordination<\/li>\n<li>incident management platform<\/li>\n<li>logging and tracing<\/li>\n<li>synthetic monitoring<\/li>\n<li>chaos engineering<\/li>\n<li>canary deploy rollback<\/li>\n<li>feature flag mitigation<\/li>\n<li>automated rollback<\/li>\n<li>burn-rate alerts<\/li>\n<li>observability debt<\/li>\n<li>incident severity levels<\/li>\n<li>incident frequency metrics<\/li>\n<li>MTTR measurement<\/li>\n<li>MTTA metrics<\/li>\n<li>incident commander checklist<\/li>\n<li>communications lead duties<\/li>\n<li>runbook testing<\/li>\n<li>incident game day<\/li>\n<li>blameless postmortem<\/li>\n<li>security incident response<\/li>\n<li>legal and compliance in incidents<\/li>\n<li>cloud failover strategies<\/li>\n<li>multi-region failover<\/li>\n<li>backup and restore drills<\/li>\n<li>cost-aware incident mitigation<\/li>\n<li>CI\/CD rollback procedures<\/li>\n<li>Kubernetes incident response<\/li>\n<li>serverless incident mitigation<\/li>\n<li>provider outage handling<\/li>\n<li>third-party dependency fallback<\/li>\n<li>billing and quota incident handling<\/li>\n<li>incident lifecycle stages<\/li>\n<li>incident action item tracking<\/li>\n<li>on-call fatigue mitigation<\/li>\n<li>tooling for incident response<\/li>\n<li>incident response automation<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>&#8212;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[149],"tags":[],"class_list":["post-1674","post","type-post","status-publish","format-standard","hentry","category-terminology"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>What is Major incident? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide) - SRE School<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/sreschool.com\/blog\/major-incident\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"What is Major incident? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide) - SRE School\" \/>\n<meta property=\"og:description\" content=\"---\" \/>\n<meta property=\"og:url\" content=\"https:\/\/sreschool.com\/blog\/major-incident\/\" \/>\n<meta property=\"og:site_name\" content=\"SRE School\" \/>\n<meta property=\"article:published_time\" content=\"2026-02-15T05:30:49+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-05-05T07:28:47+00:00\" \/>\n<meta name=\"author\" content=\"Rajesh Kumar\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Rajesh Kumar\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"28 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/sreschool.com\/blog\/major-incident\/\",\"url\":\"https:\/\/sreschool.com\/blog\/major-incident\/\",\"name\":\"What is Major incident? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide) - SRE School\",\"isPartOf\":{\"@id\":\"https:\/\/sreschool.com\/blog\/#website\"},\"datePublished\":\"2026-02-15T05:30:49+00:00\",\"dateModified\":\"2026-05-05T07:28:47+00:00\",\"author\":{\"@id\":\"https:\/\/sreschool.com\/blog\/#\/schema\/person\/0ffe446f77bb2589992dbe3a7f417201\"},\"breadcrumb\":{\"@id\":\"https:\/\/sreschool.com\/blog\/major-incident\/#breadcrumb\"},\"inLanguage\":\"en\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/sreschool.com\/blog\/major-incident\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/sreschool.com\/blog\/major-incident\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/sreschool.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"What is Major incident? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide)\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/sreschool.com\/blog\/#website\",\"url\":\"https:\/\/sreschool.com\/blog\/\",\"name\":\"SRESchool\",\"description\":\"Master SRE. Build Resilient Systems. Lead the Future of Reliability\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/sreschool.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/sreschool.com\/blog\/#\/schema\/person\/0ffe446f77bb2589992dbe3a7f417201\",\"name\":\"Rajesh Kumar\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en\",\"@id\":\"https:\/\/sreschool.com\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/f901a4f2929fa034a291a8363d589791d5a3c1f6a051c22e744acb8bfc8e022a?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/f901a4f2929fa034a291a8363d589791d5a3c1f6a051c22e744acb8bfc8e022a?s=96&d=mm&r=g\",\"caption\":\"Rajesh Kumar\"},\"sameAs\":[\"http:\/\/sreschool.com\/blog\"],\"url\":\"https:\/\/sreschool.com\/blog\/author\/admin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"What is Major incident? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide) - SRE School","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/sreschool.com\/blog\/major-incident\/","og_locale":"en_US","og_type":"article","og_title":"What is Major incident? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide) - SRE School","og_description":"---","og_url":"https:\/\/sreschool.com\/blog\/major-incident\/","og_site_name":"SRE School","article_published_time":"2026-02-15T05:30:49+00:00","article_modified_time":"2026-05-05T07:28:47+00:00","author":"Rajesh Kumar","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Rajesh Kumar","Est. reading time":"28 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/sreschool.com\/blog\/major-incident\/","url":"https:\/\/sreschool.com\/blog\/major-incident\/","name":"What is Major incident? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide) - SRE School","isPartOf":{"@id":"https:\/\/sreschool.com\/blog\/#website"},"datePublished":"2026-02-15T05:30:49+00:00","dateModified":"2026-05-05T07:28:47+00:00","author":{"@id":"https:\/\/sreschool.com\/blog\/#\/schema\/person\/0ffe446f77bb2589992dbe3a7f417201"},"breadcrumb":{"@id":"https:\/\/sreschool.com\/blog\/major-incident\/#breadcrumb"},"inLanguage":"en","potentialAction":[{"@type":"ReadAction","target":["https:\/\/sreschool.com\/blog\/major-incident\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/sreschool.com\/blog\/major-incident\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/sreschool.com\/blog\/"},{"@type":"ListItem","position":2,"name":"What is Major incident? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide)"}]},{"@type":"WebSite","@id":"https:\/\/sreschool.com\/blog\/#website","url":"https:\/\/sreschool.com\/blog\/","name":"SRESchool","description":"Master SRE. Build Resilient Systems. Lead the Future of Reliability","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/sreschool.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en"},{"@type":"Person","@id":"https:\/\/sreschool.com\/blog\/#\/schema\/person\/0ffe446f77bb2589992dbe3a7f417201","name":"Rajesh Kumar","image":{"@type":"ImageObject","inLanguage":"en","@id":"https:\/\/sreschool.com\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/f901a4f2929fa034a291a8363d589791d5a3c1f6a051c22e744acb8bfc8e022a?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/f901a4f2929fa034a291a8363d589791d5a3c1f6a051c22e744acb8bfc8e022a?s=96&d=mm&r=g","caption":"Rajesh Kumar"},"sameAs":["http:\/\/sreschool.com\/blog"],"url":"https:\/\/sreschool.com\/blog\/author\/admin\/"}]}},"_links":{"self":[{"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/posts\/1674","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/comments?post=1674"}],"version-history":[{"count":1,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/posts\/1674\/revisions"}],"predecessor-version":[{"id":2766,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/posts\/1674\/revisions\/2766"}],"wp:attachment":[{"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/media?parent=1674"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/categories?post=1674"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/tags?post=1674"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}