Quick Definition (30–60 words)
SEV3 is an incident severity classification indicating moderate impact to user experience or internal processes without critical business-wide outage. Analogy: SEV3 is like a traffic slowdown on a highway lane — inconvenient but not a complete closure. Formal: SEV3 denotes degraded service with measurable user/system impact requiring remediation within a defined SLA window.
What is SEV3?
What it is / what it is NOT
- SEV3 is a defined incident severity level commonly used in SRE and incident response to classify events that materially affect users or internal workflows but do not constitute a platform-wide outage.
- It is NOT a critical outage (SEV1) nor a purely informational alert (SEV5 or lower in many orgs). It also is not a permanent label; incidents may be escalated or de-escalated.
- SEV3 often implies single-region degradations, feature-specific failures, intermittent errors, degraded performance, or partial data inconsistency affecting a subset of users.
Key properties and constraints
- Moderate user impact with a known workaround or partial mitigation.
- Priority to fix within hours rather than minutes.
- Requires SRE/engineering involvement but often not full-blown incident commander activation.
- Tracked against SLIs/SLOs and consumes part of the error budget.
- Triggered by alerts tuned to reduce noise; typically aggregated symptoms rather than single noisy alarms.
Where it fits in modern cloud/SRE workflows
- In routing and prioritization of incidents during on-call shifts.
- As a classification in ticketing and postmortems to determine remediation and RCA depth.
- As input into capacity planning, release gating, and change windows.
- Useful for automated incident triage and AI-assisted incident summarization.
A text-only “diagram description” readers can visualize
- User requests hit CDN/edge -> edge routes to service mesh -> service A calls service B and database -> a subset of requests to service B see 5–15% errors -> monitoring triggers aggregated error rate threshold -> on-call engineer receives SEV3 page -> mitigation applied (traffic split or feature flag) -> triage creates SEV3 ticket -> SRE schedules fix and tracks SLO impact.
SEV3 in one sentence
SEV3 is a moderate-severity incident classification indicating degraded functionality or performance affecting a subset of users or services that requires prioritized remediation within hours but not immediate full-incident escalation.
SEV3 vs related terms (TABLE REQUIRED)
| ID | Term | How it differs from SEV3 | Common confusion |
|---|---|---|---|
| T1 | SEV1 | Critical outage affecting most users | Confused with SEV3 when impact is delayed |
| T2 | SEV2 | High-impact but localized outage | People mix SEV2 and SEV3 by symptom severity |
| T3 | SEV4 | Low-impact or informational alert | SEV4 sometimes misclassified as SEV3 |
| T4 | Incident | General event requiring work | Incident severity not equal to SEV3 always |
| T5 | Alert | Monitoring signal | Alerts do not always indicate SEV3 |
| T6 | Outage | Service unavailable | Outage implies broader impact than SEV3 |
| T7 | Degradation | Performance loss | Degradation may be SEV3 or SEV2 depending on scope |
| T8 | P0 | Priority label in ticketing | Priority mapping varies across orgs |
| T9 | RCA | Postmortem write-up | RCA depth depends on severity not name |
Row Details (only if any cell says “See details below”)
- None
Why does SEV3 matter?
Business impact (revenue, trust, risk)
- Revenue: Persistent SEV3 events can erode conversion rates and incremental revenue if not addressed quickly.
- Trust: Repeated moderate degradations reduce user trust and increase churn risk.
- Risk: SEV3 incidents consume engineering time and can mask more serious underlying issues; they affect SLA commitments and partner agreements.
Engineering impact (incident reduction, velocity)
- Time spent triaging SEV3s reduces development velocity and diverts teams from feature work.
- Proper classification enables focused remediation without unnecessary full-incident mobilization.
- Reduces toil when automation and runbooks exist to handle common SEV3 patterns.
SRE framing (SLIs/SLOs/error budgets/toil/on-call)
- SEV3 events should map to specific SLIs that feed SLOs; exceedance informs error budget burn.
- Error budgets for SEV3-class incidents often drive rate-limiters on releases.
- On-call teams use SEV3 to prioritize paging rules, escalation paths, and shift handovers.
- Runbooks reduce toil by codifying mitigations for known SEV3s.
3–5 realistic “what breaks in production” examples
- A payment gateway returns 10% 502 errors for a subset of geographies due to a backend API degradation.
- Search results latency increases 2–3x during peak hours for 20% of queries due to an inefficient query path.
- A feature flag rollout exposes a bug causing missing metadata in user profiles for new signups.
- Background batch jobs for analytics slow down, causing delayed reports but not transactional failures.
- Auto-scaling misconfiguration causing one availability zone to be under-provisioned leading to degraded throughput.
Where is SEV3 used? (TABLE REQUIRED)
| ID | Layer/Area | How SEV3 appears | Typical telemetry | Common tools |
|---|---|---|---|---|
| L1 | Edge | Increased edge errors or partial cache miss | 5xx rate, cache miss rate | CDN logs and metrics |
| L2 | Network | Intermittent packet loss or elevated latency | p50/p95 latency, retransmits | NPM, cloud VPC metrics |
| L3 | Service | Partial 4xx/5xx in microservice | error rate, request latency | APM, service mesh |
| L4 | Application | Feature-specific failures | user error rate, feature flag metrics | App logs, feature flag platform |
| L5 | Data | Stale reads or partial replication lag | QPS, replication lag | DB metrics, monitoring |
| L6 | Infra IaaS | VM-level performance spike | CPU, IO wait, host errors | Cloud provider monitoring |
| L7 | Platform PaaS | Runtime degradation in managed services | instance health, queue depth | Managed service dashboards |
| L8 | Kubernetes | Pod restarts or degraded readiness | pod restarts, liveness probes | K8s metrics and events |
| L9 | Serverless | Increased cold starts or throttles | invocation errors, throttled count | Serverless dashboards, logs |
| L10 | CI/CD | Failing or flaky pipelines causing rollout delays | pipeline success rate | CI/CD runs and logs |
| L11 | Observability | Missing or delayed telemetry | metric gaps, log gaps | Observability stack |
| L12 | Security | Partial policy enforcement failures | auth failures, access errors | IAM logs, WAF metrics |
Row Details (only if needed)
- None
When should you use SEV3?
When it’s necessary
- A subset of users experiences degraded functionality with no immediate complete workaround.
- Performance degradation affecting key user flows but not causing total outage.
- Non-critical data inconsistency that impacts analytics or reporting but needs a fix.
When it’s optional
- Minor feature regressions with low user impact and available workarounds.
- Single-event alerts that are unlikely to recur and do not affect SLIs.
When NOT to use / overuse it
- Don’t mark every alert SEV3; overuse dilutes urgency and on-call focus.
- Not for routine maintenance or planned degradations with adequate notice.
- Not for transient one-off spikes that self-resolve quickly unless they recur.
Decision checklist
- If error rate > X% for > Y minutes affecting critical flows -> SEV3.
- If latency doubled for major user cohort and no direct workaround -> SEV3.
- If transaction failures affect all users -> escalate to SEV2 or SEV1.
- If alert is informational or single-sample anomaly -> do not page.
Maturity ladder: Beginner -> Intermediate -> Advanced
- Beginner: Manual classification, basic runbooks, Slack paging.
- Intermediate: Automated triage rules, SLI mapping, scheduled runbooks.
- Advanced: AI-assisted triage, automated mitigations, dynamic SLO adjustments, chaos-tested runbooks.
How does SEV3 work?
Components and workflow
- Detection: Monitoring or user reports indicate a symptom mapped to an SLI threshold.
- Triage: On-call or automation assesses scope and impact; determines SEV3 classification.
- Containment: Apply short-term mitigations (feature flag rollback, traffic reroute).
- Remediation: Code fix, configuration change, scaling action.
- Recovery verification: SLI measurements confirm service returned to SLO.
- Post-incident: Create ticket, run RCA, update runbooks and automation.
Data flow and lifecycle
- Telemetry pipeline emits metrics/logs/traces -> alerting rules evaluate -> triage annotates alert -> incident created and tagged SEV3 -> work proceeds in incident ticket -> telemetry shows recovery -> SLO update and postmortem.
Edge cases and failure modes
- False positives due to noisy alerts.
- Escalation loops when SEV3 masks a hidden SEV1 cause.
- Automation failing to apply mitigation, causing further disruption.
- Observability blind spots that prevent accurate scope determination.
Typical architecture patterns for SEV3
- Pattern: Canary feature flag + gradual rollout
-
When to use: New features, mitigations available, reduces blast radius.
-
Pattern: Circuit breaker + fallback path
-
When to use: External dependencies with variable latency or errors.
-
Pattern: Read replica routing for heavy reads
-
When to use: Data tier read latency causing partial degradation.
-
Pattern: Autoscaling with buffer and warm pools
-
When to use: Intermittent load spikes causing slowdowns.
-
Pattern: Traffic mirroring for testing fixes
-
When to use: Validate fixes on a copy of production traffic without impacting users.
-
Pattern: Alert aggregation and dedupe pipeline
- When to use: Reduce noisy correlated alerts into single SEV3 incident.
Failure modes & mitigation (TABLE REQUIRED)
| ID | Failure mode | Symptom | Likely cause | Mitigation | Observability signal |
|---|---|---|---|---|---|
| F1 | Noisy alerting | Frequent pages for similar issue | Low thresholds or metric flapping | Tune thresholds and use aggregation | Alert flood, many instances |
| F2 | Blind spots | Unable to scope impact | Missing instrumentation | Add SLIs and traces | Missing metrics, sparse traces |
| F3 | Escalation gap | SEV3 hides SEV1 root | Poor triage rules | Escalation playbook and diagnostics | Rapid SLI deterioration |
| F4 | Automation failure | Mitigation not applied | Broken automation scripts | Fail-safe manual steps | Automation error logs |
| F5 | Resource starvation | Slow responses during peak | Misconfigured autoscaling | Adjust autoscaling and warm pools | High CPU, queue depth |
| F6 | Dependency flakiness | Intermittent 502/503 | Downstream instability | Circuit breaker and retries | Spiky error rates |
| F7 | Rollout regression | New deploy causes partial failures | Bad release or flag | Rollback or disable flag | Spike in error rate post-deploy |
Row Details (only if needed)
- None
Key Concepts, Keywords & Terminology for SEV3
Provide a glossary of 40+ terms. Each line: Term — 1–2 line definition — why it matters — common pitfall
- SEV1 — Highest severity incident classification meaning full outage — prioritizes immediate action — misuse inflates urgency
- SEV2 — High-impact but not total outage — often requires rapid mitigation — mislabeling causes confusion
- SEV3 — Moderate-impact incident as defined in this guide — balances remediation speed and effort — overuse reduces signal
- SLI — Service Level Indicator; measurable signal of user experience — maps incidents to user impact — poorly chosen SLIs mislead
- SLO — Service Level Objective; target for SLIs — guides error budget and priorities — unrealistic SLOs cause churn
- SLA — Service Level Agreement; contractual uptime obligation — carries financial/legal risk — conflating SLA and SLO is common
- Error budget — Allowable SLO violation window — enables controlled risk-taking — ignored budgets lead to outages
- On-call — Rotating duty to respond to incidents — critical for remediation — poor rotations cause burnout
- Incident commander — Role to coordinate response — clarifies responsibilities — missing role causes chaos
- Triage — Rapid assessment of scope and impact — determines severity — slow triage prolongs incidents
- Runbook — Prescribed steps to mitigate known issues — reduces toil — outdated runbooks mislead responders
- Playbook — Broader set of response strategies including decisions — aids complex incidents — too generic reduces applicability
- Observability — Ability to understand system behavior from telemetry — essential for diagnosis — partial observability creates blind spots
- Telemetry — Metrics, logs, traces used to monitor systems — feeds alerts and dashboards — excess telemetry cost can be high
- APM — Application Performance Monitoring; traces and performance metrics — helps diagnose latency causes — overhead if poorly configured
- Alert fatigue — Excessive alerts leading to ignored pages — reduces responsiveness — needs dedupe and prioritization
- Correlation — Linking events across systems — key to scope incidents — missing correlation leads to duplicated effort
- Aggregation — Combining noisy signals into meaningful alerts — reduces noise — over-aggregation masks problems
- Root Cause Analysis (RCA) — Postmortem finding root cause — prevents repeat incidents — blames individuals if poorly run
- Postmortem — Documentation of incident and remediation — drives learning — shallow postmortems repeat mistakes
- Canary deploy — Gradual rollout to subset of users — limits blast radius — improper canary size skews results
- Feature flag — Toggle to enable/disable features at runtime — aids quick remediation — flag debt causes complexity
- Circuit breaker — Pattern to stop calls to failing dependencies — prevents cascading failures — aggressive breakers block healthy traffic
- Retry policy — Retry failed requests with backoff — improves resiliency — improper retries cause load amplification
- Backpressure — Mechanism to slow producers when consumers are saturated — maintains stability — incorrect backpressure leads to dropped requests
- Capacity planning — Predicting resource needs — avoids resource starvation — over-provisioning wastes cost
- Autoscaling — Dynamic scaling based on load — handles variable traffic — misconfigured policies cause oscillations
- Throttling — Limiting requests to protect systems — prevents collapse — throttling critical flows hurts UX
- Rate limiting — Policy to restrict request rates — defends against spikes — unfair limits affect legitimate users
- Observability pipeline — Ingest and storage for telemetry — enables analysis — pipeline delays slow detection
- Sampling — Reducing trace volume by sampling — controls cost — low sampling misses rare issues
- Distributed tracing — Traces through service calls — shows request path — missing trace context breaks traceability
- Latency SLO — Objective for request response time — ties to UX — focusing only on p95 may miss long tails
- Availability SLO — Objective for service uptime — tracks user-facing reliability — multiple definitions confuse teams
- Mean Time To Detect (MTTD) — Time to notice incidents — shorter means faster response — long MTTD increases damage
- Mean Time To Repair (MTTR) — Time to restore service — direct measure of operability — ignored MTTR hides process issues
- Blast radius — Scope of impact from a change — smaller is safer — unmeasured radius surprises teams
- Chaos engineering — Deliberate fault injection to test resilience — uncovers gaps — poorly controlled experiments risk production
- Synthetic monitoring — Periodic checks simulating user flows — detects regressions — synthetic tests may miss real user distribution
- Real user monitoring (RUM) — Captures real client-side metrics — reflects actual user impact — privacy considerations apply
- Pager — Notification that requires immediate attention — connects people to incidents — paging unnecessary for low-severity alerts
- Escalation policy — Rules to escalate incidents — ensures resolution — rigid policies can cause premature escalation
- Incident review — Regular review of incident trends — drives systemic fixes — low participation reduces value
How to Measure SEV3 (Metrics, SLIs, SLOs) (TABLE REQUIRED)
| ID | Metric/SLI | What it tells you | How to measure | Starting target | Gotchas |
|---|---|---|---|---|---|
| M1 | Error rate (user-facing) | Fraction of failed user requests | failed requests / total per minute | <1% for critical flows | Short windows noisy |
| M2 | Latency p95 | Tail latency impacting UX | measure request durations and compute p95 | p95 < 500ms | p95 hides p99 issues |
| M3 | Success rate by region | Localized degradation | segment success rate by region | >99% per region | Small regions noisy |
| M4 | Feature flag failure rate | Feature-specific errors | errors tied to flag context | <0.5% | Missing flag context in logs |
| M5 | Queue depth | Backlog indicating processing lag | queue length per worker | below threshold for 99% time | Sudden spikes can be transient |
| M6 | Replication lag | Data freshness impact | measured seconds lag | <5s for critical data | Varied by DB topology |
| M7 | Pod restart rate | App instability in K8s | restarts per pod per hour | <0.1/hr | Crash loops produce noise |
| M8 | Cold start rate | Serverless startup impact | fraction of cold invocations | <5% | Depends on invocation patterns |
| M9 | Synthetic success | End-to-end check health | scheduled probes pass ratio | 100% ideally | Synthetics miss user diversity |
| M10 | MTTD | Detection velocity | time from incident to alert | <5m for critical flows | Detection depends on instrumentation |
| M11 | MTTR | Remediation velocity | time from page to recovery | <2h for SEV3 typical | Depends on runbooks and automation |
| M12 | Error budget burn | SLO consumption rate | measure SLI vs SLO | Keep burn under 20% per deploy | Sudden spikes can deplete budgets |
Row Details (only if needed)
- None
Best tools to measure SEV3
Tool — Prometheus + Thanos
- What it measures for SEV3: Time-series metrics aggregation for SLIs and alerting
- Best-fit environment: Kubernetes and cloud-native stacks
- Setup outline:
- Instrument apps with metrics libraries
- Configure scraping targets and rules
- Define recording rules and alerting thresholds
- Integrate with long-term storage like Thanos
- Strengths:
- Flexible query language and ecosystem
- Works well in Kubernetes
- Limitations:
- Requires operational effort at scale
- Long-term retention needs extra components
Tool — Datadog
- What it measures for SEV3: Metrics, traces, logs and synthetics consolidated
- Best-fit environment: Multi-cloud teams and managed stacks
- Setup outline:
- Install agents and instrument SDKs
- Configure APM and synthetics
- Create SLOs and dashboards
- Strengths:
- Integrated UI and quick setup
- Strong alerting and dashboards
- Limitations:
- Cost at scale
- Vendor lock-in considerations
Tool — Grafana + Loki + Tempo
- What it measures for SEV3: Dashboards, logs and traces routing
- Best-fit environment: Open-source or self-managed observability
- Setup outline:
- Configure Prometheus metrics source
- Route logs to Loki and traces to Tempo
- Build dashboards and alerting
- Strengths:
- Highly customizable and cost effective
- Open ecosystem
- Limitations:
- Requires integration effort
- Operational overhead for scale
Tool — New Relic
- What it measures for SEV3: APM and real user monitoring
- Best-fit environment: Web apps and distributed services
- Setup outline:
- Install language agents
- Enable browser RUM and mobile monitoring
- Set up alerting and SLOs
- Strengths:
- Good for deep application insights
- Ease of use
- Limitations:
- Pricing model can be complex
- Data retention trade-offs
Tool — Cloud Provider Native Monitoring (CloudWatch/GCP Stackdriver/Azure Monitor)
- What it measures for SEV3: Infra, managed service metrics, logs
- Best-fit environment: Teams heavily using a single cloud
- Setup outline:
- Enable service metrics and logs
- Create dashboards and alarms
- Integrate with incident routing
- Strengths:
- Native integration with managed services
- No additional agents for many services
- Limitations:
- Cross-cloud correlation is harder
- Differences across clouds complicate portability
Recommended dashboards & alerts for SEV3
Executive dashboard
- Panels: Overall SLO burn rate, top impacted services, business KPIs (transactions/min), number of SEV3 incidents this week.
- Why: Provides leaders quick view of reliability trends and impact on business metrics.
On-call dashboard
- Panels: Current active SEV3 incidents, per-service error rates, recent deploys, runbook links.
- Why: Enables quick triage and access to remediation steps.
Debug dashboard
- Panels: Request rate, error rate, latency percentiles, downstream dependency health, traces for recent errors, logs filtered by trace IDs.
- Why: Provides deep diagnostics for engineers doing root cause analysis.
Alerting guidance
- What should page vs ticket: Page for SEV3 when user-impacting SLI thresholds crossed and no automatic mitigation; create ticket for low-impact alerts or when a runbook handles it automatically.
- Burn-rate guidance: Use error budget burn rates to trigger deployment freezes when burn exceeds predetermined thresholds (e.g., >50% burn in 24h).
- Noise reduction tactics: Use dedupe, grouping by service or signature, suppression windows for noisy maintenance, use composite alerts to reduce duplicate pages.
Implementation Guide (Step-by-step)
1) Prerequisites – Ownership matrix and escalation policy defined. – Baseline observability: key metrics, traces, logs in place. – CI/CD with versioning and rollback ability. – Access control for runbook execution and rollback.
2) Instrumentation plan – Identify critical user journeys and map SLIs. – Instrument request counts, latencies, error reasons, and tracing. – Add context metadata: region, feature flag, user cohort.
3) Data collection – Set up metrics pipeline with retention aligned to postmortem needs. – Configure log aggregation and indexing. – Ensure traces propagate context across services.
4) SLO design – Define SLOs per critical flow and per region/service. – Set alert thresholds tied to SLO breaches and error budget burn. – Communicate SLOs to stakeholders.
5) Dashboards – Build executive, on-call, and debug dashboards. – Include deploy history and recent config changes. – Link runbooks and contact info on dashboards.
6) Alerts & routing – Implement routing rules to page appropriate teams. – Use escalation policies and on-call rotations. – Configure suppression windows for planned maintenance.
7) Runbooks & automation – Create stepwise runbooks for common SEV3 scenarios. – Automate safe mitigations where possible (feature flag toggle). – Version-runbooks and test them in rehearsals.
8) Validation (load/chaos/game days) – Run load tests simulating SEV3-class degradations. – Inject faults in chaos experiments to validate mitigations. – Conduct game days to exercise on-call processes.
9) Continuous improvement – Track incident metrics (MTTD, MTTR, recurrence). – Update SLOs and runbooks based on learnings. – Prioritize engineering work to reduce SEV3 frequency.
Checklists
Pre-production checklist
- Critical SLIs instrumented and validated.
- SLOs defined and communicated.
- Runbooks written for likely SEV3s.
- Synthetic checks in place for main flows.
- CI/CD rollback tested.
Production readiness checklist
- Alerting rules reviewed and deduped.
- On-call rotations and escalation configured.
- Dashboards accessible and linked to runbooks.
- Feature flags available for rapid rollback.
- Chaos experiments planned for resilience validation.
Incident checklist specific to SEV3
- Confirm SEV3 classification and scope.
- Notify stakeholders and create ticket.
- Apply mitigation per runbook or feature flag.
- Measure SLI recovery and document actions.
- Schedule RCA and update runbooks.
Use Cases of SEV3
Provide 8–12 use cases
1) Payment gateway intermittent errors – Context: Payments from certain region failing at 10% rate. – Problem: Revenue leakage and failed checkouts. – Why SEV3 helps: Prioritizes mitigation without full outage escalation. – What to measure: Payment success rate, latency, gateway error codes. – Typical tools: APM, payment gateway logs, synthetic checks.
2) Search latency spike for subset queries – Context: Complex queries causing p95 spikes. – Problem: Bad UX for search-heavy users. – Why SEV3 helps: Enables focused fix on query paths or caching. – What to measure: p95/p99 latency, cache hit rates. – Typical tools: Tracing, metrics, analytics.
3) Feature flag rollout bug – Context: New feature causes missing metadata for new users. – Problem: Incomplete user profiles and downstream errors. – Why SEV3 helps: Rollback using flag mitigates impact quickly. – What to measure: Errors tied to flag, user profile completeness. – Typical tools: Feature flag platform, logs.
4) K8s pod restarts affecting background jobs – Context: Cron jobs restart creating processing backlog. – Problem: Delayed processing but core app unaffected. – Why SEV3 helps: Allocation of infra fixes without full incident mobilization. – What to measure: Pod restarts, job queue depth, catch-up time. – Typical tools: K8s metrics, job monitoring.
5) Data replication lag – Context: Replica lag causing stale reads in analytics. – Problem: Reports and dashboards inaccurate. – Why SEV3 helps: Prioritize DB config fix and throttling. – What to measure: Replication lag seconds, affected queries. – Typical tools: DB monitoring, query logs.
6) CDN cache miss storm – Context: High cache churn causing origin load. – Problem: Elevated latency and origin costs. – Why SEV3 helps: Optimize caching rules or purge strategy. – What to measure: cache hit ratio, origin latency. – Typical tools: CDN metrics, logs.
7) CI/CD pipeline flakiness delaying deployments – Context: Intermittent test failures blocking feature rollouts. – Problem: Reduced velocity and release delays. – Why SEV3 helps: Triage and fix flaky tests or isolate pipeline. – What to measure: pipeline success rate and flakiness rate. – Typical tools: CI/CD logs, test isolation tools.
8) Authentication provider throttling – Context: Third-party auth service limiting requests occasionally. – Problem: Login failures for a user subset. – Why SEV3 helps: Implement retries and backoff or fallback method. – What to measure: auth error rates, retry success. – Typical tools: IAM logs, APM.
9) Serverless cold start latency increase – Context: Cold starts spike causing user-facing latency. – Problem: Poor user experience in certain operations. – Why SEV3 helps: Prioritize warm-up strategies or provisioning. – What to measure: cold start fraction, invocation latency. – Typical tools: Serverless provider metrics.
10) Observability pipeline lag – Context: Delayed metrics leading to late detection. – Problem: Incidents detected too late. – Why SEV3 helps: Classify as moderate incident and remediate ingestion pipeline. – What to measure: ingestion latency, metric gaps. – Typical tools: Observability stack logs and metrics.
Scenario Examples (Realistic, End-to-End)
Scenario #1 — Kubernetes: Partial pod readiness causing degraded API
Context: A microservice in Kubernetes experiences increased p95 latency and occasional 500s caused by one node’s tainted GPU drivers. Goal: Restore normal latency and eliminate errors for 95% of requests. Why SEV3 matters here: Only a subset of pods on a node affected; not a whole-cluster outage. Architecture / workflow: Client -> API service (K8s) -> downstream DB; pod readiness probes failing on one node. Step-by-step implementation:
- Detect spike via p95 alert.
- Triage to node-level using pod metrics and node events.
- Evacuate affected pods by cordoning node and draining.
- Roll out patched node image or restart kubelet drivers.
- Re-schedule pods and monitor SLI recovery. What to measure: pod restarts, node conditions, p95 latency, error rate. Tools to use and why: Prometheus for metrics, kubectl for remediation, APM for traces. Common pitfalls: Not correlating node events with errors; draining causes momentary increased load. Validation: Verify p95 and error rate returned to SLO and no recurrence for next 24h. Outcome: Targeted remediation reduced blast radius and preserved production stability.
Scenario #2 — Serverless/managed-PaaS: Throttling in managed database causing failed writes
Context: A managed NoSQL provider throttles writes during peak leading to 503s for certain write-heavy endpoints. Goal: Reduce user-visible write failures and mitigate data loss risk. Why SEV3 matters here: Affects write-heavy workflows for subset of users; not full product outage. Architecture / workflow: Client -> API -> serverless function -> managed DB; throttling emerges under load. Step-by-step implementation:
- Alert on increased 5xx write errors.
- Apply exponential backoff and queueing in serverless function.
- Temporarily route heavy flows to alternate write path or buffer in durable queue.
- Work with provider to increase capacity or optimize indexes.
- Monitor for reduction in write error rate. What to measure: write error rate, throttle count, queue depth. Tools to use and why: Cloud provider metrics, logs, serverless tracing. Common pitfalls: Buffered writes causing delayed data visibility; queue overflow. Validation: Successful write rate and acceptable queue drain time. Outcome: Mitigation reduced immediate user impact while provider-side scaling complete.
Scenario #3 — Incident-response/postmortem: Recurring SEV3 due to flaky circuit breaker
Context: Intermittent downstream failures trip circuit breaker causing partial functionality loss. Goal: Reduce recurrence and improve resilience. Why SEV3 matters here: Repeated moderate incidents erode reliability and increase toil. Architecture / workflow: API -> internal service -> external dependency; circuit breaker misconfiguration opens prematurely. Step-by-step implementation:
- Triage incident and classify as SEV3.
- Reconfigure circuit breaker thresholds for better hysteresis.
- Add better fallback behavior and caching where possible.
- Document change and create runbook for similar incidents.
- Conduct RCA to identify root cause of downstream flakiness. What to measure: circuit open rate, fallback invocation rate, user error rate. Tools to use and why: APM, tracing, circuit breaker metrics. Common pitfalls: Tuning that hides real issues; masking rather than fixing dependency. Validation: Reduced circuit openings and fewer SEV3 repeats over 30 days. Outcome: Lower incident frequency and clearer mitigation paths.
Scenario #4 — Cost/performance trade-off: Reducing cost causes increased p99 latency for analytics
Context: Cost cutbacks lead to reducing analytics cluster size, increasing p99 latency and delayed reports. Goal: Balance cost savings with acceptable SLO for analytics workloads. Why SEV3 matters here: Degraded analytics affects business decisions but not transactional flows. Architecture / workflow: ETL -> analytics cluster -> dashboards; reduced compute causes delays. Step-by-step implementation:
- Identify SLI impacts and map to business value.
- Implement dynamic scaling for peak windows instead of constant high capacity.
- Introduce backpressure and prioritize critical jobs.
- Schedule non-critical jobs off-peak.
- Monitor SLO and cost metrics to find optimal point. What to measure: job completion time, p99 latency, cost per run. Tools to use and why: Cluster monitoring, job schedulers, cost analytics. Common pitfalls: Over-optimization causing missed SLAs for critical reports. Validation: Cost lower while SLOs met for critical jobs. Outcome: Sustainable cost/performance balance with acceptable reliability.
Common Mistakes, Anti-patterns, and Troubleshooting
List 15–25 mistakes with: Symptom -> Root cause -> Fix
1) Symptom: Repeated SEV3 pages each week -> Root cause: Overly broad alerting -> Fix: Tune thresholds and aggregate alerts 2) Symptom: Incomplete postmortems -> Root cause: No ownership or template -> Fix: Enforce postmortem templates and action items 3) Symptom: Runbooks outdated -> Root cause: No version control -> Fix: Store runbooks in repo and review periodically 4) Symptom: High MTTR -> Root cause: Lack of automation for mitigation -> Fix: Automate common rollback and recovery steps 5) Symptom: Observability gaps during incidents -> Root cause: Missing logs/traces for flows -> Fix: Instrument critical paths and add trace context 6) Symptom: Alert fatigue -> Root cause: Too many low-signal alerts -> Fix: Implement dedupe and composite alerts 7) Symptom: SEV3 masks underlying SEV1 -> Root cause: Poor triage rules -> Fix: Improve escalation decision trees 8) Symptom: Deployment causes SEV3 regressions -> Root cause: Poor testing/canary -> Fix: Use canary deploys and progressive rollouts 9) Symptom: No clear owner for SEV3 -> Root cause: Undefined ownership matrix -> Fix: Define ownership by service and shift 10) Symptom: Alerts during maintenance -> Root cause: No suppression rules -> Fix: Suppress alerts for scheduled changes 11) Symptom: Too many false positives -> Root cause: Single-sample alerts -> Fix: Use sliding windows and composite logic 12) Symptom: Runbook execution errors -> Root cause: Untrusted automation -> Fix: Add validation and manual fallback steps 13) Symptom: Observability data overload -> Root cause: Excessive cardinality in metrics -> Fix: Reduce cardinality and use labels wisely 14) Symptom: SEV3 recurring for same root cause -> Root cause: No corrective action taken -> Fix: Track action items and ensure closure in sprints 15) Symptom: Cost spike after mitigation -> Root cause: Scale-up mitigations not reverted -> Fix: Automate rollback of temporary scaling 16) Symptom: On-call burnout -> Root cause: High SEV3 frequency and poor rotations -> Fix: Hire, reduce toil, rotate fairly 17) Symptom: Slow detection of SEV3s -> Root cause: Insufficient synthetic checks -> Fix: Add targeted synthetics and RUM 18) Symptom: Debug info unavailable -> Root cause: Redaction or log sampling too aggressive -> Fix: Balance privacy with debug needs, enrich traces 19) Symptom: Inconsistent severity mapping -> Root cause: No incident taxonomy -> Fix: Define and train teams on severity definitions 20) Symptom: Too many stakeholders alerted -> Root cause: Broad notification lists -> Fix: Reduce to minimal necessary teams and use escalation 21) Symptom: Observability pipeline lag -> Root cause: Backpressure or misconfig -> Fix: Scale ingestion and monitor pipeline health 22) Symptom: Alerts tied to single host -> Root cause: Lack of aggregation -> Fix: Use service-level aggregation and dedupe 23) Symptom: Flaky tests cause deploy blocks -> Root cause: Poor test isolation -> Fix: Quarantine flaky tests and stabilize pipeline 24) Symptom: Security events treated as SEV3 -> Root cause: Improper classification -> Fix: Separate security incident process and integrate with ops
Include at least 5 observability pitfalls (marked):
- Observability pitfall 1: Missing trace context -> Root cause: Not propagating trace headers -> Fix: Ensure middleware propagates trace IDs
- Observability pitfall 2: High-cardinality metrics -> Root cause: Using user IDs as labels -> Fix: Remove PII and high-cardinality labels
- Observability pitfall 3: Log sampling hides errors -> Root cause: Aggressive sampling configs -> Fix: Preserve error logs with higher sampling
- Observability pitfall 4: Metric gaps during deployment -> Root cause: Metric exporter restarts -> Fix: Buffer metrics and use durable export
- Observability pitfall 5: Synthetics not reflecting users -> Root cause: Limited probe coverage -> Fix: Expand probes to cover major user scenarios
Best Practices & Operating Model
Ownership and on-call
- Assign clear service ownership with primary and secondary on-call.
- Define escalation paths and role responsibilities (IC, comms, RCA owner).
- Keep rotations reasonable and provide handover notes.
Runbooks vs playbooks
- Runbooks: Step-by-step mitigations for known issues; runnable by on-call without deep context.
- Playbooks: Decision trees for complex incidents requiring judgement; include escalation points.
- Keep runbooks versioned and tested via game days.
Safe deployments (canary/rollback)
- Use canaries with automated health checks and automatic rollback on threshold breaches.
- Use feature flags for rapid and safe rollbacks.
- Record deploy metadata in dashboards for correlation.
Toil reduction and automation
- Automate common mitigations and verification steps.
- Use templates for incident tickets and postmortems to reduce administrative work.
- Invest in self-healing where safe; ensure manual overrides exist.
Security basics
- Ensure runbook access is controlled and audited.
- Do not expose sensitive keys in logs.
- Include security checks in deployment pipelines to avoid introducing vulnerabilities during fixes.
Weekly/monthly routines
- Weekly: Review recent SEV3s and action item progress.
- Monthly: Review SLO burn rates and adjust alerts and runbooks.
- Quarterly: Run game days and chaos experiments to test mitigations.
What to review in postmortems related to SEV3
- Correctness of severity classification.
- Time to detect and remediate (MTTD/MTTR).
- Whether runbooks were used and effective.
- Action items and ownership for preventing recurrence.
- Any SLO or alert tuning required.
Tooling & Integration Map for SEV3 (TABLE REQUIRED)
| ID | Category | What it does | Key integrations | Notes |
|---|---|---|---|---|
| I1 | Metrics | Store and query time-series metrics | APM, dashboards, alerting | Prometheus or managed metric services |
| I2 | Tracing | Track distributed request flows | APM, logs, dashboards | Correlate with traces for debug |
| I3 | Logging | Aggregate logs for forensics | Tracing, alerts | Index error logs with trace IDs |
| I4 | Alerting | Evaluate rules and notify on-call | Pager, ticketing | Supports escalation paths |
| I5 | Incident Mgmt | Create and track incident lifecycle | Alerts, runbooks, comms | Playback and RCA storage |
| I6 | Runbook | Document mitigation steps | Dashboards, alerts | Version-controlled runbooks |
| I7 | Feature Flags | Toggle features safely | CI/CD, dashboards | Quick mitigation control |
| I8 | CI/CD | Build and rollout automation | Deploy dashboards, observability | Enables canary and rollbacks |
| I9 | Chaos | Fault injection for resilience | Observability, incident drills | Controlled experiments |
| I10 | Synthetic | Simulate user flows periodically | Dashboards, alerting | Detect regressions early |
| I11 | Cost Mgmt | Monitor cost vs performance | Dashboards, infra | Inform trade-offs |
| I12 | Security | IAM and WAF monitoring | Alerts and logs | Separate incident channels for security |
Row Details (only if needed)
- None
Frequently Asked Questions (FAQs)
What is the standard timeframe to resolve a SEV3?
Typically within a few hours; exact timeframe varies by organization and SLOs.
Who should be paged for SEV3 incidents?
Primary on-call for the affected service and a secondary on-call; avoid paging broad lists.
Can SEV3 be automated?
Partial automation is recommended for detection and containment; complete automation depends on risk tolerance.
Does SEV3 always require an RCA?
Yes, at minimum a lightweight post-incident review; depth varies by impact and recurrence.
How does SEV3 affect error budgets?
SEV3 incidents consume error budget relative to the SLI impact; track burn to adjust releases.
How to avoid alert fatigue with SEV3 alerts?
Aggregate signals, use composite alerts, and tune thresholds to reduce false positives.
Should customers be notified for SEV3?
If customer-facing functionality is materially impacted, notify affected customers with context and ETA.
How to decide SEV2 vs SEV3?
Assess scope, user impact, and availability of workarounds; SEV2 is more severe or has greater scope.
Are SEV3s included in monthly reliability reports?
Yes; include SEV3 counts, trends, and action item progress in reliability dashboards.
How to test SEV3 runbooks?
Use game days and simulated incidents; rehearse runbooks with on-call personnel.
What KPIs track SEV3 health?
MTTD, MTTR, SEV3 frequency, SLO burn rate, and recurring issue rate.
How granular should SLIs be for SEV3?
SLIs should be specific to user journeys and segmented by region/feature for accurate scope.
Is SEV3 the same across companies?
No; severity taxonomy and thresholds vary by organization and business-criticality.
When should SEV3 be escalated to SEV2 or SEV1?
If impact widens, SLIs show continued deterioration, or critical business functions are affected.
How to integrate SEV3 into CI/CD?
Fail fast on canary SLI breaches, block rollouts if error budget burn crosses thresholds.
Should security incidents be labeled SEV3?
Security incidents have their own classification; integrate but follow security response processes.
How to measure the cost of SEV3 incidents?
Track engineer hours, mitigation infrastructure cost, and business metric impact during incident.
How often should SEV3 runbooks be reviewed?
At least quarterly, or after every occurrence to ensure relevance.
Conclusion
SEV3 represents a useful middle ground in incident taxonomy — high enough to warrant prioritized action but not so high as to trigger full incident mobilization. In modern cloud-native environments, thoughtful instrumentation, clear runbooks, targeted automation, and SLO-driven alerting are the pillars of managing SEV3 effectively. Treat SEV3 as both an operational signal and a learning opportunity: reduce recurrence through RCA and automation, and protect team focus by avoiding over-classification.
Next 7 days plan (5 bullets)
- Day 1: Inventory current SEV3 incidents and map to SLIs and runbooks.
- Day 2: Tune alert thresholds and aggregate noisy alerts.
- Day 3: Create or update runbooks for top three recurring SEV3 patterns.
- Day 4: Implement or test one automated mitigation (feature flag rollback).
- Day 5: Run a short game day to rehearse SEV3 response.
- Day 6: Review SLOs and error budgets; adjust deploy policies.
- Day 7: Schedule postmortem reviews and assign corrective work to sprints.
Appendix — SEV3 Keyword Cluster (SEO)
- Primary keywords
- SEV3 incident
- SEV3 severity
- SEV3 definition
- SEV3 SRE
- SEV3 monitoring
- SEV3 runbook
- SEV3 metrics
-
SEV3 SLO
-
Secondary keywords
- incident severity level 3
- moderate outage classification
- SRE severity taxonomy
- SEV3 examples
- SEV3 best practices
- SEV3 alerting
- SEV3 triage
- SEV3 mitigation
- SEV3 on-call
-
SEV3 postmortem
-
Long-tail questions
- What is a SEV3 incident in SRE?
- How to measure SEV3 impact with SLIs?
- When to classify an incident as SEV3?
- How to write a SEV3 runbook?
- How does SEV3 affect error budgets?
- What tools help detect SEV3 incidents?
- How to automate SEV3 mitigations?
- What is the difference between SEV2 and SEV3?
- How to reduce SEV3 frequency?
- How to triage SEV3 incidents effectively?
- What dashboards are needed for SEV3?
- How to set SLOs related to SEV3?
- How to measure MTTR for SEV3?
- How to avoid alert fatigue with SEV3?
-
What are typical SEV3 failure modes?
-
Related terminology
- SLO
- SLI
- error budget
- MTTR
- MTTD
- runbook
- playbook
- on-call rotation
- observability
- synthetic monitoring
- real user monitoring
- circuit breaker
- feature flag
- canary deployment
- autoscaling
- chaos engineering
- tracing
- APM
- log aggregation
- alert dedupe
- composite alert
- incident commander
- RCA
- postmortem
- telemetry pipeline
- Kubernetes monitoring
- serverless metrics
- managed PaaS monitoring
- CI/CD pipeline
- rollback strategy
- capacity planning
- cost-performance trade-off
- throttling metrics
- replication lag
- cold starts
- queue depth
- pod restarts
- region-specific errors
- feature flagging platforms
- incident management systems