Quick Definition (30–60 words)
SEV2 is a mid-to-high priority incident classification indicating partial service degradation or significant impact to a subset of users or business functions. Analogy: a major traffic jam blocking key lanes but not the entire highway. Formal: an incident causing degraded functionality with measurable business impact needing coordinated engineering response.
What is SEV2?
SEV2 is an incident severity level used by SRE and operations teams to prioritize response, allocate on-call resources, and drive remediation. It is not full site-wide outage (SEV1) nor a low-priority ticket (SEV3/SEV4). SEV2 typically requires immediate attention, cross-team coordination, and mitigation to restore acceptable service levels within hours rather than minutes or days.
Key properties and constraints:
- Targets: subset of customers, specific features, or non-critical regions.
- Impact: measurable revenue or user-affecting degradation but not total outage.
- Response window: immediate wake-up for primary on-call with escalation to subject matter experts.
- Communication: public status updates often required; no mandatory full executive war room.
- Automation: playbooks often include mitigation scripts, throttles, and circuit breakers.
Where it fits in modern cloud/SRE workflows:
- Triggered by alerts crossing SLO thresholds or APM anomalies.
- Handled via incident commander + domain leads with follow-up postmortem.
- Integrated with CI/CD rollbacks, traffic shaping, feature flags, and autoscaling.
- Often monitored using distributed tracing, synthetic tests, and error budgets.
Diagram description (text-only):
- User requests hit edge layer -> load balancer -> API layer -> service mesh -> backend services and databases. SEV2 typically originates in one service or region, causing elevated error rates or latency that cascade to related services. Mitigation flows: detect via observability -> page on-call -> run mitigations (traffic reroute, rollback, config change) -> monitor SLO recovery -> postmortem.
SEV2 in one sentence
SEV2 is a coordinated incident classification for significant partial service degradation that requires rapid engineering response and cross-team coordination to restore service levels without a full outage declaration.
SEV2 vs related terms (TABLE REQUIRED)
| ID | Term | How it differs from SEV2 | Common confusion |
|---|---|---|---|
| T1 | SEV1 | Full or near-full outage with executive impact | People confuse fast escalation needs |
| T2 | SEV3 | Lower-priority impact or less urgent | SEV3 sometimes escalates to SEV2 later |
| T3 | P1 | Prioritization system may match SEV2 but varies by org | P1 label mapping differs across companies |
| T4 | Alert | Raw signal that may or may not be SEV2-worthy | Alerts are not incidents by themselves |
| T5 | Incident | Container for SEV2 but can be other severities | Incident is generic, severity is specific |
| T6 | Outage | SEV2 is partial outage, not complete outage | Partial vs total outage distinction |
| T7 | Page | Notification mechanism, not severity | Paging does not equal SEV2 |
| T8 | SLA violation | SEV2 may or may not trigger SLA breach | SLA depends on contract terms |
Row Details
- T3: P1 meaning varies by organization; could map to SEV1 or SEV2 depending on business rules.
Why does SEV2 matter?
Business impact:
- Revenue: Partial degradation can reduce conversions, subscriptions, or transactional throughput, causing measurable revenue loss if unmitigated.
- Trust: Repeated SEV2 incidents erode customer confidence more than isolated minor incidents.
- Compliance & contracts: Some SEV2 incidents can trigger contractual SLAs or financial penalties depending on service terms.
Engineering impact:
- Incident reduction: Proper SEV2 handling reduces escalation frequency and recurring issues by enabling faster diagnosis and targeted fixes.
- Velocity: Clear runbooks and automation for SEV2 prevent developers from manual firefighting and maintain feature delivery cadence.
- Toil reduction: Automating common SEV2 mitigations (circuit breakers, throttles, rollbacks) reduces repetitive manual steps.
SRE framing:
- SLIs/SLOs: SEV2 often correlates with SLO breach thresholds approaching error budget exhaustion.
- Error budgets: Use SEV2 frequency as a signal to throttle feature rollouts or pause risky deployments.
- On-call: SEV2 should trigger on-call escalation patterns and a defined incident commander role to coordinate.
Realistic “what breaks in production” examples:
- API returns 50% errors for a major endpoint in one region after a config change.
- Payment processor timeout causing a backlog of transactions and customer-facing errors.
- Search subsystem latency spikes to several seconds causing checkout abandonment.
- Authentication service intermittent failures affecting new user signup.
- Background job queue backlog causing data freshness issues for dashboards.
Where is SEV2 used? (TABLE REQUIRED)
| ID | Layer/Area | How SEV2 appears | Typical telemetry | Common tools |
|---|---|---|---|---|
| L1 | Edge/Network | Increased 5xx from ingress in a subset region | 5xx rate, TCP resets, latency p95 | Load balancer logs, CDN metrics, synthetic |
| L2 | Service/API | Elevated error rates on critical endpoints | Error rate, latency, traces | APM, tracing, service mesh |
| L3 | Application | Feature-specific failures for subset of users | Exceptions, logs, user complaints | Logging platforms, feature flag systems |
| L4 | Data/DB | Slow queries or partial data loss | Query latency, replication lag | DB monitoring, slow query logs |
| L5 | Kubernetes | Pod restarts or evicted nodes causing degraded service | Pod restarts, CPU throttling, events | K8s metrics, kube-state, Prometheus |
| L6 | Serverless/PaaS | Function cold start or throttling causing partial failures | Invocation errors, throttles, duration | Cloud provider metrics, traces |
| L7 | CI/CD | Bad deploy causing regression to subset users | Deploy success, canary metrics | CI logs, deployment orchestration |
| L8 | Observability/Security | Alerting gaps or security blocks causing impact | Missing telemetry, blocked endpoints | Observability stack, WAF, SIEM |
Row Details
- L1: Edge issues often show region-specific user complaints and CDN origin health checks.
- L5: Kubernetes pod restarts can originate from resource limits, image pull failures, or liveness probes misconfiguration.
- L6: Serverless throttling often comes from concurrency limits or cold-start latencies.
When should you use SEV2?
When it’s necessary:
- Significant subset of users experience degraded core functionality.
- Business metrics are negatively trending and impact is measurable.
- Error budget consumption is high and near SLO breach.
When it’s optional:
- Non-critical feature degradation affecting small portion of traffic with acceptable fallback.
- Internal tooling issues where workarounds exist and no customer impact.
When NOT to use / overuse it:
- Single-user edge-case bugs that do not affect others.
- Known maintenance windows and planned degradations.
- False-positive alerts lacking corroborating telemetry.
Decision checklist:
- If user-facing error rate > X% for a critical endpoint AND revenue impact observed -> declare SEV2.
- If internal-only errors AND no service degradation -> use ticketing/SEV3.
- If full service outage across regions OR executive impact -> escalate to SEV1.
Maturity ladder:
- Beginner: Manual detection, simple on-call rotation, basic runbooks.
- Intermediate: Automated detection, canary rollback, feature flags, structured postmortems.
- Advanced: Automated mitigations, dynamic SLO-driven rollout control, AI-assisted triage and root cause suggestions.
How does SEV2 work?
Components and workflow:
- Detection: Alert engine triggers from metrics, logs, traces, or synthetic tests.
- Triage: On-call evaluates impact using dashboards and decides SEV2 classification.
- Response: Incident commander assigned, mitigations executed (traffic shift, rollback, throttle).
- Coordination: Cross-team communication, status updates, and escalation to subject matter experts.
- Resolution: Fix applied and monitored until SLOs return to acceptable range.
- Postmortem: RCA, remediation plan, and follow-ups scheduled.
Data flow and lifecycle:
- Observability collects telemetry -> alerting rules inspect SLIs -> incidents created -> human or automation executes runbook -> mitigation applied -> telemetry shows recovery -> incident closed -> postmortem artifacts stored.
Edge cases and failure modes:
- Telemetry blackout -> make decisions from user reports and external synthetic checks.
- Automation misfire -> have manual kill-switch and rollback paths.
- Mixed signals across regions -> isolate region and route traffic accordingly.
Typical architecture patterns for SEV2
- Canary + Progressive Rollback: Use canary metrics to detect regression then roll back canary or pause rollout.
- Circuit Breaker with Fallback: Protect downstream services and provide degraded but functional responses.
- Traffic Shifting by Region: Reroute traffic away from unhealthy region to healthy ones using global load balancer.
- Feature Flag Isolation: Turn off problematic features for affected cohorts quickly.
- Autoscaling + Throttling: Combine autoscaling to handle load with throttles to protect critical paths.
- Observability-Driven Ramp: Use SLO-driven deployment pipelines that halt on SEV2-like thresholds.
Failure modes & mitigation (TABLE REQUIRED)
| ID | Failure mode | Symptom | Likely cause | Mitigation | Observability signal |
|---|---|---|---|---|---|
| F1 | Telemetry gap | Missing metrics/logs | Agent failure or ingestion outage | Fallback to synthetic and logs | Drop in metric volume |
| F2 | Alert storm | Many similar alerts | Cascading failures or noisy rules | Grouping and suppress duplicates | High alert count |
| F3 | Automation rollback fail | Failed automated rollback | Bad rollback script or missing permissions | Manual rollback and permissions fix | Failed job logs |
| F4 | Misrouted traffic | Users hit wrong region | DNS or load balancer config error | Reconfigure LB, revert recent changes | Region error spike |
| F5 | Dependency degradation | Downstream errors increase | Third-party or shared service issue | Circuit breaker and degrade features | Increased downstream latencies |
| F6 | Resource exhaustion | High OOM or CPU leading to restarts | Memory leak or bad config | Scale or restart with fix, patch code | Pod restarts, OOM logs |
Row Details
- F1: Telemetry gaps require alternate data sources such as client-side logs or third-party synthetic monitoring.
- F3: Automation rollback failure often happens when scripts assume idempotency or lack sufficient RBAC.
- F5: Circuit breakers prevent cascading failures by tripping after error thresholds and opening fallback paths.
Key Concepts, Keywords & Terminology for SEV2
Glossary (40+ terms):
- Incident commander — Person coordinating SEV2 response — Ensures unified action — Pitfall: ambiguous authority.
- On-call rotation — Schedule for responders — Ensures 24/7 coverage — Pitfall: burnout without limits.
- Runbook — Step-by-step play for mitigations — Speeds response — Pitfall: stale instructions.
- Playbook — Strategy for recurring incidents — Standardizes responses — Pitfall: over-generalization.
- SLI — Service Level Indicator — Measures key user-facing behavior — Pitfall: choosing irrelevant SLI.
- SLO — Service Level Objective — Target for SLI — Guides reliability investments — Pitfall: unrealistic targets.
- Error budget — Allowable failure window — Enables risk during releases — Pitfall: ignored budget breaches.
- Observability — Ability to understand system state — Critical for triage — Pitfall: telemetry gaps.
- Tracing — Distributed request tracking — Helps root cause — Pitfall: sampling hides errors.
- Metrics — Numeric system measurements — Useful for thresholds — Pitfall: high-cardinality overload.
- Logs — Event records — Useful for root cause — Pitfall: unstructured noisy logs.
- Synthetic testing — Proactive checks emulating user paths — Detects regressions — Pitfall: not representative.
- Feature flag — Toggle to enable/disable features — Rapid mitigation tool — Pitfall: flag debt.
- Circuit breaker — Fails fast to protect systems — Prevents cascading failures — Pitfall: too-aggressive tripping.
- Canary deployment — Small percentage rollout — Limits blast radius — Pitfall: insufficient traffic.
- Blue-green deploy — Full environment swap — Fast rollback — Pitfall: cost overhead.
- Autoscaling — Adjust resources to load — Mitigates overloads — Pitfall: scaling latency.
- Throttling — Limit request rate — Preserves stability — Pitfall: poor UX.
- Backpressure — Signals to slow producers — Controls queue growth — Pitfall: not propagated.
- Quorum — Required nodes for consensus — Important for DB availability — Pitfall: split-brain.
- Replication lag — Delay between DB replicas — Causes stale reads — Pitfall: hidden by caches.
- Latency p50/p95/p99 — Percentile latency measures — Shows user experience — Pitfall: focusing only on p50.
- Availability — Uptime metric — Business-facing reliability measure — Pitfall: ignores partial degradations.
- Degraded mode — Reduced functionality state — Keeps core services running — Pitfall: missing user communication.
- Rollback — Revert to previous stable release — Fast remediation — Pitfall: data migrations complicate rollback.
- Hotfix — Quick patch to production — Short-term fix — Pitfall: introduces technical debt.
- Postmortem — Analysis after incident — Captures RCA and action items — Pitfall: lack of follow-through.
- RCA — Root cause analysis — Identifies underlying causes — Pitfall: blames symptoms.
- Pager duty — Notification system for paging on-call — Triggers response — Pitfall: misconfigured escalation.
- Incident timeline — Chronological events — Useful in postmortem — Pitfall: incomplete logs.
- Blast radius — Scope of impact — Guides mitigation strategies — Pitfall: unknown dependencies increase radius.
- Dependency graph — Map of service interactions — Aids impact analysis — Pitfall: outdated diagrams.
- Synthetics vs real user metrics — Simulated vs actual behavior — Complements observability — Pitfall: relying only on one.
- Alert deduplication — Reduces noise by grouping alerts — Improves signal-to-noise — Pitfall: over-aggregation hides issues.
- Burn rate — Speed of error budget consumption — Indicates pacing of incidents — Pitfall: misinterpreted thresholds.
- Immutable infrastructure — Deployable artifacts are never modified in place — Reduces config drift — Pitfall: operational overhead.
- Blue/Green database migration — Strategies for data updates — Reduces migration risk — Pitfall: complex coordination.
- Runbook automation — Scripts for standard steps — Speeds response — Pitfall: automation bugs.
- Observability pipeline — Ingestion and storage of telemetry — Foundation for detection — Pitfall: single point of failure.
- Feature cohort — Subset of users for experiments or mitigations — Controls exposure — Pitfall: nondeterministic segmentation.
- Incident SLA — Contractual response obligations — Business requirement — Pitfall: confusing internal SLOs with external SLAs.
- Synthetic health checks — Regular automated checks — Early warning system — Pitfall: poor coverage.
How to Measure SEV2 (Metrics, SLIs, SLOs) (TABLE REQUIRED)
| ID | Metric/SLI | What it tells you | How to measure | Starting target | Gotchas |
|---|---|---|---|---|---|
| M1 | Error rate | Fraction of failed requests | Failed requests divided by total | <1% for critical endpoints | Aggregation can hide spikes |
| M2 | Latency p95 | User-tail latency at 95th percentile | Measure request durations per endpoint | <500ms for APIs | High-cardinality affects compute |
| M3 | Availability | Uptime of service or endpoint | Successful requests over total | 99.9% for critical paths | Partial outages may not show |
| M4 | Throughput | Requests per second | Count requests per unit time | Baseline plus 20% buffer | Bursts may exceed capacity |
| M5 | Time to mitigate | Time from page to mitigation | Timestamp logs from incident create to mitigation | <30-60 minutes | Depends on mitigation type |
| M6 | Time to restore | Time from page to full recovery | Incident timestamps to SLO recovery | <4 hours typical for SEV2 | Varied by org policy |
| M7 | Error budget burn rate | Speed of SLO consumption | Error budget used per time window | Alert when burn >2x | Requires accurate SLO definition |
| M8 | User impact rate | % of users affected | Affected user count divided by total | <5% before SEV2 | Hard to segment users |
| M9 | Deploy failure rate | Fraction of faulty deployments | Failed deploys over total deploys | <0.5% | Canary coverage needed |
| M10 | Synthetic success rate | Health of emulated flows | Successes over checks | >99% | Synthetics may not represent real traffic |
Row Details
- M5: Time to mitigate measures a quick temporary fix, not full resolution; useful for prioritizing automations.
- M7: Burn rate needs consistent error budget window; short windows can be noisy.
- M8: Calculating affected users may require tracing or correlation IDs and can be imprecise.
Best tools to measure SEV2
Tool — Prometheus
- What it measures for SEV2: Metrics, alerting, and basic SLOs.
- Best-fit environment: Kubernetes and cloud-native stacks.
- Setup outline:
- Instrument services with metrics libraries.
- Deploy Prometheus scrape config.
- Configure alertmanager for escalation.
- Define recording rules and SLO dashboards.
- Strengths:
- Flexible querying and reliable ecosystem.
- Strong Kubernetes integration.
- Limitations:
- Scalability needs remote storage for large volumes.
- No built-in tracing.
Tool — Grafana
- What it measures for SEV2: Dashboards aggregating metrics, traces, logs.
- Best-fit environment: Multi-source observability visualization.
- Setup outline:
- Connect data sources.
- Build executive, on-call, and debug dashboards.
- Configure alerting and notification channels.
- Strengths:
- Powerful visualization and templating.
- Supports many data sources.
- Limitations:
- Alerting complexity at scale.
- Dashboards require maintenance.
Tool — OpenTelemetry + Collector
- What it measures for SEV2: Tracing and unified telemetry.
- Best-fit environment: Distributed microservices.
- Setup outline:
- Instrument SDK in services.
- Deploy collector for batching/export.
- Route to APM, storage, and analytics.
- Strengths:
- Vendor-neutral and extensible.
- Correlates traces, metrics, and logs.
- Limitations:
- Sampling decisions affect completeness.
- Collector configuration complexity.
Tool — Pager/Incident platform (Varies)
- What it measures for SEV2: Incident lifecycle, notifications, escalation.
- Best-fit environment: Any organization with on-call.
- Setup outline:
- Configure on-call schedules.
- Integrate alerting sources.
- Define escalation policies.
- Strengths:
- Standardized incident flow.
- Audit trails and reporting.
- Limitations:
- Integration maintenance.
- Cost at scale.
Tool — APM (Varies)
- What it measures for SEV2: Traces, service maps, slow transactions.
- Best-fit environment: Service-oriented applications.
- Setup outline:
- Add APM agents to services.
- Configure sampling and dashboards.
- Use service maps to identify dependencies.
- Strengths:
- Quick root cause insights.
- Transaction-level visibility.
- Limitations:
- Can be expensive for high-volume tracing.
- Closed-source vendors may limit customization.
Recommended dashboards & alerts for SEV2
Executive dashboard:
- Panels: Global availability, error budget burn rate, revenue impact estimate, open SEV2 count.
- Why: Provide concise business and reliability snapshot for executives.
On-call dashboard:
- Panels: Per-endpoint SLI charts (error rate, latency), recent deploys, active incidents, top errors with traces, synthetic checks.
- Why: Fast triage with actionable views and ownership.
Debug dashboard:
- Panels: Traces for recent failures, span duration breakdown, logs correlated by trace ID, resource utilization per pod, dependency error matrix.
- Why: Deep dive to identify root cause quickly.
Alerting guidance:
- What should page vs ticket:
- Page for SEV1 and SEV2 when immediate human action is required.
- Ticket for SEV3/SEV4 or monitoring-only issues.
- Burn-rate guidance:
- Page when burn rate >2x sustained over 30 minutes for critical SLOs.
- Noise reduction tactics:
- Deduplicate similar alerts by grouping keys, suppress during known maintenance windows, use dynamic thresholds and rolling windows.
Implementation Guide (Step-by-step)
1) Prerequisites: – Defined SLIs and SLOs for critical user journeys. – On-call rotation and escalation policies. – Instrumentation libraries and observability ingestion pipeline. – Feature flag and deployment rollback mechanisms.
2) Instrumentation plan: – Identify critical endpoints and transactions. – Add metrics (latency, error), structured logs, and traces. – Ensure correlation IDs propagate end-to-end.
3) Data collection: – Configure metrics scraping and log ingestion. – Deploy OpenTelemetry collector or vendor equivalents. – Set up synthetic checks for critical flows.
4) SLO design: – Map SLIs to SLOs with realistic targets and error budgets. – Define alert thresholds tied to SLO burn and absolute errors.
5) Dashboards: – Build executive, on-call, and debug dashboards with templated views per service. – Include recent deploy annotations and incident timeline panel.
6) Alerts & routing: – Configure alertmanager/escalations to page for SEV2-worthy conditions. – Use grouping and annotations to include playbook links.
7) Runbooks & automation: – Author concise runbooks for common SEV2 scenarios. – Implement scripted automations for safe rollbacks, traffic shifts, and feature flag toggles.
8) Validation (load/chaos/game days): – Run chaos exercises targeting single service and region to validate mitigations. – Perform game days with cross-team roles to practice SEV2 response.
9) Continuous improvement: – Schedule postmortem reviews, track action item completion, and refine SLOs. – Regularly test runbook accuracy and automation reliability.
Checklists:
Pre-production checklist:
- SLIs defined for new features.
- Tracing and metrics instrumentation present.
- Synthetic tests cover critical flows.
- Feature flag available for rollback.
Production readiness checklist:
- Monitoring targets set and alerts configured.
- On-call and escalation defined.
- Runbook exists and is accessible.
- Backout plan validated.
Incident checklist specific to SEV2:
- Verify impact and affected cohorts.
- Assign incident commander and document timeline.
- Execute mitigation per runbook.
- Communicate status to stakeholders every 30–60 minutes.
- Monitor SLOs until recovery and start postmortem.
Use Cases of SEV2
Provide 8–12 use cases:
-
E-commerce checkout latency – Context: Checkout service latency spikes. – Problem: Reduced conversions and abandoned carts. – Why SEV2 helps: Rapid mitigation reduces revenue loss. – What to measure: Checkout API error rate and p95 latency. – Typical tools: APM, synthetic tests, feature flags.
-
Regional CDN origin failure – Context: One region’s CDN origin degraded. – Problem: Users in region see 5xx errors. – Why SEV2 helps: Traffic reroute and origin failover minimize impact. – What to measure: CDN 5xx rate and origin health. – Typical tools: CDN analytics, global LB, monitoring.
-
Payment gateway timeouts – Context: Third-party payment provider intermittent failures. – Problem: Transactions failing, refunds risk. – Why SEV2 helps: Toggle alternate provider or degrade non-essential payment methods. – What to measure: Payment success rate and queue length. – Typical tools: Payment gateway dashboards, feature flag.
-
Authentication intermittent failures – Context: Auth service rate-limited due to misconfiguration. – Problem: New user signups and logins fail. – Why SEV2 helps: Short-term mitigation with degraded sign-in flows. – What to measure: Auth error rate and latency. – Typical tools: Tracing, logs, feature flags.
-
Search indexing lag – Context: Indexing pipeline backlog causes stale search. – Problem: Users see outdated results. – Why SEV2 helps: Prioritize indexing jobs and reduce load on search. – What to measure: Index freshness and queue depth. – Typical tools: Queue monitoring, job scheduler dashboards.
-
Kubernetes node pool degradation – Context: Node upgrades causing eviction and pod restarts. – Problem: Increased restarts for a subset of services. – Why SEV2 helps: Drain node, roll back upgrade, scale up. – What to measure: Pod restarts and eviction rates. – Typical tools: K8s metrics, cluster autoscaler.
-
Analytics pipeline failure – Context: Batch job fails causing dashboard staleness. – Problem: Business decisions impacted by stale data. – Why SEV2 helps: Prioritize fix to restore data freshness. – What to measure: Job success rate and data latency. – Typical tools: Job scheduler, monitoring.
-
API rate limiting misconfiguration – Context: Misapplied rate limits block legitimate clients. – Problem: Partial customer outage for high-traffic clients. – Why SEV2 helps: Rapid config change or client-specific exceptions. – What to measure: 429 rate and client error counts. – Typical tools: API gateway logs, rate-limit metrics.
Scenario Examples (Realistic, End-to-End)
Scenario #1 — Kubernetes partial node failure
Context: One availability zone shows degraded pod performance. Goal: Restore normal performance for affected services within hours. Why SEV2 matters here: A subset of users in the zone are impacted; revenue at risk. Architecture / workflow: K8s cluster with multi-AZ node pools, service mesh, Prometheus/Grafana. Step-by-step implementation:
- Detect via pod restarts and latency p95 spike.
- Page on-call and assign incident commander.
- Execute runbook: cordon affected nodes, drain, and shift traffic by adjusting service weights.
- If deploy caused issue, pause deployment and rollback.
- Monitor SLOs for recovery and collect traces. What to measure: Pod restart rate, p95 latency, AZ-specific error rate. Tools to use and why: Prometheus for metrics, Grafana dashboards, kubectl for actions. Common pitfalls: Forgetting to update autoscaler limits or missing taints. Validation: Run synthetic tests from the affected AZ after mitigation. Outcome: Traffic rebalanced and p95 latency returned under threshold; postmortem scheduled.
Scenario #2 — Serverless cold-start spike for image processing
Context: New campaign increases traffic for serverless function handling image upload. Goal: Reduce errors and latency for upload processing. Why SEV2 matters here: High-profile customers experience degraded upload times. Architecture / workflow: Managed serverless functions invoking storage and downstream workflows. Step-by-step implementation:
- Detect via invocation error rate and duration spike.
- Page on-call and reroute non-critical uploads to batch queue using feature flag.
- Increase provisioned concurrency or move heavy processing to worker pool.
- Monitor error rate and throttles until stable. What to measure: Invocation error rate, cold-start duration, provisioned concurrency metrics. Tools to use and why: Cloud provider metrics, feature flags, queue system. Common pitfalls: Provisioned concurrency cost and overshoot. Validation: Load test with production traffic patterns. Outcome: Errors reduced; feature flagged path preserved and capacity plan updated.
Scenario #3 — Incident-response and postmortem for payment gateway outage
Context: Payment provider intermittently times out for certain transactions. Goal: Restore payments and prevent reoccurrence. Why SEV2 matters here: Payments directly affect revenue and customer experience. Architecture / workflow: Application routes to multiple payment providers via gateway. Step-by-step implementation:
- Detect via elevated payment error rate and customer complaints.
- Page finance and eng on-call, enable fallback provider with feature flag.
- Throttle retry loops to avoid duplicate charges.
- Triage and identify provider-side latency; coordinate with vendor.
- After resolution, collect logs and traces for RCA. What to measure: Payment success rate, retry counts, customer complaints. Tools to use and why: Payment gateway dashboard, APM, feature flag. Common pitfalls: Duplicate charges due to improper retry logic. Validation: Reprocess queued transactions in staging before production. Outcome: Fallback reduced impact; postmortem identified retry logic change and vendor contractual change.
Scenario #4 — Cost vs performance trade-off for high-throughput API
Context: High incoming traffic causing autoscaling and cost spikes. Goal: Maintain SLOs at acceptable cost level. Why SEV2 matters here: Sudden traffic pattern causes partial degradation; need to balance cost. Architecture / workflow: Microservices with autoscaling and managed DB. Step-by-step implementation:
- Detect via burn rate and increased latency during peak.
- Page ops to enable adaptive throttling for non-essential endpoints and enable caching for hot keys.
- Adjust autoscaler cooldowns and instance types temporarily.
- Monitor cost and performance metrics; revert tactical changes after peak. What to measure: Cost per request, latency p95, cache hit rate. Tools to use and why: Cloud cost analytics, APM, caching layer. Common pitfalls: Throttling essential clients or misconfigured cache invalidation. Validation: Load testing and cost projection. Outcome: Reduced latency and bounded cost; long-term autoscaling tuning planned.
Common Mistakes, Anti-patterns, and Troubleshooting
List of mistakes (Symptom -> Root cause -> Fix):
- Symptom: Repeated SEV2 on same endpoint -> Root cause: Temporary fix not addressing root cause -> Fix: Conduct RCA and deploy permanent fix.
- Symptom: Alert storm during incident -> Root cause: No alert grouping -> Fix: Configure deduplication and grouping by root cause keys.
- Symptom: Runbook outdated -> Root cause: Runbook not versioned -> Fix: Integrate runbook updates into PR and CI pipeline.
- Symptom: Telemetry missing -> Root cause: SDK misconfiguration or agent down -> Fix: Add health checks for telemetry pipeline.
- Symptom: False positives from synthetics -> Root cause: Non-representative tests -> Fix: Update synthetic scenarios to mirror real traffic.
- Symptom: On-call burnout -> Root cause: Excessive SEV2 paging -> Fix: Rotate on-call, increase automation, and limit pager hours.
- Symptom: Slow rollback -> Root cause: Complex DB migrations -> Fix: Design backwards-compatible migrations and short-lived feature flags.
- Symptom: Cost spike after mitigation -> Root cause: Overprovisioned autoscaling -> Fix: Use fine-grained autoscaling policies and temporary measures.
- Symptom: Duplicate incident pages -> Root cause: Multiple systems alerting independently -> Fix: Centralize alert routing and create single incident from groups.
- Symptom: Poor cross-team coordination -> Root cause: Undefined incident roles -> Fix: Define incident commander, scribe, and domain leads in playbook.
- Symptom: Missed SLO breach -> Root cause: SLOs not monitored in real-time -> Fix: Create live SLO dashboards and burn rate alerts.
- Symptom: Automation misfire -> Root cause: Untested scripts in production -> Fix: Test automations in staging and add kill-switch.
- Symptom: Feature flag drift -> Root cause: Flags left enabled indefinitely -> Fix: Implement flag lifecycle and cleanup process.
- Symptom: High-cardinality metrics causing costs -> Root cause: Excessive labels or tracing sampling -> Fix: Reduce label cardinality and tune sampling.
- Symptom: Misrouted traffic in LB -> Root cause: Incorrect config or rollout error -> Fix: Implement infra-as-code PR review and canary test LB changes.
- Symptom: Hard-to-parse logs -> Root cause: Unstructured logging -> Fix: Adopt structured logging and standard fields.
- Symptom: Postmortems without action -> Root cause: Lack of accountability -> Fix: Assign owners and track completion in backlog.
- Symptom: Observability pipeline backpressure -> Root cause: Burst of telemetry causing ingestion throttling -> Fix: Implement buffering and backpressure handling.
- Symptom: Over-aggregation hides root cause -> Root cause: Overly broad dashboard views -> Fix: Provide drill-down panels with service filters.
- Symptom: Ignored error budget -> Root cause: Teams unaware of burn rate -> Fix: Integrate burn rate into weekly reviews.
- Symptom: Security blocks causing partial outage -> Root cause: Overzealous WAF or IAM change -> Fix: Test security rules before enforcement.
- Symptom: Missing correlation IDs -> Root cause: Instrumentation incomplete -> Fix: Enforce propagation via middleware and tests.
- Symptom: Slow incident communication -> Root cause: No status update cadence -> Fix: Set clear status update intervals and templates.
- Symptom: Miscalculated impacted users -> Root cause: No user segmentation telemetry -> Fix: Add user context to traces and metrics.
- Symptom: Overuse of SEV2 -> Root cause: Loose severity rubric -> Fix: Formalize severity definitions and training.
Observability pitfalls (at least five included above): telemetry missing, false positives, high-cardinality metrics, log quality, pipeline backpressure.
Best Practices & Operating Model
Ownership and on-call:
- Assign clear ownership per service; define primary and secondary on-call.
- Rotate on-call frequently and provide burnout mitigation like on-call compensation and enforced rest.
Runbooks vs playbooks:
- Runbooks: step-by-step instructions for common SEV2 actions.
- Playbooks: broader strategies that apply across multiple scenarios.
- Update runbooks as code and review quarterly.
Safe deployments:
- Canary and progressive rollouts with automatic halt on SLO triggers.
- Quick rollback paths and database migration strategies that support reversibility.
Toil reduction and automation:
- Automate repetitive SEV2 tasks such as traffic shift and flag toggles.
- Add manual overrides and testing to automation rollout.
Security basics:
- Ensure incident tools and runbooks are access-controlled.
- Test security rules in staging and include security team in incident communications when relevant.
Weekly/monthly routines:
- Weekly: Review open SEV2 incidents, SLO burn rate, and runbook updates.
- Monthly: Game day and chaos exercises, postmortem review board, and SLO target review.
What to review in postmortems related to SEV2:
- Timeline accuracy and telemetry sufficiency.
- Root cause and contributing factors.
- Action items with owners and deadlines.
- Lessons for SLOs, runbooks, and automation improvements.
Tooling & Integration Map for SEV2 (TABLE REQUIRED)
| ID | Category | What it does | Key integrations | Notes |
|---|---|---|---|---|
| I1 | Metrics storage | Stores and queries time series | K8s, APM, alerting | Often Prometheus or remote storage |
| I2 | Visualization | Dashboards and panels | Metrics, logs, traces | Grafana or vendor dashboards |
| I3 | Tracing | Distributed traces and spans | Instrumentation, APM | OpenTelemetry compatible |
| I4 | Logging | Centralized log store | Apps, agents, alerting | Structured logs aid triage |
| I5 | Alerting | Routes alerts to on-call | Pager platforms, email | Escalation and grouping features |
| I6 | Incident platform | Manages incident lifecycle | Alerting, chat, ticketing | Stores timelines and postmortems |
| I7 | Feature flags | Control features at runtime | CI/CD, apps, campaigns | Useful for rapid mitigation |
| I8 | CI/CD | Deployment orchestration | Source control, infra | Canary and rollback pipelines |
| I9 | Load balancer | Traffic routing and shifts | DNS, CDN, service mesh | Global routing for region failover |
| I10 | Chaos tooling | Simulate failures | K8s, services, infra | Useful for validation in game days |
Row Details
- I1: Metrics storage must scale; remote write or long-term storage recommended for large environments.
- I6: Incident platforms act as single source of truth for postmortems and timelines.
Frequently Asked Questions (FAQs)
What exactly qualifies as SEV2?
SEV2 is a significant partial degradation impacting a subset of users or business capabilities and requiring immediate engineering coordination.
How fast should SEV2 be acknowledged?
Immediate acknowledgement by primary on-call within defined SLA, typically within 5–15 minutes depending on rotation policy.
Is SEV2 always paged?
Not always; if the mitigation is automated and effective, it may not page. Most orgs page for SEV2 when manual action required.
Does SEV2 always require a postmortem?
Yes, SEV2 incidents should have a postmortem to prevent recurrence and track action items.
How does SEV2 differ from SEV1 in terms of communication?
SEV1 usually mandates executive-level updates and wider public status; SEV2 requires regular stakeholder updates but not necessarily executive war room.
Can SEV2 be handled solely by automation?
Sometimes for well-understood failure modes. However, human oversight is recommended to validate mitigations.
How should SLOs factor into SEV2 response?
SLO breaches or high burn rates should influence escalation and deployment pauses to limit risk.
How many SEV2 incidents are acceptable per month?
Varies / depends; track against error budgets and organizational targets rather than a universal number.
What tools are must-haves for SEV2 readiness?
Metrics, tracing, logging, incident platform, feature flags, and CI/CD with rollback capabilities.
How to avoid noisy SEV2 pages?
Tune alert thresholds, group related alerts, and use anomaly detection or adaptive thresholds to reduce false positives.
Who should be on the SEV2 communication channel?
Incident commander, primary and secondary on-call, domain owners, and relevant SMEs; include a scribe for timeline.
Can SEV2 impact internal SLAs?
Yes, internal SLA or operational targets can be affected and should be tracked similar to customer-facing SLOs.
How to prioritize SEV2 vs new feature work?
Use error budgets and customer impact to prioritize; SEV2 remediation typically supersedes new feature development.
When should a SEV2 escalate to SEV1?
If impact expands to full service outage, cross-region failures, or significant executive and regulatory implications.
What is the ideal length of a runbook for SEV2?
Concise and actionable; typically one to three pages with exact commands and rollback steps.
How often should runbooks be tested?
At least quarterly during game days or when significant system changes happen.
How to handle SEV2 in multi-tenant systems?
Isolate affected tenants via feature flags or routing rules and communicate with impacted clients.
Should SEV2 incidents be public on status pages?
If customers are affected externally, provide status updates; internal-only incidents may not need public pages.
Conclusion
SEV2 is a critical operational construct for handling significant partial degradations in cloud-native systems. Proper instrumentation, SLO-driven policies, runbooks, and reliable automation reduce time-to-mitigate and recurrence risk. Cross-team coordination and clear ownership turn SEV2 incidents into opportunities for reliability improvement.
Next 7 days plan:
- Day 1: Audit critical SLIs and ensure they are instrumented.
- Day 2: Validate runbooks for top 3 SEV2 scenarios.
- Day 3: Implement or verify feature flag capability for emergency toggles.
- Day 4: Configure SLO burn rate alerts and test paging flow.
- Day 5: Run a micro game day simulating a regional partial outage.
Appendix — SEV2 Keyword Cluster (SEO)
- Primary keywords
- SEV2 incident
- SEV2 meaning
- SEV2 severity
- SEV2 SRE
-
SEV2 incident response
-
Secondary keywords
- SEV2 vs SEV1
- SEV2 runbook
- SEV2 mitigation
- SEV2 postmortem
-
SEV2 on-call
-
Long-tail questions
- What is SEV2 in site reliability engineering
- How to handle a SEV2 incident
- SEV2 vs SEV3 differences
- SEV2 runbook template
-
SEV2 incident lifecycle best practices
-
Related terminology
- incident commander
- SLI SLO
- error budget
- feature flag rollback
- canary deployment
- circuit breaker
- observability pipeline
- synthetic testing
- distributed tracing
- Prometheus Grafana
- OpenTelemetry
- incident management
- postmortem action items
- burn rate alerting
- traffic shifting
- autoscaling policies
- runbook automation
- pager escalation
- chaos engineering
- partial outage
- regional degradation
- payment gateway failure
- serverless throttling
- Kubernetes pod restarts
- synthetic health checks
- dependency graph
- monitoring best practices
- incident lifecycle management
- feature cohort control
- blue green deployment
- rollback strategy
- telemetry gaps
- log correlation
- trace sampling
- high cardinality metrics
- alert deduplication
- incident SLA
- cost vs performance tradeoff
- observability dashboards
- executive incident dashboard
- on-call dashboard
- debug dashboard
- mitigation runbook
- incident automation
- service degradation
- degradation threshold
- incident communication template
- status page updates
- vendor fallback
- retry logic issues
- backpressure handling
- queue backlog monitoring
- replication lag monitoring
- security incident overlap
- incident tooling map
- SLA breach response
- incident response training
- game day exercises