Quick Definition (30–60 words)
Duration RED is a reliability metric that tracks request or operation duration as a primary service-level indicator, emphasizing tail latency and percentiles. Analogy: think of highway travel time rather than just speed limits. Formal: Duration RED = SLIs derived from duration percentiles across user-facing transactions.
What is Duration RED?
Duration RED is a focused extension of the RED observability pattern that highlights duration (latency) as the core signal for customer experience. It is not simply average response time; it prioritizes distribution and tail behavior for user-facing work. Duration RED complements error and saturation signals by revealing when operations are slow enough to cause timeouts, retries, or poor UX.
What it is NOT:
- Not merely mean or median duration.
- Not a replacement for error-rate monitoring.
- Not an infrastructure-only metric; it requires application-level instrumentation to be meaningful.
Key properties and constraints:
- Emphasizes percentile-based SLIs (p50, p95, p99, p999).
- Requires consistent, high-cardinality tagging to attribute latency.
- Sensitive to sampling, clock skew, and aggregation windows.
- Needs correlation with errors, retries, and throughput to diagnose impact.
Where it fits in modern cloud/SRE workflows:
- Primary SLI for user-facing APIs, RPCs, and UI transactions.
- Used in SLOs and error budgets tied to customer experience.
- Drives incident prioritization and auto-scaling decisions.
- Integrated with CI/CD, chaos experiments, and performance budgets.
Diagram description (text-only):
- Client issues request -> Ingress/load balancer (measures start) -> Edge proxy (adds latency) -> Service A (handles business logic) -> Downstream calls to Service B and DB -> Service A response -> Observability pipeline aggregates duration spans -> Alerting evaluates percentiles against SLO -> On-call receives page or ticket.
Duration RED in one sentence
Duration RED focuses on latency percentiles of user-facing requests as primary SLIs to protect customer experience and guide SRE operations.
Duration RED vs related terms (TABLE REQUIRED)
| ID | Term | How it differs from Duration RED | Common confusion |
|---|---|---|---|
| T1 | RED (classic) | Duration RED focuses on duration specifically | People think RED only uses counters |
| T2 | Apdex | Apdex is threshold-based satisfaction score | Apdex hides tail behavior |
| T3 | P95 latency | Single percentile view of duration | P95 is easier but may miss tails |
| T4 | Mean latency | Arithmetic mean may hide skew | Mean often underestimates tail pain |
| T5 | SLA | SLA is contractual and legal | SLA may not map to technical SLO |
| T6 | SLO | SLO is target; Duration RED is SLI input | SLO is policy not measurement |
| T7 | Error budget | Error budget is allowance; uses Duration RED | Budgets usually tied to errors not latency |
| T8 | Quantile estimation | Statistical method, not an SLI itself | Confused with exact percentiles |
| T9 | End-to-end tracing | Traces provide context for duration | Tracing alone is not aggregated SLI |
| T10 | Throughput | Throughput is request rate, not duration | High throughput can affect duration |
Row Details (only if any cell says “See details below”)
Not required.
Why does Duration RED matter?
Business impact:
- Revenue: Slow experiences reduce conversions and retention.
- Trust: Users expect consistent response times; variability erodes confidence.
- Risk: Latency spikes can trigger cascading retries and increased costs.
Engineering impact:
- Incident reduction: Early detection of duration inflation reduces severity.
- Velocity: Clear SLOs for duration reduce firefighting and improve deployments.
- Architecture decisions: Informs caching, decompositions, and database tuning.
SRE framing:
- SLIs/SLOs: Duration percentiles become primary SLIs for user actions.
- Error budget: Budget burn can be caused by tail latency rather than errors.
- Toil/on-call: Better instrumentation reduces manual investigation time.
What breaks in production (realistic examples):
- Payment API p99 spikes due to sync DB index contention causing checkout failures.
- UI load becomes sluggish when a third-party CDN has degraded performance.
- Kubernetes node autoscaler scales slowly because probe durations exceed thresholds, causing rolling restarts to fail.
- Serverless function cold starts increase p95 beyond SLO after a deployment with larger container image.
- Distributed transaction increases tail latency after a library upgrade that changed timeouts.
Where is Duration RED used? (TABLE REQUIRED)
| ID | Layer/Area | How Duration RED appears | Typical telemetry | Common tools |
|---|---|---|---|---|
| L1 | Edge / CDN | Request-to-first-byte and full response time | TTFB p95 p99 and status codes | Edge logs and synthetic checks |
| L2 | Ingress / API gateway | Request duration and upstream time | Route p95 p99 and upstream latency | API gateway metrics and tracing |
| L3 | Service (app) | Handler durations and downstream waits | Span durations and histograms | APM and tracing SDKs |
| L4 | Datastore | Query execution and replication lag | Query duration percentiles and locks | DB metrics and slow logs |
| L5 | Messaging / Queue | Time in queue and processing time | Queue wait and handler duration | Broker metrics and consumer traces |
| L6 | Serverless / FaaS | Cold start and execution time | Invocation duration histogram | Cloud provider function metrics |
| L7 | Kubernetes infra | Pod startup and liveness probe durations | Container start and readiness times | K8s metrics and events |
| L8 | CI/CD | Build and deploy durations | Job runtime histograms | CI metrics and pipelines |
| L9 | Observability pipeline | Ingest and query latency | Ingest lag and query time | Monitoring backend metrics |
| L10 | Security tooling | Scan durations and blocking times | Scan job duration percentiles | Security scanners and plugin metrics |
Row Details (only if needed)
Not required.
When should you use Duration RED?
When necessary:
- Customer-facing APIs or UI where response time affects experience.
- Systems with SLAs or performance-sensitive flows like payments or search.
- Services with high variability or complex downstream dependencies.
When optional:
- Internal batch jobs where throughput matters more than latency.
- Background tasks where latency doesn’t affect user experience.
When NOT to use / overuse it:
- For purely asynchronous pipelines where latency is not user-visible.
- As a sole SLI for services dominated by availability or correctness issues.
- Over-instrumenting low-value internal endpoints creates noise.
Decision checklist:
- If request results are user-visible AND latency affects UX -> use Duration RED.
- If operation is async AND not customer-visible -> prefer throughput or success-rate SLI.
- If SLOs already exist but incidents are due to errors not latency -> prioritize error SLI.
Maturity ladder:
- Beginner: Instrument p95 and p99 histograms for critical endpoints.
- Intermediate: Add labels for key dimensions and implement SLOs with alerting.
- Advanced: Use adaptive SLOs, service-level objectives per user cohort, and automated remediation.
How does Duration RED work?
Components and workflow:
- Instrumentation: Application records start and end times for transactions and spans.
- Aggregation: Metrics backend ingests histograms or quantile summaries.
- Evaluation: Compute SLIs (p95/p99) and compare with SLO targets.
- Alerting: Generate alerts based on burn rate or absolute threshold breaches.
- Response: On-call follows runbook for latency incidents and triggers mitigations.
- Remediation: Autoscaling, circuit breakers, caching, or rollbacks.
- Postmortem: Analyze traces and metrics, update SLOs and automation.
Data flow and lifecycle:
- Request enters -> instrumentation creates spans -> spans emit durations -> metrics collector converts spans to histograms -> durable store holds time series -> query computes percentiles -> alerting evaluates conditions -> feedback to incident workflow.
Edge cases and failure modes:
- Sampling discards tail spans and masks real latency.
- Clock skew across hosts distorts durations.
- Aggregation windows hide transient spikes.
- Low-volume endpoints produce noisy percentile estimates.
Typical architecture patterns for Duration RED
- Client-observed SLI pattern: Measure round-trip time at client SDKs. Use when client-side network impact matters.
- Server-side histogram + tracing: Service emits high-resolution histograms and traces. Use for backend services with many dependencies.
- Distributed tracing-first: Use traces to attribute duration across call graph; compute service-level SLIs from tracing spans. Use for microservices with complex topology.
- Synthetic + real user monitoring (RUM): Combine synthetic checks with RUM for frontend and third-party visibility.
- Per-endpoint SLOs with traffic shaping: Apply SLOs per critical endpoint and throttle or route noncritical traffic during degradation.
Failure modes & mitigation (TABLE REQUIRED)
| ID | Failure mode | Symptom | Likely cause | Mitigation | Observability signal |
|---|---|---|---|---|---|
| F1 | Sampling bias | Undetected tail latency | High sampling rate on low-value traces | Increase sampling for errors and tails | Trace sample rate drop |
| F2 | Clock skew | Negative or absurd durations | Unsynced host clocks | Use monotonic timers or sync time | Host clock drift metric |
| F3 | Aggregation lag | Delayed alerts | Monitoring pipeline backpressure | Scale ingest or lower resolution | Ingest lag metric |
| F4 | Metric cardinality | High cost and slow queries | Too many labels | Reduce labels and use rollups | Cardinality metric |
| F5 | Misattributed latency | Blame wrong service | Missing context or traces | Add context propagation | High downstream p99 |
| F6 | Percentile noise | Flapping percentiles | Low traffic for endpoint | Use smoothing or use lower percentile | Low sample count metric |
Row Details (only if needed)
Not required.
Key Concepts, Keywords & Terminology for Duration RED
This glossary gives concise definitions and common pitfalls. Each entry is Term — definition — why it matters — common pitfall.
- Duration — time between request start and completion — Primary SLI basis — Confused with CPU time.
- Latency distribution — spread of durations across requests — Shows tail behavior — Ignoring tails.
- Percentile (p95, p99) — value below which X% of samples fall — Captures UX impact — Using only p95 hides p999.
- Tail latency — extreme high percentiles — Often causes user-visible failure — Hard to estimate at low volume.
- Histogram — bucketed distribution — Efficient for aggregation — Coarse buckets mask detail.
- Summaries / sketches — approximate quantiles — Low memory cost — Complexity in interpretation.
- Quantile estimation — algorithmic percentile calculation — Balances accuracy and cost — Implementation differences.
- SLI — service-level indicator — Measure of system behavior — Wrongly chosen SLI misguides ops.
- SLO — service-level objective — Target for SLIs — Too strict SLO causes alert fatigue.
- SLA — service-level agreement — Contractual obligation — Legal implication often omitted.
- Error budget — allowable SLO violations — Drives release decisions — Undervaluing latency burn.
- RED method — Rate, Errors, Duration — Observability pattern — Often misused as only counters.
- RUM — Real user monitoring — Client-side duration capture — Privacy and sampling concerns.
- Synthetic monitoring — scripted checks — Detect regressions proactively — May miss real user paths.
- Tracing — distributed context for requests — Helps attribution — Sampling limits visibility.
- Span — tracing unit of work — Identifies component durations — Incomplete spans mislead.
- Client-observed SLI — measured by client SDK — Includes network and render time — Harder to control.
- Server-observed SLI — measured by server — Excludes client view — Misses client-side issues.
- Cold start — serverless startup latency — Affects p95/p99 — Overprovisioning increases cost.
- Probe latency — readiness/liveness probe durations — Affects orchestration — Probe misconfig breaks scaling.
- Autoscaling — adjust capacity based on metrics — Uses duration to scale for responsiveness — Reactive scaling can be late.
- Circuit breaker — stop calling slow dependencies — Prevents cascading latency — Misconfiguration leads to availability loss.
- Retry storm — repeated retries increasing load — Exacerbates latency — Retry budget missing.
- Backpressure — flow control when downstream is slow — Prevents queue growth — Hard to implement across systems.
- Token bucket — rate-limiting algorithm — Limits concurrent load — Overthrottling hurts UX.
- P95 flapping — percentile oscillation — Causes noisy alerts — Use smoothing and burn-rate checks.
- Observability pipeline — ingestion, storage, visualization — Central to duration analysis — Single point of failure if not scaled.
- Cardinality — number of unique label combinations — Affects cost — High cardinality increases backend stress.
- Aggregation window — time range for percentile calculation — Longer windows stabilize but delay response — Too short causes noise.
- Sample rate — fraction of traces collected — Balances cost/visibility — Too low hides tails.
- Monotonic clock — non-decreasing timer — Accurate durations despite system time changes — Not always used by SDKs.
- Probe jitter — avoid synchronized probes — Prevents thundering herd — Forgotten in default configs.
- Service mesh — adds network hop latency — Affects p95 — Transparent instrumentation needed.
- Sidecar proxy — local network proxy for service mesh — Captures durations — Adds overhead.
- QoS — quality of service classes — Prioritize latency-sensitive flows — Complexity in enforcement.
- Smoothing window — moving average for percentile signals — Reduces noise — Masks short incidents.
- Load spike — sudden increase in traffic — Causes tail latency — Autoscaling lag can worsen impact.
- Capacity planning — reserve headroom for latency spikes — Prevents budget burn — Overprovisioning cost tradeoff.
- Chaos engineering — inject faults to surface latency issues — Improves resilience — Requires careful scoping.
How to Measure Duration RED (Metrics, SLIs, SLOs) (TABLE REQUIRED)
| ID | Metric/SLI | What it tells you | How to measure | Starting target | Gotchas |
|---|---|---|---|---|---|
| M1 | p95 request duration | Typical slow-but-common user impact | Histogram quantile per route | 200ms for critical APIs | p95 misses rare tails |
| M2 | p99 request duration | Tail user impact | Histogram quantile per route | 1s for interactive flows | Requires high sample counts |
| M3 | p999 duration | Extreme tail risk | Sketches or streaming quantiles | 3s for critical flows | Very noisy at low volume |
| M4 | Error rate within high-duration requests | Correlates latency with failures | Count errors where duration > threshold | <1% of slow requests | Need correlation labels |
| M5 | Queue wait time | Backpressure and scheduling delays | Histogram on dequeue time | 50ms for critical queues | Ignored in single-service views |
| M6 | Cold start rate | Frequency of high latency due to cold starts | Percentage of invocations with startup >X | <1% | Requires function-level instrumentation |
| M7 | Client-observed RTT | End-user experienced duration | Frontend SDK or RUM | 300ms | Network and client render add variance |
| M8 | Backend processing time | Internal compute latency | Service spans excluding network | 100ms | Missing downstream time |
| M9 | Ingest lag | Observability pipeline delay | Time from event to availability | <30s | High pipeline load increases lag |
| M10 | Percentile sample count | Confidence in percentile | Count samples per window | >10k samples | Low-volume endpoints need smoothing |
Row Details (only if needed)
Not required.
Best tools to measure Duration RED
Choose tooling based on environment and scale. Below are recommended tools and structured details.
Tool — OpenTelemetry
- What it measures for Duration RED: Traces and span durations; histogram metrics.
- Best-fit environment: Microservices, multi-cloud, hybrid.
- Setup outline:
- Instrument services with OpenTelemetry SDKs.
- Configure exporters to tracing/metrics backend.
- Ensure high-resolution histograms enabled.
- Set sampling policy for error and tail traces.
- Strengths:
- Vendor-neutral standard.
- Rich context propagation.
- Limitations:
- Requires backend for storage and visualization.
- Sampling strategy complexity.
Tool — Prometheus + Histogram/Exemplar
- What it measures for Duration RED: Aggregated histograms and exemplars linked to traces.
- Best-fit environment: Kubernetes and self-managed stacks.
- Setup outline:
- Export histograms from app metrics.
- Use exemplars to connect histogram buckets to traces.
- Use recording rules for percentiles.
- Tune scrape intervals and retention.
- Strengths:
- Open-source and widely adopted.
- Strong alerting integration.
- Limitations:
- Percentile calculation over sliding windows requires care.
- High cardinality costs.
Tool — Managed APM (vendor)
- What it measures for Duration RED: End-to-end traces, service maps, histograms.
- Best-fit environment: Teams needing turnkey tracing and dashboards.
- Setup outline:
- Deploy vendor agents or SDKs.
- Tag key dimensions and enable distributed tracing.
- Configure dashboards and SLOs in vendor console.
- Strengths:
- Quick time-to-value.
- Integrated analytics.
- Limitations:
- Cost and vendor lock-in considerations.
Tool — Real User Monitoring (RUM) SDKs
- What it measures for Duration RED: Client-observed round trips and page load durations.
- Best-fit environment: Frontend web and mobile apps.
- Setup outline:
- Add RUM SDK to frontend.
- Capture page load, navigation timing, and XHR durations.
- Sample and redact sensitive data.
- Strengths:
- Measures real-user experience.
- Limitations:
- Privacy and sampling constraints.
Tool — Synthetic monitoring / Synthetics
- What it measures for Duration RED: End-to-end scripted transaction durations from multiple locations.
- Best-fit environment: Global services and external dependencies.
- Setup outline:
- Define critical journeys as scripts.
- Run at regular intervals from key locations.
- Alert on threshold or SLO violations.
- Strengths:
- Predictable and repeatable checks.
- Limitations:
- May not reflect real traffic patterns.
Recommended dashboards & alerts for Duration RED
Executive dashboard:
- High-level SLO adherence: p95/p99 vs target across business-critical services.
- Trend of error budget burn.
- Top 5 services by p99 increase and business impact rationale.
On-call dashboard:
- Live percentiles per route and recent heatmap.
- Top slow traces and recent deploys.
- Alerts with burn-rate and threshold state.
Debug dashboard:
- Per-service span waterfall for recent slow requests.
- Downstream call durations and queue times.
- Host/instance metrics and probe timings.
Alerting guidance:
- Page vs ticket: Page for SLO burn-rate breaches (high burn or sustained p99 breach). Ticket for isolated nonbusiness-critical p95 violations.
- Burn-rate guidance: Page when burn rate exceeds 4x for a sliding window and remaining error budget is low. Ticket when transient or single-window spike.
- Noise reduction: Use grouping by service and route; dedupe similar alerts; suppress during planned maintenance and releases.
Implementation Guide (Step-by-step)
1) Prerequisites: – Service inventory and critical endpoint list. – Observability pipeline capacity planning. – Standardized instrumentation libraries.
2) Instrumentation plan: – Identify entry points and beats to measure start/end times. – Implement histograms and traces with consistent labels. – Ensure monotonic timers used where possible.
3) Data collection: – Configure exporters to metrics and tracing backends. – Ensure exemplars link metrics to traces when possible. – Set retention and resolution policies.
4) SLO design: – Define SLO per customer-impacting endpoint. – Choose percentile and window suitable for traffic. – Define error budget policy and burn actions.
5) Dashboards: – Build executive, on-call, and debug dashboards. – Include distribution heatmaps and top slow traces.
6) Alerts & routing: – Implement burn-rate and absolute threshold alerts. – Route to correct teams by service ownership and escalation.
7) Runbooks & automation: – Document mitigation steps (scale up, rollback, circuit break). – Automate common remediations where safe.
8) Validation (load/chaos/game days): – Perform load tests to validate SLOs. – Run chaos experiments to ensure fallbacks operate.
9) Continuous improvement: – Weekly review of SLO posture. – Postmortems for every SLO breach.
Pre-production checklist:
- Instrumentation in place for all critical endpoints.
- Test traces exhibit full call graph.
- Synthetic checks validated.
- Dashboards populated with realistic data.
Production readiness checklist:
- Alerts and on-call routing tested.
- Automation for mitigation validated.
- Error budget policy documented.
- Runbooks linked to alert pages.
Incident checklist specific to Duration RED:
- Verify SLO breach and burn rate.
- Identify top slow endpoints and recent deploys.
- Check autoscaler and probe metrics.
- Apply mitigation (traffic shaping, cache warming).
- Capture traces and create postmortem if needed.
Use Cases of Duration RED
1) Checkout API – Context: E-commerce payment flow. – Problem: Occasional p99 spikes lead to abandoned carts. – Why Duration RED helps: Focuses on tail that causes checkout timeouts. – What to measure: p95/p99 per payment method, DB query durations. – Typical tools: APM, RUM, DB slow logs.
2) Search endpoint – Context: Fast, interactive results required. – Problem: Increased query time when cluster shards rebalanced. – Why Duration RED helps: SLO-driven scaling and query optimization. – What to measure: p95/p99 query times, queue wait. – Typical tools: Tracing, DB metrics, synthetic checks.
3) Third-party auth – Context: External identity provider used on login. – Problem: External provider latency increases login failure rates. – Why Duration RED helps: Detects dependency slowness, informs fallback. – What to measure: Upstream latency and retry counts. – Typical tools: Tracing, synthetic monitoring.
4) Mobile app onboarding – Context: Initial app load and API handshake. – Problem: Cold starts and network variability cause timeouts. – Why Duration RED helps: Prioritize cold start reduction and caching. – What to measure: Client-observed RTT, cold start rate. – Typical tools: RUM, function metrics.
5) Serverless webhook handler – Context: Event-driven webhooks processed on FaaS. – Problem: Cold starts inflate p95 for burst traffic. – Why Duration RED helps: Drives warm-pool sizing and concurrency. – What to measure: Invocation duration histogram, cold start percentage. – Typical tools: Cloud function metrics.
6) Streaming ingestion – Context: High-throughput event pipeline. – Problem: Backpressure causes long queue wait times and timeouts. – Why Duration RED helps: Surface queue wait and consumer latency. – What to measure: Time-in-queue percentiles, consumer processing time. – Typical tools: Broker metrics, tracing.
7) Kubernetes probe tuning – Context: Liveness/readiness probes causing restarts. – Problem: Probe durations exceed thresholds under load. – Why Duration RED helps: Ensures probes reflect realistic expectations. – What to measure: Probe execution time and failure counts. – Typical tools: K8s metrics and logs.
8) API gateway rollouts – Context: New gateway introduces additional latency. – Problem: Route-level p99 increases post-upgrade. – Why Duration RED helps: Observability for canary validation. – What to measure: Upstream and downstream duration differences. – Typical tools: Gateway metrics, traces.
Scenario Examples (Realistic, End-to-End)
Scenario #1 — Kubernetes microservice p99 spike
Context: A shopping-cart microservice running on Kubernetes shows p99 surfacing errors during a promotional spike. Goal: Reduce p99 from 2.5s to 800ms. Why Duration RED matters here: Tail latency causes timeouts and dropped carts. Architecture / workflow: Ingress -> API gateway -> cart service -> DB -> cache. Step-by-step implementation:
- Instrument cart service spans and histograms.
- Enable exemplars to correlate slow buckets to traces.
- Deploy synthetic load matching promo traffic.
- Tune DB queries and add cache for hot items.
- Adjust HPA and probe thresholds. What to measure: p95/p99, DB query p99, cache hit rate. Tools to use and why: Prometheus, OpenTelemetry, APM; for tracing and histograms. Common pitfalls: High cardinality tags causing slow queries. Validation: Run load test and verify p99 under SLO for 30-minute window. Outcome: p99 reduced, error budget stable during promotions.
Scenario #2 — Serverless cold start reduction
Context: Serverless image processing API with occasional cold starts. Goal: Reduce cold-start-driven p95 from 1.8s to 350ms. Why Duration RED matters here: Client perceived slowness leads to retries. Architecture / workflow: API Gateway -> Lambda-like functions -> storage Step-by-step implementation:
- Measure cold starts per invocation.
- Configure provisioned concurrency or warm-up invocations.
- Optimize function package size and dependencies.
- Add retries with exponential backoff. What to measure: Cold start rate, p95, function init time. Tools to use and why: Cloud function metrics, RUM for client impact. Common pitfalls: Overprovisioning increases cost without policy. Validation: Synthetic bursts with and without warm pools. Outcome: Cold start rate falls, p95 meets SLO with acceptable cost.
Scenario #3 — Incident response and postmortem for latency outage
Context: Production incident where p99 across many services rose concurrently. Goal: Triage and restore performance; identify root cause. Why Duration RED matters here: SLO breaches triggered paging and revenue risk. Architecture / workflow: Multi-service transactions failing due to a shared dependency. Step-by-step implementation:
- Page on-call based on burn rate alert.
- Use on-call dashboard to find top slow endpoints and recent deploys.
- Correlate traces showing shared dependency as bottleneck.
- Apply mitigation: circuit breaker or rollback dependency change.
- Collect artifacts and run postmortem. What to measure: SLO burn rate, root dependency p99. Tools to use and why: Tracing, APM, incident management. Common pitfalls: Lack of exemplars to correlate metrics to traces. Validation: Postmortem with action items and SLO updates. Outcome: Root cause fixed; runbook updated with mitigation steps.
Scenario #4 — Cost vs performance trade-off
Context: A streaming service needs to balance cache sizing vs reduced tail latency. Goal: Determine cost-effective cache size to meet p95 target. Why Duration RED matters here: Latency improvements cost money; need SLO-driven decision. Architecture / workflow: API -> service -> cache -> DB Step-by-step implementation:
- Measure miss-related p99 and overall p95.
- Model cost per cache tier and expected latency reduction.
- Run A/B with different cache sizes and track SLO compliance.
- Choose configuration optimizing cost per SLO improvement. What to measure: Cache hit rate, p95, cost per hour. Tools to use and why: Metrics backend and cost analytics. Common pitfalls: Ignoring cold cache warm-up effects. Validation: Cost/performance dashboard and review after 2 weeks. Outcome: Selected cache tier delivers SLO compliance at acceptable cost.
Common Mistakes, Anti-patterns, and Troubleshooting
List of practical mistakes with symptom -> root cause -> fix.
- Symptom: p95 okay but users complain. Root cause: p99 spikes. Fix: Monitor higher percentiles and adjust SLO.
- Symptom: Percentiles flapping. Root cause: low sample counts. Fix: Smoothing window and combine routes.
- Symptom: Alerts firing constantly after deploy. Root cause: overly tight SLO. Fix: Tune SLO and use deploy suppression rules.
- Symptom: Traces missing for slow requests. Root cause: sampling discarded tails. Fix: Adaptive sampling to retain slow and error traces.
- Symptom: Duration decreases but error rate increases. Root cause: Retries and early failures. Fix: Correlate error SLIs and duration.
- Symptom: High observability cost. Root cause: high cardinality metrics. Fix: Reduce labels and use rollups.
- Symptom: Metrics show low latency but users slow. Root cause: client-side rendering. Fix: Add RUM.
- Symptom: Alerts delayed. Root cause: ingest lag. Fix: Scale pipeline or lower retention resolution.
- Symptom: Probe churn and restarts. Root cause: strict probe timeouts. Fix: Tune probe durations based on p95.
- Symptom: Autoscaler not reacting. Root cause: using CPU rather than request latency. Fix: Use custom metrics like p95 or queue length.
- Symptom: Long investigation times. Root cause: missing trace context. Fix: Add consistent trace IDs and exemplars.
- Symptom: Misattributed latency to database. Root cause: absent network timing. Fix: Instrument network and downstream spans.
- Symptom: Increased costs during mitigation. Root cause: auto-scale aggressive without bounds. Fix: Add cost-aware autoscaling policies.
- Symptom: False positives during canary. Root cause: lack of canary-aware alerting. Fix: Suppress or route canary alerts.
- Symptom: Data skew across regions. Root cause: asynchronous replication lag. Fix: Measure per-region SLIs.
- Symptom: Spikes during backup windows. Root cause: maintenance tasks consuming resources. Fix: Schedule and throttle background jobs.
- Symptom: Aggregated percentile hides problem. Root cause: mixing critical and noncritical routes. Fix: Per-endpoint SLIs.
- Symptom: Alerts burst during replay. Root cause: traffic replaying causes queues. Fix: Rate-limit replay and simulate offline.
- Symptom: Noisy dashboards. Root cause: too many similar panels. Fix: Simplify and focus on key SLIs.
- Symptom: Misleading histogram buckets. Root cause: coarse buckets. Fix: Increase resolution or use sketches.
- Observability pitfall: Over-sampling client data creating privacy issues -> Fix: Redact and sample appropriately.
- Observability pitfall: Not linking exemplars to traces -> Fix: Enable exemplars in metrics pipeline.
- Observability pitfall: Using mean latency in dashboards -> Fix: Switch to percentiles and distributions.
- Observability pitfall: Forgetting monotonic timers -> Fix: Use monotonic timers in code.
- Observability pitfall: Missing dependency context -> Fix: Enforce context propagation in SDKs.
Best Practices & Operating Model
Ownership and on-call:
- Assign SLI/SLO ownership to service teams.
- On-call rotates per service owner; SLO breaches escalate to SLO owner.
Runbooks vs playbooks:
- Runbooks: step-by-step for common latency incidents.
- Playbooks: higher-level strategies for complex incidents and mitigation.
Safe deployments:
- Canary releases with Duration RED checks.
- Automatic rollback when burn rate or p99 exceed thresholds.
Toil reduction and automation:
- Automate scaling and cache population.
- Auto-annotate deploys and correlate with SLI changes.
Security basics:
- Avoid collecting PII in traces.
- Apply data redaction and sample before exporting.
Weekly/monthly routines:
- Weekly: review SLO burn and recent alerts.
- Monthly: capacity planning and dependency latency review.
Postmortem reviews:
- Always analyze SLO breach impact.
- Update runbooks and add automated tests for regression.
Tooling & Integration Map for Duration RED (TABLE REQUIRED)
| ID | Category | What it does | Key integrations | Notes |
|---|---|---|---|---|
| I1 | Tracing | Captures distributed spans and durations | Metrics backends and APM | Use exemplars to link to histograms |
| I2 | Metrics backend | Stores histograms and percentiles | Tracing and alerting systems | Tune retention vs resolution |
| I3 | APM | Correlates traces and service maps | Logs, traces, metrics | Good for rapid root cause analysis |
| I4 | RUM | Captures client-observed durations | Frontend, analytics | Privacy and sampling required |
| I5 | Synthetic monitoring | Scripted checks of journeys | Status pages and SLOs | Useful for external dependency checks |
| I6 | CI/CD | Measures deploy time and rollout metrics | Observability, alerts | Tie SLOs to deploy gates |
| I7 | Autoscaler | Scales based on duration or custom metric | Cloud provider APIs, k8s HPA | Consider cooldowns and safety limits |
| I8 | Service mesh | Adds telemetry and routing | Tracing and metrics | Introduces network overhead |
| I9 | DB performance tools | Captures query durations and locks | App tracing and APM | Use for DB tuning and indices |
| I10 | Incident mgmt | Pages and documents incidents | Monitoring and runbooks | Automate alert enrichment |
Row Details (only if needed)
Not required.
Frequently Asked Questions (FAQs)
What percentile should I choose for Duration RED?
Start with p95 for common impact and add p99 and p999 for tail. Use business context to pick thresholds.
How long should SLO windows be?
Typical windows are 30 days; shorter windows like 7 days help detect regressions faster. Choose based on traffic and business needs.
How many labels should I add to duration metrics?
Keep labels minimal for cardinality control; include service, endpoint, and environment as core labels.
Can I use mean latency as an SLI?
No. Mean hides tail and is poor for UX-sensitive SLOs.
How do I correlate slow requests to deploys?
Use deploy metadata annotations in metrics and traces and join by time window.
Should I measure client or server duration?
Both. Client gives true UX measure; server gives root-cause context.
How to avoid noisy alerts for small endpoints?
Aggregate or combine endpoints and use smoothing windows or minimum sample thresholds.
How to handle low-traffic endpoints for percentiles?
Use longer windows, smoothing, or lower percentile targets to avoid noise.
Are sketches better than histograms?
Sketches save memory and estimate quantiles; choose based on backend support and accuracy needs.
What is exemplar and why use it?
Exemplars link histogram buckets to a trace ID to find representative slow traces.
How do I ensure accurate duration measurement across languages?
Use standardized SDKs and monotonic timers; validate with integration tests.
How to handle third-party dependency latency?
Monitor upstream latency and implement fallbacks, timeouts, and circuit breakers.
How often should we review SLOs?
At least monthly and after any production incident or major release.
What is burn-rate alerting?
Alerting based on the rate of SLO budget consumption. Page when burn-rate is high and remaining budget is low.
How to keep observability costs under control?
Limit cardinality, sample traces, rollup metrics, and use retention tiers.
How to measure duration in serverless functions?
Use platform-provided invocation duration metrics and instrument cold start time.
How to tune autoscaling for latency?
Use p95 or queue-length as scaling signals instead of CPU alone and tune cooldowns.
How to prevent retries from worsening latency?
Implement retry budgets, exponential backoff, and rate limiting.
Conclusion
Duration RED centralizes latency percentiles as critical SLIs to preserve customer experience, guide SRE actions, and influence architecture. It requires careful instrumentation, testable SLOs, and collaboration across teams.
Next 7 days plan:
- Day 1: Inventory critical endpoints and define initial p95/p99 targets.
- Day 2: Instrument two most critical services with histograms and traces.
- Day 3: Create executive and on-call dashboards with p95/p99 panels.
- Day 4: Implement burn-rate alerting with basic runbook.
- Day 5: Run a small-scale load test and validate percentiles.
- Day 6: Tune alerts to reduce noise and add exemplar linking.
- Day 7: Schedule a post-implementation review and define next SLO maturity steps.
Appendix — Duration RED Keyword Cluster (SEO)
- Primary keywords
- Duration RED
- Duration RED SLI
- Duration RED SLO
- latency RED
- request duration monitoring
-
duration-based SLI
-
Secondary keywords
- tail latency monitoring
- p99 latency SLO
- duration percentiles
- real user monitoring duration
- synthetic duration checks
-
histogram latency
-
Long-tail questions
- what is duration red and how to measure it
- how to set p95 and p99 SLOs for APIs
- how to instrument duration metrics in microservices
- best practices for measuring tail latency in Kubernetes
- how to correlate traces with duration histograms
- how to reduce serverless cold start latency p95
- what is exemplar in observability for duration
- how to prevent retry storms increasing latency
- how to tune autoscaler for request latency
- how to design runbooks for latency incidents
- how to calculate burn rate for duration SLOs
- what percentile should I use for user-facing APIs
- how to implement client-observed SLIs
- what are common pitfalls measuring duration red
-
how to measure duration across cloud services
-
Related terminology
- latency distribution
- histogram buckets
- quantile estimation
- trace exemplars
- monotonic timers
- sampling policy
- observability pipeline
- cardinality management
- burn-rate alerting
- error budget
- service-level indicator
- service-level objective
- distributed tracing
- real user monitoring
- synthetic monitoring
- canary deployments
- circuit breaker
- backpressure
- queuing delay
- cold start
- p95 p99 p999
- response time percentiles
- probe latency
- request queue time
- autoscaling latency metric
- APM for latency
- k8s probe tuning
- latency runbook
- latency postmortem
- latency heatmap
- latency dashboard
- latency SLI computation
- latency aggregation window
- service mesh latency
- exemplars to trace mapping
- RUM duration
- backend processing time
- startup time histogram
- slow query log
- queue wait histogram
- deployment latency regression
- latency cost tradeoff
- latency mitigation strategies
- latency observability best practices