What is Four golden signals? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide)

Terminology

Quick Definition (30–60 words)

Four golden signals are the four core metrics SREs monitor to detect service health: latency, traffic, errors, and saturation. Analogy: they are the vital signs on a patient monitor for distributed systems. Formal: they are a prioritized SLI set guiding alerting and SLOs for production services.


What is Four golden signals?

The Four golden signals are a minimal, high-leverage observability model focusing on latency, traffic, errors, and saturation as the primary indicators of system health. They are not a complete observability solution; they are the first-order signals that should be visible and alertable across services.

  • What it is:
  • A prioritized SLI set for service-level monitoring.
  • A short list designed for rapid incident detection and triage.
  • Guidance for alert thresholds and error-budget decisions.

  • What it is NOT:

  • Not a replacement for deep tracing, logs, or custom business metrics.
  • Not sufficient alone for complex performance debugging or security investigations.
  • Not a universal set for every telemetry need; extend as needed.

  • Key properties and constraints:

  • Low cardinality and high coverage.
  • Actionable and measurable with meaningful SLOs.
  • Should be available at multiple granularity levels (service, endpoint, host, pod).
  • Must be collected with consistent measurement windows and percentiles for latency.

  • Where it fits in modern cloud/SRE workflows:

  • First-line monitoring for alerting and on-call.
  • Input to SLOs and error budgets for release control.
  • Triage starting point before diving into traces, logs, and profiling.
  • Integrated into CI/CD gating, chaos experiments, and automated remediation.

  • Diagram description (text-only)

  • Client traffic hits edge load balancer then service cluster.
  • Metrics exporters run on nodes and sidecars collecting latency, traffic, errors, saturation.
  • Metrics flow to a metrics collector and TSDB.
  • Alerting evaluates SLOs and triggers on-call routing.
  • Traces and logs are linked to metric alerts for deeper investigation.

Four golden signals in one sentence

Latency, traffic, errors, and saturation are the four prioritized SLIs that give SREs the clearest early warning of service health and recovery needs.

Four golden signals vs related terms (TABLE REQUIRED)

ID Term How it differs from Four golden signals Common confusion
T1 Metrics Aggregated numeric data; broader than the four signals Metrics includes many counters and gauges
T2 Tracing Traces show distributed request paths not summarized by signals See details below: T2
T3 Logs Raw event records; high cardinality and context-rich Logs are not summarized like the signals
T4 SLIs Service Level Indicators are metrics representing user experience SLIs can be the four signals but also others
T5 SLOs Objectives set on SLIs; policy not raw telemetry SLOs use signals to set targets
T6 Alerts Notifications derived from SLIs/SLOs Alerts are actions not metrics
T7 Dashboards Visualizations of metrics including the four signals Dashboards are views not definitions
T8 Business KPIs High-level business outcomes; may map to signals indirectly Business KPIs extend beyond infrastructure

Row Details (only if any cell says “See details below”)

  • T2: Tracing expands per-request timing across services and helps pinpoint which service or span causes high latency; use tracing after the four signals surface an incident.

Why does Four golden signals matter?

Four golden signals matter because they provide a compact, actionable view into system health that can be tied directly to user experience and business impact.

  • Business impact:
  • Faster detection reduces customer-facing downtime and revenue loss.
  • Clear SLIs enable realistic SLOs that balance feature velocity with reliability.
  • Accurate telemetry reduces trust erosion by providing transparent metrics during incidents.

  • Engineering impact:

  • Reduces MTTD (Mean Time To Detect) by focusing alerts on the most relevant indicators.
  • Decreases toil by standardizing instrumentation across services.
  • Enables safe deployment via error-budget-aware release policies.

  • SRE framing:

  • SLIs: latency and errors become primary SLIs for user-facing endpoints.
  • SLOs: set objectives on those SLIs, e.g., 99.9% p95 latency under threshold.
  • Error budgets: drive prioritization of reliability work vs feature work.
  • On-call: provides first-line signals that escalate to runbooks and playbooks.

  • Realistic “what breaks in production” examples: 1. Traffic spike causes increased queueing and p95 latency leading to timeouts. 2. Memory leak increases saturation on pods, causing OOM kills and higher error rates. 3. Dependency regression causes elevated 5xx errors from a downstream API. 4. Network partition between clusters increases latency and drops throughput. 5. Load balancer misconfiguration routes too much traffic to a slow backend, increasing p99 latency.


Where is Four golden signals used? (TABLE REQUIRED)

ID Layer/Area How Four golden signals appears Typical telemetry Common tools
L1 Edge and network Latency and errors at ingress and CDN Request latency, 5xx rate, throughput See details below: L1
L2 Service / application Per-endpoint latency and error rate p50 p95 p99, error counts, requests/sec Prometheus Grafana OpenTelemetry
L3 Infrastructure nodes Saturation metrics on hosts and VMs CPU, memory, disk, socket usage Node exporter cloud metrics
L4 Kubernetes Pod saturation and request latency per service Pod CPU, memory, pod restarts, network io Kube-state-metrics kubelet metrics
L5 Serverless / PaaS Invocation latency and error ratio Cold start latency, invocations/sec, errors Provider metrics and custom traces
L6 CI/CD Deploy-time errors and rollout latency Deploy success rate, time to deploy CI telemetry and deployment hooks
L7 Security / Observability Anomalous spikes in errors or traffic patterns Error spikes, unusual traffic SIEM metrics and observability platform

Row Details (only if needed)

  • L1: Edge use shows overall service health before internal routing; tools include cloud LB and CDN metrics and synthetic checks.

When should you use Four golden signals?

  • When necessary:
  • New production services need baseline observability.
  • You need a lightweight SLI set for SLOs and on-call alerts.
  • Rapid triage is required for incidents across many services.

  • When it’s optional:

  • Internal tooling or low-risk batch jobs where business impact is small.
  • Highly experimental feature branches in dev environments.

  • When NOT to use / overuse it:

  • Not a replacement for domain-specific business metrics (e.g., checkout success rate).
  • Do not rely solely on four signals for security incident detection.
  • Over-alerting on raw signals without contextual thresholds causes noise.

  • Decision checklist:

  • If service is user-facing and revenue impactful -> implement four signals.
  • If service latency and error rates affect SLOs -> implement and alert.
  • If internal batch job with no SLAs -> optional minimal monitoring.
  • If complex microservice with heavy interdependencies -> extend beyond four signals.

  • Maturity ladder:

  • Beginner: Basic collection of the four metrics, single percentile for latency, simple alert.
  • Intermediate: Per-endpoint metrics, p95/p99, SLOs and error budgets, dashboards.
  • Advanced: Automated remediation, adaptive alerting, anomaly detection, integration with CI gating and capacity autoscaling based on saturation signals.

How does Four golden signals work?

Step-by-step explanation of components and workflow:

  1. Instrumentation: – Application and platform export counters and histograms for latency, request counts, error counts, and resource gauges.
  2. Collection: – Agents/sidecars collect metrics and forward to centralized TSDB or observability backend.
  3. Aggregation: – Metrics are aggregated into meaningful SLIs: percentiles for latency, rate for traffic, ratio for errors, usage for saturation.
  4. Evaluation: – Alerting rules and SLO evaluators compare SLIs against thresholds and error budgets.
  5. Notification: – Alerts trigger on-call routing, automated runbooks, or SLO-based actions like blocking rollouts.
  6. Investigation: – Triage uses traces and logs, linked from metric alerts, to identify root cause.
  7. Remediation and postmortem: – Apply fixes, document incident, and adjust SLOs or thresholds as needed.
  • Data flow and lifecycle:
  • Instrumentation -> Local aggregation -> Export -> TSDB/metrics store -> SLO evaluation and dashboards -> Alerts -> Traces/logs for deep investigation -> Postmortem updates.

  • Edge cases and failure modes:

  • Metrics collector outage can blind SREs; ensure high-availability and self-monitoring.
  • Aggregation errors (incorrect histogram buckets) distort percentiles.
  • Sampling introduces bias for high-throughput services.
  • Cardinality explosion from tag misuse causes storage and query issues.

Typical architecture patterns for Four golden signals

  1. Sidecar exporter pattern – Use sidecars to collect app metrics and forward to central collector. – Use when direct instrumentation is constrained or in service mesh environments.
  2. Agent node exporter pattern – Run node-level agents for infra metrics and lightweight app metrics. – Use for host-level saturation visibility.
  3. Push gateway for ephemeral workloads – Use push gateway for batch jobs or short-lived functions to report metrics. – Use with caution; ensure job completion marks metrics lifecycle.
  4. Mesh-native telemetry – Use service meshes to derive latency, traffic, and error metrics at the proxy layer. – Use when mesh provides consistent telemetry and minimal app change.
  5. Cloud-provider-managed metrics – Use provider-managed telemetry for serverless and managed services. – Use when control plane integrates with provider SLIs and alarms.
  6. Hybrid observability pipeline – Combine metrics, traces, and logs in an integrated platform with correlation IDs. – Use for advanced troubleshooting and linking metric alerts to traces.

Failure modes & mitigation (TABLE REQUIRED)

ID Failure mode Symptom Likely cause Mitigation Observability signal
F1 Metrics pipeline down Missing or stale metrics Collector or TSDB outage Use HA collectors and self-monitoring Missing datapoints
F2 High p95 latency Slow user responses Resource saturation or downstream slow Scale or fix downstream; use retries Latency percentiles up
F3 Error rate spike Increased 5xx or client errors Deployment bug or config change Rollback and run canary analysis Error ratio spike
F4 Saturation unnoticed Queue growth and failures Missing saturation metrics Add resource gauges and autoscaling CPU memory utilization
F5 Cardinality explosion Slow queries and high costs Unbounded tag values Enforce tag cardinality limits High TSDB write and query latency
F6 Wrong percentile configs Misleading latency SLI Using mean instead of p95/p99 Standardize percentile collection Percentile mismatches
F7 Alert storm Pager fatigue Low threshold or noisy environment Add aggregation and dedupe High alert volume
F8 Sampling bias Trace not showing root cause Excessive sampling Adjust sampling or use tail-based sampling Missing trace links

Row Details (only if needed)

  • F5: Cardinality explosion is often caused by using unique request IDs or user IDs as metric labels; replace with coarse buckets or hashed low-cardinality labels.
  • F6: Percentile mismatch occurs when histograms not exported consistently; ensure uniform bucketization and client libraries support.

Key Concepts, Keywords & Terminology for Four golden signals

A glossary of 40+ terms follows. Each entry: term — definition — why it matters — common pitfall.

  • Alert — Notification triggered by a rule — Immediate response mechanism — Pitfall: noisy or unclear alerts.
  • Aggregation window — Time window for metric aggregation — Affects percentiles and rates — Pitfall: too long hides spikes.
  • API latency — Time for API responses — Direct user experience measure — Pitfall: measuring client vs server latency mismatch.
  • Autoscaling — Adjusting capacity automatically — Addresses saturation — Pitfall: reaction lag to sudden spikes.
  • Baseline — Normal metric level — Needed for anomaly detection — Pitfall: drift without rebaseline.
  • Canary — Small release to validate changes — Reduces blast radius — Pitfall: insufficient traffic to the canary.
  • Cardinality — Number of distinct metric label combinations — Impacts storage and queries — Pitfall: unbounded tags causing explosion.
  • CI/CD — Continuous integration and deployment — Integration point for observability gating — Pitfall: no SLO checks in pipeline.
  • Counter — Monotonic metric type — Good for traffic and errors — Pitfall: misuse as gauge.
  • Datapoint gap — Missing metric samples — Indicates pipeline problems — Pitfall: assuming zero is real.
  • Dashboard — Visual representation of metrics — Supports situational awareness — Pitfall: overcrowded dashboards.
  • Data retention — Duration metrics are kept — Influences historical analysis — Pitfall: short retention limits postmortem.
  • Error budget — Allowable SLO violation budget — Balances reliability vs feature work — Pitfall: misuse as blame metric.
  • Error rate — Ratio of failed requests — Core SLI — Pitfall: not distinguishing transient vs sustained errors.
  • Exporter — Component that exposes metrics — Bridge between app and collector — Pitfall: version mismatch.
  • Histogram — Distribution of observations — Enables percentile calculations — Pitfall: wrong buckets distort percentiles.
  • Instrumentation — Adding telemetry to code — Enables SLIs — Pitfall: inconsistent instrumentation across services.
  • Latency — Time to serve a request — Primary user-facing metric — Pitfall: mean hides tail latency.
  • Linearly scaling costs — Cost increasing with usage — Observability costs can balloon — Pitfall: unbounded retention and cardinality.
  • Logs — Event records from systems — Provide context to metrics — Pitfall: unstructured logs make parsing hard.
  • Metric label — Key-value pair describing metric context — Enables filtering — Pitfall: high-cardinality labels.
  • Metrics pipeline — Collection to storage path — Central for reliable telemetry — Pitfall: single point of failure.
  • Mean vs percentile — Different statistical measures — Percentiles reflect tail behavior — Pitfall: using mean for latency SLIs.
  • Monitoring — Continuous observation of systems — Incident precursor — Pitfall: monitoring without actionable alerts.
  • Node exporter — Host-level telemetry agent — Shows saturation — Pitfall: not container-aware.
  • Observability — Ability to infer system state from telemetry — Critical for debugging — Pitfall: focusing on logs only.
  • On-call — Rotational duty to respond to incidents — Endpoint for alerts — Pitfall: unclear runbooks increase stress.
  • P50/P95/P99 — Latency percentiles — Represent typical and tail latencies — Pitfall: measuring inconsistent percentiles.
  • Rate — Requests per unit time — Reflects load — Pitfall: sudden spikes cause downstream chaos.
  • Reactive remediation — Manual incident fixes — Short-term fix — Pitfall: slow and error-prone.
  • Request tracing — Per-request distributed trace — Pinpoints latency sources — Pitfall: over-sampling causes overhead.
  • Resource saturation — Exhaustion of CPU, memory, IO — Causes failures — Pitfall: ignoring ephemeral bursts.
  • Sampling — Reducing telemetry volume — Controls cost — Pitfall: losing critical rare traces.
  • SLI — Service Level Indicator — Quantitative measure of service health — Pitfall: poorly chosen SLIs.
  • SLO — Service Level Objective — Target threshold for an SLI — Pitfall: unrealistic SLOs.
  • Synthetic checks — Automated external tests — Simulate user behavior — Pitfall: synthetic tests diverge from real traffic.
  • Throughput — Volume of requests processed — Balances with latency and errors — Pitfall: optimizing throughput can hide latency regressions.
  • TL;DR — Summary term — Useful for quick communication — Pitfall: over-simplifying root causes.
  • Traces — Timed spans representing operations — Link metric anomalies to code — Pitfall: incomplete context propagation.
  • TSDB — Time-series database — Stores aggregated metrics — Pitfall: write or query performance issues.
  • VB allocation — Memory allocation patterns — Impacts saturation — Pitfall: not tracked by default metrics.
  • Vector aggregation — PromQL or similar expressions — Enables flexible queries — Pitfall: expensive queries on high cardinality.
  • Workload isolation — Separating critical from noncritical workloads — Lowers blast radius — Pitfall: cross-talk via shared infra.

How to Measure Four golden signals (Metrics, SLIs, SLOs) (TABLE REQUIRED)

ID Metric/SLI What it tells you How to measure Starting target Gotchas
M1 Latency p95 Tail user latency experience Histogram percentiles over 5m p95 < 500ms for user API Mean hides tail
M2 Latency p99 Worst-case user latency Histogram percentiles over 5m p99 < 2s for critical endpoints Costly to compute
M3 Traffic RPS Load and capacity requirements Sum requests per sec per service Use baseline plus 2x buffer Spiky traffic requires smoothing
M4 Error rate Fraction of failed requests Failed requests / total per 5m <0.1% for critical flows Distinguish 4xx vs 5xx
M5 Saturation CPU Resource pressure indicator CPU usage per pod or VM Keep headroom 20-30% Burst CPU not reflected
M6 Saturation memory Memory saturation and leaks Memory resident usage per container Keep headroom 20% OOMs appear late
M7 Queue length Backpressure indicator Pending requests or queue depth Near zero under normal load Not always instrumented
M8 Pod restart rate Stability of containers Restarts per minute per pod Zero or near zero Crashloop visibility needed
M9 Timeouts Client-side timeouts count Count timeout responses Monitor increasing trend Client vs server timeout mismatch
M10 Request success latency SLI combining success and latency Percent of requests under latency threshold 99% under threshold Avoid counting errors as successes

Row Details (only if needed)

  • None needed.

Best tools to measure Four golden signals

List of recommended tools and structured info follows.

Tool — Prometheus

  • What it measures for Four golden signals: Latency histograms, counters for traffic and errors, node metrics for saturation.
  • Best-fit environment: Kubernetes and cloud-native stacks.
  • Setup outline:
  • Instrument services with client libraries.
  • Run node exporters and kube-state-metrics.
  • Use remote_write for long-term storage.
  • Configure recording rules for p95/p99.
  • Integrate alertmanager for alerts.
  • Strengths:
  • Flexible query language and ecosystem integrations.
  • Wide community support and exporters.
  • Limitations:
  • Scalability and long-term storage management require extra components.
  • Cost and complexity at high cardinality.

Tool — OpenTelemetry + Collector

  • What it measures for Four golden signals: Traces, metrics, and resource metrics unified for correlation.
  • Best-fit environment: Hybrid cloud and microservices with tracing needs.
  • Setup outline:
  • Instrument apps with OTEL SDKs.
  • Deploy Collector as agent or gateway.
  • Configure exporters to chosen backend.
  • Use resource attributes for low-cardinality labels.
  • Strengths:
  • Vendor-neutral and unifies traces/metrics/logs.
  • Flexible processing pipelines.
  • Limitations:
  • Evolving standards and some gaps in metric conventions.

Tool — Grafana (with Loki/Tempo)

  • What it measures for Four golden signals: Visualization of metrics and linking traces/logs for debugging.
  • Best-fit environment: Teams needing integrated dashboards.
  • Setup outline:
  • Connect to TSDB and trace store.
  • Create dashboards for the four signals.
  • Enable alerts and annotations.
  • Strengths:
  • Rich dashboards and alerting capabilities.
  • Integrates logs and traces.
  • Limitations:
  • Requires backends for storage; UI alone is not sufficient.

Tool — Cloud provider monitoring (managed)

  • What it measures for Four golden signals: Provider-specific metrics for managed services and infra.
  • Best-fit environment: Serverless and managed PaaS.
  • Setup outline:
  • Enable provider monitoring and export to central store.
  • Configure alerting and SLOs in provider console.
  • Strengths:
  • Low instrumentation effort for managed services.
  • Often includes integrated dashboards and anomaly detection.
  • Limitations:
  • Varying coverage and vendor lock-in concerns.
  • Metric granularity and retention may be limited.

Tool — Datadog / New Relic / Splunk Observability

  • What it measures for Four golden signals: Unified metrics, traces, and logs with AI-assisted insights.
  • Best-fit environment: Enterprises wanting managed observability.
  • Setup outline:
  • Install agents and instrumentation.
  • Use auto-instrumentation where available.
  • Configure dashboards and monitors.
  • Strengths:
  • Managed scaling and AI-driven anomaly detection.
  • End-to-end correlation across telemetry.
  • Limitations:
  • Cost and potential data lock-in.
  • Custom rules may be needed for specific SLOs.

Recommended dashboards & alerts for Four golden signals

  • Executive dashboard:
  • Panels: Overall SLA compliance, global error budget burn rate, major service latencies p95, top 5 services by errors.
  • Why: Provides leadership a quick health summary and business impact view.

  • On-call dashboard:

  • Panels: Live p95/p99 latency, requests/sec, error rate by endpoint, pod/host saturation, active alerts, recent deployments.
  • Why: Provides quick triage info for responders; highlights affected endpoints and infra.

  • Debug dashboard:

  • Panels: Per-endpoint latency heatmap, trace samples linked to errors, request logs, queue depths, dependency call latency.
  • Why: Enables deep-dive troubleshooting and root cause analysis.

Alerting guidance:

  • What should page vs ticket:
  • Page on sustained SLO breaches or rapid burn rates that threaten user experience.
  • Ticket minor degradations and non-urgent trends.
  • Burn-rate guidance:
  • If burn rate > 2x for >15 minutes, escalate to on-call and consider rolling back releases.
  • If burn rate > 5x, initiate incident response and mitigation.
  • Noise reduction tactics:
  • Deduplicate alerts that share causal labels.
  • Group related alerts by service and endpoint.
  • Use suppression during expected maintenance windows.
  • Add refractory windows for frequent flapping alerts.

Implementation Guide (Step-by-step)

This implementation guide outlines a practical path from zero to mature four-signal observability.

1) Prerequisites – Service owner and on-call roster defined. – CI/CD with ability to gate deployments. – Observability backend selected and accessible. – Basic tracing and logging setup available.

2) Instrumentation plan – Identify critical endpoints and business flows. – Add counters for total requests and error counts. – Add histograms for request durations with consistent buckets. – Expose resource metrics (CPU, memory, queue length).

3) Data collection – Deploy exporters or sidecars and configure scraping/export. – Ensure consistent label schema across services. – Configure retention policies and remote storage for long-term SLOs.

4) SLO design – Choose SLIs from the four signals for critical endpoints. – Set SLO targets using historical data or business risk tolerance. – Define error budget policies and automated reactions.

5) Dashboards – Build executive, on-call, and debug dashboards. – Include deployment and change annotations. – Add surface links to runbooks and trace explorers.

6) Alerts & routing – Create alert policies tied to SLOs (not raw metrics). – Set paging thresholds for sustained breaches. – Configure dedupe, suppression, and escalation policies.

7) Runbooks & automation – Write runbooks for common issues (high latency, high error rate, saturation). – Implement automated remediation where safe (scale up, circuit-breaker). – Maintain playbooks for rollback and feature flagging.

8) Validation (load/chaos/game days) – Run load tests to validate latency and saturation signals. – Run chaos experiments to validate error detection and recovery. – Conduct game days simulating SLO breaches and runbook execution.

9) Continuous improvement – Review postmortems and adjust SLOs. – Reduce alert noise and refine dashboards. – Automate repetitive runbook steps.

Checklists:

  • Pre-production checklist
  • Instrumentation present for latency, traffic, errors, saturation.
  • Exporters and collectors configured.
  • Dashboards show baseline metrics.
  • Smoke alerts configured for missing metrics.

  • Production readiness checklist

  • SLOs and error budgets defined.
  • On-call notified for alerts.
  • Runbooks linked to alerts.
  • Autoscaling and capacity plans validated.

  • Incident checklist specific to Four golden signals

  • Confirm which of the four signals triggered.
  • Determine affected endpoints and scope.
  • Check recent deployments and config changes.
  • Correlate with traces/logs and apply mitigation.
  • Record timeline and update SLO/error budget burn calculations.

Use Cases of Four golden signals

Eight realistic use cases with context, problem, and metrics:

  1. User-facing API performance regression – Context: Public REST API for mobile app. – Problem: Sudden p95 latency regression after deploy. – Why: Four signals detect latency and increased errors quickly. – What to measure: p95/p99 latency, error rate, CPU saturation. – Typical tools: Prometheus, Grafana, tracing.

  2. Autoscaling validation – Context: Microservice with autoscale on CPU. – Problem: Autoscaler scales too late causing latency spikes. – Why: Saturation metrics reveal resource strain early. – What to measure: CPU, queue length, latency p95. – Typical tools: Metrics exporter, cluster autoscaler telemetry.

  3. Dependency outage – Context: Service depends on third-party API. – Problem: Downstream 5xx causes upstream errors. – Why: Error rate spike points to broken dependency. – What to measure: Error rates per external call, latency, retries. – Typical tools: Tracing, error counting, alerting.

  4. Cost vs performance tuning – Context: Budget constraints require right-sizing. – Problem: Over-provisioning wastes cost; under-provisioning increases latency. – Why: Saturation and latency guide right-sizing trades. – What to measure: CPU utilization, p95 latency, throughput. – Typical tools: Cloud metrics, cost telemetry.

  5. Canary release monitoring – Context: Deploy new version to subset of users. – Problem: New version causes regressions. – Why: Four signals enable quick detection during canary. – What to measure: Canary error rate, latency, traffic. – Typical tools: Feature flags, canary tooling, metrics.

  6. Serverless cold-start troubleshooting – Context: Serverless platform with variable cold starts. – Problem: High latency outliers due to cold starts. – Why: Latency and traffic patterns reveal cold start behavior. – What to measure: Invocation latency distribution, concurrency. – Typical tools: Provider metrics, tracing.

  7. Database connection pool issues – Context: Service uses DB connection pool. – Problem: Pool exhaustion causing queued requests and timeouts. – Why: Queue length and saturation highlight resource constraint. – What to measure: Connection pool usage, queue length, latency. – Typical tools: App metrics, DB monitoring.

  8. Security anomaly detection – Context: Unexpected traffic spikes may signal DDoS. – Problem: Large increase in requests causing capacity issues. – Why: Traffic metric spike warns of abnormal behavior. – What to measure: Requests/sec, 4xx/5xx distribution, origin IP patterns. – Typical tools: Edge metrics, SIEM correlation.


Scenario Examples (Realistic, End-to-End)

Scenario #1 — Kubernetes service degraded after rollout

Context: A stateless microservice on Kubernetes rolled out a new version. Goal: Detect and mitigate regressions quickly using four signals. Why Four golden signals matters here: They provide immediate feedback on latency, errors, and pod saturation post-deploy. Architecture / workflow: Ingress -> Service -> Pods (sidecar exporters) -> Prometheus -> Alertmanager -> On-call. Step-by-step implementation:

  • Instrument endpoints with histograms and counters.
  • Configure Pod resource requests and limits.
  • Set SLO on p95 latency and error rate.
  • Create alerts for p95 breach and increasing error rate.
  • Use canary deployment to roll out to 5% traffic first. What to measure: p95/p99 latency, error rate per endpoint, pod CPU and memory, pod restart count. Tools to use and why: Prometheus for metrics, Grafana for dashboards, kube-state-metrics for pod health. Common pitfalls: Not tagging canary traffic; insufficient canary traffic for detection. Validation: Run a synthetic load against canary and compare signals to baseline. Outcome: Canary detects latency spike; rollout paused and investigation reveals a database connection leak fixed before full rollout.

Scenario #2 — Serverless function cold-start and scaling

Context: An HTTP-triggered function on managed serverless platform. Goal: Reduce cold-start impact and ensure SLO compliance. Why Four golden signals matters here: Latency distribution and traffic patterns expose cold-starts and concurrency limits. Architecture / workflow: Client -> Cloud Function -> Managed metrics -> Observability backend. Step-by-step implementation:

  • Record invocation latency histogram and cold-start tag.
  • Monitor functions/sec and concurrency metrics.
  • Set alert on p99 latency or cold-start rate.
  • Implement warmers or adjust provisioned concurrency. What to measure: p95/p99 latency, cold-start count, concurrent executions. Tools to use and why: Provider metrics and traces to correlate cold starts. Common pitfalls: Conflating network latency with cold-start latency. Validation: Simulate burst traffic and measure tail latency before and after provisioning. Outcome: Provisioned concurrency reduced p99 latency and improved SLO compliance.

Scenario #3 — Incident response and postmortem using four signals

Context: Unexpected production outage causing user errors. Goal: Rapid detection, mitigation, and root-cause analysis. Why Four golden signals matters here: They drive initial paging and scope for responders. Architecture / workflow: Users -> API -> Service -> Metrics pipeline -> Alerting -> Incident commander. Step-by-step implementation:

  • On alert, gather p95/p99 latency, error rates, and saturation metrics.
  • Check recent deployments and config changes.
  • Correlate with traces and error logs to identify failing dependency.
  • Mitigate by circuit-breaker and rollback.
  • Prepare postmortem focusing on why alerts didn’t prevent escalation. What to measure: Time to detect, time to mitigate, error budget burn. Tools to use and why: Prometheus, tracing, logs, incident management tool. Common pitfalls: Missing trace context due to sampling, unclear runbooks. Validation: Run retrospective analysis and update runbook and SLO thresholds. Outcome: Faster detection and improved runbook; automated mitigation added.

Scenario #4 — Cost/performance trade-off in autoscaling

Context: Cloud service with cost-sensitive workload. Goal: Balance cost and performance by tuning autoscaling based on four signals. Why Four golden signals matters here: Saturation and latency inform optimal scaling thresholds. Architecture / workflow: Load balancer -> Autoscaled instances -> Monitoring -> Cost metrics. Step-by-step implementation:

  • Track p95 latency and CPU utilization together.
  • Implement autoscaler policies that use custom metrics like queue length.
  • Run load tests to find breakpoints where latency degrades.
  • Adjust scaling thresholds and cool-down periods to reduce cost without harming SLO. What to measure: p95 latency, CPU utilization, requests/sec, cost per hour. Tools to use and why: Cloud metrics, custom metrics for queue depth, cost telemetry. Common pitfalls: Autoscaler spikes causing oscillations; misaligned cool-down windows. Validation: A/B tests on scaling policies and observe error budget impacts. Outcome: Reduced cost by 20% while maintaining SLOs.

Common Mistakes, Anti-patterns, and Troubleshooting

List of common mistakes (symptom -> root cause -> fix). Includes at least 5 observability pitfalls.

  1. Symptom: Frequent false alerts -> Root cause: Aggressive thresholds and noisy metric variance -> Fix: Shift to SLO-based alerts and increase aggregation window.
  2. Symptom: Missing latency spikes -> Root cause: Using mean instead of percentiles -> Fix: Use p95/p99 histograms.
  3. Symptom: High TSDB cost -> Root cause: Unbounded label cardinality -> Fix: Enforce label hygiene and drop high-cardinality labels.
  4. Symptom: No tracing for errors -> Root cause: Sampling too aggressive -> Fix: Use tail-based sampling for error traces.
  5. Symptom: Metrics gap during outage -> Root cause: Single collector failure -> Fix: HA collectors and monitor collector health.
  6. Symptom: Slow queries on dashboards -> Root cause: Expensive queries over high-cardinality data -> Fix: Use recorded rules and reduce cardinality.
  7. Symptom: On-call burnout -> Root cause: Alert storm and unclear runbooks -> Fix: Triage alerts, improve runbooks, and implement dedupe.
  8. Symptom: Late detection of saturation -> Root cause: Not monitoring queue depth or internal resource pools -> Fix: Add queue and pool metrics.
  9. Symptom: Inaccurate SLOs -> Root cause: Insufficient historical data -> Fix: Collect baseline data and iterate SLOs.
  10. Symptom: Inconsistent metrics across services -> Root cause: Different bucketization and units -> Fix: Standardize metric libraries and buckets.
  11. Symptom: High tail latency in production but not in load tests -> Root cause: Synthetic tests not matching production traffic patterns -> Fix: Use realistic traffic replays.
  12. Symptom: Error budget consumed rapidly after deploys -> Root cause: No canary or insufficient canary traffic -> Fix: Implement canary analysis and rollback automation.
  13. Symptom: Dashboards show different values than alerts -> Root cause: Different aggregation windows or queries -> Fix: Align queries and use recording rules.
  14. Symptom: Missing host metrics in containers -> Root cause: Not deploying node exporters or permissions blocked -> Fix: Deploy node exporters and check RBAC.
  15. Observability pitfall: Too many non-actionable metrics -> Root cause: Instrumenting everything regardless of use-case -> Fix: Prioritize four signals and business SLIs.
  16. Observability pitfall: Logs not correlated with traces -> Root cause: Missing correlation IDs -> Fix: Add trace IDs to logs and propagate context.
  17. Observability pitfall: Over-reliance on vendor AI -> Root cause: Blind trust in automated root-cause suggestions -> Fix: Use AI as assistive not authoritative.
  18. Symptom: Rapid cost increases -> Root cause: High retention and ingestion during incident -> Fix: Limit retention for noncritical metrics and use downsampling.
  19. Symptom: Difficulty scaling observability -> Root cause: Centralized single-point collector -> Fix: Decentralize collectors and use remote_write.
  20. Symptom: Deployments cause intermittent errors -> Root cause: Database migrations blocking queries -> Fix: Run migrations with backward-compatible patterns.
  21. Symptom: Alerts trigger but no one paged -> Root cause: Misconfigured routing and silences -> Fix: Review alertmanager routing and escalation policies.
  22. Symptom: Service degrades silently -> Root cause: No synthetic checks or external user perspective -> Fix: Add synthetic transactions and RUM where applicable.
  23. Symptom: Latency regressions unnoticed for hours -> Root cause: High alert thresholds and long aggregation -> Fix: Add progressive alerting like staged thresholds.
  24. Symptom: Unauthorized data exposure -> Root cause: Including PII in metrics -> Fix: Scrub sensitive fields and avoid PII in labels.
  25. Symptom: Long postmortem cycles -> Root cause: Missing timeline from metrics and deploy annotations -> Fix: Timestamp deploys and annotate dashboards.

Best Practices & Operating Model

  • Ownership and on-call:
  • Service teams own SLOs and associated runbooks.
  • Dedicated SREs assist with SLO design and incident response.
  • Rotate on-call with documented escalation paths.

  • Runbooks vs playbooks:

  • Runbooks: Step-by-step instructions for operational fixes.
  • Playbooks: Higher-level decision trees for complex incidents.
  • Keep both versioned and review after incidents.

  • Safe deployments:

  • Use canary releases, automated rollbacks, and feature flags.
  • Gate deployments by SLO checks and canary health.

  • Toil reduction and automation:

  • Automate common remediation (scale up, circuit-break, restart).
  • Use automation for alert dedupe and grouping.
  • Track toil reduction metrics as part of SRE work.

  • Security basics:

  • Avoid PII in metrics and logs.
  • Limit access to observability data stores.
  • Monitor for anomalous telemetry that could indicate security incidents.

Weekly/monthly routines:

  • Weekly: Review recent alerts, top contributors to error budgets, and any rollbacks.
  • Monthly: Review SLOs, adjust targets based on business changes, and inspect telemetry costs.
  • Quarterly: Run game days and capacity planning exercises.

What to review in postmortems related to Four golden signals:

  • Which of the four signals triggered and when.
  • How quickly SLOs and error budgets detected the issue.
  • Gaps in instrumentation or missing metrics.
  • Changes to thresholds, runbooks, or automation after the incident.

Tooling & Integration Map for Four golden signals (TABLE REQUIRED)

ID Category What it does Key integrations Notes
I1 Metrics store Stores time-series metrics and histograms Prometheus remote_write, OpenTelemetry See details below: I1
I2 Dashboards Visualizes metrics and alerts Grafana, dashboard plugins Integrates traces and logs
I3 Tracing Collects distributed traces OpenTelemetry, Jaeger, Tempo Links to metrics via trace IDs
I4 Logging Aggregates logs for context Loki, Elasticsearch Correlates with traces and metrics
I5 Alerting Routes alerts and escalations Alertmanager, Opsgenie Supports dedupe and grouping
I6 APM Application performance monitoring Datadog, New Relic Offers transaction-level insights
I7 Cloud monitoring Managed infra and service metrics Provider monitoring services Good for serverless and managed systems
I8 CI/CD Integrates SLO checks in pipelines Pipeline plugins Blocks deploys when budgets exhausted
I9 Chaos tools Induce failures and validate SLOs Chaos frameworks Use in game days and testing
I10 Cost analytics Tracks observability and infra cost Cost platforms Tie cost to metrics and scaling

Row Details (only if needed)

  • I1: Metrics store note: Choose scalable backend for high cardinality and use downsampling for long-term retention.

Frequently Asked Questions (FAQs)

What exactly are the four golden signals?

They are latency, traffic, errors, and saturation—four key SLIs to observe service health.

Are the four signals enough for every system?

No; they are a baseline. Add business and domain-specific SLIs as needed.

How do I pick percentiles for latency?

Start with p95 for common tail behavior and p99 for critical endpoints; adjust by risk.

How often should I evaluate SLOs?

Monthly to quarterly, or after significant product or traffic changes.

Should alerts be based on raw metrics or SLOs?

Prefer SLO-based alerts to align with business impact; use raw metric alerts for emergent failures.

How do I avoid metric cardinality problems?

Limit labels, avoid user identifiers, and roll up high-cardinality fields.

How do saturation signals differ from latency?

Saturation measures resource usage; latency measures user experience. Saturation often causes latency increases.

Can service meshes replace instrumentation?

Service meshes can provide telemetry but instrumenting app-level metrics remains valuable.

How to measure saturation for serverless?

Use provider concurrency and memory usage metrics plus function latency.

What is an error budget?

Allowed percentage of SLO violations during a period; it guides release and mitigation actions.

How to correlate metrics with traces and logs?

Propagate trace IDs into logs and link metric alerts to sample traces.

Is synthetic monitoring part of four signals?

Synthetic checks complement the four signals by providing external user perspective.

How to handle noisy alerts?

Implement grouping, dedupe, and adjust thresholds to focus on sustained issues.

What percentiles are most useful in production?

p95 and p99 are the most informative for user-facing endpoints.

How do I test my observability stack?

Run load tests, chaos experiments, and game days verifying detection and remediation.

How should teams own SLOs?

Service teams own SLO definition and on-call; SREs advise and enforce best practices.

Should metrics be pushed or pulled?

Pull (scraping) for stable services; push for ephemeral or short-lived workloads.

How to protect observability data from exposure?

Mask sensitive fields and secure access to observability systems.


Conclusion

The Four golden signals provide a focused, high-return foundation for observability in modern cloud-native systems. They accelerate detection, guide SLO-driven operations, and enable effective incident response when combined with traces, logs, and automation.

Next 7 days plan:

  • Day 1: Inventory services and identify critical endpoints for SLOs.
  • Day 2: Instrument requests with counters and histograms for latency and errors.
  • Day 3: Deploy exporters and collectors; validate metrics in dashboards.
  • Day 4: Define SLOs and error-budget policies for top services.
  • Day 5: Configure SLO-based alerts and integrate with on-call routing.
  • Day 6: Run synthetic checks and a short load test to validate signals.
  • Day 7: Hold a review and create missing runbooks for top alert scenarios.

Appendix — Four golden signals Keyword Cluster (SEO)

  • Primary keywords
  • Four golden signals
  • Four golden metrics
  • SRE golden signals
  • Latency traffic errors saturation
  • Four signals monitoring

  • Secondary keywords

  • Observability best practices
  • SLO SLI metrics
  • Cloud-native monitoring
  • Kubernetes observability
  • Serverless monitoring

  • Long-tail questions

  • What are the four golden signals in observability
  • How to measure latency p95 and p99 for SLOs
  • How to set SLOs using four golden signals
  • How to instrument services for the four golden signals
  • What is the difference between errors and saturation in monitoring
  • How do four golden signals fit into incident response
  • How to avoid cardinality explosion in metrics
  • How to use histograms for latency SLIs
  • How to correlate traces with four golden signals alerts
  • When to use percentiles vs mean for latency
  • How to implement canary rollouts with SLO checks
  • How to handle high tail latency in Kubernetes
  • How to monitor serverless cold-start impact
  • How to design dashboards for four golden signals
  • How to create runbooks for latency and error incidents

  • Related terminology

  • SLI
  • SLO
  • Error budget
  • Percentile latency
  • Histogram buckets
  • TSDB
  • Prometheus
  • OpenTelemetry
  • Grafana
  • Alertmanager
  • Sidecar exporter
  • Node exporter
  • Kube-state-metrics
  • Remote write
  • Sampling
  • Tail-based sampling
  • Synthetic monitoring
  • RUM
  • Autoscaling
  • Circuit breaker
  • Canary deployment
  • Feature flagging
  • Chaos engineering
  • Game day
  • Incident commander
  • Postmortem
  • Runbook
  • Playbook
  • Observability pipeline
  • Metric cardinality
  • Downsampling
  • Retention policy
  • Correlation ID
  • Service mesh telemetry
  • Managed monitoring
  • Cost of observability
  • Alert dedupe
  • Grouping and suppression
  • Burn rate
  • Deployment annotation