π§½ Part 1: Introduction & Fundamentals
1. What are SLIs?
Service Level Indicators (SLIs) are precise, quantitative measures that capture specific aspects of system behavior and performance, such as latency, availability, throughput, or error rate. These indicators form the foundation of service-level management in both Site Reliability Engineering (SRE) and DevOps practices.
Why SLIs Matter
SLIs provide the data necessary to:
- Assess whether services are operating reliably
- Set realistic goals for performance and availability
- Detect anomalies and performance regressions
- Trigger alerting and remediation workflows
SLIs help bridge the gap between technical performance and user satisfaction, ultimately enabling engineers to balance feature velocity and system stability.
SLIs, SLOs, and SLAs
- SLIs (Indicators): Raw performance measurements (e.g., 99th percentile latency).
- SLOs (Objectives): The target or goal for a given SLI (e.g., 99.9% availability).
- SLAs (Agreements): Legal/business-level contracts that formalize SLOs with penalties for breaches.
2. SLIs vs SLOs vs SLAs
Understanding the relationship and differences:
Term | Definition | Example |
---|---|---|
SLI | A measurement | HTTP 500 error rate |
SLO | A target for an SLI | HTTP 500 error rate < 0.1% over 30 days |
SLA | Legal agreement | 99.9% uptime guaranteed per quarter |
Visual diagrams can illustrate how SLIs roll up into SLOs and then into SLAs. Use real-world examples from cloud service providers like Google Cloud or AWS for better clarity.
3. The Golden Signals
The four golden signals help you detect the most common sources of performance issues:
- Latency: How long requests take
- Traffic: Number of requests or data processed
- Errors: Failure rate
- Saturation: System resource usage (CPU, memory, IOPS)
These signals directly translate into SLIs. For instance, latency SLIs might include P90/P95/P99 request times.
π οΈ Part 2: Designing SLIs
4. How to Design Good SLIs
A good SLI should be:
- Relevant: Tied directly to user experience
- Precise: Easy to measure and monitor
- Actionable: Should lead to clear operational responses
- Sustainable: Not too costly or difficult to measure
5. Choosing the Right Metrics
Use both quantitative (e.g., latency in ms) and qualitative (e.g., user feedback scores) metrics.
- Leading indicators: Predict potential future issues (e.g., CPU nearing saturation)
- Lagging indicators: Reflect past performance (e.g., downtime last month)
6. Types of SLIs
Type | Description | Example |
---|---|---|
Availability | % of successful requests | 99.95% uptime |
Latency | Time taken to respond | 95th percentile < 300ms |
Throughput | Volume of requests served | 10K req/sec |
Error Rate | Failed vs total requests | 0.1% errors |
Quality | Correctness of data/response | 99.99% data accuracy |
7. Instrumenting Applications
Expose SLIs using open-source tooling:
- Prometheus: Use client libraries to export metrics
- OpenTelemetry: For distributed tracing and standardized instrumentation
- Custom metrics endpoints: /metrics route in apps
Example code snippets for exporting latency and error rate via Prometheus client.
π Part 3: Collecting & Analyzing SLIs
8. Monitoring Tools
Tool | Purpose |
---|---|
Prometheus | Metric collection and storage |
Grafana | Dashboard visualization |
Datadog | Full-stack monitoring & APM |
New Relic | Real-time observability |
GCP Monitoring | Cloud-native SLI tracking |
9. Querying Metrics
- PromQL: Learn query basics: rate(), avg_over_time(), histogram_quantile()
- Common Queries:
- HTTP 500 rate:
rate(http_requests_total{status="500"}[5m])
- P95 latency:
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
- HTTP 500 rate:
10. Alerting Based on SLIs
- Set alert thresholds aligned with SLOs
- Prevent alert fatigue by:
- Grouping alerts
- Using multi-window, multi-burn-rate strategies
- Example:
- Alert if 5xx errors > 0.5% for 10 minutes
π Part 4: Real-World Examples
11. Web Applications
SLIs:
- HTTP error rate
- Median and P95 response times
- Availability over time
12. APIs & Microservices
SLIs:
- RPC call error rate
- Success rate
- P95 latency for endpoints
13. Databases
SLIs:
- Query success/failure ratio
- Read/write latency
- Replica lag
14. CDNs & Edge Services
SLIs:
- Cache hit/miss ratio
- Time to first byte
- Regional availability
15. CI/CD Pipelines
SLIs:
- Build success rate
- Deployment latency
- Time to recovery after failure
π§ͺ Part 5: Implementing & Evolving SLIs
16. Setting Baselines
- Use historical data to define thresholds
- Analyze long-term trends
17. From SLIs to SLOs
- Map SLIs to user expectations
- Define realistic but ambitious targets
- Use Error Budgets to enforce SLOs
18. Review and Iteration
- Conduct regular reviews
- Adjust SLIs/SLOs as services and customer expectations evolve
- Include feedback from dev, ops, and product teams
π§βπ» Part 6: Hands-on Projects
19. SLI Lab Setup
- Tools: Prometheus, Grafana, Node Exporter
- Set up basic metrics collection and dashboards
20. Monitor a Web App (Flask/Node.js)
- Expose metrics: latency, errors, uptime
- Create Grafana dashboard
21. Define SLIs & Setup Alerts
- Choose key metrics
- Write PromQL-based alerts
22. Visualize SLIs on Dashboard
- Group by service/module
- Annotate with deployments and incidents
23. Mini Project: OpenTelemetry + Grafana
- Trace distributed services
- Derive SLIs from trace data
π Appendices
- Glossary of terms
- Templates for SLI/SLO documentation
- Real SLI dashboards (anonymized)
- SLI-focused interview questions
- Sample GitHub repos
- Recommended readings (e.g., Google SRE Workbook)
π Target Audience
- SRE beginners
- DevOps engineers
- System admins transitioning to SRE
- Computer science students focusing on system reliability