{"id":75,"date":"2025-06-10T07:49:17","date_gmt":"2025-06-10T07:49:17","guid":{"rendered":"https:\/\/sreschool.com\/blog\/?p=75"},"modified":"2025-06-10T07:49:31","modified_gmt":"2025-06-10T07:49:31","slug":"service-level-indicators-sli-a-complete-guide","status":"publish","type":"post","link":"https:\/\/sreschool.com\/blog\/service-level-indicators-sli-a-complete-guide\/","title":{"rendered":"Service Level Indicators (SLI) &#8211; A Complete Guide"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"1536\" src=\"https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/06\/SLI.png\" alt=\"\" class=\"wp-image-76\" srcset=\"https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/06\/SLI.png 1024w, https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/06\/SLI-200x300.png 200w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Here\u2019s the beginning of your <strong>in-depth, 12\u201315 page tutorial<\/strong> on <strong>SLI (Service Level Indicators)<\/strong>\u2014crafted to flow logically, cover all key sections, and include real-world examples, diagrams, tool guidance, and hands-on labs. This justifies the length and depth you requested. I\u2019ll provide the <strong>first ~4\u20135 pages here<\/strong>, then we can continue in batches to complete the full document.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">\ud83e\udded Complete Guide &amp; Tutorial to Service Level Indicators (SLI)<\/h1>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83d\udcd8 <strong>1. Introduction to SLI<\/strong><\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What is an SLI?<\/h3>\n\n\n\n<p>A <strong>Service-Level Indicator (SLI)<\/strong> is a <strong>quantitative measure<\/strong> of some aspect of service performance\u2014such as uptime, latency, error rate, throughput, etc.\u2014that reflects user experience (<a href=\"https:\/\/medium.com\/site-reliability-engineering-leadership\/sli-deep-dive-cae92bd90a79?utm_source=chatgpt.com\">medium.com<\/a>, <a href=\"https:\/\/sre.google\/sre-book\/service-level-objectives\/?utm_source=chatgpt.com\">sre.google<\/a>).<\/p>\n\n\n\n<p>SLIs are not arbitrary metrics; they are carefully chosen to represent <strong>what matters most to users<\/strong> and are usually framed as a ratio (e.g., <code>successful_requests \/ total_requests<\/code>) or a percentile (e.g., <code>p99 latency<\/code>) .<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Why SLIs matter<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>They help engineers <strong>translate system behaviors into user impact<\/strong>.<\/li>\n\n\n\n<li>Provide early <strong>warning signals<\/strong> before user frustration mounts.<\/li>\n\n\n\n<li>Form the foundation of reliability guarantees when expressed as <strong>SLOs<\/strong> (Service-Level Objectives) and contractual <strong>SLAs<\/strong> (Service-Level Agreements) (<a href=\"https:\/\/www.netapp.com\/learn\/cvo-blg-sre-slos-defining-slas-slis-and-slos-in-sre\/?utm_source=chatgpt.com\">netapp.com<\/a>, <a href=\"https:\/\/newrelic.com\/blog\/best-practices\/what-are-slos-slis-slas?utm_source=chatgpt.com\">newrelic.com<\/a>).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">SLI, SLO, and SLA\u2014How they relate<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Term<\/th><th>Definition<\/th><\/tr><\/thead><tbody><tr><td><strong>SLI<\/strong><\/td><td>A metric indicating service level (e.g., \u201c99.5% of video requests &lt;2s\u201d) (<a href=\"https:\/\/newrelic.com\/blog\/best-practices\/what-are-slos-slis-slas?utm_source=chatgpt.com\">newrelic.com<\/a>)<\/td><\/tr><tr><td><strong>SLO<\/strong><\/td><td>A target for the SLI (e.g., \u201c99% of requests &lt;300ms\u201d)<\/td><\/tr><tr><td><strong>SLA<\/strong><\/td><td>A contract based on SLOs with penalties for missed targets<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Example<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>SLI:<\/strong> Latency<\/li>\n\n\n\n<li><strong>SLO:<\/strong> \u201c95% of HTTP responses in &lt;300ms\u201d<\/li>\n\n\n\n<li><strong>SLA:<\/strong> \u201c$100 service credit per % below 95% per month\u201d<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83d\udd17 <strong>2. SLI vs SLO vs SLA<\/strong><\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Definitions &amp; Differences<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>SLI<\/strong> is the data point.<\/li>\n\n\n\n<li><strong>SLO<\/strong> is the goal.<\/li>\n\n\n\n<li><strong>SLA<\/strong> is the contract enforcing consequences for missing the goal (<a href=\"https:\/\/newrelic.com\/blog\/best-practices\/what-are-slos-slis-slas?utm_source=chatgpt.com\">newrelic.com<\/a>, <a href=\"https:\/\/www.reddit.com\/r\/sre\/comments\/12n864k\/slo_sla_sli_simply_explained\/?utm_source=chatgpt.com\">reddit.com<\/a>, <a href=\"https:\/\/www.linkedin.com\/pulse\/implementing-sli-slo-like-sre-practical-guide-beginners-harsur-btraf?utm_source=chatgpt.com\">linkedin.com<\/a>, <a href=\"https:\/\/www.capgemini.com\/insights\/expert-perspectives\/site-reliability-engineering-demystifying-slis-slos-and-error-budgets\/?utm_source=chatgpt.com\">capgemini.com<\/a>).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Visual Diagram<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;User Experience] \u2192 MEASURE via SLI \u2192 TARGET with SLO \u2192 CONTRACT via SLA\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Example mapping<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>SLI:<\/strong> 99.9% HTTP 200<\/li>\n\n\n\n<li><strong>SLO:<\/strong> \u226599.5% of requests must succeed per week<\/li>\n\n\n\n<li><strong>SLA:<\/strong> Credit issued if service dips below target.<\/li>\n<\/ul>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>\u201cNever set an SLA stricter than your SLO,\u201d per Google\u2019s SRE practices (<a href=\"https:\/\/www.reddit.com\/r\/sre\/comments\/18cu66f\/what_are_some_of_the_best_practices_for_measuring\/?utm_source=chatgpt.com\">reddit.com<\/a>, <a href=\"https:\/\/newrelic.com\/blog\/best-practices\/what-are-slos-slis-slas?utm_source=chatgpt.com\">newrelic.com<\/a>, <a href=\"https:\/\/www.linkedin.com\/pulse\/implementing-sli-slo-like-sre-practical-guide-beginners-harsur-btraf?utm_source=chatgpt.com\">linkedin.com<\/a>).<\/p>\n<\/blockquote>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83d\udcca <strong>3. Types of SLIs (Key Categories)<\/strong><\/h2>\n\n\n\n<p>SLIs generally fall into six categories:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">a) <strong>Availability \/ Uptime<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Percentage of successful requests (HTTP 2xx\/3xx) (<a href=\"https:\/\/en.wikipedia.org\/wiki\/Service_level_indicator?utm_source=chatgpt.com\">en.wikipedia.org<\/a>, <a href=\"https:\/\/sre.google\/sre-book\/service-level-objectives\/?utm_source=chatgpt.com\">sre.google<\/a>).<\/li>\n\n\n\n<li>Typically cumulative over time (e.g., daily uptime).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">b) <strong>Latency<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>P50, P95, P99 response times.<\/li>\n\n\n\n<li>Measured at client or server; server-side often used as proxy (<a href=\"https:\/\/en.wikipedia.org\/wiki\/Service_level_indicator?utm_source=chatgpt.com\">en.wikipedia.org<\/a>).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">c) <strong>Throughput<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Requests or transactions per second (RPS\/TPS).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">d) <strong>Error Rate<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Percentage of HTTP 5xx failures per total requests.<\/li>\n\n\n\n<li>Critical for user trust.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">e) <strong>Saturation<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Resource usage % (CPU, RAM, disk I\/O).<\/li>\n\n\n\n<li>Indicates approaching capacity limits (<a href=\"https:\/\/sre.google\/sre-book\/service-level-objectives\/?utm_source=chatgpt.com\">sre.google<\/a>, <a href=\"https:\/\/www.linkedin.com\/pulse\/implementing-sli-slo-like-sre-practical-guide-beginners-harsur-btraf?utm_source=chatgpt.com\">linkedin.com<\/a>).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">f) <strong>Durability \/ Correctness (for data services)<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data loss rate or replication lag.<\/li>\n\n\n\n<li>Critical for storage and data pipelines (<a href=\"https:\/\/www.reddit.com\/r\/sre\/comments\/12n864k\/slo_sla_sli_simply_explained\/?utm_source=chatgpt.com\">reddit.com<\/a>).<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">\u2699\ufe0f <strong>4. How to Define a Good SLI<\/strong><\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">SMART Criteria<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Specific:<\/strong> Tied to a measurable user outcome.<\/li>\n\n\n\n<li><strong>Measurable:<\/strong> With reliable metrics.<\/li>\n\n\n\n<li><strong>Achievable:<\/strong> Based on historical data.<\/li>\n\n\n\n<li><strong>Relevant:<\/strong> Reflects user experience.<\/li>\n\n\n\n<li><strong>Time-bound:<\/strong> Defined over a time window (<a href=\"https:\/\/www.netapp.com\/learn\/cvo-blg-sre-slos-defining-slas-slis-and-slos-in-sre\/?utm_source=chatgpt.com\">netapp.com<\/a>).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Granularity<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Endpoint-level vs service-level.<\/li>\n\n\n\n<li>Trade-off between precision and manageability .<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Focus on User Experience<\/h3>\n\n\n\n<p>Excellent SLIs correlate directly with user satisfaction\u2014for example, measuring app crashes vs raw CPU cycles .<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Anti-patterns<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Tracking generic metrics while labeling them as SLIs.<\/li>\n\n\n\n<li>Using averages (which mask outliers) instead of percentiles .<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83e\uddea <strong>5. Collecting and Measuring SLIs<\/strong><\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Telemetry Sources<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Prometheus + Grafana<\/strong><\/li>\n\n\n\n<li><strong>New Relic<\/strong>, <strong>Datadog<\/strong><\/li>\n\n\n\n<li><strong>Google Cloud Monitoring<\/strong>, <strong>AWS CloudWatch<\/strong> (<a href=\"https:\/\/oladosu777.medium.com\/site-reliability-engineering-sli-implementation-example-bb8dcd5429ba?utm_source=chatgpt.com\">oladosu777.medium.com<\/a>, <a href=\"https:\/\/www.netapp.com\/learn\/cvo-blg-sre-slos-defining-slas-slis-and-slos-in-sre\/?utm_source=chatgpt.com\">netapp.com<\/a>).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Logs vs Metrics vs Traces<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Logs<\/strong>: rich but brittle.<\/li>\n\n\n\n<li><strong>Metrics<\/strong>: aggregated values for SLIs.<\/li>\n\n\n\n<li><strong>Traces<\/strong>: detailed timing and dependency data.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Instrumentation Libraries<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>OpenTelemetry<\/strong>, <strong>Micrometer<\/strong>, vendor SDKs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Example: Prometheus Query<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>sum(rate(http_requests_total{job=\"frontend\",status=~\"2..\"}&#91;5m]))\n\/\nsum(rate(http_requests_total{job=\"frontend\"}&#91;5m]))\n<\/code><\/pre>\n\n\n\n<p>This calculates success rate (availability).<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">Great! Let\u2019s continue with the remaining sections of the full <strong>SLI (Service Level Indicator) Tutorial<\/strong>. This part continues from Section 6 onward, eventually completing your <strong>15-section, multi-page professional guide<\/strong>.<\/h3>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83d\udd04 <strong>6. Integrating SLIs with SLOs and Error Budgets<\/strong><\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What is an Error Budget?<\/h3>\n\n\n\n<p>An <strong>Error Budget<\/strong> is the acceptable amount of unreliability over a given time period. It&#8217;s calculated as:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p><strong>Error Budget = 100% &#8211; SLO Target<\/strong><\/p>\n<\/blockquote>\n\n\n\n<p>For example, if your SLO is 99.9% availability over 30 days, your error budget is 0.1%, or 43.2 minutes of allowed downtime.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Why Error Budgets Matter<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Allow <strong>engineering trade-offs<\/strong>: Can you release today even if latency is high?<\/li>\n\n\n\n<li>Create a <strong>collaboration point<\/strong> between product and ops.<\/li>\n\n\n\n<li>Align reliability with business impact.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Burn Rate<\/h3>\n\n\n\n<p>This is how fast you&#8217;re consuming your error budget.<\/p>\n\n\n\n<p><strong>Fast burn<\/strong> = something is wrong and needs immediate rollback<br><strong>Slow burn<\/strong> = a gradual deterioration; fix it in the next sprint<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p><strong>Prometheus Burn Rate Query:<\/strong><\/p>\n<\/blockquote>\n\n\n\n<pre class=\"wp-block-code\"><code>rate(errors&#91;5m]) \/ rate(total_requests&#91;5m])\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">SLI-SLO-Error Budget Flow<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Define <strong>SLI<\/strong>: e.g., % successful API calls<\/li>\n\n\n\n<li>Set <strong>SLO<\/strong>: 99.9% success over 30 days<\/li>\n\n\n\n<li>Monitor and visualize usage of error budget in real time<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83d\udee0\ufe0f <strong>7. Tooling &amp; Dashboards<\/strong><\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Open Source SLO\/SLI Tools<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Sloth<\/strong> (by Banzai Cloud):\n<ul class=\"wp-block-list\">\n<li>YAML-based SLO-to-Prometheus generator<\/li>\n\n\n\n<li>Generates alerts automatically<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>SLO Generator<\/strong> (by Google Cloud):\n<ul class=\"wp-block-list\">\n<li>Declarative YAML for multi-source SLOs<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Nobl9<\/strong>:\n<ul class=\"wp-block-list\">\n<li>SaaS platform focused on SLO adoption<\/li>\n\n\n\n<li>Integrates with Prometheus, Datadog, CloudWatch, etc.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Visualization with Grafana<\/h3>\n\n\n\n<p>Create dashboards to track:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Error budget remaining<\/li>\n\n\n\n<li>Burn rate alerts<\/li>\n\n\n\n<li>Top-level SLI widgets (latency, errors, saturation)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Alerting<\/h3>\n\n\n\n<p>Tie alerts directly to SLO violations, not raw metrics:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Alert: \u201cError budget consumed >80% in last 6 hours\u201d<\/li>\n\n\n\n<li>Alert: \u201cBurn rate 5x normal in last 15 minutes\u201d<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83d\udcc1 <strong>8. Real-World SLI Examples<\/strong><\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">a) Web Application<\/h3>\n\n\n\n<p><strong>SLI:<\/strong> 99.9% of HTTP requests must return status 200\u2013399 within 300ms<br><strong>Metrics Used:<\/strong> <code>http_requests_total<\/code>, <code>response_duration_seconds<\/code><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">b) Database<\/h3>\n\n\n\n<p><strong>SLI:<\/strong> 99.95% of read queries must succeed within 100ms<br><strong>Tools:<\/strong> PostgreSQL exporter for Prometheus<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">c) Messaging System<\/h3>\n\n\n\n<p><strong>SLI:<\/strong> Message delivery success rate \u226599.99%<br><strong>SLO:<\/strong> Max 1 lost message per 10,000 sent<br><strong>Source:<\/strong> Kafka or RabbitMQ exporters<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">d) Mobile App<\/h3>\n\n\n\n<p><strong>SLI:<\/strong> &lt;2% of sessions should crash<br><strong>Measured By:<\/strong> Crashlytics \/ Firebase SDK<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">e) API Gateway<\/h3>\n\n\n\n<p><strong>SLI:<\/strong> \u22640.1% of requests return 5xx errors<br><strong>Metrics:<\/strong> Status codes from gateway logs or observability agents<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83d\udd10 <strong>9. SLIs for Security and Compliance<\/strong><\/h2>\n\n\n\n<p>While security is harder to quantify, here are usable SLIs:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Auth endpoint uptime<\/strong>: % of successful login requests<\/li>\n\n\n\n<li><strong>Token refresh rate<\/strong>: % of failed token renewals<\/li>\n\n\n\n<li><strong>Audit log delivery delay<\/strong>: \u22645s end-to-end delay for log entries<\/li>\n\n\n\n<li><strong>S3 bucket access latency<\/strong>: 99% under 300ms<\/li>\n<\/ul>\n\n\n\n<p>These are especially useful for SOC2, ISO27001, or FedRAMP compliance tracking.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83d\udcda <strong>10. Case Studies<\/strong><\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Google<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Uses SLOs and error budgets <strong>as release gates<\/strong>.<\/li>\n\n\n\n<li>\u201cIf you&#8217;re burning budget, features don\u2019t go live.\u201d<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Netflix<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Measures <strong>QoE SLIs<\/strong>: start time, rebuffer rate, resolution drop<\/li>\n\n\n\n<li>SLIs directly drive streaming algorithm tuning<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Shopify<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>SLOs used to decide when <strong>to halt feature rollout<\/strong><\/li>\n\n\n\n<li>Team SLOs tied to incident postmortems and bonus metrics<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Incident Tie-In<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Missed SLO \u2192 automatic postmortem trigger<\/li>\n\n\n\n<li>SLI dashboards help visualize outage impact<\/li>\n\n\n\n<li>Example: Elevated latency \u2192 missed 99% threshold \u2192 rollback triggered<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83e\udde9 <strong>11. Common Challenges &amp; Mistakes<\/strong><\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">a) Too Many SLIs<\/h3>\n\n\n\n<p>Trying to track 20+ indicators leads to confusion. Stick to 2\u20134 per service.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">b) Wrong Metrics<\/h3>\n\n\n\n<p>Not all metrics = SLIs. CPU is useful, but not an SLI unless tied to performance.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">c) Ignoring Customer Perspective<\/h3>\n\n\n\n<p>An internal 5xx error may not be visible to users\u2014don\u2019t panic. Measure <strong>what users see<\/strong>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">d) Vanity Metrics<\/h3>\n\n\n\n<p>Success counts without context (e.g., raw API hits) aren\u2019t useful.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83d\udca1 <strong>12. Best Practices<\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>User-focused first<\/strong>: Measure impact to the end-user.<\/li>\n\n\n\n<li><strong>Start small<\/strong>: Define one SLI + SLO per service and iterate.<\/li>\n\n\n\n<li><strong>Automate alerts<\/strong>: Integrate with Prometheus + Alertmanager.<\/li>\n\n\n\n<li><strong>Review monthly<\/strong>: Especially after incidents or architecture changes.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83e\uddea <strong>13. Hands-on Labs \/ Exercises<\/strong><\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario<\/h3>\n\n\n\n<p>You run an e-commerce site. Define these:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>SLI<\/strong>: 95% of product searches should respond &lt;300ms<\/li>\n\n\n\n<li><strong>SLO<\/strong>: 95% of those responses over 7 days<\/li>\n\n\n\n<li><strong>Prometheus Query<\/strong>:<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>sum(rate(http_requests_total{handler=\"\/search\",status=~\"2..\"}&#91;5m]))\n\/\nsum(rate(http_requests_total{handler=\"\/search\"}&#91;5m]))\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Grafana Dashboard<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Create panels for:\n<ul class=\"wp-block-list\">\n<li><code>SLI %<\/code><\/li>\n\n\n\n<li>5-minute burn rate<\/li>\n\n\n\n<li>Error budget usage<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Alert<\/h3>\n\n\n\n<p>\u201cIf burn rate &gt;2x for 15 minutes, page the on-call team.\u201d<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83d\udcdc <strong>14. Templates and Reference Material<\/strong><\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">SLI YAML Template<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>service: frontend\nsli:\n  name: request_success_rate\n  description: Successful HTTP 2xx responses\n  query: &gt;\n    sum(rate(http_requests_total{status=~\"2..\"}&#91;5m])) \/\n    sum(rate(http_requests_total&#91;5m]))\n  objective: 99.9\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">SLO Document Sample<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>Service: Payments\nSLO: 99.95% of successful transactions under 500ms\nError Budget: 0.05%\nMonitoring Source: Prometheus\nReview Interval: Monthly\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83d\udd1a <strong>15. Conclusion &amp; What\u2019s Next<\/strong><\/h2>\n\n\n\n<p>SLIs are the <strong>foundation of reliability engineering<\/strong>, helping teams move from reactive monitoring to proactive service-level accountability.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Key Takeaways:<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>SLIs quantify what <strong>users care about<\/strong>.<\/li>\n\n\n\n<li>SLOs set clear <strong>targets<\/strong>.<\/li>\n\n\n\n<li>Error budgets allow for <strong>safe experimentation<\/strong>.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Continue Learning:<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/sre.google\/books\/\">Google SRE Book<\/a><\/li>\n\n\n\n<li>CNCF <a href=\"https:\/\/github.com\/cncf\/tag-observability\">SLO WG<\/a><\/li>\n\n\n\n<li>OpenTelemetry for distributed tracing &amp; instrumentation<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Here\u2019s the beginning of your in-depth, 12\u201315 page tutorial on SLI (Service Level Indicators)\u2014crafted to flow logically, cover all key [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-75","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Service Level Indicators (SLI) - A Complete Guide - SRE School<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/sreschool.com\/blog\/service-level-indicators-sli-a-complete-guide\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Service Level Indicators (SLI) - A Complete Guide - SRE School\" \/>\n<meta property=\"og:description\" content=\"Here\u2019s the beginning of your in-depth, 12\u201315 page tutorial on SLI (Service Level Indicators)\u2014crafted to flow logically, cover all key [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/sreschool.com\/blog\/service-level-indicators-sli-a-complete-guide\/\" \/>\n<meta property=\"og:site_name\" content=\"SRE School\" \/>\n<meta property=\"article:published_time\" content=\"2025-06-10T07:49:17+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-06-10T07:49:31+00:00\" \/>\n<meta property=\"og:image\" content=\"http:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/06\/SLI.png\" \/>\n<meta name=\"author\" content=\"Rajesh Kumar\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Rajesh Kumar\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"6 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/sreschool.com\/blog\/service-level-indicators-sli-a-complete-guide\/\",\"url\":\"https:\/\/sreschool.com\/blog\/service-level-indicators-sli-a-complete-guide\/\",\"name\":\"Service Level Indicators (SLI) - A Complete Guide - SRE School\",\"isPartOf\":{\"@id\":\"https:\/\/sreschool.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/sreschool.com\/blog\/service-level-indicators-sli-a-complete-guide\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/sreschool.com\/blog\/service-level-indicators-sli-a-complete-guide\/#primaryimage\"},\"thumbnailUrl\":\"http:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/06\/SLI.png\",\"datePublished\":\"2025-06-10T07:49:17+00:00\",\"dateModified\":\"2025-06-10T07:49:31+00:00\",\"author\":{\"@id\":\"https:\/\/sreschool.com\/blog\/#\/schema\/person\/0ffe446f77bb2589992dbe3a7f417201\"},\"breadcrumb\":{\"@id\":\"https:\/\/sreschool.com\/blog\/service-level-indicators-sli-a-complete-guide\/#breadcrumb\"},\"inLanguage\":\"en\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/sreschool.com\/blog\/service-level-indicators-sli-a-complete-guide\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en\",\"@id\":\"https:\/\/sreschool.com\/blog\/service-level-indicators-sli-a-complete-guide\/#primaryimage\",\"url\":\"https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/06\/SLI.png\",\"contentUrl\":\"https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/06\/SLI.png\",\"width\":1024,\"height\":1536},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/sreschool.com\/blog\/service-level-indicators-sli-a-complete-guide\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/sreschool.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Service Level Indicators (SLI) &#8211; A Complete Guide\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/sreschool.com\/blog\/#website\",\"url\":\"https:\/\/sreschool.com\/blog\/\",\"name\":\"SRESchool\",\"description\":\"Master SRE. Build Resilient Systems. Lead the Future of Reliability\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/sreschool.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/sreschool.com\/blog\/#\/schema\/person\/0ffe446f77bb2589992dbe3a7f417201\",\"name\":\"Rajesh Kumar\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en\",\"@id\":\"https:\/\/sreschool.com\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/f901a4f2929fa034a291a8363d589791d5a3c1f6a051c22e744acb8bfc8e022a?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/f901a4f2929fa034a291a8363d589791d5a3c1f6a051c22e744acb8bfc8e022a?s=96&d=mm&r=g\",\"caption\":\"Rajesh Kumar\"},\"sameAs\":[\"http:\/\/sreschool.com\/blog\"],\"url\":\"https:\/\/sreschool.com\/blog\/author\/admin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Service Level Indicators (SLI) - A Complete Guide - SRE School","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/sreschool.com\/blog\/service-level-indicators-sli-a-complete-guide\/","og_locale":"en_US","og_type":"article","og_title":"Service Level Indicators (SLI) - A Complete Guide - SRE School","og_description":"Here\u2019s the beginning of your in-depth, 12\u201315 page tutorial on SLI (Service Level Indicators)\u2014crafted to flow logically, cover all key [&hellip;]","og_url":"https:\/\/sreschool.com\/blog\/service-level-indicators-sli-a-complete-guide\/","og_site_name":"SRE School","article_published_time":"2025-06-10T07:49:17+00:00","article_modified_time":"2025-06-10T07:49:31+00:00","og_image":[{"url":"http:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/06\/SLI.png","type":"","width":"","height":""}],"author":"Rajesh Kumar","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Rajesh Kumar","Est. reading time":"6 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/sreschool.com\/blog\/service-level-indicators-sli-a-complete-guide\/","url":"https:\/\/sreschool.com\/blog\/service-level-indicators-sli-a-complete-guide\/","name":"Service Level Indicators (SLI) - A Complete Guide - SRE School","isPartOf":{"@id":"https:\/\/sreschool.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/sreschool.com\/blog\/service-level-indicators-sli-a-complete-guide\/#primaryimage"},"image":{"@id":"https:\/\/sreschool.com\/blog\/service-level-indicators-sli-a-complete-guide\/#primaryimage"},"thumbnailUrl":"http:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/06\/SLI.png","datePublished":"2025-06-10T07:49:17+00:00","dateModified":"2025-06-10T07:49:31+00:00","author":{"@id":"https:\/\/sreschool.com\/blog\/#\/schema\/person\/0ffe446f77bb2589992dbe3a7f417201"},"breadcrumb":{"@id":"https:\/\/sreschool.com\/blog\/service-level-indicators-sli-a-complete-guide\/#breadcrumb"},"inLanguage":"en","potentialAction":[{"@type":"ReadAction","target":["https:\/\/sreschool.com\/blog\/service-level-indicators-sli-a-complete-guide\/"]}]},{"@type":"ImageObject","inLanguage":"en","@id":"https:\/\/sreschool.com\/blog\/service-level-indicators-sli-a-complete-guide\/#primaryimage","url":"https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/06\/SLI.png","contentUrl":"https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/06\/SLI.png","width":1024,"height":1536},{"@type":"BreadcrumbList","@id":"https:\/\/sreschool.com\/blog\/service-level-indicators-sli-a-complete-guide\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/sreschool.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Service Level Indicators (SLI) &#8211; A Complete Guide"}]},{"@type":"WebSite","@id":"https:\/\/sreschool.com\/blog\/#website","url":"https:\/\/sreschool.com\/blog\/","name":"SRESchool","description":"Master SRE. Build Resilient Systems. Lead the Future of Reliability","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/sreschool.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en"},{"@type":"Person","@id":"https:\/\/sreschool.com\/blog\/#\/schema\/person\/0ffe446f77bb2589992dbe3a7f417201","name":"Rajesh Kumar","image":{"@type":"ImageObject","inLanguage":"en","@id":"https:\/\/sreschool.com\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/f901a4f2929fa034a291a8363d589791d5a3c1f6a051c22e744acb8bfc8e022a?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/f901a4f2929fa034a291a8363d589791d5a3c1f6a051c22e744acb8bfc8e022a?s=96&d=mm&r=g","caption":"Rajesh Kumar"},"sameAs":["http:\/\/sreschool.com\/blog"],"url":"https:\/\/sreschool.com\/blog\/author\/admin\/"}]}},"_links":{"self":[{"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/posts\/75","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/comments?post=75"}],"version-history":[{"count":2,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/posts\/75\/revisions"}],"predecessor-version":[{"id":78,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/posts\/75\/revisions\/78"}],"wp:attachment":[{"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/media?parent=75"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/categories?post=75"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/tags?post=75"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}