
๐จ What is Toil in SRE?
Toil is a term coined by Google SREs to describe a specific class of operational work that is manual, repetitive, automatable, and scalable with service growth โ but not with team growth.
๐ก Toil is the work you do that doesn’t scale and adds no enduring value.
๐ Official Definition (from Google SRE Book)
โToil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, reactive, and lacking enduring value.โ
Reference: Google SRE Book โ Eliminating Toil
๐งฉ Key Characteristics of Toil

Characteristic | Description |
---|---|
Manual | Performed by humans instead of automation |
Repetitive | Occurs often and follows a predictable pattern |
Automatable | A script or tool could handle it |
Tactical | Immediate fix, not a long-term solution |
Reactive | Triggered by alerts/incidents |
Non-durable | Leaves no lasting improvement to the system |
โ Examples of Toil in SRE
Example | Why itโs Toil |
---|---|
Manually restarting failed pods | Repetitive, automatable |
Fixing disk full alerts every week | Predictable, non-durable |
Reconfiguring firewall rules by hand | Can be automated via Terraform or scripts |
Responding to noisy alerts | Reactive, not value-adding |
Generating weekly reports manually | Easily scriptable |
Approving routine access requests | Better handled with IAM automation or workflows |
๐ซ Not All Operational Work is Toil
Work Type | Toil? | Why |
---|---|---|
Designing alerting strategy | โ | Strategic, not repetitive |
Writing Terraform modules | โ | Automates infrastructure |
Debugging complex outages | โ | Requires deep thinking, high-value |
Reviewing architectural changes | โ | Adds long-term value |
๐ง Why Toil is Bad
- Burns out engineers
- Distracts from innovation
- Delays projects
- Doesnโt scale
- Leads to boredom and attrition
Google recommends that SREs should spend <50% of their time on Toil, and ideally much less.
๐ง How to Identify Toil in Your Environment
Signal | Interpretation |
---|---|
Youโre doing it more than twice a month | Candidate for automation |
Your incident postmortems are repetitive | Root cause not fixed |
You can document it easily | You can probably automate it |
Onboarding involves lots of manual steps | It’s Toil |
SOPs require step-by-step human execution | Replace with scripts/pipelines |
๐ ๏ธ How to Reduce or Eliminate Toil: A Practical Tutorial
๐น 1. Catalog Your Repetitive Tasks
Create a list of all recurring activities:
- Weekly report generation
- Alert acknowledgements
- Manual deployment steps
- Restarting services
๐น 2. Score Each Task on โToil Scaleโ
Create a scoring matrix:
Task | Manual | Repetitive | Automatable | Urgency | Toil Score |
---|---|---|---|---|---|
Restarting pods | โ | โ | โ | High | 3 |
Debugging memory leak | โ | โ | โ | High | 0 |
Weekly CPU report | โ | โ | โ | Low | 3 |
Focus on tasks with highest Toil Score.
๐น 3. Automate Using Tools
Tool | Use Case |
---|---|
Bash/Python | Small scripts to start |
Ansible | Automate infra/config |
Terraform | Infra provisioning |
Jenkins/GitHub Actions | CI/CD workflows |
Prometheus + Alertmanager | Alerting automation |
PagerDuty/Opsgenie | Auto-remediation hooks |
Runbooks + Automation | Click-to-run fixes |
๐น 4. Introduce Auto-Remediation
Example: Auto-restart failed pods in Kubernetes
apiVersion: v1
kind: Pod
metadata:
name: resilient-app
spec:
containers:
- name: app
image: my-app
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 3
periodSeconds: 5
With liveness probes, Kubernetes eliminates the need for manual restarts.
๐น 5. Alert Intelligently
Bad:
- โDisk usage > 80%โ
Good:
- โDisk usage projected to reach 100% in 3 hoursโ
Even better:
- Automatically clean up temp files or notify owner before alerting.
Use suppressions, alert deduplication, and runbook links to avoid Toil from noisy alerts.
๐น 6. Measure Toil Reduction
Track metrics like:
Metric | Example Tool |
---|---|
Time spent per incident | Jira, PagerDuty logs |
# of repeated manual tasks | Custom spreadsheet |
Alert volume per on-call | Prometheus |
Automation coverage | CI/CD stats |
Create a Toil Dashboard to monitor improvements.
๐ผ Real-World Example: Toil Reduction Project
Problem:
Every on-call SRE had to:
- Manually restart a failing job
- Archive logs to S3
- Email stakeholders
Solution:
- Created a Jenkins pipeline triggered via API
- Used
kubectl
to auto-restart job - Uploaded logs via
aws s3 cp
- Sent Slack messages using Webhooks
Result: 80% reduction in manual toil during on-call.
๐ Summary: SRE Toil Elimination Mindset
Step | Description |
---|---|
Identify | Use logs and retros |
Quantify | Track frequency and time |
Prioritize | Automate high-toil tasks first |
Automate | Use the right tool for the task |
Monitor | Show visible reduction |
๐ Final Thoughts
Toil is the silent killer of productivity and innovation in SRE. Reducing toil:
- Boosts engineer happiness
- Improves system reliability
- Frees time for strategic projects
“If you do something more than twice, automate it.”