What is Toil?

Posted on May 18, 2025 | by Rajesh Kumar

🚨 What is Toil in SRE?

Toil is a term coined by Google SREs to describe a specific class of operational work that is manual, repetitive, automatable, and scalable with service growth — but not with team growth.

💡 Toil is the work you do that doesn’t scale and adds no enduring value.

📘 Official Definition (from Google SRE Book)

“Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, reactive, and lacking enduring value.”

Reference: Google SRE Book – Eliminating Toil

🧩 Key Characteristics of Toil

Characteristic	Description
Manual	Performed by humans instead of automation
Repetitive	Occurs often and follows a predictable pattern
Automatable	A script or tool could handle it
Tactical	Immediate fix, not a long-term solution
Reactive	Triggered by alerts/incidents
Non-durable	Leaves no lasting improvement to the system

✅ Examples of Toil in SRE

Example	Why it’s Toil
Manually restarting failed pods	Repetitive, automatable
Fixing disk full alerts every week	Predictable, non-durable
Reconfiguring firewall rules by hand	Can be automated via Terraform or scripts
Responding to noisy alerts	Reactive, not value-adding
Generating weekly reports manually	Easily scriptable
Approving routine access requests	Better handled with IAM automation or workflows

🚫 Not All Operational Work is Toil

Work Type	Toil?	Why
Designing alerting strategy	❌	Strategic, not repetitive
Writing Terraform modules	❌	Automates infrastructure
Debugging complex outages	❌	Requires deep thinking, high-value
Reviewing architectural changes	❌	Adds long-term value

🧠 Why Toil is Bad

Burns out engineers
Distracts from innovation
Delays projects
Doesn’t scale
Leads to boredom and attrition

Google recommends that SREs should spend <50% of their time on Toil, and ideally much less.

🔧 How to Identify Toil in Your Environment

Signal	Interpretation
You’re doing it more than twice a month	Candidate for automation
Your incident postmortems are repetitive	Root cause not fixed
You can document it easily	You can probably automate it
Onboarding involves lots of manual steps	It’s Toil
SOPs require step-by-step human execution	Replace with scripts/pipelines

🛠️ How to Reduce or Eliminate Toil: A Practical Tutorial

🔹 1. Catalog Your Repetitive Tasks

Create a list of all recurring activities:

- Weekly report generation
- Alert acknowledgements
- Manual deployment steps
- Restarting services

🔹 2. Score Each Task on “Toil Scale”

Create a scoring matrix:

Task	Manual	Repetitive	Automatable	Urgency	Toil Score
Restarting pods	✅	✅	✅	High	3
Debugging memory leak	❌	❌	❌	High	0
Weekly CPU report	✅	✅	✅	Low	3

Focus on tasks with highest Toil Score.

🔹 3. Automate Using Tools

Tool	Use Case
Bash/Python	Small scripts to start
Ansible	Automate infra/config
Terraform	Infra provisioning
Jenkins/GitHub Actions	CI/CD workflows
Prometheus + Alertmanager	Alerting automation
PagerDuty/Opsgenie	Auto-remediation hooks
Runbooks + Automation	Click-to-run fixes

🔹 4. Introduce Auto-Remediation

Example: Auto-restart failed pods in Kubernetes

apiVersion: v1
kind: Pod
metadata:
  name: resilient-app
spec:
  containers:
  - name: app
    image: my-app
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 3
      periodSeconds: 5

With liveness probes, Kubernetes eliminates the need for manual restarts.

🔹 5. Alert Intelligently

Bad:

“Disk usage > 80%”

Good:

“Disk usage projected to reach 100% in 3 hours”

Even better:

Automatically clean up temp files or notify owner before alerting.

Use suppressions, alert deduplication, and runbook links to avoid Toil from noisy alerts.

🔹 6. Measure Toil Reduction

Track metrics like:

Metric	Example Tool
Time spent per incident	Jira, PagerDuty logs
# of repeated manual tasks	Custom spreadsheet
Alert volume per on-call	Prometheus
Automation coverage	CI/CD stats

Create a Toil Dashboard to monitor improvements.

💼 Real-World Example: Toil Reduction Project

Problem:

Every on-call SRE had to:

Manually restart a failing job
Archive logs to S3
Email stakeholders

Solution:

Created a Jenkins pipeline triggered via API
Used kubectl to auto-restart job
Uploaded logs via aws s3 cp
Sent Slack messages using Webhooks

Result: 80% reduction in manual toil during on-call.

📊 Summary: SRE Toil Elimination Mindset

Step	Description
Identify	Use logs and retros
Quantify	Track frequency and time
Prioritize	Automate high-toil tasks first
Automate	Use the right tool for the task
Monitor	Show visible reduction

🔚 Final Thoughts

Toil is the silent killer of productivity and innovation in SRE. Reducing toil:

Boosts engineer happiness
Improves system reliability
Frees time for strategic projects

“If you do something more than twice, automate it.”