{"id":769,"date":"2025-08-29T07:51:04","date_gmt":"2025-08-29T07:51:04","guid":{"rendered":"https:\/\/sreschool.com\/blog\/?p=769"},"modified":"2025-08-30T08:47:50","modified_gmt":"2025-08-30T08:47:50","slug":"chaos-monkey-a-comprehensive-tutorial-for-site-reliability-engineering","status":"publish","type":"post","link":"https:\/\/sreschool.com\/blog\/chaos-monkey-a-comprehensive-tutorial-for-site-reliability-engineering\/","title":{"rendered":"Chaos Monkey: A Comprehensive Tutorial for Site Reliability Engineering"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Introduction &amp; Overview<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What is Chaos Monkey?<\/h3>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"436\" src=\"https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/08\/f3911ae1-55c9-4cd2-8688-d1045e78d440_who-is-chaos-engineering-for_compressed.jpg\" alt=\"\" class=\"wp-image-976\" style=\"width:840px;height:auto\" srcset=\"https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/08\/f3911ae1-55c9-4cd2-8688-d1045e78d440_who-is-chaos-engineering-for_compressed.jpg 800w, https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/08\/f3911ae1-55c9-4cd2-8688-d1045e78d440_who-is-chaos-engineering-for_compressed-300x164.jpg 300w, https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/08\/f3911ae1-55c9-4cd2-8688-d1045e78d440_who-is-chaos-engineering-for_compressed-768x419.jpg 768w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><\/figure>\n\n\n\n<p>Chaos Monkey is an open-source tool developed by Netflix to test the resilience of IT infrastructure by randomly terminating instances in a production environment. It is a cornerstone of chaos engineering, a discipline that involves intentionally injecting failures to uncover system weaknesses and improve reliability. By simulating real-world disruptions, Chaos Monkey helps ensure systems can withstand unexpected failures without impacting end-users.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">History or Background<\/h3>\n\n\n\n<p>Chaos Monkey was born in 2011 as Netflix transitioned from on-premises data centers to Amazon Web Services (AWS). The move to the cloud introduced new challenges, such as unpredictable instance failures, prompting Netflix to create a tool that would proactively test system resilience. Released as open-source in 2012, Chaos Monkey became the foundation of Netflix\u2019s Simian Army, a suite of tools designed to enhance system reliability. It has since inspired the broader adoption of chaos engineering across industries, with companies like Amazon and Google implementing similar practices.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Why is it Relevant in Site Reliability Engineering?<\/h3>\n\n\n\n<p>Site Reliability Engineering (SRE) focuses on ensuring systems are reliable, scalable, and efficient. Chaos Monkey aligns with SRE principles by:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Proactively identifying weaknesses<\/strong>: It exposes vulnerabilities before they cause outages.<\/li>\n\n\n\n<li><strong>Promoting automation<\/strong>: Encourages automated recovery mechanisms to handle failures.<\/li>\n\n\n\n<li><strong>Enhancing resilience<\/strong>: Ensures systems can maintain functionality despite disruptions.<\/li>\n\n\n\n<li><strong>Reducing downtime<\/strong>: Helps SRE teams meet Service Level Objectives (SLOs) by validating failover mechanisms.<\/li>\n<\/ul>\n\n\n\n<p>Chaos Monkey is particularly valuable in distributed systems, such as microservices architectures, where dependencies and failure points are complex. It fosters a culture of resilience, a key tenet of SRE.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Core Concepts &amp; Terminology<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Key Terms and Definitions<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Term<\/strong><\/th><th><strong>Definition<\/strong><\/th><\/tr><\/thead><tbody><tr><td>Chaos Engineering<\/td><td>The practice of intentionally injecting failures to test system resilience.<\/td><\/tr><tr><td>Chaos Monkey<\/td><td>A tool that randomly terminates instances or services in production.<\/td><\/tr><tr><td>Simian Army<\/td><td>A collection of Netflix tools for testing system reliability, including Chaos Monkey.<\/td><\/tr><tr><td>Steady State Hypothesis<\/td><td>A measurable state of normal system operation used to evaluate chaos experiments.<\/td><\/tr><tr><td>Blast Radius<\/td><td>The scope of impact from a chaos experiment, ideally limited to minimize harm.<\/td><\/tr><tr><td>Fault Injection<\/td><td>Introducing controlled failures (e.g., instance termination, latency) to test systems.<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">How It Fits into the Site Reliability Engineering Lifecycle<\/h3>\n\n\n\n<p>Chaos Monkey integrates into the SRE lifecycle at several stages:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Design &amp; Development<\/strong>: Encourages engineers to build redundancy and fault tolerance into systems.<\/li>\n\n\n\n<li><strong>Testing<\/strong>: Validates system behavior under failure conditions, complementing traditional testing.<\/li>\n\n\n\n<li><strong>Monitoring &amp; Observability<\/strong>: Requires robust monitoring to track system behavior during experiments.<\/li>\n\n\n\n<li><strong>Incident Response<\/strong>: Improves response strategies by exposing gaps in recovery mechanisms.<\/li>\n\n\n\n<li><strong>Postmortems<\/strong>: Provides data for analyzing failures and refining system architecture.<\/li>\n<\/ul>\n\n\n\n<p>By embedding Chaos Monkey into SRE practices, teams can proactively address vulnerabilities, aligning with the SRE goal of maintaining high availability.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Architecture &amp; How It Works<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Components &amp; Internal Workflow<\/h3>\n\n\n\n<p>Chaos Monkey operates as a lightweight service that interacts with cloud infrastructure to terminate instances randomly. Its key components include:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Configuration Layer<\/strong>: Defines rules for instance termination, such as target groups, schedules, and frequency.<\/li>\n\n\n\n<li><strong>Scheduler<\/strong>: Determines when and which instances to terminate based on a configurable mean time between failures.<\/li>\n\n\n\n<li><strong>Cloud Integration<\/strong>: Interfaces with cloud providers (e.g., AWS EC2) via APIs to terminate instances.<\/li>\n\n\n\n<li><strong>Monitoring Hooks<\/strong>: Integrates with monitoring tools to ensure terminations occur only when the system is stable.<\/li>\n<\/ul>\n\n\n\n<p><strong>Workflow<\/strong>:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Chaos Monkey queries the cloud provider for a list of running instances in a target group (e.g., an Auto Scaling Group in AWS).<\/li>\n\n\n\n<li>It applies filters based on configuration (e.g., excluding critical instances or checking for ongoing outages).<\/li>\n\n\n\n<li>A random instance is selected and terminated using the cloud provider\u2019s API.<\/li>\n\n\n\n<li>The system\u2019s response is monitored to ensure recovery mechanisms (e.g., auto-scaling, load balancing) kick in.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Architecture Diagram<\/h3>\n\n\n\n<p>Below is a textual representation of Chaos Monkey\u2019s architecture (as images cannot be generated):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#091;Chaos Monkey Service]\n       |\n       | (Configuration: Target groups, schedules, mean time between failures)\n       v\n&#091;Scheduler] ----&gt; &#091;Cloud API (e.g., AWS EC2)]\n       |              |\n       |              v\n       |        &#091;Instance Termination]\n       v\n&#091;Monitoring System] &lt;---- &#091;Observability: Metrics, Logs]\n       |\n       v\n&#091;Auto-Scaling\/Load Balancer] ----&gt; &#091;System Recovery]\n<\/code><\/pre>\n\n\n\n<p><strong>Description<\/strong>: The Chaos Monkey service runs on a host or container, configured to target specific instance groups. The scheduler triggers termination events, which are executed via cloud APIs. Monitoring systems track the impact, and recovery mechanisms (e.g., auto-scaling) restore system stability.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Integration Points with CI\/CD or Cloud Tools<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>CI\/CD Pipelines<\/strong>: Chaos Monkey can be integrated into CI\/CD workflows to run experiments post-deployment, ensuring new code is resilient. Tools like Jenkins or GitLab can trigger Chaos Monkey via scripts.<\/li>\n\n\n\n<li><strong>Cloud Platforms<\/strong>: Supports AWS, Azure, and GCP through APIs, leveraging auto-scaling groups or equivalent constructs.<\/li>\n\n\n\n<li><strong>Monitoring Tools<\/strong>: Integrates with Prometheus, Datadog, or AWS CloudWatch to monitor system health during experiments.<\/li>\n\n\n\n<li><strong>Spinnaker<\/strong>: Chaos Monkey 2.0 relies on Spinnaker for deployment orchestration, enabling cross-cloud compatibility.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Installation &amp; Getting Started<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Basic Setup or Prerequisites<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Cloud Environment<\/strong>: An AWS, Azure, or GCP account with running instances.<\/li>\n\n\n\n<li><strong>Spinnaker<\/strong>: Required for Chaos Monkey 2.0 (not needed for the original version).<\/li>\n\n\n\n<li><strong>MySQL 5.X<\/strong>: For storing configuration and state.<\/li>\n\n\n\n<li><strong>Go<\/strong>: Chaos Monkey is written in Go, requiring the Go runtime for custom configurations.<\/li>\n\n\n\n<li><strong>Permissions<\/strong>: API credentials with permissions to terminate instances.<\/li>\n\n\n\n<li><strong>Monitoring<\/strong>: A monitoring system to track system health (e.g., Prometheus, CloudWatch).<\/li>\n\n\n\n<li><strong>OS<\/strong>: A Linux-based host or container for running Chaos Monkey.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Hands-On: Step-by-Step Beginner-Friendly Setup Guide<\/h3>\n\n\n\n<p>This guide assumes an AWS environment with an Auto Scaling Group (ASG).<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Install Dependencies<\/strong>:\n<ul class=\"wp-block-list\">\n<li>Install Go: <code>sudo apt-get install golang<\/code> (Ubuntu) or equivalent.<\/li>\n\n\n\n<li>Set up MySQL: <code>sudo apt-get install mysql-server<\/code> and create a database for Chaos Monkey.<\/li>\n\n\n\n<li>Install Spinnaker (optional for Chaos Monkey 2.0): Follow Spinnaker\u2019s official setup guide.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Clone Chaos Monkey Repository<\/strong>:<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>git clone https:\/\/github.com\/Netflix\/chaosmonkey.git\ncd chaosmonkey<\/code><\/pre>\n\n\n\n<p>3. <strong>Configure Chaos Monkey<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Create a configuration file (<code>chaosmonkey.toml<\/code>):<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code\"><code>&#091;chaosmonkey]\nenabled = true\nschedule_enabled = true\nmean_time_between_terminations_in_days = 1\nmin_time_between_terminations_in_days = 0\n&#091;database]\nhost = \"localhost\"\nname = \"chaosmonkey\"\nuser = \"root\"\npassword = \"your_password\"<\/code><\/pre>\n\n\n\n<ol class=\"wp-block-list\">\n<li><\/li>\n<\/ol>\n\n\n\n<p>Specify target ASGs and termination schedules.<\/p>\n\n\n\n<p>4. <strong>Build and Run Chaos Monkey<\/strong>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>go build\n.\/chaosmonkey -config chaosmonkey.toml<\/code><\/pre>\n\n\n\n<p>5. <strong>Integrate with Monitoring<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Configure Prometheus or CloudWatch to monitor instance terminations and system recovery.<\/li>\n\n\n\n<li>Example Prometheus metric query:<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code\"><code>rate(chaosmonkey_terminations_total&#091;5m])<\/code><\/pre>\n\n\n\n<p>6. <strong>Test the Setup<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Verify Chaos Monkey terminates instances in the target ASG during the scheduled window.<\/li>\n\n\n\n<li>Check logs for termination events: <code>tail -f chaosmonkey.log<\/code>.<\/li>\n<\/ul>\n\n\n\n<p>7. <strong>Automate with CI\/CD<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Add a pipeline stage in Jenkins to trigger Chaos Monkey post-deployment:<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code\"><code>.\/chaosmonkey -config chaosmonkey.toml --dry-run<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Real-World Use Cases<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario 1: Testing Microservices Resilience<\/h3>\n\n\n\n<p>An e-commerce platform uses Chaos Monkey to test a microservices-based checkout system. By randomly terminating instances of the payment service, the team validates that the system reroutes traffic to healthy instances, ensuring uninterrupted checkouts.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario 2: Validating Auto-Scaling<\/h3>\n\n\n\n<p>A streaming service runs Chaos Monkey on its content delivery nodes. When instances are terminated, the auto-scaling group spins up new instances, and the load balancer redistributes traffic, confirming the system\u2019s ability to handle sudden failures.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario 3: Disaster Recovery Testing<\/h3>\n\n\n\n<p>A financial institution uses Chaos Monkey to simulate server failures in its transaction processing system. The experiment reveals that failover to a secondary region takes longer than expected, prompting improvements in replication latency.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario 4: Industry-Specific Example (Healthcare)<\/h3>\n\n\n\n<p>A healthcare provider uses Chaos Monkey to test a patient records system. By terminating database instances, the team ensures that read replicas handle queries without downtime, critical for maintaining access to patient data.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Benefits &amp; Limitations<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Key Advantages<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Proactive Failure Detection<\/strong>: Identifies weaknesses before they cause outages.<\/li>\n\n\n\n<li><strong>Improved Resilience<\/strong>: Encourages redundancy and automated recovery.<\/li>\n\n\n\n<li><strong>Cultural Shift<\/strong>: Fosters a mindset of embracing failure as a learning opportunity.<\/li>\n\n\n\n<li><strong>Open-Source<\/strong>: Freely available and customizable for various environments.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Common Challenges or Limitations<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Challenge<\/strong><\/th><th><strong>Description<\/strong><\/th><\/tr><\/thead><tbody><tr><td>Limited Scope<\/td><td>Only terminates instances, not simulating other failures like network latency.<\/td><\/tr><tr><td>Spinnaker Dependency<\/td><td>Chaos Monkey 2.0 requires Spinnaker, adding complexity.<\/td><\/tr><tr><td>Risk of Disruption<\/td><td>Random terminations can cause outages if systems lack redundancy.<\/td><\/tr><tr><td>Custom Code Requirement<\/td><td>Advanced configurations require writing Go code.<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Best Practices &amp; Recommendations<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Security Tips<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Limit Blast Radius<\/strong>: Target non-critical instances initially to minimize impact.<\/li>\n\n\n\n<li><strong>Role-Based Access<\/strong>: Restrict Chaos Monkey\u2019s API permissions to specific resources.<\/li>\n\n\n\n<li><strong>Audit Logs<\/strong>: Enable logging to track termination events for compliance.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Performance &amp; Maintenance<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Schedule Wisely<\/strong>: Run experiments during low-traffic periods to reduce user impact.<\/li>\n\n\n\n<li><strong>Monitor Closely<\/strong>: Use tools like Prometheus to track system health in real-time.<\/li>\n\n\n\n<li><strong>Automate Rollbacks<\/strong>: Implement abort conditions to stop experiments if metrics degrade.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Compliance Alignment<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Align with standards like SOC 2 by documenting experiment outcomes and ensuring no patient or customer data is compromised.<\/li>\n\n\n\n<li>Use Chaos Monkey in pre-production environments for regulated industries to avoid compliance risks.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Automation Ideas<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Integrate with CI\/CD pipelines to run Chaos Monkey post-deployment.<\/li>\n\n\n\n<li>Use Infrastructure-as-Code (e.g., Terraform) to define Chaos Monkey configurations.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Comparison with Alternatives<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Tool<\/strong><\/th><th><strong>Features<\/strong><\/th><th><strong>Pros<\/strong><\/th><th><strong>Cons<\/strong><\/th><th><strong>Best Use Case<\/strong><\/th><\/tr><\/thead><tbody><tr><td>Chaos Monkey<\/td><td>Random instance termination<\/td><td>Simple, open-source, cloud-native<\/td><td>Limited to instance termination<\/td><td>Basic resilience testing<\/td><\/tr><tr><td>Gremlin<\/td><td>Multiple failure types (latency, CPU, etc.)<\/td><td>Comprehensive, user-friendly UI<\/td><td>Paid service, complex setup<\/td><td>Advanced chaos experiments<\/td><\/tr><tr><td>LitmusChaos<\/td><td>Kubernetes-native, extensive fault library<\/td><td>Open-source, CI\/CD integration<\/td><td>Steep learning curve<\/td><td>Kubernetes environments<\/td><\/tr><tr><td>Chaos Mesh<\/td><td>Kubernetes chaos orchestration<\/td><td>Advanced workflows, open-source<\/td><td>Kubernetes-specific<\/td><td>Cloud-native Kubernetes testing<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">When to Choose Chaos Monkey<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Use Chaos Monkey<\/strong>: For simple, instance-level failure testing in cloud environments, especially AWS, or when starting with chaos engineering.<\/li>\n\n\n\n<li><strong>Choose Alternatives<\/strong>: For Kubernetes-specific testing (LitmusChaos, Chaos Mesh) or broader failure scenarios (Gremlin).<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>Chaos Monkey is a foundational tool in chaos engineering, enabling SRE teams to build resilient systems by simulating random instance failures. Its simplicity and open-source nature make it accessible, though its scope is limited compared to modern alternatives. As distributed systems grow in complexity, Chaos Monkey remains a valuable starting point for fostering a culture of reliability.<\/p>\n\n\n\n<p><strong>Future Trends<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Integration with AI-driven observability for smarter experiment design.<\/li>\n\n\n\n<li>Expansion to serverless and edge computing environments.<\/li>\n\n\n\n<li>Greater emphasis on automated, continuous chaos testing.<\/li>\n<\/ul>\n\n\n\n<p><strong>Next Steps<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Start with small experiments in a staging environment.<\/li>\n\n\n\n<li>Explore the Simian Army for additional chaos tools.<\/li>\n\n\n\n<li>Join chaos engineering communities (e.g., Chaos Community Slack).<\/li>\n<\/ul>\n\n\n\n<p><strong>Resources<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Official Documentation: Chaos Monkey GitHub<a href=\"https:\/\/netflix.github.io\/chaosmonkey\/\"><\/a><\/li>\n\n\n\n<li>Community: Gremlin Chaos Engineering Community<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Introduction &amp; Overview What is Chaos Monkey? Chaos Monkey is an open-source tool developed by Netflix to test the resilience [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-769","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Chaos Monkey: A Comprehensive Tutorial for Site Reliability Engineering - SRE School<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/sreschool.com\/blog\/chaos-monkey-a-comprehensive-tutorial-for-site-reliability-engineering\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Chaos Monkey: A Comprehensive Tutorial for Site Reliability Engineering - SRE School\" \/>\n<meta property=\"og:description\" content=\"Introduction &amp; Overview What is Chaos Monkey? Chaos Monkey is an open-source tool developed by Netflix to test the resilience [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/sreschool.com\/blog\/chaos-monkey-a-comprehensive-tutorial-for-site-reliability-engineering\/\" \/>\n<meta property=\"og:site_name\" content=\"SRE School\" \/>\n<meta property=\"article:published_time\" content=\"2025-08-29T07:51:04+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-08-30T08:47:50+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/08\/f3911ae1-55c9-4cd2-8688-d1045e78d440_who-is-chaos-engineering-for_compressed.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"800\" \/>\n\t<meta property=\"og:image:height\" content=\"436\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"priteshgeek\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"priteshgeek\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"7 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/sreschool.com\/blog\/chaos-monkey-a-comprehensive-tutorial-for-site-reliability-engineering\/\",\"url\":\"https:\/\/sreschool.com\/blog\/chaos-monkey-a-comprehensive-tutorial-for-site-reliability-engineering\/\",\"name\":\"Chaos Monkey: A Comprehensive Tutorial for Site Reliability Engineering - SRE School\",\"isPartOf\":{\"@id\":\"https:\/\/sreschool.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/sreschool.com\/blog\/chaos-monkey-a-comprehensive-tutorial-for-site-reliability-engineering\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/sreschool.com\/blog\/chaos-monkey-a-comprehensive-tutorial-for-site-reliability-engineering\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/08\/f3911ae1-55c9-4cd2-8688-d1045e78d440_who-is-chaos-engineering-for_compressed.jpg\",\"datePublished\":\"2025-08-29T07:51:04+00:00\",\"dateModified\":\"2025-08-30T08:47:50+00:00\",\"author\":{\"@id\":\"https:\/\/sreschool.com\/blog\/#\/schema\/person\/6a53e3870889dd6a65b2e04b7bc3d7db\"},\"breadcrumb\":{\"@id\":\"https:\/\/sreschool.com\/blog\/chaos-monkey-a-comprehensive-tutorial-for-site-reliability-engineering\/#breadcrumb\"},\"inLanguage\":\"en\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/sreschool.com\/blog\/chaos-monkey-a-comprehensive-tutorial-for-site-reliability-engineering\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en\",\"@id\":\"https:\/\/sreschool.com\/blog\/chaos-monkey-a-comprehensive-tutorial-for-site-reliability-engineering\/#primaryimage\",\"url\":\"https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/08\/f3911ae1-55c9-4cd2-8688-d1045e78d440_who-is-chaos-engineering-for_compressed.jpg\",\"contentUrl\":\"https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/08\/f3911ae1-55c9-4cd2-8688-d1045e78d440_who-is-chaos-engineering-for_compressed.jpg\",\"width\":800,\"height\":436},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/sreschool.com\/blog\/chaos-monkey-a-comprehensive-tutorial-for-site-reliability-engineering\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/sreschool.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Chaos Monkey: A Comprehensive Tutorial for Site Reliability Engineering\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/sreschool.com\/blog\/#website\",\"url\":\"https:\/\/sreschool.com\/blog\/\",\"name\":\"SRESchool\",\"description\":\"Master SRE. Build Resilient Systems. Lead the Future of Reliability\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/sreschool.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/sreschool.com\/blog\/#\/schema\/person\/6a53e3870889dd6a65b2e04b7bc3d7db\",\"name\":\"priteshgeek\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en\",\"@id\":\"https:\/\/sreschool.com\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/231a0e8b7a02636f2fbacf8dcf4494cb1cc0d49ecc9a8165fbaeaeeaf102641a?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/231a0e8b7a02636f2fbacf8dcf4494cb1cc0d49ecc9a8165fbaeaeeaf102641a?s=96&d=mm&r=g\",\"caption\":\"priteshgeek\"},\"url\":\"https:\/\/sreschool.com\/blog\/author\/priteshgeek\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Chaos Monkey: A Comprehensive Tutorial for Site Reliability Engineering - SRE School","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/sreschool.com\/blog\/chaos-monkey-a-comprehensive-tutorial-for-site-reliability-engineering\/","og_locale":"en_US","og_type":"article","og_title":"Chaos Monkey: A Comprehensive Tutorial for Site Reliability Engineering - SRE School","og_description":"Introduction &amp; Overview What is Chaos Monkey? Chaos Monkey is an open-source tool developed by Netflix to test the resilience [&hellip;]","og_url":"https:\/\/sreschool.com\/blog\/chaos-monkey-a-comprehensive-tutorial-for-site-reliability-engineering\/","og_site_name":"SRE School","article_published_time":"2025-08-29T07:51:04+00:00","article_modified_time":"2025-08-30T08:47:50+00:00","og_image":[{"width":800,"height":436,"url":"https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/08\/f3911ae1-55c9-4cd2-8688-d1045e78d440_who-is-chaos-engineering-for_compressed.jpg","type":"image\/jpeg"}],"author":"priteshgeek","twitter_card":"summary_large_image","twitter_misc":{"Written by":"priteshgeek","Est. reading time":"7 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/sreschool.com\/blog\/chaos-monkey-a-comprehensive-tutorial-for-site-reliability-engineering\/","url":"https:\/\/sreschool.com\/blog\/chaos-monkey-a-comprehensive-tutorial-for-site-reliability-engineering\/","name":"Chaos Monkey: A Comprehensive Tutorial for Site Reliability Engineering - SRE School","isPartOf":{"@id":"https:\/\/sreschool.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/sreschool.com\/blog\/chaos-monkey-a-comprehensive-tutorial-for-site-reliability-engineering\/#primaryimage"},"image":{"@id":"https:\/\/sreschool.com\/blog\/chaos-monkey-a-comprehensive-tutorial-for-site-reliability-engineering\/#primaryimage"},"thumbnailUrl":"https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/08\/f3911ae1-55c9-4cd2-8688-d1045e78d440_who-is-chaos-engineering-for_compressed.jpg","datePublished":"2025-08-29T07:51:04+00:00","dateModified":"2025-08-30T08:47:50+00:00","author":{"@id":"https:\/\/sreschool.com\/blog\/#\/schema\/person\/6a53e3870889dd6a65b2e04b7bc3d7db"},"breadcrumb":{"@id":"https:\/\/sreschool.com\/blog\/chaos-monkey-a-comprehensive-tutorial-for-site-reliability-engineering\/#breadcrumb"},"inLanguage":"en","potentialAction":[{"@type":"ReadAction","target":["https:\/\/sreschool.com\/blog\/chaos-monkey-a-comprehensive-tutorial-for-site-reliability-engineering\/"]}]},{"@type":"ImageObject","inLanguage":"en","@id":"https:\/\/sreschool.com\/blog\/chaos-monkey-a-comprehensive-tutorial-for-site-reliability-engineering\/#primaryimage","url":"https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/08\/f3911ae1-55c9-4cd2-8688-d1045e78d440_who-is-chaos-engineering-for_compressed.jpg","contentUrl":"https:\/\/sreschool.com\/blog\/wp-content\/uploads\/2025\/08\/f3911ae1-55c9-4cd2-8688-d1045e78d440_who-is-chaos-engineering-for_compressed.jpg","width":800,"height":436},{"@type":"BreadcrumbList","@id":"https:\/\/sreschool.com\/blog\/chaos-monkey-a-comprehensive-tutorial-for-site-reliability-engineering\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/sreschool.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Chaos Monkey: A Comprehensive Tutorial for Site Reliability Engineering"}]},{"@type":"WebSite","@id":"https:\/\/sreschool.com\/blog\/#website","url":"https:\/\/sreschool.com\/blog\/","name":"SRESchool","description":"Master SRE. Build Resilient Systems. Lead the Future of Reliability","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/sreschool.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en"},{"@type":"Person","@id":"https:\/\/sreschool.com\/blog\/#\/schema\/person\/6a53e3870889dd6a65b2e04b7bc3d7db","name":"priteshgeek","image":{"@type":"ImageObject","inLanguage":"en","@id":"https:\/\/sreschool.com\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/231a0e8b7a02636f2fbacf8dcf4494cb1cc0d49ecc9a8165fbaeaeeaf102641a?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/231a0e8b7a02636f2fbacf8dcf4494cb1cc0d49ecc9a8165fbaeaeeaf102641a?s=96&d=mm&r=g","caption":"priteshgeek"},"url":"https:\/\/sreschool.com\/blog\/author\/priteshgeek\/"}]}},"_links":{"self":[{"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/posts\/769","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/comments?post=769"}],"version-history":[{"count":2,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/posts\/769\/revisions"}],"predecessor-version":[{"id":978,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/posts\/769\/revisions\/978"}],"wp:attachment":[{"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/media?parent=769"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/categories?post=769"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sreschool.com\/blog\/wp-json\/wp\/v2\/tags?post=769"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}