Enroll Now Talk to Advisor
Skip to main content
Intermediate 5 Days · Instructor-Led

Site Reliability Professional Training Course

Advanced SRE practices for working reliability engineers.

Duration

5 Days

Level

Intermediate

Format

Instructor-Led

Certification

SRESchool

About this course

This 5-day intermediate training course is designed for engineers who are already working in SRE or reliability roles. You will go deeper into advanced SLO management, chaos engineering, distributed observability, and team-level SRE practices. The course prepares you for the Certified Site Reliability Professional (CSRP) certification.

Prerequisites

  • Completion of the SRE Training Course or equivalent experience
  • Minimum 1 year of hands-on SRE or DevOps experience
  • Working knowledge of SLOs, incident management, and monitoring
  • Familiarity with cloud platforms (AWS, GCP, or Azure)

Who should attend

  • Site Reliability Engineers with 1–3 years of experience
  • Platform Engineers deepening reliability skills
  • Senior DevOps Engineers working in production reliability
  • Cloud Engineers managing production systems at scale

What you will learn

Design advanced SLO frameworks for complex multi-service systems
Implement distributed tracing with OpenTelemetry
Lead chaos engineering experiments to proactively test reliability
Design capacity planning strategies for production workloads
Lead incident response as an incident commander
Build and maintain observability platforms at team scale

5-Day Course Agenda

D1

Day 1: Advanced SLO Design and Error Budget Management

Advanced SLI selection for complex systemsMulti-service SLO dependenciesError budget burn rate alertsSLO for event-driven and async systemsError budget policy enforcement
D2

Day 2: Advanced Observability and Distributed Tracing

OpenTelemetry architecture and implementationDistributed tracing in microservicesService mesh observabilityAdvanced alerting and anomaly detectionObservability-driven development
D3

Day 3: Chaos Engineering and Resilience Testing

Principles of chaos engineeringGame days and failure injection exercisesChaos testing tools: Chaos Monkey, Gremlin, LitmusChaosSteady-state hypothesis designBuilding a resilience testing program
D4

Day 4: Reliability Architecture Patterns

Circuit breakers and bulkheadsGraceful degradation patternsRetry and timeout strategiesCaching strategies for reliabilityDatabase reliability patterns
D5

Day 5: Capacity Planning and Team SRE Practices

Demand forecasting and capacity planningLoad testing methodologiesOn-call program design for teamsRunbook and playbook managementCSRP certification exam preparation

Hands-on Labs

  • Implementing OpenTelemetry in a sample microservices application
  • Building burn rate alerts with Prometheus
  • Designing and running a chaos engineering game day
  • Implementing circuit breakers in a service
  • Capacity planning exercise with production data

Tools Covered

OpenTelemetryJaegerPrometheusGrafanaLitmusChaosGremlin (overview)Istio/EnvoyKubernetesTerraform

Career outcomes

Senior Site Reliability EngineerStaff SREPlatform Reliability EngineerSenior Production EngineerCloud Reliability Engineer

Course FAQs

The Site Reliability Professional Training Course is a 5 Days instructor-led training program.

Instructor-Led, Virtual, On-Site (Corporate).

Completion of the SRE Training Course or equivalent experience Minimum 1 year of hands-on SRE or DevOps experience Working knowledge of SLOs, incident management, and monitoring Familiarity with cloud platforms (AWS, GCP, or Azure)

Yes. This course is the recommended preparation for the undefined (undefined) certification.

Ready to enroll in the Site Reliability Professional Training Course?

Contact us to join an upcoming batch or request a private session for your team.