Tutorial: On-call Rotation in DevSecOps

Uncategorized

1. Introduction & Overview

What is On-call Rotation?

On-call rotation is a structured schedule that distributes the responsibility of responding to alerts or incidents among team members, especially during off-hours. It ensures continuous monitoring and rapid incident response in systems that require 24/7 availability.

History or Background

  • Origin: Initially practiced in telecom and IT operations teams where uptime was mission-critical.
  • Evolution: With the rise of DevOps and now DevSecOps, on-call rotation is no longer exclusive to operations—it includes developers and security engineers too.
  • Modernization: Integrated with automated alerting tools like PagerDuty, Opsgenie, and Slack + Prometheus.

Why is it Relevant in DevSecOps?

  • Always-on Security: Threats can occur anytime. An on-call schedule ensures a team is always ready to respond.
  • Incident Response Integration: It’s a key part of Incident Management—an essential DevSecOps pillar.
  • Compliance & SLAs: Many regulatory frameworks (e.g., HIPAA, SOC 2) require defined incident response procedures.
  • Cultural Shift: Embeds ownership and accountability into security and operations.

2. Core Concepts & Terminology

Key Terms and Definitions

TermDefinition
On-call RotationA schedule that assigns engineers responsibility to monitor/respond to alerts.
Primary/SecondaryPrimary is first to respond, secondary is backup if the primary is unavailable.
Escalation PolicyRules that determine how alerts are passed if not acknowledged.
Alert FatigueDesensitization caused by frequent, repetitive, or false alarms.
RunbookA set of documented steps to resolve known issues or respond to specific incidents.

How It Fits into the DevSecOps Lifecycle

On-call rotation supports the “Respond” and “Recover” phases:

  • Plan → Build → Test → Release → Deploy → Monitor → Respond → Recover
  • Integrated with:
    • SIEM tools for security alerts.
    • Monitoring platforms for availability metrics.
    • CI/CD pipelines for post-deployment monitoring.

3. Architecture & How It Works

Components

  • Alert Sources: CloudWatch, Prometheus, Datadog, etc.
  • Routing Engine: Defines rules for who gets notified (e.g., Opsgenie, PagerDuty).
  • Notification Channels: Slack, Email, SMS, Phone.
  • Escalation Policies: Chains of contact and wait times.
  • Calendar/Rotation Schedules: Define availability.

Internal Workflow

  1. Alert Triggered from Monitoring/Logging tool.
  2. Routing Engine Matches alert with the current on-call schedule.
  3. Notification Sent to the primary on-call via preferred channel.
  4. If not acknowledged, escalate to the next responder.
  5. Incident Response Initiated using a runbook or documented process.

Architecture Diagram (Textual Description)

[Monitoring Tools] → [Alert Routing (e.g., PagerDuty)] → [On-Call Schedule] 
   → [Primary Responder] → [Secondary Responder] → [Incident Management Tool]

Integration Points with CI/CD or Cloud Tools

  • Prometheus + Alertmanager for Kubernetes-based deployments.
  • PagerDuty Webhooks triggered post-deployment (CI) failures.
  • AWS CloudWatch Alarms integrated with Opsgenie.
  • Slack + JIRA for automated ticketing on alert acknowledgement.

4. Installation & Getting Started

Basic Setup or Prerequisites

  • Cloud Monitoring Service (e.g., AWS CloudWatch, Prometheus).
  • Alert Management Tool (e.g., PagerDuty, Opsgenie).
  • Slack or Email for notifications.
  • CI/CD tool (GitHub Actions, GitLab CI, Jenkins).
  • Permissions to configure alert routing and integrations.

Hands-on: Step-by-step Setup with PagerDuty + Prometheus

Step 1: Configure Prometheus Alertmanager

# alertmanager.yml
route:
  receiver: 'pagerduty'
receivers:
  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'YOUR_SERVICE_KEY'

Step 2: Add PagerDuty Integration

  1. Log in to PagerDuty.
  2. Create a Service → Integration Type: Prometheus.
  3. Copy Service Integration Key.
  4. Paste it into alertmanager.yml.

Step 3: Set Up On-call Rotation

  • Navigate to Schedules.
  • Add team members.
  • Define rotation frequency (weekly, daily).
  • Configure escalation policy.

Step 4: Test an Alert

curl -XPOST http://localhost:9093/api/v1/alerts \
-d '[{
  "labels": {
    "alertname": "HighCPUUsage"
  },
  "annotations": {
    "summary": "Instance i-12345678 CPU usage > 90%"
  }
}]'

5. Real-World Use Cases

1. Security Incident Response

  • Scenario: IDS/IPS system detects port scanning on cloud VMs.
  • On-call Rotation: Notifies security engineer on call to validate and mitigate.

2. Post-deployment Failures

  • Scenario: New deployment causes a memory leak.
  • On-call Rotation: DevSecOps engineer triages based on alert, reverts the deployment.

3. DDoS Attack

  • Scenario: Traffic anomaly from single region.
  • On-call Rotation: Network security specialist investigates, blocks IPs via firewall.

4. Ransomware Alert in Finance Sector

  • Industry-specific: Financial services must respond to malware alerts within 15 minutes (per regulation).
  • On-call Rotation: Guarantees SLA compliance through pre-defined rotations and escalation.

6. Benefits & Limitations

Key Advantages

  • ✅ Ensures 24/7 availability.
  • ✅ Reduces Mean Time To Resolution (MTTR).
  • ✅ Encourages shared responsibility among teams.
  • ✅ Supports compliance and auditability.

Common Challenges

  • ⚠️ Alert Fatigue and burnout.
  • ⚠️ Poor escalation policy leads to delayed responses.
  • ⚠️ Requires cultural buy-in—“You build it, you run it.”
  • ⚠️ Scheduling complexity in global teams.

7. Best Practices & Recommendations

Security Tips

  • Rotate and audit notification channels (e.g., phone numbers, emails).
  • Use role-based access control (RBAC) for alert configuration.
  • Store runbooks in version-controlled systems.

Performance & Maintenance

  • Regularly review and refine alert rules.
  • Conduct post-incident reviews (PIRs).
  • Test escalation policies quarterly.

Compliance Alignment

  • Log all alert acknowledgements and responses.
  • Attach audit trails for GDPR, HIPAA, SOC 2 readiness.

Automation Ideas

  • Auto-create JIRA tickets on alert.
  • Trigger rollback in CI/CD if critical alert is triggered post-deploy.
  • Auto-ping responder if acknowledgment doesn’t occur in X mins.

8. Comparison with Alternatives

Feature / ToolOn-call RotationSlack Only AlertsEmail ForwardingPagerDutyOpsgenie
Escalation Policies✅ Yes❌ No❌ No✅ Yes✅ Yes
Schedule Management✅ Yes❌ No❌ No✅ Yes✅ Yes
CI/CD Integration✅ Yes⚠️ Manual❌ No✅ Yes✅ Yes
Audit Trails✅ Yes❌ No⚠️ Partial✅ Yes✅ Yes

When to Choose On-call Rotation?

  • When system availability or security events require 24/7 coverage.
  • For teams with shared security and operational responsibilities.
  • If SLA and compliance mandates are strict.

9. Conclusion

On-call rotation is a foundational practice in DevSecOps for maintaining uptime, ensuring rapid response to security incidents, and meeting compliance needs. As systems become more dynamic and distributed, well-managed on-call processes will continue to be critical.

Future Trends

  • AI-driven alert prioritization.
  • Integration with chaos engineering for proactive issue detection.
  • Gamification to reduce burnout.

Leave a Reply