SRE
Site Reliability Engineering
Reliability engineering
Production engineering
Operations
DevOps
Platform engineering
Service ownership
You build it, you run it
Shared responsibility
Production readiness review (PRR)
Launch checklist
Operational excellence
Reliability
Resilience
Fault tolerance
High availability (HA)
Scalability
Elasticity
Durability
Maintainability
Observability
Monitoring
Telemetry
Instrumentation
Reliability culture
Toil
Toil budget
Automation
Autoremediation
Self-healing
Runbook
Playbook
Operational runbook
Standard operating procedure (SOP)
On-call
Pager duty
Primary on-call
Secondary on-call
Escalation policy
Escalation chain
Follow-the-sun
Incident response
Incident management
Major incident
SEV1
SEV2
SEV3
Incident commander (IC)
Communications lead
Scribe
War room
Bridge line
Incident timeline
Impact assessment
Customer impact
Blameless postmortem
Post-incident review (PIR)
Root cause analysis (RCA)
5 Whys
Fishbone diagram
Action items
Corrective action
Preventive action
CAPA
Lessons learned
Incident retrospective
Change management
Change advisory board (CAB)
Change request
Change calendar
Maintenance window
Release management
Deployment
Rollback
Roll forward
Hotfix
Patch
Backport
Release train
Service catalog
System of record
Configuration management database (CMDB)
Asset inventory
Reliability testing
Load testing
Stress testing
Soak testing
Chaos engineering
Game day
Fault injection
Failure mode and effects analysis (FMEA)
Risk assessment
Risk register
Threat modeling
SLI
Service Level Indicator
SLO
Service Level Objective
SLA
Service Level Agreement
Error budget
Error budget burn
Burn rate
Multi-window burn rate
Budget policy
Freeze policy
Availability
Uptime
Downtime
Outage
Partial outage
Degradation
Latency
Tail latency
P50 latency
P90 latency
P95 latency
P99 latency
P99.9 latency
Throughput
RPS
QPS
Error rate
Success rate
Apdex
Saturation
Capacity
Utilization
Headroom
MTTR
Mean Time to Recovery
Mean Time to Restore
Mean Time to Resolution
MTTD
Mean Time to Detect
MTTA
Mean Time to Acknowledge
MTBF
Mean Time Between Failures
Change failure rate
Deployment frequency
Lead time for changes
Metrics
Time series
Counter
Gauge
Histogram
Summary
Percentile
Cardinality
Dimensional metrics
Label
Tag
Metric namespace
Metric scraping
Pull model
Push model
Prometheus
Alertmanager
PromQL
Recording rule
Alerting rule
Service discovery
Grafana
Dashboard
Panel
Annotation
Templating
Variables
SLO dashboard
Golden signals
Four golden signals
RED method
USE method
Latency (RED)
Errors (RED)
Duration (RED)
Rate (RED)
Utilization (USE)
Saturation (USE)
Errors (USE)
Health check
Liveness check
Readiness check
Synthetic monitoring
Canary check
Black-box monitoring
White-box monitoring
Heartbeat
Uptime check
Alert
Alarm
Page
Notification
Alert routing
Alert deduplication
Alert suppression
Silence
Maintenance mode
Alert correlation
Noise reduction
Alert fatigue
Threshold alert
Anomaly detection
Dynamic threshold
Baseline
Seasonality
SLI query
SLO compliance
Error budget policy
Service-level reporting
Logs
Structured logging
Unstructured logs
Log level
DEBUG
INFO
WARN
ERROR
FATAL
Log aggregation
Log shipping
Log forwarder
Log parsing
Log enrichment
Log sampling
Log retention
Log rotation
Centralized logging
Log indexing
Log search
Log analytics
Syslog
Journald
Fluentd
Fluent Bit
Logstash
Filebeat
Vector
OpenSearch
Elasticsearch
Kibana
OpenSearch Dashboards
Splunk
Graylog
Loki
Distributed tracing
Trace
Span
Span context
Trace ID
Span ID
Parent span
Root span
Context propagation
Baggage
Correlation ID
Request ID
Trace correlation
Log correlation
Sampling
Head-based sampling
Tail-based sampling
Probability sampling
Rate-limiting sampler
Trace exporter
Span exporter
OTLP
OpenTelemetry
OTel
OpenTracing
OpenCensus
OpenTelemetry Collector
Receiver
Processor
Exporter
Sampler
Batch processor
Resource
Resource attributes
Semantic conventions
Instrumentation library
Auto-instrumentation
Manual instrumentation
W3C Trace Context
traceparent
tracestate
B3 propagation
Jaeger
Zipkin
Tempo
Lightstep
Honeycomb
X-Ray
APM
Application Performance Monitoring
RUM
Real User Monitoring
Synthetic transactions
Service map
Dependency graph
Flame graph
Profiling
Continuous profiling
eBPF
PagerDuty
Opsgenie
VictorOps
incident.io
Status page
StatusPage
Incident channel
Runbook automation
ChatOps
Slack
Teams
Zoom bridge
Incident bot
Circuit breaker
Bulkhead
Timeout
Retry
Exponential backoff
Jitter
Token bucket
Leaky bucket
Load shedding
Backpressure
Container
Container runtime
OCI
containerd
Image
Container image
Image registry
Kubernetes
Cluster
Control plane
API server
etcd
Scheduler
Node
Kubelet
Pod
Deployment
StatefulSet
DaemonSet
Service
Ingress
Namespace
ConfigMap
Secret
ServiceAccount
PersistentVolume (PV)
PersistentVolumeClaim (PVC)
StorageClass
CNI
CSI
Horizontal Pod Autoscaler (HPA)
Cluster Autoscaler
Pod disruption budget (PDB)
CustomResourceDefinition (CRD)
Operator
Admission controller
Helm
Helm chart
Kustomize
Service mesh
Sidecar
Envoy
Istio
Linkerd
CI
Continuous Integration
CD
Continuous Delivery
Continuous Deployment
Pipeline
GitOps
Argo CD
Flux
Jenkins
GitHub Actions
Infrastructure as Code (IaC)
Terraform
AWS CloudFormation
Pulumi
Ansible
Message queue
Kafka
Topic
Partition (Kafka)
Consumer group
Dead-letter queue (DLQ)
Disaster recovery (DR)
RTO
RPO
Active-active
Active-passive
Multi-region
Multi-AZ
AWS
EC2
S3
EBS
EFS
RDS
Aurora
DynamoDB
ElastiCache
EKS
ECS
Fargate
Lambda
API Gateway
CloudFront
CloudWatch
CloudWatch Logs
CloudTrail
IAM policy
KMS key
Secrets Manager
SSM Parameter Store
Route 53
VPC Flow Logs
WAF (AWS)
Shield
Auto Scaling Group (ASG)
Elastic Load Balancing (ELB)
ALB
NLB
EventBridge
SQS
SNS
Kinesis
OpenSearch Service
GCP
Compute Engine
GKE
Cloud Run
Cloud Functions
App Engine
Cloud Storage
Persistent Disk
Cloud SQL
Spanner
Bigtable
Firestore
Pub/Sub (GCP)
BigQuery
Cloud Monitoring
Cloud Logging
Cloud Trace
Cloud Profiler
Cloud IAM
Cloud Load Balancing
Cloud DNS
Cloud Armor
Azure
Virtual Machines (Azure)
AKS
Azure Functions
App Service
Blob Storage
Azure Files
Managed Disks
Azure SQL Database
Cosmos DB
Azure Cache for Redis
Service Bus
Event Hubs
Event Grid
Azure Monitor
Log Analytics
Application Insights
Azure Active Directory
Managed Identity
Key Vault
Virtual Network (VNet)
Network Security Group (NSG)
Application Gateway
Azure Firewall
Azure CDN
Datadog
New Relic
Dynatrace
AppDynamics
Elastic APM
Splunk Observability
Grafana Cloud
Thanos
Cortex
Mimir
VictoriaMetrics
InfluxDB
Telegraf
StatsD
Prometheus Remote Write
Graceful degradation