Quick Definition (30–60 words)
An ALB is an Application Load Balancer that routes HTTP(S) and WebSocket traffic to application endpoints based on content, headers, and path. Analogy: ALB is the traffic conductor at a busy intersection directing cars by destination and type. Formal: ALB operates at Layer 7, enforcing routing, TLS termination, and application-aware health checks.
What is ALB?
What it is:
- An ALB is an application-aware load balancing service that routes requests using HTTP semantics, host headers, paths, and advanced rules.
- It provides TLS termination, path-based routing, header rewrites, WebSocket support, and integration with service discovery and target groups.
What it is NOT:
- Not a generic network TCP load balancer; ALB specifically targets Layer 7 HTTP/S and WebSocket flows.
- Not a full API gateway replacement when you need advanced features like API key management, complex rate-limiting, comprehensive WAF policies, or built-in transformations beyond basic rewrites.
Key properties and constraints:
- Layer 7 routing with host/path/header rules.
- TLS termination and certificate management (may integrate with managed certs).
- Health checks per target group and per-path.
- Sticky sessions via cookies (session affinity).
- Rate-limiting and WAF are often adjacent services, not always built-in.
- Limits vary by provider and region: connection limits, rule counts, certificates per load balancer — check provider docs. Varies / depends.
Where it fits in modern cloud/SRE workflows:
- Edge routing for services exposed to clients and internal north-south traffic.
- Ingress controller role for Kubernetes clusters when wrapped with a controller.
- Integrated into CI/CD pipelines for zero-downtime deploys using target group switching and weighted routing.
- Key control point for security teams: TLS, WAF, and DDoS mitigations tie into ALB.
- Observability hub: access logs, request tracing headers, metrics feed into SRE dashboards and alerting.
Diagram description (text-only):
- Clients -> CDN or Edge Cache -> ALB -> TLS termination -> Listener rules evaluate host/path/header -> select target group -> route to backend instances or containers -> backend health checks return status -> ALB applies stickiness or retries -> responses flow back through ALB -> CDN or client.
ALB in one sentence
ALB is a Layer 7 load balancer that routes HTTP(S) and WebSocket traffic using content-aware rules, SSL termination, and health checks to distribute requests to application endpoints.
ALB vs related terms (TABLE REQUIRED)
| ID | Term | How it differs from ALB | Common confusion |
|---|---|---|---|
| T1 | NLB | Operates at Layer 4 and handles TCP/UDP not HTTP routing | People expect HTTP features like path routing |
| T2 | Classic LB | Older generation combining Layer 4 and some Layer 7 features | Often assumed to be same as ALB |
| T3 | API Gateway | Has API management, throttling, auth features | Confused as replacement for ALB for simple routing |
| T4 | Ingress Controller | Kubernetes-native routing controller that may use ALB | People equate Ingress with ALB directly |
| T5 | CDN | Caches content close to users rather than routing to backends | Assumed to handle origin load balancing |
| T6 | Reverse Proxy | Software like Nginx works at Layer 7 like ALB but self-managed | Mistaken for managed ALB features like autoscaling |
| T7 | Service Mesh | Focuses on service-to-service communication inside clusters | Confused about replacing ALB for north-south traffic |
| T8 | WAF | Web Application Firewall blocks attacks by rules, not load balancing | People think WAF is built into all ALBs |
| T9 | TLS Terminator | Performs crypto operations only | Assumed to perform routing decisions like ALB |
| T10 | Edge Router | Sits at network edge for multiple protocols | People assume it includes ALB application rules |
Row Details (only if any cell says “See details below”)
None
Why does ALB matter?
Business impact:
- Revenue continuity: ALB ensures client requests reach healthy application instances, reducing downtime that directly impacts transactions and revenue.
- Trust and brand: Proper TLS termination and consistent routing preserve user trust and meet compliance requirements.
- Risk management: ALB centralizes attack surface controls (TLS policies, integration with WAF/DDoS), lowering business risk.
Engineering impact:
- Incident reduction: Health checks and smart routing isolate unhealthy targets automatically, reducing P1 pages.
- Velocity: ALB enables blue-green and canary deployments by switching target groups, supporting rapid releases.
- Scalability: Autoscaling targets behind ALB scale application capacity without changing client endpoints.
SRE framing:
- SLIs/SLOs: ALB provides SLIs like request success rate, latency at edge, and availability for SLOs.
- Error budgets: ALB incidents consume error budgets if they affect routing, TLS, or availability.
- Toil reduction: Automate routing, health checks, and certificate rotation to reduce manual work.
- On-call: ALB alerts should be precise — OSI or backend issues should not page operations unless ALB behavior is root cause.
What breaks in production (3–5 realistic examples):
- Misconfigured health checks mark healthy instances as unhealthy and push traffic to a small set of nodes, causing overload and increased latency.
- TLS certificate expiry on ALB causes browsers to block access, producing a sudden outage.
- Listener rule conflict routes traffic incorrectly after a deployment, causing 404s or security bypass.
- Burst traffic overwhelms backend target capacity because autoscaling was misconfigured or cooldowns too long.
- Access logs disabled or misrouted causes blindspots during incident response, delaying root cause analysis.
Where is ALB used? (TABLE REQUIRED)
| ID | Layer/Area | How ALB appears | Typical telemetry | Common tools |
|---|---|---|---|---|
| L1 | Edge | Public HTTP gateway with TLS termination | Request rate latency TLS metrics | ALB service CDN logs |
| L2 | Network | Routing for north-south flows | Connection counts error rates | NLB ALB combined tools |
| L3 | Service | Ingress to a microservice or target group | Per-route responses 4xx 5xx | Kubernetes Ingress controllers |
| L4 | App | Host/path based routing to app versions | Backend latency success rate | Service discovery metrics |
| L5 | Data | Not for data plane heavy streams | Bandwidth errors | Monitoring tools |
| L6 | Kubernetes | ALB as Ingress via controller | Ingress events pod health | Controller logs kube-metrics |
| L7 | Serverless | Fronting serverless endpoints with HTTP rules | Invocation latency error metrics | Function platform logs |
| L8 | CI/CD | Deployment switching target groups | Deployment success failure | CI tools telemetry |
| L9 | Observability | Source of access logs traces headers | Access logs trace IDs | Logging and APM tools |
| L10 | Security | TLS policies WAF integration | Blocked requests anomalies | WAF SIEM tools |
Row Details (only if needed)
None
When should you use ALB?
When it’s necessary:
- You need HTTP(S) or WebSocket routing based on host, path, or headers.
- TLS termination close to the edge is required.
- You want managed autoscaling and high availability for web traffic.
- You need native integration with cloud target groups and service discovery.
When it’s optional:
- Simple TCP services where Layer 4 load balancing suffices.
- Very small internal apps where reverse proxies per app are simpler.
- When a full API gateway is already handling advanced API management and rate-limiting.
When NOT to use / overuse it:
- Don’t use ALB for non-HTTP protocols like SSH or SMTP.
- Avoid chaining multiple ALBs unless necessary; it increases latency and complexity.
- Don’t offload business logic or complex request transformations to ALB; use an API gateway or application layer.
Decision checklist:
- If you need content-aware routing and TLS termination -> Use ALB.
- If you need advanced auth, API keys, or per-API quotas -> Use API gateway or combine ALB with API gateway.
- If the workload is pure TCP/UDP -> Use network-level LB or NLB.
Maturity ladder:
- Beginner: Single ALB fronting a monolith with basic health checks and TLS.
- Intermediate: ALB with multiple target groups, path-based routing, blue-green deploys, basic observability.
- Advanced: ALB integrated with WAF, automated certificate rotation, granular SLIs, canary weighted routing, and traffic-shifting automation.
How does ALB work?
Components and workflow:
- Listener: Accepts incoming connections on a port (80/443) and evaluates rules.
- Rules and priorities: List of conditions (host, path, headers) and actions.
- Target groups: Backends registered by instance ID, IP, or container port where traffic is forwarded.
- Health checks: Periodic probes per target group to mark targets healthy or unhealthy.
- Load balancer nodes: Managed compute that terminates TLS and forwards requests.
- Access logs and metrics: Request logs, latency histograms, and error counters for observability.
Data flow and lifecycle:
- Client connects to ALB listener.
- TLS handshake if HTTPS; ALB selects certificate.
- Listener evaluates request against rules by priority.
- Matching action forwards to a target group or redirects.
- ALB selects a healthy target using load-balancing algorithm (usually round-robin weighted).
- ALB forwards request, waits for backend response, possibly applying timeouts and retries.
- Response is returned to client; ALB logs request and updates metrics.
Edge cases and failure modes:
- Backend timeouts causing ALB to return 504.
- Misrouted traffic due to rule precedence errors causing unexpected backends.
- Sudden target group flapping if health checks are too strict.
- TLS policy mismatches causing handshake failures for certain clients.
- Cross-zone routing disabled causing uneven load if targets are unevenly distributed.
Typical architecture patterns for ALB
- Single ALB with multiple host-based rules: Good for small multi-tenant applications sharing a domain.
- ALB per application with CDN in front: Good for isolation and caching static content.
- ALB as Kubernetes Ingress via controller: Good for cluster-native workloads; integrates target groups with pod IPs.
- ALB fronting serverless endpoints: Connects HTTP services to functions or managed PaaS backends.
- ALB + API Gateway hybrid: ALB handles static routing, API gateway handles auth and rate-limits.
- Internal ALB for service-to-service north-south traffic: Keeps cross-team traffic internal without exposing to internet.
Failure modes & mitigation (TABLE REQUIRED)
| ID | Failure mode | Symptom | Likely cause | Mitigation | Observability signal |
|---|---|---|---|---|---|
| F1 | Backend flapping | Frequent 5xx spikes | Heath checks misconfigured | Relax checks adjust thresholds | Health check failures |
| F2 | TLS handshake failures | Clients report TLS errors | Expired or wrong cert | Rotate certs update policies | TLS error counts |
| F3 | Rule conflict | Requests routed wrong | Overlapping priorities | Reorder rules test in staging | Unexpected 404s 301s |
| F4 | Autoscale lag | Increased latency and 5xx | Slow scale-up cooldowns | Tune autoscale policies | Instance launch metrics |
| F5 | Access log gaps | Missing forensic data | Logging disabled misconfigured | Enable and route logs | Missing request IDs |
| F6 | High connection churn | CPU spikes on LB nodes | Client keepalive poor settings | Adjust timeouts and keepalives | Connection churn metric |
| F7 | DDoS or traffic spikes | High request burst outages | Insufficient WAF or limits | Engage WAF rate-limits CDN | Sudden traffic surge metric |
| F8 | Internal routing loop | Elevated latency and timeouts | Redirect rules circular | Fix redirect logic add guards | High upstream latency |
| F9 | Cross-zone imbalance | Uneven backend utilization | Cross-zone disabled | Enable cross-zone balancing | Per-az request counts |
| F10 | Header truncation | Auth failures downstream | Header size limits | Increase limits or compress | 4xx auth errors |
Row Details (only if needed)
None
Key Concepts, Keywords & Terminology for ALB
- ALB — Layer 7 load balancer for HTTP and WebSocket traffic — central request router — assuming Layer 4 features is a pitfall.
- Listener — Entry point configured with port/protocol — receives connections — misconfigured ports cause no traffic.
- Target group — Set of backend endpoints — used for routing and health checks — forgetting registration causes 503.
- Health check — Probes to determine target health — prevents routing to unhealthy nodes — overly strict checks cause flapping.
- Rule — Condition-action pair on listener — defines routing logic — priority errors cause unexpected routing.
- Path-based routing — Routes by URL path — enables microservice segregation — incorrect prefixes can break routes.
- Host-based routing — Routes by host header — useful for multi-tenant domains — missing host header breaks routing.
- Sticky sessions — Session affinity using cookies — maintains session to a target — prevents true stateless scaling.
- TLS termination — Decrypts TLS at ALB — reduces backend CPU usage — mismanagement risks cert expiry outages.
- TLS passthrough — Leaves TLS to backend — needed for end-to-end encryption — not always supported by ALB.
- Certificate — Public key for TLS — must be valid and rotated — expired certs cause client errors.
- WAF — Web Application Firewall — blocks attacks by rules — may be separate service integrated with ALB.
- CDN — Content Delivery Network — caches responses before ALB — reduces backend load — invalid caches cause stale content.
- Access logs — Request level logs including headers and paths — essential for forensics — disabling leads to blindspots.
- Connection draining — Allows in-flight requests to complete on targets being deregistered — prevents abrupt terminations.
- Weighted routing — Distribute traffic by weight across targets — enables canary releases — wrong weights cause leakage.
- Canary deployment — Gradual rollout to subset of traffic — reduces risk — needs monitoring to rollback quickly.
- Blue-green deployment — Swap active target groups or endpoints — minimizes downtime — needs DNS or LB switching.
- Cross-zone load balancing — Distributes traffic evenly across availability zones — prevents hotspotting — disabled causes imbalance.
- Idle timeout — Connection timeout setting — affects long-polling and WebSockets — too low breaks long connections.
- Keepalive — Maintains persistent connections — reduces backend connection overhead — misconfigured can keep stale connections.
- Rate limiting — Limits request rate — protects backends — may need integration with API gateway or WAF.
- Retry logic — Retries transient failures — protect clients from intermittent errors — may hide persistent failures.
- Circuit breaker — Stops sending traffic to failing components — reduces cascading failures — must be tuned to backend behavior.
- Observability — Metrics logs traces fed into monitoring systems — necessary for response and capacity planning — missing traces hampers debugging.
- SLIs — Service Level Indicators like p99 latency and availability — measurable signals — choose ones that reflect user experience.
- SLOs — Service Level Objectives derived from SLIs — operational goals — unrealistic SLOs cause wasted effort.
- Error budget — Allowable failure margin for releases — drives risk-taking decisions — burning budget too fast limits deployments.
- Access control lists — Rules controlling source access — protects internal ALBs — misconfigured ACL blocks legit traffic.
- Mutual TLS — Two-way TLS authentication — enforces client certs — complex rotation management is a pitfall.
- HTTP/2 — Protocol for multiplexed requests — reduces latency — some backends may not support it.
- WebSocket — Bidirectional persistent connections — requires idle timeout adjustment — broken by intermediate proxies with short timeouts.
- Header rewriting — Modify headers passing to backend — supports routing and security — incorrect rewrites break auth.
- Content-based routing — Decisions based on request body or headers — powerful but can be expensive — heavy parsing increases latency.
- Connection limit — Max concurrent connections the ALB supports — exceeding causes dropped traffic — monitor and scale.
- Target registration — Adding instances or IPs to target groups — mistakes leave apps unserved — automate registration.
- Health threshold — Number of consecutive successes/failures to mark healthy/unhealthy — tuning needed to avoid flapping.
- IPv6 support — Whether ALB handles IPv6 traffic — impacts global clients — not always available or configured.
- Internal ALB — Not internet-facing, used for internal traffic — reduces exposure — using internet ALB increases attack surface.
- DNS CNAME/ALIAS — DNS records pointing to ALB — must be updated on IP changes — using alias records often required.
- Rate-based WAF rules — Block based on threshold of requests — useful for bot mitigation — fine-tuning required to avoid false positives.
- Auto scaling integration — Dynamic scaling of backends based on metrics — avoids overload — poor metrics lead to scaling mismatches.
- Latency histogram — Distribution of response times — helps identify p99 and outliers — averages hide tail latency.
How to Measure ALB (Metrics, SLIs, SLOs) (TABLE REQUIRED)
| ID | Metric/SLI | What it tells you | How to measure | Starting target | Gotchas |
|---|---|---|---|---|---|
| M1 | Request success rate | Availability from client view | Successful responses / total requests | 99.9% monthly | 4xx vs 5xx matter differently |
| M2 | Latency p95 p99 | User-perceived responsiveness | Histogram from ALB timings | p95 < 300ms p99 < 1s | Backend queuing inflates tail |
| M3 | Error rate 5xx | Backend failures through ALB | 5xx count / total | < 0.1% | Retries can mask real errors |
| M4 | Backend healthy targets | Capacity available | Healthy targets per group | >= 2 per AZ | Flapping hides real capacity |
| M5 | TLS handshake failures | TLS problems between client and ALB | TLS error counts | 0 per hour | Some clients use legacy ciphers |
| M6 | Connection count | Concurrency and saturation risk | Active connections gauge | Depends on app | Long-lived websockets inflate this |
| M7 | Request per second | Traffic baseline | Sum requests per second | Varies by app | Bursts require spike handling |
| M8 | 4xx rate | Client errors and routing issues | 4xx count / total | Monitor trend | Automated clients can cause spikes |
| M9 | Access log volume | Logging completeness and scale | Log lines ingested | All requests logged | Sampling hides detail |
| M10 | Retry rate | Network or backend retries | Retry attempts / requests | Low single digits | Retries can cause amplification |
| M11 | Timeouts 504 | Backend or ALB timeouts | Count of 504 responses | 0 in SLO window | Long backend processing causes this |
| M12 | CPU of targets | Backend CPU pressure | Target metrics from hosts | Depends on instance | Lack of autoscale causes slowdowns |
| M13 | Latency broken down by route | Identifies slow endpoints | Per-rule histograms | Baseline per route | Aggregation hides hotspots |
| M14 | Cache hit ratio | If CDN in front of ALB | Cache hits / requests | > 70% for static | Dynamic content reduces ratio |
| M15 | DDoS signals | Attack detection | Anomalous traffic volumes | Alert on spikes | High false positives possible |
Row Details (only if needed)
None
Best tools to measure ALB
H4: Tool — Cloud provider metrics (native)
- What it measures for ALB: Request counts, latency, healthy hosts, TLS metrics, access logs.
- Best-fit environment: Cloud-hosted ALB in same provider.
- Setup outline:
- Enable ALB metrics in provider console.
- Enable access logs to storage.
- Configure log lifecycle and export to analytics.
- Hook metrics into cloud monitoring.
- Create dashboards for SLIs.
- Strengths:
- Rich native integration and low latency.
- Accurate source for LB-specific metrics.
- Limitations:
- Vendor-specific and less customizable.
- Retention and export costs may apply.
H4: Tool — Prometheus + exporters
- What it measures for ALB: Backend and controller metrics, exporter-sourced ALB metrics.
- Best-fit environment: Kubernetes or self-managed monitoring.
- Setup outline:
- Deploy cloud exporter or ALB controller metrics.
- Scrape exporter endpoints.
- Create recording rules for SLIs.
- Integrate with Alertmanager for alerting.
- Strengths:
- Highly customizable and open source.
- Good for cluster-native visibility.
- Limitations:
- Needs maintenance and scaling for high volumes.
- Exporter coverage varies by provider.
H4: Tool — Distributed tracing (OpenTelemetry)
- What it measures for ALB: End-to-end latency, trace context propagation through ALB.
- Best-fit environment: Microservices and distributed systems.
- Setup outline:
- Instrument services with OpenTelemetry SDKs.
- Ensure ALB forwards trace headers.
- Collect traces centrally and connect to traces dashboard.
- Strengths:
- Pinpoints backend and network-induced latency.
- Visualizes dependency graphs.
- Limitations:
- Requires instrumentation; sampling trades off fidelity.
H4: Tool — Log analytics (ELK / Lakes)
- What it measures for ALB: Access logs, request headers, error payloads.
- Best-fit environment: Teams needing deep search and forensics.
- Setup outline:
- Route ALB access logs to storage.
- Ingest logs into analytics pipeline.
- Parse fields and create dashboards.
- Strengths:
- Powerful ad-hoc investigation and alerting.
- Can retain long-term history.
- Limitations:
- Costly at large scale and needs storage management.
H4: Tool — Application Performance Monitoring (APM)
- What it measures for ALB: End-to-end latency, error rates, traces correlated with ALB metrics.
- Best-fit environment: Production apps requiring deep performance insights.
- Setup outline:
- Install APM agents in services.
- Ensure incoming request tracing header integration.
- Correlate APM metrics with ALB request rates.
- Strengths:
- Rich UI for root cause analysis.
- Automatic anomaly detection.
- Limitations:
- License costs and potential overhead.
Recommended dashboards & alerts for ALB
Executive dashboard:
- Panels: Overall request success rate, p95/p99 latency, monthly availability, top 5 affected services.
- Why: Business-level health and SLA visibility.
On-call dashboard:
- Panels: Current request/s error rates, target group health, recent 5xx spikes, TLS failure rate, top slow routes.
- Why: Fast triage, identify if ALB or backend is root cause.
Debug dashboard:
- Panels: Per-route latency histogram, per-target CPU and memory, access logs tail, health check history, connection counts.
- Why: Deep debugging for incidents.
Alerting guidance:
- Page-worthy alerts: ALB total availability below SLO, TLS handshake failures affecting >1% of traffic, persistent target group unhealthy count >= threshold.
- Ticket-only alerts: Transient elevated latency not sustained beyond window, access log upload failure with retries in place.
- Burn-rate guidance: Use burn-rate calculations for error budget consumption; page if burn rate > 3x for 1 hour and error budget still significant.
- Noise reduction tactics: Deduplicate alerts by grouping by ALB name, use suppression windows during known maintenance, set thresholds per-application baselines.
Implementation Guide (Step-by-step)
1) Prerequisites – Inventory of applications and domains. – DNS plan for ALIAS/CNAME records. – Certificates or certificate manager setup. – IAM or role permissions for provisioning ALBs and logging. – Monitoring and logging accounts prepared.
2) Instrumentation plan – Decide SLIs and SLOs for request success and latency. – Ensure services propagate trace headers. – Enable ALB access logs and export to logging system.
3) Data collection – Stream access logs to a durable storage. – Scrape ALB metrics and backend metrics into monitoring system. – Configure trace ingestion for distributed tracing.
4) SLO design – Create SLI definitions: success rate, p99 latency per route. – Select SLO targets and error budgets based on business needs.
5) Dashboards – Build executive, on-call, and debug dashboards as described above.
6) Alerts & routing – Implement alerting thresholds and connect to on-call rotation. – Configure escalation rules and runbooks.
7) Runbooks & automation – Create runbooks for common failures: TLS expiry, target flapping, rule misconfiguration. – Automate target registration, certificate rotation, and scale policies.
8) Validation (load/chaos/game days) – Run load tests including TLS handshakes and long-lived connections. – Execute chaos experiments: kill targets, inject latency. – Perform game days simulating certificate expiry and large traffic spikes.
9) Continuous improvement – Postmortem after incidents with action items. – Tune health checks and autoscale policies. – Refine SLOs based on observed data.
Pre-production checklist:
- Health checks validated against staging endpoints.
- TLS certificates uploaded and valid.
- Access logs enabled and verified.
- Test routing rules with synthetic traffic.
- Monitoring alerts configured and tested.
Production readiness checklist:
- Redundant AZs with targets in each.
- Minimum healthy targets per AZ verified.
- Autoscaling policies tested for spikes.
- Runbooks and on-call escalation in place.
- Canary deployment strategy defined.
Incident checklist specific to ALB:
- Check ALB metrics for spikes and TLS failures.
- Verify target group health and recent health check history.
- Tail access logs for affected requests.
- Confirm rule changes or deployments in last 30 minutes.
- If needed, reroute traffic to backup target group or disable faulty rules.
Use Cases of ALB
1) Multi-tenant web application – Context: Single domain with tenant-specific subdomains. – Problem: Need host-based routing to different backend clusters. – Why ALB helps: Routes by host header to specific target groups. – What to measure: Per-tenant request success rate and latency. – Typical tools: ALB, DNS alias, monitoring.
2) Kubernetes Ingress – Context: Kubernetes cluster exposing many services. – Problem: Need managed ingress with cloud load balancing. – Why ALB helps: Integrates with Ingress controllers to register pod IPs. – What to measure: Ingress latency, pod readiness, rule evaluation time. – Typical tools: ALB controller, Prometheus.
3) Serverless HTTP fronting – Context: Functions behind HTTP endpoints. – Problem: Functions need stable URL and TLS. – Why ALB helps: Fronts functions and applies routing rules. – What to measure: Invocation latency, cold-start impact. – Typical tools: ALB, serverless platform logs.
4) Blue-green deployments – Context: Risk-averse release process. – Problem: Need instant rollback and zero downtime. – Why ALB helps: Swap target groups for zero-downtime switchover. – What to measure: Error rate during switch, traffic split. – Typical tools: CI/CD orchestrator, ALB target groups.
5) WebSocket backend – Context: Real-time chat or streaming. – Problem: Persistent connections require correct timeouts. – Why ALB helps: Supports WebSocket with long idle timeouts. – What to measure: Connection counts, idle timeout errors. – Typical tools: ALB, application logs.
6) Path-based microservice routing – Context: Microservices with route prefixes. – Problem: Consolidate single public entry point. – Why ALB helps: Path-based routing to different target groups. – What to measure: Per-path latency and errors. – Typical tools: ALB, tracing.
7) Canary testing with weighted routing – Context: New version rollout. – Problem: Gradual exposure to a subset of traffic. – Why ALB helps: Weighted routing to split traffic by percentages. – What to measure: Canary error rate compared to baseline. – Typical tools: ALB, CI pipeline.
8) Internal service gateway – Context: Cross-team internal APIs. – Problem: Secure and monitor internal north-south traffic. – Why ALB helps: Internal ALB keeps traffic inside VPC with ACLs. – What to measure: Internal latency and success rates. – Typical tools: ALB internal mode, SIEM.
9) TLS offload for legacy backends – Context: Old services that don’t support TLS. – Problem: Need TLS at edge while keeping backends unchanged. – Why ALB helps: TLS termination and re-encrypt if needed. – What to measure: TLS failure counts and backend errors. – Typical tools: ALB, certificate manager.
10) Integration with WAF for security – Context: Web app under attack. – Problem: Block OWASP class attacks and bots. – Why ALB helps: Integrate WAF to block malicious requests at edge. – What to measure: Blocked requests and false positives. – Typical tools: ALB + WAF, SIEM.
Scenario Examples (Realistic, End-to-End)
Scenario #1 — Kubernetes ingress with ALB
Context: A microservices platform in Kubernetes exposes multiple services to the internet.
Goal: Provide secure, path- and host-based routing with autoscaling.
Why ALB matters here: ALB integrates with Kubernetes controllers to register pods and provides managed TLS termination and routing.
Architecture / workflow: Client -> CDN -> ALB -> Kubernetes Ingress Controller -> Services (pods) -> Tracing/Logging.
Step-by-step implementation:
- Install ALB Ingress controller with appropriate IAM role.
- Define Ingress resources with host and path rules.
- Configure target groups to use pod IPs or node ports.
- Enable access logs and metrics export.
- Setup health checks to use application-specific endpoints.
What to measure: Per-route p95/p99, target group healthy count, pod restart rates.
Tools to use and why: ALB controller, Prometheus, Grafana, OpenTelemetry for traces.
Common pitfalls: Incorrect service annotations causing wrong port registration; health check endpoints that require auth.
Validation: Run load test and scale pods, verify ALB distributes traffic, check traces for end-to-end latency.
Outcome: Managed external ingress with predictable routing and SLO-aligned observability.
Scenario #2 — Serverless API fronted by ALB
Context: A managed serverless platform provides functions for an API.
Goal: Expose functions under a shared domain with TLS and routing.
Why ALB matters here: ALB provides the endpoint, TLS termination, and path-based routing to function endpoints.
Architecture / workflow: Client -> ALB -> Function gateway -> Function runtime -> Response.
Step-by-step implementation:
- Create ALB with HTTPS listener and certificate.
- Configure rules mapping paths to serverless function endpoints or targets.
- Enable health checks where supported or rely on platform health.
- Instrument function with traces and propagate headers.
What to measure: Invocation latency, cold start rate, error rates.
Tools to use and why: ALB logs, function metrics, APM for tracing.
Common pitfalls: Incorrect idle timeout causing function timeouts; rate-limits on function provider.
Validation: Synthetic requests and observe function scaling and latency.
Outcome: Secure, stable front door for serverless APIs.
Scenario #3 — Incident response: TLS expiry outage
Context: Production web app experiences mass failures when TLS cert expires on ALB.
Goal: Restore access quickly and prevent recurrence.
Why ALB matters here: ALB serves certificate and handshake; expiry blocks all HTTPS traffic.
Architecture / workflow: Client -> ALB with expired cert -> TLS fail -> No request reaches backend.
Step-by-step implementation:
- On-call receives TLS handshake failure alerts.
- Verify certificate expiry on ALB console or via monitoring.
- Upload new cert or attach managed cert.
- Validate handshake and route health.
- Create postmortem to adjust certificate rotation automation.
What to measure: TLS failure counts before and after, user impact metrics.
Tools to use and why: Provider console, monitoring, ticketing, runbook.
Common pitfalls: Cert uploaded but wrong domain; IAM role blocking cert access.
Validation: Browser tests and synthetic checks for TLS versions.
Outcome: Restored HTTPS access and automated rotation implemented.
Scenario #4 — Cost versus performance trade-off for ALB
Context: High-traffic static site with dynamic endpoints; costs on ALB ingress become significant.
Goal: Reduce ALB costs without degrading latency for dynamic endpoints.
Why ALB matters here: ALB costs scale with request count and data processed.
Architecture / workflow: Client -> CDN cache -> ALB for misses -> Dynamic backend.
Step-by-step implementation:
- Move static assets fully to CDN and origin shield.
- Configure CDN to handle more logic via edge workers to avoid ALB.
- Keep ALB for dynamic API calls only; route static paths straight to CDN.
- Monitor ALB request volume and costs.
What to measure: ALB request count, CDN cache hit ratio, latency for dynamic endpoints, cost per request.
Tools to use and why: CDN analytics, ALB metrics, cost reporting.
Common pitfalls: CDN cache misconfiguration causing cache misses; edge logic adding latency.
Validation: Compare cost and latency pre and post changes under load.
Outcome: Reduced ALB cost and preserved performance for dynamic traffic.
Scenario #5 — Canary release with weighted ALB routing
Context: Deploy new service version with minimal risk.
Goal: Shift 5% traffic to canary and monitor.
Why ALB matters here: ALB supports weighted routing to target groups enabling canary tests.
Architecture / workflow: Client -> ALB -> Weighted target groups (v1 95% v2 5%) -> Backends.
Step-by-step implementation:
- Deploy new target group for v2.
- Configure weighted routing on ALB listener rules.
- Monitor canary SLOs and compare error rates.
- Gradually increase weight or rollback on anomalies.
What to measure: Canary error rate, latency delta, resource usage.
Tools to use and why: ALB metrics, APM, CI/CD rollback automation.
Common pitfalls: Insufficient telemetry on canary; cross-AZ distribution differences.
Validation: Canary steady for defined window then increase or roll back.
Outcome: Safer releases with measurable impact.
Scenario #6 — Internal service gateway with ALB
Context: Multiple internal teams share APIs inside VPC.
Goal: Secure internal traffic and centralize routing and monitoring.
Why ALB matters here: Internal ALB restricts exposure and integrates with IAM and ACLs.
Architecture / workflow: Internal client -> Internal ALB -> Internal services -> Observability.
Step-by-step implementation:
- Create internal ALB with private subnets.
- Configure security groups and ACLs for allowed sources.
- Enable internal access logs and monitoring.
- Integrate with service discovery for dynamic targets.
What to measure: Internal latency, auth failures, internal access patterns.
Tools to use and why: ALB, SIEM, service discovery tools.
Common pitfalls: Overly permissive ACLs; missing internal DNS records.
Validation: Internal clients test routes and SLO validation.
Outcome: Secure internal API routing and centralized monitoring.
Common Mistakes, Anti-patterns, and Troubleshooting
1) Symptom: 5xx spikes after deployment -> Root cause: Health checks point to wrong path -> Fix: Update health check path and verify probe response. 2) Symptom: TLS errors from clients -> Root cause: Expired certificate -> Fix: Rotate certificate and automate renewal. 3) Symptom: Uneven load across AZs -> Root cause: Cross-zone disabled -> Fix: Enable cross-zone balancing. 4) Symptom: Missing access logs -> Root cause: Logging disabled or permissions error -> Fix: Enable logs and verify IAM. 5) Symptom: High p99 latency -> Root cause: Backend queuing and CPU saturation -> Fix: Scale backends and optimize code. 6) Symptom: Frequent target flapping -> Root cause: Health check thresholds too strict -> Fix: Relax thresholds and add warming period. 7) Symptom: Redirect loops -> Root cause: Misconfigured redirect rules -> Fix: Review rule precedence and add loop detection. 8) Symptom: WebSocket disconnects -> Root cause: Idle timeout too short -> Fix: Increase idle timeout for long connections. 9) Symptom: Unexpected 404s -> Root cause: Rule priority order incorrect -> Fix: Reorder rules and test synthetic requests. 10) Symptom: High costs with many small requests -> Root cause: Static assets served through ALB not CDN -> Fix: Move static content to CDN edge. 11) Symptom: Canary leaked traffic -> Root cause: Weighting misconfigured -> Fix: Correct weight or use separate listener rule. 12) Symptom: Monitoring blind spot -> Root cause: Tracing headers not propagated -> Fix: Ensure ALB forwards trace headers and services accept them. 13) Symptom: Burst traffic causes outages -> Root cause: Insufficient autoscale policies -> Fix: Tune scale policies and pre-warm capacity. 14) Symptom: Auth failures downstream -> Root cause: Header rewrite removed auth token -> Fix: Preserve auth headers or perform auth at ALB gateway. 15) Symptom: Too many alerts -> Root cause: Low thresholds and no aggregation -> Fix: Raise thresholds, aggregate by ALB, add dedupe. 16) Symptom: Per-route slowdowns -> Root cause: Backend cold starts or DB contention -> Fix: Warm backends, scale DB, add caching. 17) Symptom: Long deploy impact -> Root cause: No connection draining -> Fix: Enable connection draining before deregistration. 18) Symptom: Incorrect origin IPs in logs -> Root cause: Missing X-Forwarded-For configuration -> Fix: Ensure ALB sets X-Forwarded-For and backend reads it. 19) Symptom: Header size errors -> Root cause: Headers exceed ALB limit -> Fix: Reduce header size or compress payloads. 20) Symptom: Security breaches -> Root cause: Weak TLS policy or open ACLs -> Fix: Harden TLS policy and restrict ingress. 21) Symptom: Inconsistent TLS ciphers for clients -> Root cause: TLS policy misconfigured -> Fix: Set supported cipher suites aligned with client base. 22) Symptom: Slow rule evaluation -> Root cause: Too many rules with complex conditions -> Fix: Consolidate rules and use prefix matching. 23) Symptom: Backend IP changes not reflected -> Root cause: Static target registration instead of Dynamic discovery -> Fix: Use service discovery or controller integration. 24) Symptom: Failed retries masking issues -> Root cause: Aggressive retries hide persistent backend failures -> Fix: Reduce automatic retries and surface errors. 25) Symptom: Observability missing request IDs -> Root cause: ALB not injecting or passing trace ID header -> Fix: Ensure header injection and propagate across services.
Observability pitfalls (at least 5):
- Missing access logs prevents accurate incident reconstruction -> Fix: Enable logs and retention.
- Tracing not propagated through ALB -> Fix: Ensure ALB forwards trace headers.
- Aggregated metrics hide per-route issues -> Fix: Create per-route SLIs.
- Sampling rates too high or too low -> Fix: Tune sampling to capture important traces without overload.
- No alert on missing telemetry -> Fix: Monitor logging pipeline health and configure alerts.
Best Practices & Operating Model
Ownership and on-call:
- Assign ALB ownership to infrastructure or platform team with SLAs for updates and incidents.
- Include ALB in on-call rotation and document escalation for DNS and cert issues.
Runbooks vs playbooks:
- Runbooks: Short prescriptive steps for known failures (TLS expiry, target flapping).
- Playbooks: Higher-level incident management patterns and cross-team coordination steps.
Safe deployments (canary/rollback):
- Use weighted routing and metrics-based automation to increase traffic to new versions.
- Have automatic rollback trigger when canary exceed error thresholds.
Toil reduction and automation:
- Automate cert rotation, target registration, and logging configuration.
- Use infrastructure-as-code for ALB configuration and tests.
Security basics:
- Use strong TLS policies and prefer managed certificates.
- Integrate ALB with WAF and DDoS protection.
- Restrict ALB management access with least privilege roles.
Weekly/monthly routines:
- Weekly: Check ALB health, review failed health checks and spike patterns.
- Monthly: Rotate TLS certs as needed, review rule complexity, audit logging configuration.
What to review in postmortems related to ALB:
- Was ALB part of the failure chain or did it mask another failure?
- Were health checks and thresholds appropriate?
- Was observability sufficient to detect the issue early?
- Action items: improved automation, updated runbooks, changed thresholds.
Tooling & Integration Map for ALB (TABLE REQUIRED)
| ID | Category | What it does | Key integrations | Notes |
|---|---|---|---|---|
| I1 | Monitoring | Collects LB metrics and alerts | Cloud metrics Prometheus | Native metrics are authoritative |
| I2 | Logging | Stores ALB access logs for analysis | Log analytics APM | Retention impacts cost |
| I3 | Tracing | End-to-end request tracing | OpenTelemetry APM | Requires header propagation |
| I4 | CI/CD | Automates deploys and target switches | GitOps pipelines | Use safe deploy strategies |
| I5 | WAF | Blocks malicious requests at edge | ALB WAF rules SIEM | Tuning needed to avoid false positives |
| I6 | CDN | Caches content reducing ALB load | Origin config ALB | Improves cost and latency |
| I7 | IAM | Controls ALB provisioning and cert access | Cloud IAM roles | Least privilege for changes |
| I8 | Service discovery | Registers targets dynamically | DNS consul Kubernetes | Prevent stale registrations |
| I9 | Chaos tools | Exercises failure modes | Chaos frameworks monitoring | Test removal of targets and latency |
| I10 | Cost tools | Tracks ALB cost and usage | Billing dashboards | Useful for optimization |
Row Details (only if needed)
None
Frequently Asked Questions (FAQs)
H3: What exactly does ALB stand for?
ALB stands for Application Load Balancer, indicating Layer 7 routing for HTTP and WebSocket traffic.
H3: Can ALB terminate TLS?
Yes, ALB commonly performs TLS termination and can use managed certificates or uploaded certs.
H3: Is ALB suitable for WebSocket traffic?
Yes, ALBs typically support WebSocket with appropriate idle-timeout configuration.
H3: Should I use ALB or an API Gateway?
Use ALB for simple routing and TLS termination; use an API Gateway when you need API management features like throttling, auth, and API keys.
H3: How do health checks work with ALB?
Health checks are configured per target group and probe endpoints to mark targets healthy or unhealthy based on consecutive successes or failures.
H3: Does ALB support IPv6?
Varies / depends on provider and region; check provider capabilities for IPv6 support.
H3: Can I do sticky sessions with ALB?
Yes, ALB supports session affinity using cookies for sticky sessions.
H3: How do I debug high p99 latency?
Trace end-to-end requests, break down per-route latency, and inspect backend resource metrics to find hotspots.
H3: What happens when an ALB certificate expires?
Clients will receive TLS errors and browsers will block connections; rotate certs immediately and automate renewal.
H3: How many rules can an ALB have?
Varies / depends on provider; check provider limits and design to minimize rule complexity.
H3: Should ALB be public or internal?
Use public ALB for internet-facing services and internal ALB for private intra-VPC traffic.
H3: How do I measure ALB availability?
Use SLIs like request success rate from client perspective and p99 latency; compute SLOs with business context.
H3: Can ALB perform rate limiting?
ALB may not provide built-in granular rate limiting; combine with WAF or API gateway for rate enforcement.
H3: How do I test ALB changes safely?
Use staging environments, canary releases, and weighted routing to validate changes before full rollout.
H3: What metrics are most important for ALB?
Request success rate, p99 latency, 5xx rate, healthy hosts, TLS failures, and connection counts.
H3: Are ALB access logs necessary?
Yes, they are essential for postmortems, security analysis, and deep debugging.
H3: How do I reduce ALB costs?
Move static content to CDN, consolidate rules, and ensure high cache hit ratio to reduce ALB request volume.
H3: Can ALB rewrite headers?
ALB can perform basic header manipulations but advanced transformations should be handled upstream in application or API gateway.
Conclusion
ALBs are a critical Layer 7 component for modern cloud architectures, providing content-aware routing, TLS termination, and integration points for observability and security. Properly instrumented and configured ALBs reduce incidents, enable safer deployments, and provide the observability needed for effective SRE practices.
Next 7 days plan:
- Day 1: Inventory ALBs, domains, and certificates; enable access log if missing.
- Day 2: Define SLIs (success rate and p99 latency) and implement metrics collection.
- Day 3: Create executive and on-call dashboards for ALB health.
- Day 4: Implement or verify automated certificate rotation and health-check tuning.
- Day 5: Run a small canary with weighted routing for a non-critical service.
- Day 6: Conduct a game day simulating target flapping and TLS expiry.
- Day 7: Review findings and update runbooks, alerts, and automation tasks.
Appendix — ALB Keyword Cluster (SEO)
- Primary keywords
- Application Load Balancer
- ALB
- Layer 7 load balancer
- ALB tutorial
-
ALB best practices
-
Secondary keywords
- ALB architecture
- ALB metrics
- ALB health checks
- ALB TLS
-
ALB routing rules
-
Long-tail questions
- How does an application load balancer work in 2026
- How to measure ALB p99 latency
- How to configure TLS on ALB
- How to do canary releases with ALB
- How to troubleshoot ALB 504 timeouts
- What is the difference between ALB and NLB
- When to use ALB vs API gateway
- How to enable ALB access logs
- How to secure ALB with WAF
- How to set health checks for ALB target groups
- How to use ALB with Kubernetes
- How to configure WebSocket on ALB
- How to reduce ALB costs
- How to automate ALB certificate rotation
-
How to implement weighted routing with ALB
-
Related terminology
- Listener
- Target group
- Health check
- Listener rule
- Path based routing
- Host based routing
- Sticky sessions
- Cross zone load balancing
- Idle timeout
- Access logs
- TLS termination
- Mutual TLS
- WAF
- CDN
- Ingress controller
- Service discovery
- OpenTelemetry
- APM
- Prometheus
- Grafana
- Canary deployment
- Blue-green deployment
- Circuit breaker
- Retry policy
- Rate limiting
- Connection draining
- WebSocket support
- HTTP/2 support
- Certificate manager
- IAM roles
- Internal ALB
- Public ALB
- Autoscaling
- Cross-AZ distribution
- Access control list
- Header rewriting
- Content-based routing
- Latency histogram
- Error budget
- SLIs and SLOs
- Observability pipeline
- Trace propagation
- Load testing