Quick Definition (30–60 words)
Semantic conventions are standardized naming and attribute rules for telemetry data that enable interoperability and consistent interpretation across tools and teams. Analogy: like a shared dictionary for observability signals. Formal line: a specification defining attribute keys, value types, and expected semantics for traces, metrics, and logs.
What is Semantic conventions?
Semantic conventions are explicit rules and best practices that prescribe how to name, tag, and structure telemetry (traces, metrics, logs, events, and resources) so systems and people can reason about observability data consistently. They are not a specific tool, nor a runtime component; they are a contract between producers and consumers of telemetry.
What it is / what it is NOT
- It is a contract: naming, types, cardinality guidance.
- It is NOT implementation code or a vendor API.
- It is NOT a replacement for domain-specific labels or business context; it augments them.
Key properties and constraints
- Predictable attribute keys and value types.
- Guidance on cardinality to avoid cardinality explosion.
- Compatibility focus across languages and runtimes.
- Versioning and evolution constraints to avoid breaking consumers.
- Security-aware: avoid secrets and PII in attributes.
Where it fits in modern cloud/SRE workflows
- Instrumentation libraries embed conventions so telemetry is consistent.
- CI pipelines check that new telemetry follows conventions (linting).
- Observability backends map conventions to dashboards, SLIs, and alerts.
- Incident postmortems reference conventions when adding context or changing instrumentation.
Text-only diagram description (visualize)
- Service A emits trace span with standardized attributes -> Telemetry collector normalizes attributes using semantic conventions -> Observability backend indexes normalized data -> Alerting rules and dashboards use known attribute keys to compute SLIs and SLOs -> Engineers debug using consistent filters and fields.
Semantic conventions in one sentence
A set of standardized attribute names, types, and usage rules that make telemetry data consistent, searchable, and interoperable across tools and teams.
Semantic conventions vs related terms (TABLE REQUIRED)
| ID | Term | How it differs from Semantic conventions | Common confusion |
|---|---|---|---|
| T1 | Schema | Schema defines data shape; conventions define naming and semantics | Often used interchangeably |
| T2 | API | API is an interface; conventions are naming rules for data sent via APIs | People expect vendor APIs to enforce conventions |
| T3 | Telemetry | Telemetry is raw data; conventions are rules applied to that data | Telemetry producers skip naming standards |
| T4 | Ontology | Ontology models relationships; conventions focus on attributes | Ontology sounds academic |
| T5 | Tagging | Tagging is ad-hoc labels; conventions are prescriptive tags | Teams use inconsistent tags |
| T6 | Data model | Data model is storage format; conventions are semantic layer | Model may change without updating conventions |
| T7 | Observability spec | Observability spec is broader; conventions are part of it | Overlapping scope causes confusion |
Row Details (only if any cell says “See details below”)
- None
Why does Semantic conventions matter?
Business impact (revenue, trust, risk)
- Faster incident resolution reduces downtime, protecting revenue.
- Consistent telemetry builds customer trust through reliable SLA reporting.
- Poor conventions lead to missed regulatory signals and data leakage risk.
Engineering impact (incident reduction, velocity)
- Engineers spend less time interpreting fields; they search and filter faster.
- Reusable dashboards and alerts across services accelerate new deployments.
- Reduced debugging toil increases developer velocity.
SRE framing (SLIs/SLOs/error budgets/toil/on-call)
- SLIs rely on known attribute keys to compute correctness and latency.
- SLOs are meaningful only if instrumentation is consistent across services.
- Error budgets and automated burn-rate policies depend on reliable telemetry semantics.
- Well-specified conventions reduce on-call cognitive load and toil of noisy alerts.
3–5 realistic “what breaks in production” examples
- Broken dependency mapping: A service emits inconsistent dependency attributes; topology views miss edges and the team misrouted incident ownership.
- Alert flapping: High-cardinality user_id fields cause cardinality explosion; alerts generate massive noise and paging.
- SLA reporting mismatch: Two teams use different keys for “request.duration”; monthly SLA reports disagree with billing.
- Security leak: Unstructured log fields contain PII due to lack of convention forbidding secrets.
- Cost spike: Uncontrolled high-cardinality telemetry increases storage and query costs.
Where is Semantic conventions used? (TABLE REQUIRED)
| ID | Layer/Area | How Semantic conventions appears | Typical telemetry | Common tools |
|---|---|---|---|---|
| L1 | Edge / CDN | Standardized request attributes for client info | Traces, access logs | CDN logs and telemetry |
| L2 | Network | Labels for network hops and protocols | Network metrics and traces | Net observability tools |
| L3 | Service / API | Span attributes for endpoints and methods | Traces, metrics, logs | Tracing SDKs and APMs |
| L4 | Application | Standard field names for business ops | Application logs and custom metrics | Logging libs and metrics SDKs |
| L5 | Data layer | Conventions for DB calls and cache keys | DB spans and latency metrics | SQL tracing, DB APM |
| L6 | Kubernetes | Pod, container, and k8s resource attributes | Pod metrics and traces | K8s metadata collectors |
| L7 | Serverless / Functions | Cold-start, invocation attributes standardized | Traces, metrics, logs | Function runtimes |
| L8 | CI/CD | Build and deploy attributes standardized | Event logs and traces | CI systems and pipelines |
| L9 | Observability | Ingestion mapping and normalization rules | All telemetry types | Collectors and backends |
| L10 | Security / Audit | Standard fields for auth and identity events | Audit logs and security traces | SIEM and security tooling |
Row Details (only if needed)
- None
When should you use Semantic conventions?
When it’s necessary
- Cross-team metrics, SLIs, and SLOs depend on consistent attribute names.
- Multi-tenant systems where tenant and customer identifiers must be uniform.
- Regulatory or compliance reporting where fields must be auditable.
When it’s optional
- Single small project with few services and a single owner.
- Very early prototypes where speed matters over observability consistency.
When NOT to use / overuse it
- Avoid adding high-cardinality identifiers as convention defaults.
- Do not store secrets, raw PII, or full payloads as standard attributes.
- Don’t force conventions that conflict with essential domain-specific labels.
Decision checklist
- If multiple teams and shared dashboards -> adopt conventions.
- If you require automated SLOs across services -> adopt conventions.
- If single-owner experimental microservice -> lightweight convention optional.
- If telemetry cardinality unknown -> iterate and add cardinality guards.
Maturity ladder: Beginner -> Intermediate -> Advanced
- Beginner: Adopt a minimal set of attributes for HTTP and DB calls.
- Intermediate: Enforce via CI checks, central registry, and collectors.
- Advanced: Automated schema enforcement, telemetry transformations, cross-service contract testing, and versioned conventions with migration tooling.
How does Semantic conventions work?
Explain step-by-step
Components and workflow
- Specification: A human-readable and machine-parseable document that lists keys, types, cardinality, and examples.
- Instrumentation libraries: SDKs and auto-instrumentation implement conventions by emitting attributes.
- Collector/ingestion: Agents normalize incoming telemetry and map non-conforming keys.
- Backend mapping: Observability backends index and expose fields as standardized filters.
- CI/CD validation: Linting and tests check new instrumentation against the spec.
- Monitoring & alerts: Alerts and dashboards reference the standardized keys.
- Governance: Change control and versioning for convention updates.
Data flow and lifecycle
- Instrumentation emits telemetry -> Local SDK attaches resource attributes -> Collector receives and normalizes -> Backend stores events and indexes attributes -> Consumers (dashboards, alerts, automation) query using convention keys -> Feedback leads to spec updates and CI checks.
Edge cases and failure modes
- Unknown attribute keys: Collector can tag and store them but may be unmapped.
- Cardinality spikes: High-cardinality keys cause indexing costs and query slowdowns.
- Version drift: Older SDKs use deprecated keys leading to fragmented queries.
- Security leakage: Sensitive data accidentally emitted due to developer error.
Typical architecture patterns for Semantic conventions
- Agent + Collector normalization – Use when many language runtimes and responsibility for normalization centralized.
- SDK-first enforcement with CI linting – Use when teams control code and want early validation.
- API Gateway normalization – Use at the edge to ensure upstream services see uniform request attributes.
- Sidecar telemetry adapter – Use in Kubernetes when you prefer no-code changes in application pods.
- Event-stream normalization – Use for asynchronous pipelines where events need standardized metadata.
Failure modes & mitigation (TABLE REQUIRED)
| ID | Failure mode | Symptom | Likely cause | Mitigation | Observability signal |
|---|---|---|---|---|---|
| F1 | High cardinality | Slow queries and cost spike | Per-request IDs in attributes | Strip or hash identifiers | Query latency and storage increase |
| F2 | Inconsistent keys | Dashboards show gaps | Multiple naming patterns | Centralize and map keys | Missing spans in topology |
| F3 | PII leakage | Compliance alert | Developers logging secrets | Block and redaction rules | Security log alerts |
| F4 | Version drift | New attributes not found | Old SDK versions | Enforce CI checks | Diverging attribute counts |
| F5 | Collector overload | Dropped telemetry | Unbounded ingestion | Rate limit and sampling | Ingestion error metrics |
| F6 | Alert noise | Frequent paging | Incorrect SLIs or keys | Tune SLOs and dedupe | High pager frequency |
Row Details (only if needed)
- None
Key Concepts, Keywords & Terminology for Semantic conventions
Below is an extensive glossary of terms relevant to semantic conventions. Each entry lists a short definition, why it matters, and a common pitfall.
- Attribute — A key-value pair on telemetry — Enables filtering and grouping — Pitfall: high cardinality.
- Tag — Synonym for attribute in some systems — Simple metadata on spans/metrics — Pitfall: inconsistent naming.
- Label — Another synonym used by metrics systems — Important for aggregation — Pitfall: labels differing by case.
- Resource — Identity of the emitting entity — Critical for ownership and context — Pitfall: missing service name.
- Span — Unit of work in tracing — Core for distributed traces — Pitfall: incomplete parent-child relations.
- Trace — Collection of spans for a request flow — Shows end-to-end latency — Pitfall: broken context propagation.
- Metric — Numeric time-series data — Useful for SLOs and trends — Pitfall: misinterpreted units.
- Log — Time-stamped event record — Good for debug context — Pitfall: unstructured logs with secrets.
- Semantic layer — The logical mapping of keys to meaning — Makes data interpretable — Pitfall: no formal spec.
- Cardinality — Number of unique values for a key — Drives cost and query performance — Pitfall: unguarded user_id tags.
- Sampling — Reducing telemetry by selecting subset — Manages volume — Pitfall: biased sampling.
- Normalization — Converting variations into standard form — Enables unified queries — Pitfall: lossy transformations.
- Auto-instrumentation — Runtime libraries that add telemetry automatically — Speeds adoption — Pitfall: lacks domain context.
- Manual instrumentation — Developer-added telemetry points — Precise but laborious — Pitfall: inconsistent naming.
- Ingestion pipeline — Collector and processors handling telemetry — Central control point — Pitfall: single point of failure.
- Indexing — Storing attributes for fast search — Improves queries — Pitfall: cost of indexing high-cardinality fields.
- Schema evolution — Changing spec over time — Needed for improvements — Pitfall: poor versioning plan.
- Contract testing — Tests ensuring producers match the spec — Catches regressions — Pitfall: missing tests in CI.
- Linting — Automated checks on code for conformance — Early feedback — Pitfall: false positives.
- Redaction — Removing sensitive fields from telemetry — Required for privacy — Pitfall: over-redaction losing context.
- Hashing — Pseudonymize identifiers — Balances traceability and privacy — Pitfall: weak hashing causing collisions.
- Sampling rate — The percentage of telemetry collected — Balances cost and fidelity — Pitfall: setting too low for error detection.
- SLI — Service Level Indicator — Measures user-facing behavior — Pitfall: poorly defined SLI means misleading SLOs.
- SLO — Service Level Objective — Target for SLI performance — Pitfall: unrealistic targets causing alert fatigue.
- Error budget — Allowance for SLO violations — Drives release decisions — Pitfall: miscalculated budget.
- Burn rate — Speed of error budget consumption — Used for escalation — Pitfall: improper thresholds.
- Observability lineage — Mapping from instrumentation to dashboards — Helps governance — Pitfall: stale lineage docs.
- Field mapping — Translating vendor fields to conventions — Ensures compatibility — Pitfall: mapping ambiguity.
- Telemetry contract — A machine-readable spec of emitted fields — Enables automated checks — Pitfall: missing ownership.
- Collector processor — Component that transforms telemetry — Enforces conventions at ingest — Pitfall: misconfiguration.
- Topology — Graph of service dependencies — Key for incident routing — Pitfall: incomplete or noisy edges.
- Context propagation — Passing trace IDs across boundaries — Required for distributed tracing — Pitfall: dropped headers.
- Sampling bias — When sampling skews representativeness — Misleads SLI calculations — Pitfall: sampling per-route inconsistently.
- Observability pipeline cost — Total cost of storing and querying telemetry — Needs governance — Pitfall: uncontrolled retention.
- Metric aggregation keys — Labels used to aggregate metrics — Defines SLI granularity — Pitfall: too-fine aggregation.
- Event enrichment — Adding metadata to events in pipeline — Adds value — Pitfall: late enrichment losing raw context.
- Telemetry contract registry — Central store of specs — Source of truth — Pitfall: not synced with code.
- Auto-remediation — Automation driven by telemetry semantics — Reduces toil — Pitfall: unsafe automations.
- Privacy-safe telemetry — Conventions to avoid PII — Compliance support — Pitfall: accidental language in messages.
- Observability maturity — Level of tooling/process sophistication — Guides roadmap — Pitfall: skipping foundational steps.
- Vendor-neutral conventions — Standards that work across backends — Prevents lock-in — Pitfall: tooling specific extensions.
- Attribute typing — Declaring value types for attributes — Prevents misinterpretation — Pitfall: inconsistent types across services.
- Metric units — Standard units like ms, bytes — Critical for correct aggregation — Pitfall: unit mismatch in dashboards.
- Sampling decision — Determined by SDK or collector — Affects trace completeness — Pitfall: inconsistent sampling across services.
How to Measure Semantic conventions (Metrics, SLIs, SLOs) (TABLE REQUIRED)
| ID | Metric/SLI | What it tells you | How to measure | Starting target | Gotchas |
|---|---|---|---|---|---|
| M1 | Attribute coverage | Percent services using key X | Count services emitting key / total | 90% initially | Some services offline |
| M2 | Attribute consistency | Ratio of services using same type | Check type against spec | 95% | Type coercion in SDKs |
| M3 | Cardinality per attribute | Unique values per time window | Count unique values | Keep low per attribute | Burst during jobs |
| M4 | Missing fields rate | Fraction of telemetry missing required keys | Missing / total events | <1% | Instrumentation deploys lag |
| M5 | Telemetry ingestion error rate | Failures during ingestion | Error count / total | <0.1% | Collector misconfig |
| M6 | SLI validity coverage | Percent SLIs backed by convention keys | SLIs with standard keys / total | 100% | Legacy SLIs |
| M7 | Alert noise rate | Pager events per incident | Pager count / time | Target depends | Bad SLO definitions |
| M8 | Storage cost per metric | Billing per attribute group | Cost allocation metrics | Reduce over time | Hidden backend costs |
Row Details (only if needed)
- None
Best tools to measure Semantic conventions
Tool — OpenTelemetry
- What it measures for Semantic conventions: instrumentation coverage and standardized attributes.
- Best-fit environment: heterogeneous language environments and cloud-native stacks.
- Setup outline:
- Add SDK and auto-instrumentation to services.
- Configure resource attributes per service.
- Enable attribute and span processors.
- Connect to a collector for normalization.
- Run CI contract checks.
- Strengths:
- Vendor-neutral and wide language support.
- Extensible processors for normalization.
- Limitations:
- Does not provide backend storage; requires collector/backend.
- Spec evolution needs governance.
Tool — Prometheus
- What it measures for Semantic conventions: metric label consistency and cardinality.
- Best-fit environment: metrics-heavy, Kubernetes-native systems.
- Setup outline:
- Export metrics with consistent label names.
- Use recording rules for SLOs.
- Run metric linting in CI.
- Monitor label cardinalities.
- Strengths:
- Efficient time-series store for metrics.
- Strong ecosystem for alerts.
- Limitations:
- Not for traces; labels can cause high cardinality costs.
- Single-node metrics scraping patterns can miss ephemeral workloads.
Tool — Grafana
- What it measures for Semantic conventions: dashboard panels consuming convention keys.
- Best-fit environment: multi-data-source dashboards for ops and executives.
- Setup outline:
- Create templated panels using standard keys.
- Share dashboards and version them.
- Use provisioning to enforce dashboard standards.
- Strengths:
- Flexible visualization and templating.
- Good for cross-team dashboards.
- Limitations:
- Dashboards can diverge without governance.
- Query languages vary across backends.
Tool — Jaeger / Tempo
- What it measures for Semantic conventions: trace completeness and span attribute consistency.
- Best-fit environment: distributed tracing across microservices.
- Setup outline:
- Collect traces via OpenTelemetry.
- Ensure context propagation across services.
- Instrument key spans with convention attributes.
- Strengths:
- Good trace visualizations and dependency graphs.
- Low overhead tracing options.
- Limitations:
- Storage costs and sampling configuration needed.
- Deep analysis needs complementary logs/metrics.
Tool — SIEM / Security tools
- What it measures for Semantic conventions: audit and security attribute compliance.
- Best-fit environment: teams requiring security and compliance evidence.
- Setup outline:
- Map audit fields to convention keys.
- Perform ingestion-time redaction.
- Create alerts for PII leakage.
- Strengths:
- Centralized security investigations.
- Correlation of telemetry with security events.
- Limitations:
- High volume can be costly.
- Requires careful privacy controls.
Recommended dashboards & alerts for Semantic conventions
Executive dashboard
- Panels: Top-level telemetry coverage %, SLO compliance summary, Cost trend of telemetry, High-impact missing keys.
- Why: High-level health, cost, and compliance visibility.
On-call dashboard
- Panels: Current SLO burn rate, Top missing or inconsistent attributes, Recent topology changes, Active alerts grouped by service.
- Why: Fast triage and ownership.
Debug dashboard
- Panels: Recent traces without required keys, Attribute cardinality heatmaps, Raw logs with context, Per-service attribute coverage.
- Why: Deep debugging and instrumentation verification.
Alerting guidance
- What should page vs ticket:
- Page: SLO burn-rate escalation, collector down, ingestion failures causing data loss.
- Ticket: Missing non-critical attributes, minor drops in attribute coverage.
- Burn-rate guidance:
- Page at sustained burn rate >4x for critical SLOs.
- Use progressive thresholds for paging vs ticketing.
- Noise reduction tactics:
- Dedupe by fingerprinting root causes.
- Group alerts by service and top-level cause.
- Suppress during deploy windows if expected.
Implementation Guide (Step-by-step)
1) Prerequisites – Inventory services and languages. – Define minimal required attributes and cardinality limits. – Choose telemetry transport and backend. – Assign convention owners and reviewers.
2) Instrumentation plan – Identify critical paths and business transactions. – Choose auto vs manual instrumentation per component. – Define attribute naming for each service and domain.
3) Data collection – Deploy OpenTelemetry SDKs and collectors. – Configure processors for normalization, redaction, and sampling. – Validate ingestion and indexing.
4) SLO design – Map SLIs to standardized attributes. – Define SLO windows and error budgets. – Use aggregate keys compatible with conventions.
5) Dashboards – Build templates referencing convention keys. – Use templated variables for service selection. – Create baseline panels for coverage and cardinality.
6) Alerts & routing – Create alert rules referencing convention-backed SLIs. – Route by ownership using resource attributes. – Implement dedupe and grouping.
7) Runbooks & automation – Provide runbooks that reference specific attribute keys. – Script common remediations based on telemetry semantics. – Automate rollbacks and canaries tied to SLO breaches.
8) Validation (load/chaos/game days) – Run load tests to validate metric aggregation and cardinality. – Execute chaos experiments to verify topology and tracing. – Do game days simulating missing keys and collector failures.
9) Continuous improvement – Collect feedback and measure attribute coverage trends. – Iterate spec and enforce via CI. – Rotate sensitive fields and refine cardinality rules.
Checklists
Pre-production checklist
- Inventory done and owners assigned.
- SDKs and auto-instrumentation working locally.
- Minimal spec approved and published.
- CI linting added for new instrumentation.
Production readiness checklist
- Collector scaling tested.
- Dashboards and alerts validated end-to-end.
- On-call runbooks include attribute-driven play.
- Redaction rules in place for PII.
Incident checklist specific to Semantic conventions
- Verify collector health and queue depths.
- Check recent deploys for instrumentation changes.
- Inspect SLI sources and attribute keys.
- Roll back instrumentation changes if causing noise.
Use Cases of Semantic conventions
Provide common real-world use cases.
-
Cross-service SLOs – Context: Multi-service transaction spanning services. – Problem: Aggregating latency across services requires common keys. – Why helps: Standardized request id and endpoint keys enable full-chain SLIs. – What to measure: End-to-end latency, per-service latency, error rates. – Typical tools: Tracing SDKs, collector, metrics store.
-
Multi-tenant billing reconciliation – Context: Usage attribution per tenant. – Problem: Inconsistent tenant_id naming breaks billing reports. – Why helps: Single tenant attribute ensures correct cost allocation. – What to measure: Requests per tenant, compute usage, errors. – Typical tools: Metrics store, event pipelines.
-
Security auditing – Context: Authentication and access logs for compliance. – Problem: Missing standardized user and auth fields hamper audits. – Why helps: Standard audit attributes make queries automated and auditable. – What to measure: Auth success/failure counts, privilege changes. – Typical tools: SIEM, audit log pipeline.
-
Dependency mapping – Context: Visualize service-to-service calls. – Problem: Inconsistent peer.service or db.instance attributes hide edges. – Why helps: Conventions for peer attributes produce accurate topology. – What to measure: Call counts, latencies, error rates per dependency. – Typical tools: Tracing backend, topology visualizer.
-
Cost control – Context: Observability cost management. – Problem: High-cardinality metrics balloon storage costs. – Why helps: Conventions prescribe cardinality limits and hashing strategies. – What to measure: Cardinality per label, storage per dataset. – Typical tools: Cost allocation tools, metrics store.
-
Automated runbook triggers – Context: Automatic remediation based on telemetry. – Problem: Alerts lack structured fields to automate recovery. – Why helps: Standardized fields allow reliable automation inputs. – What to measure: Success rate of automated remediations. – Typical tools: Automation platform, alerting system.
-
On-call handoffs – Context: Clear ownership during incidents. – Problem: No standard service or team attributes; routing delayed. – Why helps: Resource attributes include team/contact for quick routing. – What to measure: Mean time to owner identification. – Typical tools: Incident management tools.
-
Feature rollout monitoring – Context: Canary releases and experiments. – Problem: Can’t filter by feature flag consistently. – Why helps: Standard feature flag attribute allows targeted SLOs. – What to measure: Error rates by feature flag, performance deltas. – Typical tools: Tracing, metrics, experimentation platform.
Scenario Examples (Realistic, End-to-End)
Scenario #1 — Kubernetes microservice debugging
Context: A Kubernetes cluster with multiple microservices written in different languages. Goal: Consistent tracing and metric labels across pods for faster root cause. Why Semantic conventions matters here: Service, pod, and container attributes allow portable dashboards and ownership tagging. Architecture / workflow: Apps instrumented with OpenTelemetry SDKs -> Collector sidecar normalizes attributes -> Backend shows topology and SLOs. Step-by-step implementation:
- Agree on service and deployment attribute keys.
- Add OpenTelemetry SDK or sidecar to pods.
- Configure collector to add k8s.resource attributes.
- Add CI linting to ensure resource attributes present.
- Build a debug dashboard for missing attributes. What to measure: Span coverage, attribute coverage per pod, cardinality of pod labels. Tools to use and why: OpenTelemetry, Kubernetes metadata collectors, Grafana. Common pitfalls: Using pod name as a high-cardinality label. Validation: Run a job that emits traces and validate dashboards show the pod and service attributes. Outcome: Faster MTTD due to consistent metadata.
Scenario #2 — Serverless function SLOs (managed PaaS)
Context: Serverless functions across accounts and regions with managed logging/metrics. Goal: Compute SLOs across functions with consistent cold-start and request attributes. Why Semantic conventions matters here: Functions must emit the same invocation and error attributes to align SLOs. Architecture / workflow: Function runtime emits metrics with convention keys -> Central collector normalizes -> SLOs computed in metrics backend. Step-by-step implementation:
- Define function invocation attributes: function.name, cold_start, memory_alloc.
- Add wrapper layer or middleware to inject standard attributes.
- Ensure redaction for input payloads.
- Aggregate metrics by function.name for SLOs. What to measure: Invocation latency, cold start rate, errors. Tools to use and why: Function monitoring, metrics store, collector. Common pitfalls: Provider-managed logs adding non-standard keys. Validation: Simulate bursts to observe cold-start rates and SLOs. Outcome: Unified SLO reporting across serverless fleet.
Scenario #3 — Incident response and postmortem
Context: Production outage where dependency mapping failed. Goal: Improve instrumentation so future incidents are trackable and owned. Why Semantic conventions matters here: Postmortem identifies missing peer.service attributes as root cause. Architecture / workflow: During incident, on-call uses traces but topology is missing. Postmortem updates conventions and CI checks. Step-by-step implementation:
- Reconstruct incident using available telemetry.
- Identify missing attributes and add to spec.
- Implement SDK changes and CI contract tests.
- Run a game day to validate. What to measure: Time to map service dependency before vs after. Tools to use and why: Tracing backend, CI test suites. Common pitfalls: Not enforcing changes in CI, leading to drift. Validation: Conduct postmortem follow-up verification. Outcome: Reduced time-to-owner and clearer topology in next incident.
Scenario #4 — Cost vs performance trade-off
Context: Storage costs rising due to high-cardinality logs and metrics. Goal: Reduce observability cost while retaining actionable detail. Why Semantic conventions matters here: Conventions define which keys are high-cardinality and how to hash or sample them. Architecture / workflow: Inventory attributes -> Classify cardinality -> Apply hashing and sampling rules in collector -> Recompute SLOs to ensure fidelity. Step-by-step implementation:
- Measure cardinality per attribute.
- Decide hashing strategy for identifiers.
- Implement sampling on verbose logs and traces.
- Validate SLI accuracy under sampling. What to measure: Storage cost, SLI degradation, cardinality metrics. Tools to use and why: Metrics store, cost allocation, collector processors. Common pitfalls: Over-aggressive sampling hiding rare errors. Validation: Run price-performance scenarios and rollback if SLOs fail. Outcome: Reduced costs with acceptable SLI fidelity.
Common Mistakes, Anti-patterns, and Troubleshooting
List of mistakes with symptom -> root cause -> fix.
- Symptom: Missing traces in topology -> Root cause: Context propagation broken -> Fix: Ensure trace headers pass through proxies.
- Symptom: Dashboards show empty metrics -> Root cause: Wrong attribute name in query -> Fix: Align queries to conventions.
- Symptom: Alert storms -> Root cause: High-cardinality label on SLI -> Fix: Remove label from SLI or aggregate.
- Symptom: Storage spike -> Root cause: Unbounded logging of request bodies -> Fix: Redact payloads and limit retention.
- Symptom: Compliance flag -> Root cause: PII emitted in attributes -> Fix: Enforce redaction and privacy checks.
- Symptom: CI lint failures -> Root cause: New instrumentation not following the spec -> Fix: Update code or spec and review.
- Symptom: False negatives in SLO -> Root cause: Sampling removed error traces -> Fix: Adjust sampling for error paths.
- Symptom: Multiple dashboards with different names -> Root cause: No dashboard templating -> Fix: Create shared templates.
- Symptom: High query latency -> Root cause: Too many indexed high-cardinality fields -> Fix: Reduce indexed attributes.
- Symptom: People use ad-hoc tags -> Root cause: No registry or governance -> Fix: Publish registry and require approvals.
- Symptom: Broken ownership routing -> Root cause: Missing team attribute -> Fix: Add standardized team/resource attributes.
- Symptom: Inconsistent units -> Root cause: Metrics emitted in ms and s -> Fix: Standardize units in spec.
- Symptom: Unreproducible tests -> Root cause: Instrumentation varies between environments -> Fix: Use test fixtures enforcing conventions.
- Symptom: Collector crashes -> Root cause: Unbounded attribute sizes -> Fix: Enforce max attribute size and sampling.
- Symptom: Alerts during deploys -> Root cause: Deployment-induced metric changes -> Fix: Use suppression windows and canary SLOs.
- Symptom: Missing dependency edges -> Root cause: Auto-instrumentation filtered certain frameworks -> Fix: Add explicit span instrumentation.
- Symptom: Long on-call escalations -> Root cause: Lack of standardized runbooks referencing attributes -> Fix: Build runbooks tied to keys.
- Symptom: Incorrect billing -> Root cause: Tenant_id inconsistently named -> Fix: Centralize tenant attribute and migrate data.
- Symptom: Queries failing in dashboards -> Root cause: Attribute type changed -> Fix: Migrate or cast types and update spec.
- Symptom: Slow search responses -> Root cause: Non-indexed but frequently queried fields -> Fix: Index required fields or change queries.
- Symptom: Observability blind spots -> Root cause: Sampling policy excludes important flows -> Fix: Create guarantees for critical paths.
- Symptom: Conflicting conventions between vendors -> Root cause: Vendor-specific extensions not mapped -> Fix: Map vendor fields to vendor-neutral keys.
- Symptom: Runaway cost from trace retention -> Root cause: No retention policy per data class -> Fix: Implement tiered retention and aggregation.
- Symptom: Alerts lack context -> Root cause: Missing correlation IDs in logs -> Fix: Add request IDs to logs and spans.
- Symptom: Debugging slow due to lack of context -> Root cause: Incomplete enrichment pipeline -> Fix: Enrich events earlier in pipeline.
Observability-specific pitfalls included above: cardinality, sampling bias, missing context propagation, inconsistent units, and redaction mistakes.
Best Practices & Operating Model
Ownership and on-call
- Assign a conventions owner and a data steward per team.
- Include observability on-call rotation for collector and indexer health.
- Define escalation paths tied to resource attributes.
Runbooks vs playbooks
- Runbooks: step-by-step remediation referencing convention keys.
- Playbooks: higher-level decision flows and owner mappings.
Safe deployments (canary/rollback)
- Use canaries to validate new instrumentation.
- Automate rollback triggers based on SLO burn rates.
Toil reduction and automation
- Automate telemetry contract checks in CI.
- Auto-remediate common collector issues with safe guards.
Security basics
- Prohibit secrets in attributes; enforce redaction.
- Use hashing and tokenization for identifiers.
- Audit telemetry for PII regularly.
Weekly/monthly routines
- Weekly: Review attribute cardinality and top new keys.
- Monthly: Audit coverage for critical SLOs and enforcement in CI.
- Quarterly: Review spec and deprecate old keys.
What to review in postmortems related to Semantic conventions
- Was required telemetry present during incident?
- Any convention changes required?
- Did any attribute cause noise or cost spikes?
- Follow-up tasks for CI checks and runbook updates.
Tooling & Integration Map for Semantic conventions (TABLE REQUIRED)
| ID | Category | What it does | Key integrations | Notes |
|---|---|---|---|---|
| I1 | Instrumentation SDK | Emits telemetry with conventions | Languages, frameworks | Core for producer-side enforcement |
| I2 | Collector | Normalizes and processes telemetry | Backends and processors | Can enforce redaction and hashing |
| I3 | Tracing backend | Stores and visualizes traces | Collector and SDKs | Useful for dependency graphs |
| I4 | Metrics store | Time-series storage for SLIs | Scrapers and exporters | Good for SLOs and alerts |
| I5 | Logging platform | Stores structured logs | Log shippers and processors | Indexing decisions critical |
| I6 | CI linters | Validates telemetry contracts | VCS and build pipelines | Early feedback loop |
| I7 | Dashboarding | Visualizes convention-backed panels | Backends and templates | Collaboration surface |
| I8 | Alerting system | Routes alerts per attributes | Incident management | Needs integration with resource tags |
| I9 | Security SIEM | Audit and compliance analysis | Log and event sources | Enforce PII policies |
| I10 | Cost allocation | Allocates telemetry cost to owners | Billing and telemetry data | Guides cardinality trade-offs |
Row Details (only if needed)
- None
Frequently Asked Questions (FAQs)
What are semantic conventions in observability?
Semantic conventions are standardized rules for naming and typing telemetry attributes to ensure consistent meaning across systems and teams.
Are semantic conventions mandatory?
Varies / depends. They are essential in multi-team and production environments but optional for small prototypes.
Do conventions increase telemetry cost?
They can reduce cost if they enforce low cardinality; misused, they can increase cost.
How do I handle PII in telemetry?
Redact or hash identifiers and enforce telemetry privacy rules in the collector.
How to enforce conventions automatically?
Use CI linting, SDKs that follow the spec, and collector processors that normalize or reject non-conforming attributes.
Can conventions be vendor-specific?
Yes, but vendor-neutral conventions prevent lock-in. Mapping layers can translate vendor fields.
How often should conventions change?
Spec updates should be infrequent and versioned; follow quarterly or as-needed changes with migration plans.
How do conventions affect SLOs?
SLIs rely on consistent keys; conventions make SLO computation correct and repeatable.
What is the role of OpenTelemetry?
OpenTelemetry provides SDKs and an ecosystem that commonly implements conventions; it is vendor-neutral.
How to prevent cardinality explosion?
Limit allowed high-cardinality keys, hash identifiers, and monitor cardinality metrics.
Who owns the semantic conventions?
A central observability owner and per-team stewards share responsibility.
Can conventions be tested in CI?
Yes; contract tests and linters can validate instrumentation changes before merge.
How to map legacy telemetry to new conventions?
Use collector processors to transform keys and run parallel reporting during migration.
What to do if a convention causes paging?
Tune SLOs, adjust alert thresholds, or change the attribute aggregation to reduce noise.
How to measure attribute coverage?
Compute the percentage of services emitting required keys over a time window.
Are conventions useful for security monitoring?
Yes; standardized audit fields make security queries and correlations reliable.
What prevents secrets from being emitted?
Static code reviews, linters, runtime redaction in collectors, and training prevent secret leakage.
How do conventions interact with feature flags?
Feature flag attributes should be standardized to measure performance and errors by variant.
Conclusion
Semantic conventions are an operational contract that unlocks repeatable, scalable observability. They reduce incident time-to-resolution, avoid costly data growth, and enable meaningful SLIs and SLOs across diverse systems. A pragmatic, incremental approach—backed by SDKs, CI checks, and collector normalization—delivers the most value.
Next 7 days plan
- Day 1: Inventory critical services and assign owners.
- Day 2: Define a minimal convention set for HTTP, DB, and auth.
- Day 3: Add OpenTelemetry SDKs or sidecars to two pilot services.
- Day 4: Implement CI linting for attribute presence and types.
- Day 5: Build one executive and one on-call dashboard using the new keys.
Appendix — Semantic conventions Keyword Cluster (SEO)
- Primary keywords
- semantic conventions
- observability conventions
- telemetry conventions
- OpenTelemetry semantic conventions
-
semantic naming for telemetry
-
Secondary keywords
- attribute naming standard
- telemetry schema
- observability schema
- telemetry normalization
-
resource attributes
-
Long-tail questions
- what are semantic conventions in observability
- how to create semantic conventions for microservices
- semantic conventions for serverless functions
- impact of semantic conventions on SLOs
- how to measure semantic conventions coverage
- best practices for semantic conventions in kubernetes
- how to prevent PII in telemetry
- how to reduce cardinality with semantic conventions
- how to enforce semantic conventions in CI
- strategies for migrating legacy telemetry to new conventions
- semantic conventions vs telemetry schema differences
- how semantic conventions affect incident response
- how to build dashboards using semantic conventions
- semantic conventions for multi-tenant systems
-
how to test semantic conventions automatically
-
Related terminology
- attribute coverage
- tag cardinality
- resource metadata
- trace context propagation
- SLI definition
- SLO design
- error budget policy
- telemetry sampling
- ingestion normalization
- collector processors
- telemetry contract registry
- contract linting
- telemetry redaction
- hashing identifiers
- audit logging attributes
- topology mapping
- dependency graph keys
- observability pipeline
- metric units standardization
- metric label guidelines
- event enrichment
- runbooks tied to telemetry
- CI telemetry checks
- telemetry cost allocation
- canary instrumentation
- auto-instrumentation best practices
- manual instrumentation checklist
- telemetry schema evolution
- vendor-neutral telemetry
- observability maturity model
- telemetry privacy rules
- collector normalization rules
- ingestion error metrics
- dashboard templating
- alert deduplication
- burn-rate escalation
- telemetry retention tiers
- structured logging conventions
- feature flag telemetry
- serverless telemetry attributes
- kubernetes resource attributes
- security SIEM mapping
- telemetry contract testing
- attribute typing standards
- index vs raw telemetry trade-offs
- telemetry governance checklist
- telemetry owner assignment
- telemetry transformation rules
- telemetry observability health metrics
- telemetry cardinality heatmap