OpenTelemetry: OpenTelemetry is an open source observability framework created when CNCF merged the OpenTracing and OpenCensus projects.[65] OpenTracing offers “consistent, expressive, vendor-neutral APIs for popular platforms”[66] while the Google-created OpenCensus project acts as a “collection of language-specific libraries for instrumenting an application, collecting stats (metrics), and exporting data to a supported backend.”[67]
Under OpenTelemetry, the projects create a “complete telemetry system [that is] suitable for monitoring microservices and other types of modern, distributed systems — and [is] compatible with most major OSS and commercial backends.”[68] It is the “second most active” CNCF project.[69] In October 2020, AWS announced the public preview of its distro for OpenTelemetry.[70]
https://opentelemetry.io describes OpenTelemetry as “a collection of APIs, SDKs, and tools. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior.“
OpenTelemetry is generally available across several languages and is suitable for production use.
You can follow OpenTelemetry’s blog here: https://opentelemetry.io/blog/
OpenTelemetry enables comprehensive observability by integrating distributed tracing, metrics, and logs across various application layers and environments. Examples include:
Distributed Tracing
- Tracking a request as it flows through multiple microservices, capturing spans for each service interaction, and visualizing the end-to-end latency and bottlenecks in systems like Jaeger or Zipkin.
- Instrumenting HTTP handlers, database queries, and RPC calls to create trace data, which helps diagnose where failures or performance issues occur.
Metrics Collection
- Gathering infrastructure metrics such as CPU, memory usage, and network I/O, as well as custom application metrics like request duration, error counts, and throughput.
- Exporting metrics to platforms like Prometheus for real-time monitoring and alerting, enabling fast response to anomalies.
Logging Correlation
- Enriching application logs with trace and span IDs so that developers can link logs directly to traces, making it easier to contextually analyze incidents.
- Sending logs to log management systems like Loki or Elasticsearch, alongside metrics and trace data, for unified querying and troubleshooting.
K8s OpenTel example
- In a Kubernetes-based microservice architecture, OpenTelemetry is used to instrument all services. Traces track requests between services, metrics capture latency and error rates, and logs include trace context. This comprehensive telemetry allows teams to visualize SLAs, quickly investigate outages, and correlate issues between signals for rapid root cause analysis.
These patterns demonstrate how OpenTelemetry provides a holistic observability solution beyond siloed tracing, metrics, or logging, improving visibility and accelerating issue resolution in distributed architectures.
OpenTelemetry Timeline (Key Developments)
Pre-history (2010–2018): Foundations of Distributed Tracing
- 2010 – Google publishes Dapper, introducing large-scale distributed tracing concepts
- 2012–2015 – Emergence of tools like Zipkin and Jaeger
- 2016 – OpenTracing launched (vendor-neutral tracing APIs)
- 2017 – OpenCensus launched (metrics + tracing SDKs)
Problem: Two competing standards created fragmentation and adoption friction
2019: Birth of OpenTelemetry
- May 2019
- OpenTracing + OpenCensus officially merge into OpenTelemetry
- Accepted into CNCF as a Sandbox project
Strategic shift:
- One unified standard for telemetry (traces, metrics, logs)
- Vendor-neutral instrumentation layer
2020: First Production Readiness
- v1.0 Tracing API released
- Stable tracing specification and SDKs
- Signals confidence for production adoption
Impact:
- Tracing becomes the first mature pillar
- Vendors (Datadog, New Relic, etc.) begin aligning
2021: CNCF Incubation & Ecosystem Growth
- Aug 2021
- OpenTelemetry moves to CNCF Incubating
Key developments:
- Multi-language SDK expansion (10+ languages)
- Formalisation of OTLP (OpenTelemetry Protocol)
- Strong adoption across vendors and enterprises
2022: Metrics Maturity & “Three Pillars” Alignment
- Metrics API & SDK reach stability
- Logs integration matures (still evolving)
Outcome:
- First time a single framework supports all three signals:
- Traces
- Metrics
- Logs
2023–2024: Industry Standardisation Phase
- Widespread adoption across:
- Cloud providers (AWS, Azure, GCP)
- Observability vendors (Datadog, Splunk, Grafana)
Key trends:
- Auto-instrumentation becomes mainstream
- OpenTelemetry Collector becomes the de facto telemetry pipeline layer
- Deep integration with CNCF stack (Prometheus, Jaeger, etc.)
2025: Scaling Challenges & Maturity Work
- Focus on:
- Configuration standardisation (YAML/JSON)
- SDK self-observability
- Reducing operational complexity
Reality:
- OTel becomes powerful but operationally complex at scale
2026: CNCF Graduation 🎓
- May 11, 2026
- OpenTelemetry reaches CNCF Graduated status
This signals:
- Enterprise-grade maturity
- Strong governance and ecosystem stability
- Long-term industry standard for observability
🧠 Summary (Condensed View)
| Phase | Focus | Outcome |
|---|---|---|
| 2010–2018 | Fragmented tracing ecosystem | Competing standards |
| 2019 | Merge into OpenTelemetry | Unified vision |
| 2020 | Tracing stabilised | Production adoption |
| 2021 | CNCF incubation | Rapid ecosystem growth |
| 2022 | Metrics stabilised | Full observability stack |
| 2023–24 | Industry adoption | De facto standard |
| 2025 | Scaling & complexity | Maturity refinement |
| 2026 | CNCF graduation | Enterprise standard |
Unified Timeline: OpenTelemetry vs Commercial Platforms
Phase 1: Pre-OpenTelemetry (2010–2018)
Vendor-controlled instrumentation era
Vendors
- Datadog (founded 2010)
- Agent-based metrics + infra monitoring
- Later adds APM (tracing)
- New Relic
- Strong APM-first model
- Proprietary agents and SDKs
- Splunk
- Log-centric (machine data platform)
- Later moves into APM via acquisition (SignalFx, Omnition)
Key Characteristics
- Fully proprietary instrumentation
- Vendor lock-in at the agent + SDK layer
- Scaling issue:
- Each service tightly coupled to a vendor agent
- Difficult multi-vendor strategy
- High operational friction in polyglot environments
This is the problem OpenTelemetry was created to solve.
Phase 2: 2019–2020 (OpenTelemetry Emerges)
Vendors react cautiously
OpenTelemetry
- Merge of OpenTracing + OpenCensus
- CNCF sandbox
- Tracing reaches v1.0 (2020)
Vendor Positioning
- Datadog
- Initially resistant (protect proprietary APM agents)
- Begins adding OTLP ingestion endpoints
- New Relic
- Early strategic pivot:
- Promotes “open instrumentation” narrative
- Starts aligning SDKs with OTel
- Splunk
- Acquires SignalFx + Omnition (OTel-native tracing company)
- Becomes one of the biggest OTel contributors early
Scaling Context
- Microservices explode (Kubernetes, service meshes)
- Tracing becomes critical, but:
- Instrumentation complexity increases exponentially
Vendors realise:
They cannot scale proprietary instrumentation across cloud-native ecosystems.
Phase 3: 2021–2022 (Adoption & Standardisation)
OTel becomes real in production
OpenTelemetry
- CNCF Incubation
- OTLP stabilised
- Metrics reach maturity (2022)
Vendor Adoption Patterns
🟣 Datadog
- Adds:
- OTLP ingest (traces + metrics)
- OTel Collector support
- Still pushes:
- Datadog Agent as primary path
Strategy:
“Support OTel, but keep users inside Datadog ecosystem”
🔵 New Relic
- Fully embraces OTel:
- OTel-native ingestion
- Promotes agentless / open instrumentation
- Drops pricing barriers (usage-based model shift)
Strategy:
“Win by being the most OpenTelemetry-friendly vendor”
🟢 Splunk
- Deep integration:
- Splunk Distribution of OpenTelemetry Collector
- Native OTLP pipelines
- Heavy contributor to OTel project
Strategy:
“Own the pipeline layer via OTel”
Scaling Problem (Critical Insight)
At this stage, OTel solves instrumentation, but creates:
- Pipeline sprawl (Collectors everywhere)
- Config complexity (YAML explosion)
- Cardinality + cost issues
Observability shifts from:
- “Can I collect telemetry?”
to - “Can I control cost and cardinality at scale?”
Phase 4: 2023–2024 (Mainstream Adoption)
OTel becomes the default
Market Reality
- OpenTelemetry becomes:
- Default instrumentation standard
- Expected in enterprise architectures
Vendor Differentiation Shifts
Datadog
- Focus:
- UX, correlation, AI features
- Still optimised for:
- Datadog-native pipelines
Key move:
- “OTel in → Datadog internal model”
New Relic
- Positions itself as:
- “Best backend for OTel data”
- Strong:
- Unified schema (NRDB)
Splunk
- Focus:
- Enterprise-scale ingestion + analytics
- OTel Collector becomes:
- First-class ingestion layer
Scaling Complexity (Now the Core Problem)
At scale (what you’d discuss in an SRE interview):
- Cardinality explosion
- Metrics labels blow up costs
- Sampling strategies
- Head vs tail sampling in traces
- Pipeline engineering
- Filtering, enrichment, routing
- Storage tiering
- Hot vs cold observability data
Vendors now compete on:
“Who helps you manage observability complexity best?”
Phase 5: 2025–2026 (Maturity & Control Plane Thinking)
OpenTelemetry
- CNCF Graduation (2026)
- Focus:
- Stability
- Configuration standardisation
- Pipeline governance
Vendor Convergence
All three now:
- Support OTLP natively
- Support OpenTelemetry Collector
- Provide distribution/custom builds
Strategic Shift
Observability architecture becomes:
[Instrumentation: OpenTelemetry]
↓
[Pipeline: OTel Collector / Vendor distros]
↓
[Backend: Datadog / New Relic / Splunk]
The battleground moves to:
| Layer | Who owns it |
|---|---|
| Instrumentation | OpenTelemetry |
| Pipeline | Shared (OTel + vendors) |
| Storage + UX | Vendors |
Key Takeaways (What Actually Matters)
1. OpenTelemetry commoditised instrumentation
- Vendors lost control of the data generation layer
2. Vendors adapted differently
| Vendor | Strategy |
|---|---|
| Datadog | Controlled openness (OTel-compatible, but agent-first) |
| New Relic | Full embrace of OTel |
| Splunk | Deep integration + pipeline ownership |
3. The real problem shifted
Before OTel:
- How do I instrument services?
After OTel:
- How do I:
- Control cost?
- Manage cardinality?
- Design pipelines?
- Sample intelligently?
4. Modern Observability = Data Engineering Problem
At scale, you’re effectively building:
- A real-time telemetry data platform
- With:
- Streaming pipelines (Collectors)
- Schema governance
- Cost optimisation