OpenTelemetry Collector

The OpenTelemetry Collector is the central data-plane component of the OpenTelemetry ecosystem. It’s designed to receive, process, and export telemetry data (metrics, logs, traces) in a vendor-neutral way.

Think of it as a programmable observability router sitting between your applications and your backend platforms (e.g. Datadog, Grafana, Prometheus).

Core Role in an Observability Architecture

At a high level, the collector solves three problems:

Decoupling instrumentation from backend
- Apps emit OTLP (OpenTelemetry Protocol)
- Collector handles translation/export
- You can swap backends without touching app code
Centralized control of telemetry
- Sampling, filtering, enrichment happen in one place
Cost + performance optimization
- Drop noisy data
- Batch and compress before sending

Internal Architecture (Pipeline Model)

The collector uses a pipeline-based architecture. Each signal type (metrics/logs/traces) has its own pipeline:

Receivers → Processors → Exporters

1. Receivers (Ingress)

These ingest telemetry from various sources:

OTLP (gRPC / HTTP)
Prometheus scrape
Jaeger, Zipkin
Fluent Bit / logs

Example:

otlp → receives traces from instrumented apps
prometheus → scrapes metrics endpoints

2. Processors (Transformation Layer)

This is where most engineering value lies.

Typical processors:

Batch → groups data for efficiency
Memory limiter → prevents OOM
Attributes → add/remove labels/tags
Sampling → reduce trace volume
Filter → drop unwanted telemetry

Example use cases:

Drop high-cardinality labels before Datadog ingestion
Sample 10% of traces in production
Add env=prod tag globally

3. Exporters (Egress)

Send telemetry to backends:

Datadog
Prometheus (remote_write)
Grafana Loki / Tempo
OTLP (to another collector)
Kafka

This is where vendor lock-in is abstracted away.

Deployment Modes

1. Agent (Daemon Mode)

Runs on each host / node
Collects local telemetry
Low latency

Used in:

Kubernetes DaemonSets
VM-based workloads

2. Gateway (Central Mode)

Centralized collectors
Receives data from agents

Used for:

Global sampling decisions
Routing to multiple backends

3. Hybrid (Most Common in Enterprise)

[App] → [Agent Collector] → [Gateway Collector] → [Backends]

This gives:

Local buffering + resilience
Centralized policy enforcement

Key Capabilities (Why it’s widely adopted)

Vendor Neutrality

You instrument once → export anywhere:

Datadog
Grafana stack (Tempo/Loki/Mimir)
New Relic
Splunk

Protocol Translation

Example:

App emits OTLP
Collector exports to:
- Datadog API
- Prometheus format
- Kafka

Data Reduction (Critical for cost control)

Tail-based sampling (advanced)
Metric aggregation
Log filtering

Observability of Observability

The collector exposes its own metrics:

Queue length
Dropped spans
Export latency

This is crucial in debugging pipelines (especially in your Datadog issues earlier).

Example Minimal Config

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  batch:
  memory_limiter:
    limit_mib: 512

exporters:
  datadog:
    api:
      key: ${DD_API_KEY}

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [datadog]

Advanced Features (Enterprise Use)

Tail-Based Sampling

Decide to keep/drop traces after seeing full trace
Useful for:
- Keep only errors
- Keep slow requests

Multi-Destination Routing

Send:

Full data → cheap storage (S3 / Loki)
Sampled data → Datadog (cost control)

Load Balancing Exporters

Distribute traffic across multiple collectors or endpoints

Extensions

Health checks
pprof debugging
zPages (live debugging UI)

Common Pitfalls (Relevant to your work)

Given your Datadog + GitLab context:

Silent drops
- Caused by memory limiter or queue overflow
Wrong pipeline wiring
- Metrics/logs/traces configured but not connected
Exporter misconfig
- Traces disappear (exact symptom you saw)
Sampling too aggressive
- Data “vanishes”

Mental Model

If you had to reduce it:

The OpenTelemetry Collector is a stateful, programmable telemetry proxy with buffering, transformation, and routing capabilities.

BLU // SAS