OpenTelemetry Collector with Failover Connector, Sending_Queue and File_Storage
Resilient telemetry flow from
Son Testing
(Source) to
Observability Dev
(Aggregator + Backends)
How the
Failover Connector
Works
The failover connector is an internal component that acts as an exporter for the ingress pipeline and a receiver for the downstream pipelines. It maintains a prioritized list of destinations, routing telemetry to the first healthy pipeline. It periodically retries higher-priority pipelines to return healthy pipelines when they become healthy again.
Telemetry
Sources
Applications
Pods
Host Metrics
Logs
Traces
OTLP
(gRPC/HTTP)
Son Testing K8s Cluster
(Source — Telemetry Producers)
OpenTelemetry Collector
(Single Collector Process · Multiple Pipelines)
INGRESS PIPELINE (Receivers)
otlp receiver
Receives Traces, Metrics & Logs
Exports to Failover Connector
FAILOVER CONNECTOR
(Exporter for Ingress, Receiver for Downstream Pipelines)
Priority Levels (Health-Based Routing)
1.
primary
(highest priority)
2.
failover
(lower priority)
retry_interval
30s (configurable)
Routes data to the first healthy pipeline
Routes data to
failover pipeline
when primary
pipeline is unhealthy
PRIMARY PIPELINE
(Fast Path · No Disk)
otlp exporter
sending_queue
(memory)
FAILOVER PIPELINE
(Durable Buffer Path)
otlp exporter
sending_queue
(memory)
file_storage (Write-Ahead Log on Disk)
Observability Dev K8s Cluster
(Observability Platform)
Otel Aggregator Service
(OTLP Receiver + Processing)
OTLP Receiver
(Ingests from Primary and Failover
pipelines)
Processing
Batching · Resource Detection
Transformation · Routing
Mimir
(Metrics)
Loki
(Logs)
Tempo
(Traces)
Primary OTLP
(gRPC/HTTP)
Live Data
Both pipelines target the
same Aggregator endpoint
Failover OTLP
(gRPC/HTTP)
Buffered or Live Data
Important Notes
The connector does not buffer or persist data — it only routes based on health.
Buffering and persistence are provided by the exporters (
sending_queue
and
file_storage
) in each pipeline.
Both pipelines send to the same aggregator endpoint.
After recovery, the connector routes new traffic to Primary while Failover drains backlog.
1
NORMAL OPERATION
(Aggregator is Healthy)
Connector routes new incoming telemetry to the
Primary Pipeline
(highest priority, healthy).
Primary
exporter sends data directly to the Otel Aggregator.
Failover pipeline is
idle
but ready.
Ingress
Failover
Connector
→ Primary
to Failover
(idle)
Result:
Low latency, normal flow.
2
BACKEND OUTAGE (2 HOURS)
(Aggregator Unavailable)
Primary
exporter's
sending_queue
starts buffering in memory. When it reaches limits or errors surface, the pipeline returns failure to the connector.
Connector marks
Primary
as
unhealthy
and switches traffic to the Failover pipeline.
Failover
exporter's
sending_queue
+
file_storage
buffer and persist telemetry safely for the duration of the outage.
!
Aggregator
Down
Ingress
Failover
Connector
Primary
✕
(unhealthy)
Failover
(active · buffering)
Result:
No data loss (within capacity). Live traffic goes to failover; data is durable on disk.
3
BACKEND RECOVERY — BOTH SEND
(Aggregator is Healthy Again)
Connector retries
Primary
on
retry_interval
(e.g., 30s).
When
Primary
is healthy again, connector routes
new
incoming telemetry back to
Primary
.
Failover pipeline
continues sending its buffered backlog (from memory and disk) until fully drained.
Both pipelines may send simultaneously for a period:
– Primary sends fresh/new telemetry
– Failover sends older/buffered telemetry
Ingress
Failover
Connector
→ Primary
(new data)
→ Failover
(draining backlog)
Result:
System returns to normal while ensuring no data loss during the outage window.
Key Components
Failover Connector
Health-based routing between pipelines (no storage).
Primary Pipeline
Fast path to backend with in-memory queue only.
Failover Pipeline
Buffering path with in-memory queue + persistent file_storage.
Otel Aggregator
Central ingest and routing to Mimir (metrics), Loki (logs), Tempo (traces).
Arrow Legend
Active live traffic
Failover / buffered
Idle / inactive