OpenTelemetry Collector · Failover Connector — Primary & Failover Queue Matrix

Primary pipeline ships live to the aggregator; when the failover connector trips, the failover pipeline's two-tier queue (sending_queuefile_storage on a PVC) takes over. The matrix shows how a queue_size of 10 000 / 15 000 / 20 000 / 25 000 behaves during 1-hour and 2-hour aggregator outages, at both block_on_overflow settings.

Ingest rate 50 batches/s
Batch size ~50 KB/batch
Primary queue 1 000 batches · RAM
Failover storage file_storage on PVC
Drain rate (healthy) 50 batches/s · = ingest
retry_on_failure on · max_elapsed_time ∞

Primary + Failover queue behaviour — 1 h and 2 h aggregator outage

The Primary pipeline's 1 000-batch in-memory queue covers only ~20 s before the failover connector routes traffic to the Failover pipeline. All long-outage behaviour (and all loss or backpressure) is governed by the Failover queue knobs below.

Failover queue configuration · 1-hour and 2-hour outage outcome Rows compare queue_size (10 k → 25 k) × block_on_overflow (false / true). Times to fill assume 50 batches/s sustained ingest; drain times assume backend returns at full health and runs at break-even (50 batches/s) while live ingest continues.
Failover
queue_size
block_on_overflow Time to fill
@ 50 / s
PVC
footprint
1-hour outage outcome
(180 000 batches arrive)
2-hour outage outcome
(360 000 batches arrive)
Drain time
after recovery
Recommended when…
primary_queue in-memory only · 1 000 batches · ~20 s of buffer
1 000 false (default) ~20 s RAM
~50 MB
failover-connector After the ~20 s primary buffer is exhausted, the connector trips and traffic is routed to the Failover pipeline for the duration of both the 1 h and 2 h scenarios. Instant — primary drains as soon as backend is healthy; rejoin window is negligible. Always — it's the “healthy path” before failover engages.
failover_queue sending_queue + file_storage on PVC · varies by row
10 000 false ~3 m 20 s
(200 s)
~500 MB saved: 10 000
dropped: ~170 000
Queue fills in 3 m 20 s; remaining ~56 m drops newest batches at the exporter. enqueue_failed_* rises steadily.
saved: 10 000
dropped: ~350 000
Same pattern; drop window extends ~1 h 57 m.
~3 m 20 s
(drain the 10 k backlog while live ingest continues at 50/s — see drain-rate note below)
Tight PVC budget; short-outage tolerance only; bursty workload where newest-data loss is acceptable.
10 000 true ~3 m 20 s — then stalls ~500 MB saved: 10 000
blocked upstream: ~170 000
After fill, exporter blocks. Backpressure propagates to receiver and OTLP clients; they retry or drop themselves.
saved: 10 000
blocked upstream: ~350 000
Upstream bears the cost for ~1 h 57 m.
~3 m 20 s Producers can hold or retry themselves; collector must never drop.
15 000 false ~5 m
(300 s)
~750 MB saved: 15 000
dropped: ~165 000
Queue fills in 5 m; dropping window ~55 m.
saved: 15 000
dropped: ~345 000
Dropping window ~1 h 55 m.
~5 m Moderate PVC budget; balances buffering with disk cost.
15 000 true ~5 m — then stalls ~750 MB saved: 15 000
blocked upstream: ~165 000
saved: 15 000
blocked upstream: ~345 000
~5 m Upstream can tolerate ~55 m of backpressure on a 1 h outage.
20 000 false ~6 m 40 s
(400 s)
~1.0 GB saved: 20 000
dropped: ~160 000
Dropping window ~53 m.
saved: 20 000
dropped: ~340 000
Dropping window ~1 h 53 m.
~6 m 40 s Sweet spot when PVC cost and headroom both matter.
20 000 true ~6 m 40 s — then stalls ~1.0 GB saved: 20 000
blocked upstream: ~160 000
saved: 20 000
blocked upstream: ~340 000
~6 m 40 s Strong durability; upstream is resilient to ~53 m backpressure on 1 h.
25 000 false ~8 m 20 s
(500 s)
~1.25 GB saved: 25 000
dropped: ~155 000
Dropping window ~51 m 40 s.
saved: 25 000
dropped: ~335 000
Dropping window ~1 h 51 m.
~8 m 20 s Max headroom with bounded drop behaviour; watch PVC compaction.
25 000 true ~8 m 20 s — then stalls ~1.25 GB saved: 25 000
blocked upstream: ~155 000
saved: 25 000
blocked upstream: ~335 000
~8 m 20 s Strongest single-node durability; requires resilient upstream.

Can the failover drain be rate-limited? — Yes.

The collector exposes several knobs to throttle how fast the Failover pipeline exports its backlog after the aggregator returns:

The matrix below shows effective drain performance at full rate, 1/2 rate, and 1/3 rate — the key subtlety is that live ingest is still arriving at 50 batches/s, so only a drain rate greater than ingest actually shrinks the backlog.

Failover drain rate vs live ingest

Once the aggregator returns, the failover exporter starts draining. If the exporter is rate-limited, the effective backlog reduction per second = drain_rate − ingest_rate. Any configuration where drain ≤ ingest means the backlog never shrinks while producers are live — catch-up only happens when ingest falls below the drain rate.

Drain rate with and without rate limiting · live ingest = 50 batches/s Assumes a 25 000-batch backlog carried out of a 1-hour outage (worst-case saved amount). “Catch-up time” is measured while live ingest continues at 50/s.
Drain config Effective drain rate Net drain
(drain − 50/s ingest)
Catch-up time for 25 000-batch backlog Notes
Full rate
num_consumers ≥ 10 · no limit
~50 batches/s 0 / s (break-even) never — while live ingest = 50/s
Only drains once ingest dips; if ingest falls to 25/s, catch-up ≈ 16 m 40 s.
Break-even default. Good enough once traffic is nightly-low.
2× rate
num_consumers doubled · higher parallelism
~100 batches/s +50 / s ~8 m 20 s If the backend/network can keep up, this is the fastest safe catch-up.
3× rate
aggressive drain · bursty
~150 batches/s +100 / s ~4 m 10 s Fastest — risk of saturating aggregator / inducing a second outage.
Half rate
num_consumers halved · rate_limiter 25/s
~25 batches/s −25 / s (backlog grows) never — backlog grows by 25/s
PVC fills again at 25/s × remaining outage-budget; same drop/block behaviour returns.
Only safe if ingest is simultaneously throttled (rate_limiter on ingress).
Third rate
rate_limiter ~16.7/s
~16.7 batches/s −33.3 / s (backlog grows faster) never — backlog grows by 33/s Hazardous with live ingest; usable only during post-incident off-hours.
Quarter rate
rate_limiter ~12.5/s
~12.5 batches/s −37.5 / s never — backlog grows by 37.5/s Effectively non-draining under live load — avoid unless ingest is paused.

Practical guidance