OpenTelemetry Collector · Failover Connector — Primary & Failover Queue Matrix

Primary pipeline ships live to the aggregator; when the failover connector trips, the failover pipeline's two-tier queue (sending_queue → file_storage on a PVC) takes over. The matrix shows how a queue_size of 10 000 / 15 000 / 20 000 / 25 000 behaves during 1-hour and 2-hour aggregator outages, at both block_on_overflow settings.

Ingest rate 50 batches/s

Batch size ~50 KB/batch

Primary queue 1 000 batches · RAM

Failover storage file_storage on PVC

Drain rate (healthy) 50 batches/s · = ingest

retry_on_failure on · max_elapsed_time ∞

Primary + Failover queue behaviour — 1 h and 2 h aggregator outage

The Primary pipeline's 1 000-batch in-memory queue covers only ~20 s before the failover connector routes traffic to the Failover pipeline. All long-outage behaviour (and all loss or backpressure) is governed by the Failover queue knobs below.

Failover queue configuration · 1-hour and 2-hour outage outcome Rows compare `queue_size` (10 k → 25 k) × `block_on_overflow` (false / true). Times to fill assume 50 batches/s sustained ingest; drain times assume backend returns at full health and runs at break-even (50 batches/s) while live ingest continues.
Failover queue_size	block_on_overflow	Time to fill @ 50 / s	PVC footprint	1-hour outage outcome (180 000 batches arrive)	2-hour outage outcome (360 000 batches arrive)	Drain time after recovery	Recommended when…
primary_queue in-memory only · 1 000 batches · ~20 s of buffer
`1 000`	`false (default)`	~20 s	RAM ~50 MB	failover-connector After the ~20 s primary buffer is exhausted, the connector trips and traffic is routed to the Failover pipeline for the duration of both the 1 h and 2 h scenarios.		Instant — primary drains as soon as backend is healthy; rejoin window is negligible.	Always — it's the “healthy path” before failover engages.
failover_queue sending_queue + file_storage on PVC · varies by row
`10 000`	`false`	~3 m 20 s (200 s)	~500 MB	saved: 10 000 dropped: ~170 000 Queue fills in 3 m 20 s; remaining ~56 m drops newest batches at the exporter. `enqueue_failed_*` rises steadily.	saved: 10 000 dropped: ~350 000 Same pattern; drop window extends ~1 h 57 m.	~3 m 20 s (drain the 10 k backlog while live ingest continues at 50/s — see drain-rate note below)	Tight PVC budget; short-outage tolerance only; bursty workload where newest-data loss is acceptable.
`10 000`	`true`	~3 m 20 s — then stalls	~500 MB	saved: 10 000 blocked upstream: ~170 000 After fill, exporter blocks. Backpressure propagates to receiver and OTLP clients; they retry or drop themselves.	saved: 10 000 blocked upstream: ~350 000 Upstream bears the cost for ~1 h 57 m.	~3 m 20 s	Producers can hold or retry themselves; collector must never drop.
`15 000`	`false`	~5 m (300 s)	~750 MB	saved: 15 000 dropped: ~165 000 Queue fills in 5 m; dropping window ~55 m.	saved: 15 000 dropped: ~345 000 Dropping window ~1 h 55 m.	~5 m	Moderate PVC budget; balances buffering with disk cost.
`15 000`	`true`	~5 m — then stalls	~750 MB	saved: 15 000 blocked upstream: ~165 000	saved: 15 000 blocked upstream: ~345 000	~5 m	Upstream can tolerate ~55 m of backpressure on a 1 h outage.
`20 000`	`false`	~6 m 40 s (400 s)	~1.0 GB	saved: 20 000 dropped: ~160 000 Dropping window ~53 m.	saved: 20 000 dropped: ~340 000 Dropping window ~1 h 53 m.	~6 m 40 s	Sweet spot when PVC cost and headroom both matter.
`20 000`	`true`	~6 m 40 s — then stalls	~1.0 GB	saved: 20 000 blocked upstream: ~160 000	saved: 20 000 blocked upstream: ~340 000	~6 m 40 s	Strong durability; upstream is resilient to ~53 m backpressure on 1 h.
`25 000`	`false`	~8 m 20 s (500 s)	~1.25 GB	saved: 25 000 dropped: ~155 000 Dropping window ~51 m 40 s.	saved: 25 000 dropped: ~335 000 Dropping window ~1 h 51 m.	~8 m 20 s	Max headroom with bounded drop behaviour; watch PVC compaction.
`25 000`	`true`	~8 m 20 s — then stalls	~1.25 GB	saved: 25 000 blocked upstream: ~155 000	saved: 25 000 blocked upstream: ~335 000	~8 m 20 s	Strongest single-node durability; requires resilient upstream.

Can the failover drain be rate-limited? — Yes.

The collector exposes several knobs to throttle how fast the Failover pipeline exports its backlog after the aggregator returns:

sending_queue knobs: num_consumers (worker count) directly scales the drain rate — halving it halves throughput. Default is 10 workers.
Pipeline processors: insert a rate_limiter / tail_sampling / transform processor before the exporter to clamp outbound rate.
Extension: the memory_limiter extension can shed load when the collector is hot, indirectly throttling drain.
Batch + timeout: a batch processor with a larger timeout + smaller send_batch_size reduces effective req/s to the backend.
Exporter settings: otlp exporter's sending_queue.num_consumers, and client compression / max_request_size all shape throughput.

The matrix below shows effective drain performance at full rate, 1/2 rate, and 1/3 rate — the key subtlety is that live ingest is still arriving at 50 batches/s, so only a drain rate greater than ingest actually shrinks the backlog.

Failover drain rate vs live ingest

Once the aggregator returns, the failover exporter starts draining. If the exporter is rate-limited, the effective backlog reduction per second = drain_rate − ingest_rate. Any configuration where drain ≤ ingest means the backlog never shrinks while producers are live — catch-up only happens when ingest falls below the drain rate.

Drain rate with and without rate limiting · live ingest = 50 batches/s Assumes a 25 000-batch backlog carried out of a 1-hour outage (worst-case saved amount). “Catch-up time” is measured while live ingest continues at 50/s.
Drain config	Effective drain rate	Net drain (drain − 50/s ingest)	Catch-up time for 25 000-batch backlog	Notes
`Full rate` num_consumers ≥ 10 · no limit	~50 batches/s	0 / s (break-even)	never — while live ingest = 50/s Only drains once ingest dips; if ingest falls to 25/s, catch-up ≈ 16 m 40 s.	Break-even default. Good enough once traffic is nightly-low.
`2× rate` num_consumers doubled · higher parallelism	~100 batches/s	+50 / s	~8 m 20 s	If the backend/network can keep up, this is the fastest safe catch-up.
`3× rate` aggressive drain · bursty	~150 batches/s	+100 / s	~4 m 10 s	Fastest — risk of saturating aggregator / inducing a second outage.
`Half rate` num_consumers halved · rate_limiter 25/s	~25 batches/s	−25 / s (backlog grows)	never — backlog grows by 25/s PVC fills again at 25/s × remaining outage-budget; same drop/block behaviour returns.	Only safe if ingest is simultaneously throttled (rate_limiter on ingress).
`Third rate` rate_limiter ~16.7/s	~16.7 batches/s	−33.3 / s (backlog grows faster)	never — backlog grows by 33/s	Hazardous with live ingest; usable only during post-incident off-hours.
`Quarter rate` rate_limiter ~12.5/s	~12.5 batches/s	−37.5 / s	never — backlog grows by 37.5/s	Effectively non-draining under live load — avoid unless ingest is paused.

Practical guidance

Size PVC for outage duration, not loss tolerance. A 1 h outage at 50 batches/s = 180 000 batches = ~9 GB. No queue_size in this table buffers a full 1 h without drops or blocking.
Prefer block_on_overflow = true if OTLP clients can buffer; prefer false if they can't and newest-data loss is acceptable.
For safe catch-up, configure drain rate greater than steady ingest. At 50 batches/s sustained, target ≥ 100 batches/s drain until the backlog is gone — then revert to a conservative rate to protect the aggregator.
Combine drain limiting with ingest limiting. A rate_limiter processor on the failover pipeline's ingress (not just the exporter) is the only way a below-ingest drain rate actually reduces backlog.