OpenTelemetry Collector — sending_queue behaviour during a backend outage

Single pipeline, in-memory queue only — how queue_size (250 vs 1000) and block_on_overflow (true / false) change what happens when the aggregator is unreachable

Behaviour matrix — what the collector does when the aggregator is downAssumes steady ingest of 10 batches/s and default `retry_on_failure.max_elapsed_time = 5m`. Times are illustrative — real throughput depends on batch size and processor load.
queue_size	block_on_overflow	time to fill (@ 10 batches/s)	on overflow	effect upstream (receiver / client)	memory cost	trade-off · when to choose
`250`	`false`	~25 seconds (250 / 10)	Exporter drops incoming batches; `enqueue_failed_*` increments.	None — ingest continues at full rate; the loss is invisible to producers.	Low (~250 batches in RAM)	Cheap, safe from OOM, but tiny outage budget. Fine when SDK retry is strong or some data loss is acceptable.
`250`	`true`	~25 seconds — then stalls	Caller blocks waiting for space; no exporter-side drops.	Backpressure hits processors & receivers quickly (~25 s in). OTLP clients see timeouts, will retry or drop.	Low (~250 batches in RAM)	Push loss upstream fast. Best when producers can buffer or are explicitly designed to slow down.
`1000`	`false (default)`	~100 seconds (1000 / 10)	Same as row 1 once full — drops at the exporter, `enqueue_failed_*` rises.	None — ingest unaffected; larger buffer just delays when drops start.	~4× row 1 (~1000 batches)	Common default. Rides short blips; outages > ~100 s (or > 5 min retry window) still lose oldest data.
`1000`	`true`	~100 seconds — then stalls	Caller blocks; no exporter-side drops until the retry window lapses on old items.	Backpressure arrives later (~100 s) but just as hard — and lasts until the backend recovers.	~4× row 2 (~1000 batches)	Maximum in-memory durability without a disk queue. Risk: OOM if outage outlasts memory headroom.

OpenTelemetry Collector — file_storage persistent queue behaviour during a backend outage

OpenTelemetry Collector — file_storage persistent queue during a backend outage

Two-tier queue: in-memory sending_queue in front of a persistent file_storage extension on a PVC — how queue_size (15 000 vs 25 000) and block_on_overflow (false / true) change what happens when the aggregator is unreachable

Behaviour matrix — in-memory only vs file_storage persistent queue, during a backend outage Assumes steady ingest of 10 batches/s, ~50 KB per batch, default `retry_on_failure`. Times and disk sizes are illustrative — real throughput depends on batch size and processor load.
queue_size	block_on_overflow	time to fill (@ 10 batches/s)	on overflow	effect upstream (receiver / client)	survives restart?	storage cost	trade-off · when to choose
sending_queue in-memory only (no file_storage)
`250`	`false`	~25 seconds	Exporter drops incoming batches; `enqueue_failed_*` increments.	None — ingest continues at full rate; loss invisible to producers.	`no` — RAM-only	~250 batches in RAM (~12 MB)	Cheap, OOM-safe, tiny outage budget. OK when SDK retry is strong.
`250`	`true`	~25 s — then stalls	Caller blocks; no exporter-side drops.	Backpressure ~25 s in; OTLP clients see timeouts.	`no` — RAM-only	~250 batches in RAM (~12 MB)	Push loss upstream fast; use when producers can buffer.
`1000`	`false (default)`	~100 seconds	Drops at exporter, `enqueue_failed_*` rises.	None — ingest unaffected; larger buffer delays drops.	`no` — RAM-only	~1000 batches in RAM (~50 MB)	Common default. Rides short blips; > ~100 s still loses oldest.
`1000`	`true`	~100 s — then stalls	Caller blocks; no exporter-side drops until retry window lapses.	Backpressure arrives later but lasts until backend recovers.	`no` — RAM-only	~1000 batches in RAM (~50 MB)	Max in-memory durability; risk of OOM on long outages.
file_storage persistent queue on a PVC (in-memory tier in front)
`15 000`	`false`	~25 minutes (1 500 s)	After 25 min, newest batches are dropped at the exporter; `enqueue_failed_*` rises. Already-persisted batches still wait on disk and flush when backend returns.	None — ingest continues at full rate; loss invisible to producers.	`yes` — persists & resumes	~750 MB PVC (plus compaction headroom)	Good balance: ~25 min of buffer with bounded disk. Accepts tail-drops on long outages.
`15 000`	`true`	~25 min — then stalls	Caller blocks when disk queue full; no exporter-side drops.	Backpressure after ~25 min; OTLP clients retry or drop. Upstream becomes the bottleneck.	`yes` — persists & resumes	~750 MB PVC	Pushes loss upstream; best when producers can hold or slow down.
`25 000`	`false`	~42 minutes (2 500 s)	Same mechanics as 15 000 but with ~17 more minutes of headroom before drops begin.	None — ingest unaffected for the full ~42 min window.	`yes` — persists & resumes	~1.25 GB PVC	Larger outage budget at modest disk cost; still bounded drop behaviour.
`25 000`	`true`	~42 min — then stalls	Caller blocks; no exporter-side drops until retry window lapses on oldest items.	Backpressure after ~42 min; hardest guarantee without sharding.	`yes` — persists & resumes	~1.25 GB PVC (watch compaction / fsync)	Maximum single-node durability. Risk: slow PVC = slow ingest even when backend is healthy.

OpenTelemetry Collector with Failover Connector

OpenTelemetry Collector with Failover Connector, Sending_Queue and File_Storage

Resilient telemetry flow from Son Testing (Source) to Observability Dev (Aggregator + Backends)

Failover Connector · Primary + Failover Queue Matrix

OpenTelemetry Collector · Failover Connector — Primary & Failover Queue Matrix

Primary pipeline ships live to the aggregator; when the failover connector trips, the failover pipeline’s two-tier queue (sending_queue → file_storage on a PVC) takes over. The matrix shows how a queue_size of 10 000 / 15 000 / 20 000 / 25 000 behaves during 1-hour and 2-hour aggregator outages, at both block_on_overflow settings.

Ingest rate 50 batches/s

Batch size ~50 KB/batch

Primary queue 1 000 batches · RAM

Failover storage file_storage on PVC

Drain rate (healthy) 50 batches/s · = ingest

retry_on_failure on · max_elapsed_time ∞

Primary + Failover queue behaviour — 1 h and 2 h aggregator outage

The Primary pipeline’s 1 000-batch in-memory queue covers only ~20 s before the failover connector routes traffic to the Failover pipeline. All long-outage behaviour (and all loss or backpressure) is governed by the Failover queue knobs below.

Failover queue configuration · 1-hour and 2-hour outage outcome Rows compare `queue_size` (10 k → 25 k) × `block_on_overflow` (false / true). Times to fill assume 50 batches/s sustained ingest; drain times assume backend returns at full health and runs at break-even (50 batches/s) while live ingest continues.
Failover queue_size	block_on_overflow	Time to fill @ 50 / s	PVC footprint	1-hour outage outcome (180 000 batches arrive)	2-hour outage outcome (360 000 batches arrive)	Drain time after recovery	Recommended when…
primary_queue in-memory only · 1 000 batches · ~20 s of buffer
`1 000`	`false (default)`	~20 s	RAM ~50 MB	failover-connector After the ~20 s primary buffer is exhausted, the connector trips and traffic is routed to the Failover pipeline for the duration of both the 1 h and 2 h scenarios.		Instant — primary drains as soon as backend is healthy; rejoin window is negligible.	Always — it’s the “healthy path” before failover engages.
failover_queue sending_queue + file_storage on PVC · varies by row
`10 000`	`false`	~3 m 20 s (200 s)	~500 MB	saved: 10 000 dropped: ~170 000 Queue fills in 3 m 20 s; remaining ~56 m drops newest batches at the exporter. `enqueue_failed_*` rises steadily.	saved: 10 000 dropped: ~350 000 Same pattern; drop window extends ~1 h 57 m.	~3 m 20 s (drain the 10 k backlog while live ingest continues at 50/s — see drain-rate note below)	Tight PVC budget; short-outage tolerance only; bursty workload where newest-data loss is acceptable.
`10 000`	`true`	~3 m 20 s — then stalls	~500 MB	saved: 10 000 blocked upstream: ~170 000 After fill, exporter blocks. Backpressure propagates to receiver and OTLP clients; they retry or drop themselves.	saved: 10 000 blocked upstream: ~350 000 Upstream bears the cost for ~1 h 57 m.	~3 m 20 s	Producers can hold or retry themselves; collector must never drop.
`15 000`	`false`	~5 m (300 s)	~750 MB	saved: 15 000 dropped: ~165 000 Queue fills in 5 m; dropping window ~55 m.	saved: 15 000 dropped: ~345 000 Dropping window ~1 h 55 m.	~5 m	Moderate PVC budget; balances buffering with disk cost.
`15 000`	`true`	~5 m — then stalls	~750 MB	saved: 15 000 blocked upstream: ~165 000	saved: 15 000 blocked upstream: ~345 000	~5 m	Upstream can tolerate ~55 m of backpressure on a 1 h outage.
`20 000`	`false`	~6 m 40 s (400 s)	~1.0 GB	saved: 20 000 dropped: ~160 000 Dropping window ~53 m.	saved: 20 000 dropped: ~340 000 Dropping window ~1 h 53 m.	~6 m 40 s	Sweet spot when PVC cost and headroom both matter.
`20 000`	`true`	~6 m 40 s — then stalls	~1.0 GB	saved: 20 000 blocked upstream: ~160 000	saved: 20 000 blocked upstream: ~340 000	~6 m 40 s	Strong durability; upstream is resilient to ~53 m backpressure on 1 h.
`25 000`	`false`	~8 m 20 s (500 s)	~1.25 GB	saved: 25 000 dropped: ~155 000 Dropping window ~51 m 40 s.	saved: 25 000 dropped: ~335 000 Dropping window ~1 h 51 m.	~8 m 20 s	Max headroom with bounded drop behaviour; watch PVC compaction.
`25 000`	`true`	~8 m 20 s — then stalls	~1.25 GB	saved: 25 000 blocked upstream: ~155 000	saved: 25 000 blocked upstream: ~335 000	~8 m 20 s	Strongest single-node durability; requires resilient upstream.

Can the failover drain be rate-limited? — Yes.

The collector exposes several knobs to throttle how fast the Failover pipeline exports its backlog after the aggregator returns:

sending_queue knobs: num_consumers (worker count) directly scales the drain rate — halving it halves throughput. Default is 10 workers.
Pipeline processors: insert a rate_limiter / tail_sampling / transform processor before the exporter to clamp outbound rate.
Extension: the memory_limiter extension can shed load when the collector is hot, indirectly throttling drain.
Batch + timeout: a batch processor with a larger timeout + smaller send_batch_size reduces effective req/s to the backend.
Exporter settings: otlp exporter’s sending_queue.num_consumers, and client compression / max_request_size all shape throughput.

The matrix below shows effective drain performance at full rate, 1/2 rate, and 1/3 rate — the key subtlety is that live ingest is still arriving at 50 batches/s, so only a drain rate greater than ingest actually shrinks the backlog.

Failover drain rate vs live ingest

Once the aggregator returns, the failover exporter starts draining. If the exporter is rate-limited, the effective backlog reduction per second = drain_rate − ingest_rate. Any configuration where drain ≤ ingest means the backlog never shrinks while producers are live — catch-up only happens when ingest falls below the drain rate.

Drain rate with and without rate limiting · live ingest = 50 batches/s Assumes a 25 000-batch backlog carried out of a 1-hour outage (worst-case saved amount). “Catch-up time” is measured while live ingest continues at 50/s.
Drain config	Effective drain rate	Net drain (drain − 50/s ingest)	Catch-up time for 25 000-batch backlog	Notes
`Full rate` num_consumers ≥ 10 · no limit	~50 batches/s	0 / s (break-even)	never — while live ingest = 50/s Only drains once ingest dips; if ingest falls to 25/s, catch-up ≈ 16 m 40 s.	Break-even default. Good enough once traffic is nightly-low.
`2× rate` num_consumers doubled · higher parallelism	~100 batches/s	+50 / s	~8 m 20 s	If the backend/network can keep up, this is the fastest safe catch-up.
`3× rate` aggressive drain · bursty	~150 batches/s	+100 / s	~4 m 10 s	Fastest — risk of saturating aggregator / inducing a second outage.
`Half rate` num_consumers halved · rate_limiter 25/s	~25 batches/s	−25 / s (backlog grows)	never — backlog grows by 25/s PVC fills again at 25/s × remaining outage-budget; same drop/block behaviour returns.	Only safe if ingest is simultaneously throttled (rate_limiter on ingress).
`Third rate` rate_limiter ~16.7/s	~16.7 batches/s	−33.3 / s (backlog grows faster)	never — backlog grows by 33/s	Hazardous with live ingest; usable only during post-incident off-hours.
`Quarter rate` rate_limiter ~12.5/s	~12.5 batches/s	−37.5 / s	never — backlog grows by 37.5/s	Effectively non-draining under live load — avoid unless ingest is paused.

Practical guidance

Size PVC for outage duration, not loss tolerance. A 1 h outage at 50 batches/s = 180 000 batches = ~9 GB. No queue_size in this table buffers a full 1 h without drops or blocking.
Prefer block_on_overflow = true if OTLP clients can buffer; prefer false if they can’t and newest-data loss is acceptable.
For safe catch-up, configure drain rate greater than steady ingest. At 50 batches/s sustained, target ≥ 100 batches/s drain until the backlog is gone — then revert to a conservative rate to protect the aggregator.
Combine drain limiting with ingest limiting. A rate_limiter processor on the failover pipeline’s ingress (not just the exporter) is the only way a below-ingest drain rate actually reduces backlog.

BLU // SAS

Queuing Behaviour of Otel Collector