OpenTelemetry Collector — sending_queue behaviour during a backend outage
OpenTelemetry Collector — sending_queue behaviour during a backend outage
Single pipeline, in-memory queue only — how queue_size (250 vs 1000) and block_on_overflow (true / false) change what happens when the aggregator is unreachable
Behaviour matrix — what the collector does when the aggregator is downAssumes steady ingest of 10 batches/s and default retry_on_failure.max_elapsed_time = 5m. Times are illustrative — real throughput depends on batch size and processor load.
None — ingest continues at full rate; the loss is invisible to producers.
Low (~250 batches in RAM)
Cheap, safe from OOM, but tiny outage budget. Fine when SDK retry is strong or some data loss is acceptable.
250
true
~25 seconds — then stalls
Caller blocks waiting for space; no exporter-side drops.
Backpressure hits processors & receivers quickly (~25 s in). OTLP clients see timeouts, will retry or drop.
Low (~250 batches in RAM)
Push loss upstream fast. Best when producers can buffer or are explicitly designed to slow down.
1000
false (default)
~100 seconds (1000 / 10)
Same as row 1 once full — drops at the exporter, enqueue_failed_* rises.
None — ingest unaffected; larger buffer just delays when drops start.
~4× row 1 (~1000 batches)
Common default. Rides short blips; outages > ~100 s (or > 5 min retry window) still lose oldest data.
1000
true
~100 seconds — then stalls
Caller blocks; no exporter-side drops until the retry window lapses on old items.
Backpressure arrives later (~100 s) but just as hard — and lasts until the backend recovers.
~4× row 2 (~1000 batches)
Maximum in-memory durability without a disk queue. Risk: OOM if outage outlasts memory headroom.
OpenTelemetry Collector — file_storage persistent queue behaviour during a backend outage
OpenTelemetry Collector — file_storage persistent queue during a backend outage
Two-tier queue: in-memory sending_queue in front of a persistent file_storage extension on a PVC — how queue_size (15 000 vs 25 000) and block_on_overflow (false / true) change what happens when the aggregator is unreachable
Behaviour matrix — in-memory only vs file_storage persistent queue, during a backend outageAssumes steady ingest of 10 batches/s, ~50 KB per batch, default retry_on_failure. Times and disk sizes are illustrative — real throughput depends on batch size and processor load.
Common default. Rides short blips; > ~100 s still loses oldest.
1000
true
~100 s — then stalls
Caller blocks; no exporter-side drops until retry window lapses.
Backpressure arrives later but lasts until backend recovers.
no — RAM-only
~1000 batches in RAM (~50 MB)
Max in-memory durability; risk of OOM on long outages.
file_storage persistent queue on a PVC (in-memory tier in front)
15 000
false
~25 minutes (1 500 s)
After 25 min, newest batches are dropped at the exporter; enqueue_failed_* rises. Already-persisted batches still wait on disk and flush when backend returns.
None — ingest continues at full rate; loss invisible to producers.
yes — persists & resumes
~750 MB PVC (plus compaction headroom)
Good balance: ~25 min of buffer with bounded disk. Accepts tail-drops on long outages.
15 000
true
~25 min — then stalls
Caller blocks when disk queue full; no exporter-side drops.
Backpressure after ~25 min; OTLP clients retry or drop. Upstream becomes the bottleneck.
yes — persists & resumes
~750 MB PVC
Pushes loss upstream; best when producers can hold or slow down.
25 000
false
~42 minutes (2 500 s)
Same mechanics as 15 000 but with ~17 more minutes of headroom before drops begin.
None — ingest unaffected for the full ~42 min window.
yes — persists & resumes
~1.25 GB PVC
Larger outage budget at modest disk cost; still bounded drop behaviour.
25 000
true
~42 min — then stalls
Caller blocks; no exporter-side drops until retry window lapses on oldest items.
Backpressure after ~42 min; hardest guarantee without sharding.
yes — persists & resumes
~1.25 GB PVC (watch compaction / fsync)
Maximum single-node durability. Risk: slow PVC = slow ingest even when backend is healthy.
OpenTelemetry Collector with Failover Connector
OpenTelemetry Collector with Failover Connector, Sending_Queue and File_Storage
Resilient telemetry flow from Son Testing (Source) to Observability Dev (Aggregator + Backends)
Primary pipeline ships live to the aggregator; when the failover connector trips, the failover pipeline’s two-tier queue (sending_queue → file_storage on a PVC) takes over. The matrix shows how a queue_size of 10 000 / 15 000 / 20 000 / 25 000 behaves during 1-hour and 2-hour aggregator outages, at both block_on_overflow settings.
Ingest rate50 batches/s
Batch size~50 KB/batch
Primary queue1 000 batches · RAM
Failover storagefile_storage on PVC
Drain rate (healthy)50 batches/s · = ingest
retry_on_failureon · max_elapsed_time ∞
Primary + Failover queue behaviour — 1 h and 2 h aggregator outage
The Primary pipeline’s 1 000-batch in-memory queue covers only ~20 s before the failover connector routes traffic to the Failover pipeline. All long-outage behaviour (and all loss or backpressure) is governed by the Failover queue knobs below.
Failover queue configuration · 1-hour and 2-hour outage outcomeRows compare queue_size (10 k → 25 k) × block_on_overflow (false / true). Times to fill assume 50 batches/s sustained ingest; drain times assume backend returns at full health and runs at break-even (50 batches/s) while live ingest continues.
Failover queue_size
block_on_overflow
Time to fill @ 50 / s
PVC footprint
1-hour outage outcome (180 000 batches arrive)
2-hour outage outcome (360 000 batches arrive)
Drain time after recovery
Recommended when…
primary_queue in-memory only · 1 000 batches · ~20 s of buffer
1 000
false (default)
~20 s
RAM ~50 MB
failover-connector After the ~20 s primary buffer is exhausted, the connector trips and traffic is routed to the Failover pipeline for the duration of both the 1 h and 2 h scenarios.
Instant — primary drains as soon as backend is healthy; rejoin window is negligible.
Always — it’s the “healthy path” before failover engages.
failover_queue sending_queue + file_storage on PVC · varies by row
10 000
false
~3 m 20 s (200 s)
~500 MB
saved: 10 000 dropped: ~170 000
Queue fills in 3 m 20 s; remaining ~56 m drops newest batches at the exporter. enqueue_failed_* rises steadily.
saved: 10 000 dropped: ~350 000
Same pattern; drop window extends ~1 h 57 m.
~3 m 20 s (drain the 10 k backlog while live ingest continues at 50/s — see drain-rate note below)
Tight PVC budget; short-outage tolerance only; bursty workload where newest-data loss is acceptable.
10 000
true
~3 m 20 s — then stalls
~500 MB
saved: 10 000 blocked upstream: ~170 000
After fill, exporter blocks. Backpressure propagates to receiver and OTLP clients; they retry or drop themselves.
saved: 10 000 blocked upstream: ~350 000
Upstream bears the cost for ~1 h 57 m.
~3 m 20 s
Producers can hold or retry themselves; collector must never drop.
15 000
false
~5 m (300 s)
~750 MB
saved: 15 000 dropped: ~165 000
Queue fills in 5 m; dropping window ~55 m.
saved: 15 000 dropped: ~345 000
Dropping window ~1 h 55 m.
~5 m
Moderate PVC budget; balances buffering with disk cost.
15 000
true
~5 m — then stalls
~750 MB
saved: 15 000 blocked upstream: ~165 000
saved: 15 000 blocked upstream: ~345 000
~5 m
Upstream can tolerate ~55 m of backpressure on a 1 h outage.
20 000
false
~6 m 40 s (400 s)
~1.0 GB
saved: 20 000 dropped: ~160 000
Dropping window ~53 m.
saved: 20 000 dropped: ~340 000
Dropping window ~1 h 53 m.
~6 m 40 s
Sweet spot when PVC cost and headroom both matter.
20 000
true
~6 m 40 s — then stalls
~1.0 GB
saved: 20 000 blocked upstream: ~160 000
saved: 20 000 blocked upstream: ~340 000
~6 m 40 s
Strong durability; upstream is resilient to ~53 m backpressure on 1 h.
25 000
false
~8 m 20 s (500 s)
~1.25 GB
saved: 25 000 dropped: ~155 000
Dropping window ~51 m 40 s.
saved: 25 000 dropped: ~335 000
Dropping window ~1 h 51 m.
~8 m 20 s
Max headroom with bounded drop behaviour; watch PVC compaction.
The collector exposes several knobs to throttle how fast the Failover pipeline exports its backlog after the aggregator returns:
sending_queue knobs:num_consumers (worker count) directly scales the drain rate — halving it halves throughput. Default is 10 workers.
Pipeline processors: insert a rate_limiter / tail_sampling / transform processor before the exporter to clamp outbound rate.
Extension: the memory_limiter extension can shed load when the collector is hot, indirectly throttling drain.
Batch + timeout: a batch processor with a larger timeout + smaller send_batch_size reduces effective req/s to the backend.
Exporter settings:otlp exporter’s sending_queue.num_consumers, and client compression / max_request_size all shape throughput.
The matrix below shows effective drain performance at full rate, 1/2 rate, and 1/3 rate — the key subtlety is that live ingest is still arriving at 50 batches/s, so only a drain rate greater than ingest actually shrinks the backlog.
Failover drain rate vs live ingest
Once the aggregator returns, the failover exporter starts draining. If the exporter is rate-limited, the effective backlog reduction per second = drain_rate − ingest_rate. Any configuration where drain ≤ ingest means the backlog never shrinks while producers are live — catch-up only happens when ingest falls below the drain rate.
Drain rate with and without rate limiting · live ingest = 50 batches/sAssumes a 25 000-batch backlog carried out of a 1-hour outage (worst-case saved amount). “Catch-up time” is measured while live ingest continues at 50/s.
Drain config
Effective drain rate
Net drain (drain − 50/s ingest)
Catch-up time for 25 000-batch backlog
Notes
Full rate num_consumers ≥ 10 · no limit
~50 batches/s
0 / s (break-even)
never — while live ingest = 50/s Only drains once ingest dips; if ingest falls to 25/s, catch-up ≈ 16 m 40 s.
Break-even default. Good enough once traffic is nightly-low.
never — backlog grows by 25/s PVC fills again at 25/s × remaining outage-budget; same drop/block behaviour returns.
Only safe if ingest is simultaneously throttled (rate_limiter on ingress).
Third rate rate_limiter ~16.7/s
~16.7 batches/s
−33.3 / s (backlog grows faster)
never — backlog grows by 33/s
Hazardous with live ingest; usable only during post-incident off-hours.
Quarter rate rate_limiter ~12.5/s
~12.5 batches/s
−37.5 / s
never — backlog grows by 37.5/s
Effectively non-draining under live load — avoid unless ingest is paused.
Practical guidance
Size PVC for outage duration, not loss tolerance. A 1 h outage at 50 batches/s = 180 000 batches = ~9 GB. No queue_size in this table buffers a full 1 h without drops or blocking.
Prefer block_on_overflow = true if OTLP clients can buffer; prefer false if they can’t and newest-data loss is acceptable.
For safe catch-up, configure drain rate greater than steady ingest. At 50 batches/s sustained, target ≥ 100 batches/s drain until the backlog is gone — then revert to a conservative rate to protect the aggregator.
Combine drain limiting with ingest limiting. A rate_limiter processor on the failover pipeline’s ingress (not just the exporter) is the only way a below-ingest drain rate actually reduces backlog.