Primary pipeline ships live to the aggregator; when the failover connector trips, the failover pipeline's two-tier queue (sending_queue → file_storage on a PVC) takes over. The matrix shows how a queue_size of 10 000 / 15 000 / 20 000 / 25 000 behaves during 1-hour and 2-hour aggregator outages, at both block_on_overflow settings.
The Primary pipeline's 1 000-batch in-memory queue covers only ~20 s before the failover connector routes traffic to the Failover pipeline. All long-outage behaviour (and all loss or backpressure) is governed by the Failover queue knobs below.
| Failover queue_size |
block_on_overflow | Time to fill @ 50 / s |
PVC footprint |
1-hour outage outcome (180 000 batches arrive) |
2-hour outage outcome (360 000 batches arrive) |
Drain time after recovery |
Recommended when… |
|---|---|---|---|---|---|---|---|
| primary_queue in-memory only · 1 000 batches · ~20 s of buffer | |||||||
1 000 |
false (default) |
~20 s | RAM ~50 MB |
failover-connector After the ~20 s primary buffer is exhausted, the connector trips and traffic is routed to the Failover pipeline for the duration of both the 1 h and 2 h scenarios. | Instant — primary drains as soon as backend is healthy; rejoin window is negligible. | Always — it's the “healthy path” before failover engages. | |
| failover_queue sending_queue + file_storage on PVC · varies by row | |||||||
10 000 |
false |
~3 m 20 s (200 s) |
~500 MB |
saved: 10 000 dropped: ~170 000 Queue fills in 3 m 20 s; remaining ~56 m drops newest batches at the exporter.
enqueue_failed_* rises steadily. |
saved: 10 000 dropped: ~350 000 Same pattern; drop window extends ~1 h 57 m.
|
~3 m 20 s (drain the 10 k backlog while live ingest continues at 50/s — see drain-rate note below) |
Tight PVC budget; short-outage tolerance only; bursty workload where newest-data loss is acceptable. |
10 000 |
true |
~3 m 20 s — then stalls | ~500 MB |
saved: 10 000 blocked upstream: ~170 000 After fill, exporter blocks. Backpressure propagates to receiver and OTLP clients; they retry or drop themselves.
|
saved: 10 000 blocked upstream: ~350 000 Upstream bears the cost for ~1 h 57 m.
|
~3 m 20 s | Producers can hold or retry themselves; collector must never drop. |
15 000 |
false |
~5 m (300 s) |
~750 MB |
saved: 15 000 dropped: ~165 000 Queue fills in 5 m; dropping window ~55 m.
|
saved: 15 000 dropped: ~345 000 Dropping window ~1 h 55 m.
|
~5 m | Moderate PVC budget; balances buffering with disk cost. |
15 000 |
true |
~5 m — then stalls | ~750 MB |
saved: 15 000 blocked upstream: ~165 000 |
saved: 15 000 blocked upstream: ~345 000 |
~5 m | Upstream can tolerate ~55 m of backpressure on a 1 h outage. |
20 000 |
false |
~6 m 40 s (400 s) |
~1.0 GB |
saved: 20 000 dropped: ~160 000 Dropping window ~53 m.
|
saved: 20 000 dropped: ~340 000 Dropping window ~1 h 53 m.
|
~6 m 40 s | Sweet spot when PVC cost and headroom both matter. |
20 000 |
true |
~6 m 40 s — then stalls | ~1.0 GB |
saved: 20 000 blocked upstream: ~160 000 |
saved: 20 000 blocked upstream: ~340 000 |
~6 m 40 s | Strong durability; upstream is resilient to ~53 m backpressure on 1 h. |
25 000 |
false |
~8 m 20 s (500 s) |
~1.25 GB |
saved: 25 000 dropped: ~155 000 Dropping window ~51 m 40 s.
|
saved: 25 000 dropped: ~335 000 Dropping window ~1 h 51 m.
|
~8 m 20 s | Max headroom with bounded drop behaviour; watch PVC compaction. |
25 000 |
true |
~8 m 20 s — then stalls | ~1.25 GB |
saved: 25 000 blocked upstream: ~155 000 |
saved: 25 000 blocked upstream: ~335 000 |
~8 m 20 s | Strongest single-node durability; requires resilient upstream. |
The collector exposes several knobs to throttle how fast the Failover pipeline exports its backlog after the aggregator returns:
num_consumers (worker count) directly scales the drain rate — halving it halves throughput. Default is 10 workers.rate_limiter / tail_sampling / transform processor before the exporter to clamp outbound rate.memory_limiter extension can shed load when the collector is hot, indirectly throttling drain.batch processor with a larger timeout + smaller send_batch_size reduces effective req/s to the backend.otlp exporter's sending_queue.num_consumers, and client compression / max_request_size all shape throughput.The matrix below shows effective drain performance at full rate, 1/2 rate, and 1/3 rate — the key subtlety is that live ingest is still arriving at 50 batches/s, so only a drain rate greater than ingest actually shrinks the backlog.
Once the aggregator returns, the failover exporter starts draining. If the exporter is rate-limited, the effective backlog reduction per second = drain_rate − ingest_rate. Any configuration where drain ≤ ingest means the backlog never shrinks while producers are live — catch-up only happens when ingest falls below the drain rate.
| Drain config | Effective drain rate | Net drain (drain − 50/s ingest) |
Catch-up time for 25 000-batch backlog | Notes |
|---|---|---|---|---|
Full ratenum_consumers ≥ 10 · no limit |
~50 batches/s | 0 / s (break-even) | never — while live ingest = 50/s Only drains once ingest dips; if ingest falls to 25/s, catch-up ≈ 16 m 40 s. |
Break-even default. Good enough once traffic is nightly-low. |
2× ratenum_consumers doubled · higher parallelism |
~100 batches/s | +50 / s | ~8 m 20 s | If the backend/network can keep up, this is the fastest safe catch-up. |
3× rateaggressive drain · bursty |
~150 batches/s | +100 / s | ~4 m 10 s | Fastest — risk of saturating aggregator / inducing a second outage. |
Half ratenum_consumers halved · rate_limiter 25/s |
~25 batches/s | −25 / s (backlog grows) | never — backlog grows by 25/s PVC fills again at 25/s × remaining outage-budget; same drop/block behaviour returns. |
Only safe if ingest is simultaneously throttled (rate_limiter on ingress). |
Third raterate_limiter ~16.7/s |
~16.7 batches/s | −33.3 / s (backlog grows faster) | never — backlog grows by 33/s | Hazardous with live ingest; usable only during post-incident off-hours. |
Quarter raterate_limiter ~12.5/s |
~12.5 batches/s | −37.5 / s | never — backlog grows by 37.5/s | Effectively non-draining under live load — avoid unless ingest is paused. |
180 000 batches = ~9 GB. No queue_size in this table buffers a full 1 h without drops or blocking.block_on_overflow = true if OTLP clients can buffer; prefer false if they can't and newest-data loss is acceptable.≥ 100 batches/s drain until the backlog is gone — then revert to a conservative rate to protect the aggregator.rate_limiter processor on the failover pipeline's ingress (not just the exporter) is the only way a below-ingest drain rate actually reduces backlog.