Two-tier queue: in-memory sending_queue in front of a persistent file_storage extension on a PVC — how queue_size (15 000 vs 25 000) and block_on_overflow (false / true) change what happens when the aggregator is unreachable
| queue_size | block_on_overflow | time to fill (@ 10 batches/s) |
on overflow | effect upstream (receiver / client) |
survives restart? | storage cost | trade-off · when to choose |
|---|---|---|---|---|---|---|---|
| sending_queue in-memory only (no file_storage) | |||||||
250 |
false |
~25 seconds | Exporter drops incoming batches; enqueue_failed_* increments. |
None — ingest continues at full rate; loss invisible to producers. | no — RAM-only |
~250 batches in RAM (~12 MB) | Cheap, OOM-safe, tiny outage budget. OK when SDK retry is strong. |
250 |
true |
~25 s — then stalls | Caller blocks; no exporter-side drops. | Backpressure ~25 s in; OTLP clients see timeouts. | no — RAM-only |
~250 batches in RAM (~12 MB) | Push loss upstream fast; use when producers can buffer. |
1000 |
false (default) |
~100 seconds | Drops at exporter, enqueue_failed_* rises. |
None — ingest unaffected; larger buffer delays drops. | no — RAM-only |
~1000 batches in RAM (~50 MB) | Common default. Rides short blips; > ~100 s still loses oldest. |
1000 |
true |
~100 s — then stalls | Caller blocks; no exporter-side drops until retry window lapses. | Backpressure arrives later but lasts until backend recovers. | no — RAM-only |
~1000 batches in RAM (~50 MB) | Max in-memory durability; risk of OOM on long outages. |
| file_storage persistent queue on a PVC (in-memory tier in front) | |||||||
15 000 |
false |
~25 minutes (1 500 s) | After 25 min, newest batches are dropped at the exporter; enqueue_failed_* rises. Already-persisted batches still wait on disk and flush when backend returns. |
None — ingest continues at full rate; loss invisible to producers. | yes — persists & resumes |
~750 MB PVC (plus compaction headroom) |
Good balance: ~25 min of buffer with bounded disk. Accepts tail-drops on long outages. |
15 000 |
true |
~25 min — then stalls | Caller blocks when disk queue full; no exporter-side drops. | Backpressure after ~25 min; OTLP clients retry or drop. Upstream becomes the bottleneck. | yes — persists & resumes |
~750 MB PVC | Pushes loss upstream; best when producers can hold or slow down. |
25 000 |
false |
~42 minutes (2 500 s) | Same mechanics as 15 000 but with ~17 more minutes of headroom before drops begin. | None — ingest unaffected for the full ~42 min window. | yes — persists & resumes |
~1.25 GB PVC | Larger outage budget at modest disk cost; still bounded drop behaviour. |
25 000 |
true |
~42 min — then stalls | Caller blocks; no exporter-side drops until retry window lapses on oldest items. | Backpressure after ~42 min; hardest guarantee without sharding. | yes — persists & resumes |
~1.25 GB PVC (watch compaction / fsync) |
Maximum single-node durability. Risk: slow PVC = slow ingest even when backend is healthy. |