OpenTelemetry Collector — sending_queue behaviour during a backend outage

Single pipeline, in-memory queue only — how queue_size (250 vs 1000) and block_on_overflow (true / false) change what happens when the aggregator is unreachable

Behaviour matrix — what the collector does when the aggregator is downAssumes steady ingest of 10 batches/s and default `retry_on_failure.max_elapsed_time = 5m`. Times are illustrative — real throughput depends on batch size and processor load.
queue_size	block_on_overflow	time to fill (@ 10 batches/s)	on overflow	effect upstream (receiver / client)	memory cost	trade-off · when to choose
`250`	`false`	~25 seconds (250 / 10)	Exporter drops incoming batches; `enqueue_failed_*` increments.	None — ingest continues at full rate; the loss is invisible to producers.	Low (~250 batches in RAM)	Cheap, safe from OOM, but tiny outage budget. Fine when SDK retry is strong or some data loss is acceptable.
`250`	`true`	~25 seconds — then stalls	Caller blocks waiting for space; no exporter-side drops.	Backpressure hits processors & receivers quickly (~25 s in). OTLP clients see timeouts, will retry or drop.	Low (~250 batches in RAM)	Push loss upstream fast. Best when producers can buffer or are explicitly designed to slow down.
`1000`	`false (default)`	~100 seconds (1000 / 10)	Same as row 1 once full — drops at the exporter, `enqueue_failed_*` rises.	None — ingest unaffected; larger buffer just delays when drops start.	~4× row 1 (~1000 batches)	Common default. Rides short blips; outages > ~100 s (or > 5 min retry window) still lose oldest data.
`1000`	`true`	~100 seconds — then stalls	Caller blocks; no exporter-side drops until the retry window lapses on old items.	Backpressure arrives later (~100 s) but just as hard — and lasts until the backend recovers.	~4× row 2 (~1000 batches)	Maximum in-memory durability without a disk queue. Risk: OOM if outage outlasts memory headroom.