OpenTelemetry Collector — sending_queue behaviour during a backend outage

Single pipeline, in-memory queue only — how queue_size (250 vs 1000) and block_on_overflow (true / false) change what happens when the aggregator is unreachable

How the sending_queue works
Every exporter has an in-memory sending_queue that buffers outgoing batches while the exporter is busy or the backend is slow.

When the backend is healthy, batches flow through quickly and the queue stays near-empty.

When the backend is down, the exporter retries with exponential backoff (default max_elapsed_time 5 min) and the queue fills at the incoming rate. Two knobs govern what happens next:

queue_size — capacity in batches (default 1000).
block_on_overflow — when full, drop (false, default) or block the caller (true).
Telemetry Sources Applications Pods Host Metrics Logs Traces OTLP (gRPC/HTTP) Son Testing K8s Cluster (Source — Telemetry Producers) OpenTelemetry Collector (single pipeline · in-memory sending_queue only) INGRESS PIPELINE (Receivers) otlp receiver Receives Traces, Metrics & Logs Processors → Exporter PROCESSING Batching · Resource Detection · Transformation (memory_limiter can refuse here if downstream stalls — see note →) OTLP EXPORTER (producer-consumer: ingest fills the queue; workers drain it to the backend) otlp exporter sending_queue (memory) ← tail (enqueue) head (drain →) queue_size · block_on_overflow govern capacity & overflow workers retry_on_failure exponential backoff · max_elapsed_time 5m otelcol_exporter_queue_size / _capacity otelcol_exporter_enqueue_failed_* Observability Dev K8s Cluster (Observability Platform) Otel Aggregator Service (OTLP Receiver + Processing) OTLP Receiver (ingest from source collector) — unreachable during outage Processing Batching · Resource Detection Transformation · Routing Mimir (Metrics) Loki (Logs) Tempo (Traces) OTLP export (gRPC/HTTP) Backend outage retries fail, queue fills Important Notes
  • No file_storage — the queue is RAM only; a Collector restart loses everything still in it.
  • queue_size is measured in batches (default sizer); pick it from peak throughput × outage-budget.
  • Retries stop after max_elapsed_time (default 5 min) — the oldest items are then dropped even if the queue has room.
  • block_on_overflow = true trades drops for backpressure: receivers slow, clients retry, upstream may buffer or drop.
  • Watch otelcol_exporter_queue_size vs _capacity and enqueue_failed_*.
1 NORMAL OPERATION (backend healthy)
  • Batches are enqueued and workers drain them immediately.
  • Queue depth stays near zero — the buffer is only ever a few items deep.
  • queue_size and block_on_overflow have no observable effect.
sending_queue depth ~1 / 1000 Ingest sending_queue near-empty Backend ✓
Result: Steady state, no drops, low memory.
2 OUTAGE STARTS — QUEUE FILLS (backend unreachable, retries in flight)
  • Exports fail; the exporter retries with exponential backoff.
  • New batches keep arriving and are enqueued instead of sent — depth climbs at the incoming rate.
  • Time-to-full depends on queue_size:
queue_size = 250 ~25 s @ 10 batches/s queue_size = 1000 ~100 s @ 10 batches/s Same ingest rate → larger queue buys more time before overflow.
Result: No loss yet — clock is ticking against queue capacity.
3a OVERFLOW · block_on_overflow = false (default — queue is full, new data is dropped)
  • Enqueue returns failure immediately; the incoming batch is dropped at the exporter and never reaches retry logic.
  • Logs: \"Dropping data because sending_queue is full\"; metric otelcol_exporter_enqueue_failed_* increments.
  • Upstream (receiver, client) is unaffected — ingest continues at full rate, but the collector silently sheds load.
Ingest sending_queue FULL (250 or 1000) dropped enqueue_failed++ Backend ✕
Result: Data loss. Collector stays responsive; sources keep streaming.
3b OVERFLOW · block_on_overflow = true (queue is full, caller is blocked until space frees)
  • The call into the exporter blocks; if capacity frees before the caller's timeout, the item is still enqueued.
  • Backpressure propagates: processors pause → receiver slows → OTLP clients see timeouts/errors and retry or buffer themselves.
  • Trades exporter-side drops for upstream-side drops or latency. Good when upstream can hold data (SDK retry) or slow down.
Ingest backpressure sending_queue FULL — caller waits Backend ✕
Result: No exporter drops, but upstream slows or refuses — loss moves out of the collector.
Behaviour matrix — what the collector does when the aggregator is downAssumes steady ingest of 10 batches/s and default retry_on_failure.max_elapsed_time = 5m. Times are illustrative — real throughput depends on batch size and processor load.
queue_sizeblock_on_overflowtime to fill
(@ 10 batches/s)
on overfloweffect upstream
(receiver / client)
memory costtrade-off · when to choose
250false~25 seconds (250 / 10)Exporter drops incoming batches; enqueue_failed_* increments.None — ingest continues at full rate; the loss is invisible to producers.Low (~250 batches in RAM)Cheap, safe from OOM, but tiny outage budget. Fine when SDK retry is strong or some data loss is acceptable.
250true~25 seconds — then stallsCaller blocks waiting for space; no exporter-side drops.Backpressure hits processors & receivers quickly (~25 s in). OTLP clients see timeouts, will retry or drop.Low (~250 batches in RAM)Push loss upstream fast. Best when producers can buffer or are explicitly designed to slow down.
1000false (default)~100 seconds (1000 / 10)Same as row 1 once full — drops at the exporter, enqueue_failed_* rises.None — ingest unaffected; larger buffer just delays when drops start.~4× row 1 (~1000 batches)Common default. Rides short blips; outages > ~100 s (or > 5 min retry window) still lose oldest data.
1000true~100 seconds — then stallsCaller blocks; no exporter-side drops until the retry window lapses on old items.Backpressure arrives later (~100 s) but just as hard — and lasts until the backend recovers.~4× row 2 (~1000 batches)Maximum in-memory durability without a disk queue. Risk: OOM if outage outlasts memory headroom.