About AI Tokens

AI tokens” usually refer to the smallest units of text that a large language model (LLM) reads or produces. They are not full words—often they’re word pieces, characters, or even punctuation.

Here’s a clear breakdown:

What is a Token?

A token is a chunk of text. Depending on the language and tokenizer, a token might be:

  • A whole short word → “cat”
  • Part of a long word → “inter-”, “-national”
  • A punctuation mark → “,”
  • A single character in some languages → Chinese, Japanese
  • Even whitespace or special system symbols

LLMs don’t think in letters or words—they think in tokens.

Why Tokens Matter

1. Pricing

Most AI models bill based on the number of tokens:

  • Input tokens → what you send to the model
  • Output tokens → what the model generates

More text = more tokens = higher cost.

2. Context Window

Models have a maximum number of tokens they can process at once (their context window).
Example: a 128k-token context model can handle roughly a 300–400 page book.

3. Speed

More tokens → slower responses.

4. Precision

Tokenization affects:

  • How a model understands spelling variations
  • How it interprets compound words
  • How it handles programming languages or math

How Tokenization Works (Simple Example)

Sentence:
“ChatGPT loves learning.”

A tokenizer might split it into:

  • “Chat”
  • “G”
  • “PT”
  • “ loves”
  • “ learning”
  • “.”

Total: 6 tokens

Another tokenizer might split it differently depending on how it was trained.

Types of AI Tokens

1. Text Tokens

Used by LLMs for natural language.

2. Code Tokens

Optimized for programming languages—models like GPT-Engineer or Code Llama may use different tokenization patterns.

3. Special Tokens

Used internally:

  • <|system|>
  • <|user|>
  • <|assistant|>
  • End-of-text markers

These help the model structure conversations.

Tokenization Methods You Might See

  • BPE (Byte Pair Encoding) — Very common, used by GPT models.
  • SentencePiece — Used by some Google models.
  • WordPiece — Used in BERT.
  • Unigram LM — Another method used internally by some models.

All aim to compress text efficiently.

Tokens vs Cryptocurrency Tokens

People sometimes confuse the two.
AI tokens ≠ crypto tokens (unless you’re talking about blockchain-based AI projects).
In AI language models, token simply means a chunk of text.

Monetizing AI Tokens

When people talk about “monetizing AI tokens,” they usually mean charging money based on how many tokens an AI model uses. This happens in three main contexts:


1. Monetizing AI Usage (the most common meaning)

This refers to how companies like OpenAI, Anthropic, Google, or API providers make money.

They charge per input token and output token.

Example:

  • You send 2,000 tokens → billed
  • The AI replies with 1,000 tokens → billed
  • Total = 3,000 tokens

This is the core economic unit of LLMs.

Why monetize tokens?

Because tokens are:

  • Predictable units of compute
  • Directly tied to GPU usage and cost
  • Easy to meter for billing
  • Scalable for apps and businesses

This is the same reason cloud providers bill per compute-second.


2. Token Monetization for Developers / App Builders

If you build an app using an LLM (chatbot, agent, game, coding tool, etc.), your cost is in tokens, so:

  • You pay wholesale token prices to the model provider
  • You resell or mark up the service to your users
  • Profit comes from margin, efficiency, or value-add features

So when a founder says:

“We need to monetize tokens.”

They mean:

“We need to charge users enough to cover the model’s token costs and turn a profit.”

3. Internal Economics: How AI Companies Monetize Tokens

Behind the scenes:

  • GPUs run inference
  • GPUs cost money per second
  • Token throughput determines cost per query

So:
Tokens = compute time = money

AI labs map token pricing to:

  • Electricity and GPU costs
  • Model size
  • Hardware efficiency
  • Demand and market competition

Tokens are essentially micropayments for LLM compute.

Summary

When people talk about monetizing AI tokens, they almost always mean:

Charging users based on the number of tokens processed by an AI model—which ties directly to compute cost.

Why Token Observability Matters

1. Cost Control & Margin Protection

Tokens ≈ compute.
Compute = the single largest cost center for a hyperscaler running LLM inference.

Monitoring token metrics helps answer:

  • Which workloads are consuming the most tokens?
  • Are certain customers causing spiky or abusive token usage?
  • Are model changes increasing token-per-request cost?

Without token-level data, you can’t accurately understand or optimize unit economics.

2. Capacity Planning & Scaling

Token throughput is directly tied to:

  • GPU utilization
  • Model saturation
  • Latency under load
  • Queueing behavior

Hyperscalers use token telemetry to:

  • Predict peak demand
  • Scale GPU clusters
  • Allocate inference servers
  • Tune batching efficiency

Token rate = the strongest predictor of load.

3. Performance Monitoring

Token-level metrics reveal performance bottlenecks:

  • Drops in tokens/sec → GPU underutilization or network issues
  • Slow output token generation → model regression or hardware throttling
  • Sudden intake-token spikes → possible DDOS or abusive workload

Token observability gives real-time performance signals that logs and traces alone cannot.

4. Abuse Detection & Security

Token patterns can reveal:

  • Automated scraping
  • Prompt injection attempts
  • Misuse of free-tier accounts
  • Traffic laundering
  • API key sharing

Hyperscalers often build token anomaly detectors to block or throttle bad actors.

5. Customer Billing & Transparency

Tokens are the billing unit, so observability supports:

  • Accurate metering
  • Customer usage dashboards
  • Invoice reconciliation
  • Disputes handling

If you can’t monitor tokens precisely, you can’t bill precisely.

6. Product Insights and Research

Token telemetry helps model teams understand:

  • Which features produce high output-token inflation
  • Which models yield the most efficient tokens-per-task
  • How users structure prompts
  • The distribution of prompt lengths across verticals

This feeds into model optimization and product strategy.

What Hyperscalers Actually Monitor

A serious AI provider typically records:

Per-request metrics

  • Input tokens
  • Output tokens
  • Total tokens
  • Tokens/s (input and output separately)
  • Latency per stage (tokenization, inference, streaming, post-processing)

Aggregated metrics

  • Avg. tokens per customer per day
  • Avg. tokens per model
  • Peak token throughput
  • GPU tokens/sec efficiency
  • Cost-per-1M tokens (by model & hardware)

Anomaly signals

  • Token spikes
  • Sudden distribution shifts
  • Abnormal output-token growth
  • Token storms (malicious loops)

How Hyperscalers Use Token Observability Operationally

Real-time dashboards show:

  • Tokens/sec per cluster
  • Tokens/sec per model
  • Cost per token by model family
  • GPU efficiency mapped to token throughput

Alerts trigger when:

  • Tokens/sec drop below cluster baseline
  • Output token rate spikes
  • Billing anomalies arise
  • Context-length usage nears limits
  • Tokenization error rate rises

Token observability is treated similarly to:

  • CPU load
  • I/O throughput
  • Memory pressure
    in traditional cloud infrastructure — but even more essential, because tokens are the product.

Conclusion

For a hyperscaler, yes — it is extremely important to monitor token-level data. Tokens are the backbone of:

  • Infrastructure efficiency
  • Cost control
  • Security
  • Billing
  • Product insights
  • Scaling
  • Model performance

Monitoring tokens is equivalent to monitoring compute, cost, customer experience, and revenue.

The Full Observability Stack for LLM Platforms

Metrics → Traces → Logs → Token-level Metadata → Derived Intelligence

This stack looks similar to traditional cloud observability, but AI adds new layers that hyperscalers must track.

Metrics (the high-level, real-time health check)

Metrics are numeric, aggregated, time-series values that tell you how the system is behaving right now.

Standard infra metrics (still essential)

  • CPU utilization
  • GPU utilization
  • GPU memory pressure
  • Disk & network I/O
  • Latency percentiles (p50, p95, p99)

AI-specific metrics (new)

These are introduced because LLM behavior depends on tokens:

Input Tokens/sec

High input load means:

  • Users are sending long prompts
  • Prompt/chat augmentation systems are expanding context

Output Tokens/sec

Drops indicate:

  • GPU/TPU throttling
  • Model regression
  • Saturated clusters
  • Poor batching efficiency

Tokens per request (avg, p95, p99)

Useful for:

  • Capacity planning
  • Billing accuracy
  • Detecting abuse (e.g., extremely long conversations)

Context Window Utilization %

When users approach ~80–100% of max tokens:

  • Latency spikes
  • GPU memory spikes
  • Errors rise (context overflow)

Cost per 1M tokens

Internally tracked even if not exposed externally.

Batching Efficiency

LLM servers batch requests to keep GPUs fully fed.

Token metrics drive batching decisions.

Traces (the per-request story)

Traces show how a single request flows through the AI system, end-to-end.

Why traces are critical for LLMs

LLM inference has many stages:

  1. Receive request
  2. Tokenize input
  3. Validate safety filters
  4. Route to the correct model
  5. Reserve GPU memory
  6. Batch with other requests
  7. Run inference
  8. Stream output tokens
  9. Safety redaction or compression
  10. Return to customer

A trace shows timing for each stage.

AI-specific trace spans

Hyperscalers add spans such as:

  • tokenization_time_ms
  • model_loading_time_ms (if cold start)
  • batch_queue_wait_ms
  • first_token_latency_ms (how fast generation starts)
  • avg_output_tokens_per_s (generation speed)
  • safety_filter_decisions
  • cache_hit/miss for retrieval-augmented generation (RAG)

This is where token-level metadata attaches to the request story.

Logs (the granular, textual details)

Logs are fine-grained raw information useful for debugging and audits.

Standard logs

  • Errors
  • Warnings
  • Timeouts
  • API failures
  • Model load/unload events

AI-specific logs

These include:

  • Tokenizer failures
  • Abnormally large token counts
  • Prompt-injection detection
  • Bad formatting given to LLMs
  • Model safety blocks
  • Batching decision logs
  • GPU kernel execution logs
  • Per-layer inference anomalies

Logs are essential for diagnosing:

  • Why a GPU crashed
  • Why latency spiked
  • Why a user got incomplete output
  • Why safety filters triggered

Token-Level Metadata (the new layer unique to LLM systems)

This is the observability layer that did not exist before LLMs.

Token metadata can be attached:

  • At request-level (summary)
  • At span-level (trace)
  • At event-level (log mini-records)

What token metadata includes

Per-request

  • input_token_count
  • output_token_count
  • system_token_count (system + tool messages)
  • total_token_count

Streaming-level

  • tokens_per_chunk
  • time_between_chunks (latency signal)
  • decoding_sampling_metadata (temperature, top_p, frequency penalties)

User behavior

  • Average tokens per conversation
  • Max tokens per session
  • Token distribution patterns per customer

Billing

  • Which user consumed how many tokens
  • What model they used
  • Which organization it belongs to

Why token metadata matters

It enables:

  • Accurate billing
  • Anomaly detection
  • Performance regression detection
  • GPU efficiency optimization
  • Fair usage limits
  • Safety monitoring
  • Product insights

It is the single most important observability layer for scaling AI reliably.

Derived Intelligence Layer (the hyperscaler “secret sauce”)

This is where hyperscalers turn raw token data into business and operational intelligence.

Examples:

Predictive scaling

Use tokens/sec + historical trends to forecast when to spin up more GPU instances.

Token anomaly detection

Detect:

  • Token storms (abusive loops)
  • Sudden prompt-length explosions
  • Mass scraping
  • Token patterns typical of jailbreak attempts

Per-customer cost modeling

Compute:

  • cost_per_token_per_customer
  • margin_per_customer
  • expected_token_growth

Model performance regression detection

If output tokens/sec drops after a new model version → revert or investigate.

Token distribution insights

Understand:

  • How customers structure their prompts
  • How long typical conversations last
  • Whether your models are too verbose

How These Layers Work Together

Here’s the flow:

LayerPurpose
MetricsQuick health check & real-time monitoring
TracesDeep visibility into each request’s path
LogsDetailed debugging + forensic analysis
Token-level metadataAI-specific insight for billing, cost, performance, and safety
Derived intelligenceForecasting, anomaly detection, business insights

Without token-level metadata, the rest of the observability stack fails to diagnose why LLM systems behave the way they do.

How to architect token observability pipelines

(distributed design, telemetry ingestion, GPU node metrics, deduping, sampling, retention, privacy, etc.)

Goals of a Token Observability Pipeline

Before wiring anything, you design around a few core goals:

  • Billing – exact input/output tokens per tenant, per model, per time period
  • Cost & capacity – tokens/sec per cluster/model for GPU planning
  • Performance – latency vs tokens, tokens/sec, context usage
  • Safety & abuse detection – unusual token patterns, storms, spikes
  • Product insight – how people actually “spend” their tokens

Everything in the pipeline should serve at least one of these.

High-Level Architecture

Think in four layers:

  1. Producers – where token data is generated
  2. Collectors/Agents – local sidecars or SDKs for telemetry
  3. Transport – message bus / metrics pipeline
  4. Backends – time-series DB, log store, data warehouse, feature stores

A typical stack could be:

  • Producers: API Gateway, Orchestrator, Model Servers
  • Collectors: OpenTelemetry agents, custom sidecars
  • Transport: Kafka / Pulsar (events), Prometheus remote write (metrics)
  • Backends:
    • Time-series: Prometheus / Cortex / Mimir / Thanos
    • Logs: Elasticsearch / OpenSearch / ClickHouse
    • Warehouse: BigQuery / Snowflake / Redshift
    • Online store: Redis / Feature Store for real-time detection

Where Token Data Is Collected

You usually instrument at multiple layers to cross-check and avoid blind spots:

1. API Gateway

  • Knows: tenant, API key, endpoint, region, status code
  • Can record: high-level token counts per request (from response headers / body)
  • Good for billing & rate limiting.

2. Orchestration Layer

(Your “brain”: routes calls to models, tools, RAG, function calling)

  • Knows: which model, which tools, which pipeline was used
  • Can log:
    • input_token_count
    • output_token_count
    • system_token_count
    • effective context_size
    • retries / fallbacks
  • Good for cost attribution, per-feature usage, A/B tests.

3. Model / Inference Servers

  • Closest to GPUs
  • Know:
    • exact tokenization
    • decode speed tokens/sec
    • batching behavior
  • Emit:
    • tokens_in / tokens_out
    • first_token_latency_ms
    • tokens_per_second
    • batch_size & batch_tokens
  • Critical for performance & hardware efficiency.

4. GPU / Hardware Telemetry

  • Expose: GPU utilization, memory, kernel errors
  • Correlate with token throughput from inference layer

You want correlated token metrics at each hop: gateway ↔ orchestrator ↔ model server.

Data Model / Schema for Token Events

Define a canonical token event (or a couple of them) that all services emit.

Example: TokenUsageEvent

Core fields:

  • request_id
  • trace_id / span_id (for linking to traces)
  • timestamp
  • tenant_id / org_id / user_id (or hashed)
  • model_id (e.g. gpt-4.1-mini)
  • region / cluster

Token details:

  • input_tokens
  • output_tokens
  • system_tokens (system + tool messages)
  • total_tokens
  • context_window_used (percentage or absolute)
  • generation_parameters (temperature, top_p, etc.)

Operational:

  • status (success, timeout, error code)
  • latency_ms (end-to-end)
  • first_token_latency_ms
  • tokens_per_second

You can then derive metrics from these events, rather than letting every service invent its own schema.

Real-Time Path: Metrics & Alerts

From the stream of token events, you build live metrics.

Steps

  1. Emit counters & gauges from services:
    • tokens_in_total{model, region, tenant}
    • tokens_out_total{model, region, tenant}
    • requests_total{status, model}
    • tokens_per_request_bucket histograms
  2. Scrape or push into a metrics backend (e.g. Prometheus).
  3. Aggregate & alert:
    • Alerts on tokens/sec dropping (possible outage)
    • Tokens/sec spiking (possible abuse or launch)
    • Latency vs tokens (p95 > SLO)
    • Context utilization near 100% (risk of errors)

Dashboards (examples)

  • Capacity dashboard
    • Tokens/sec by model & region
    • GPU utilization vs tokens/sec
    • Batch efficiency vs tokens
  • Billing / finance dashboard
    • Tokens/day per tenant & model
    • Cost extrapolated from tokens
  • Reliability dashboard
    • Error rate vs tokens
    • Latency percentiles segmented by token buckets (0–1k, 1k–8k, etc.)

Metrics are usually aggregated, not per-request; they give the “health overview”.

Batch / Analytics Path

All token events should also land in a data lake / warehouse for deep analysis.

Pipeline

  1. Services emit TokenUsageEvent to Kafka (or similar).
  2. Stream is:
    • Mirrored to warehouse (e.g. via Kafka Connect / Flink / Beam)
    • Optionally pre-aggregated per minute/hour for heavy tenants
  3. In warehouse, build:
    • Billing tables: tokens per org/model/day
    • Product analytics: average tokens per feature/workflow
    • Cost modeling: map tokens → GPU hours → $$
    • Forecasting: time-series of tokens usage

Uses

  • Finance: margin per customer, pricing strategy
  • Product: which features are driving token usage?
  • Infra: predicting when another GPU cluster is needed

This batch layer is where data scientists and analysts live.

Sampling, Aggregation & Retention

Token data is high-volume. Hyperscalers must be smart about storage.

Techniques

  • Event sampling:
    • Keep 100% of billing-relevant fields
    • Sample detailed trace/log info (e.g. 1–5% of requests)
  • Time-based aggregation:
    • Raw events kept for X days
    • Hourly/daily aggregates retained for months/years
  • Dimension reduction:
    • Only keep the necessary tags: model, region, tenant, status
    • Avoid high-cardinality chaos like arbitrary user-supplied IDs unless hashed carefully

Example policy

  • Raw TokenUsageEvent: retained 7–30 days
  • Aggregated per-tenant-per-day tokens: retained for years (for billing & compliance)

Privacy, Safety, and Multi-Tenant Concerns

Tokens are generated from user prompts, which may be sensitive. Your pipeline must be privacy-conscious.

Key principles

  • Store counts, not contents
    • Token observability rarely needs raw text.
    • Keep input_tokens=1523 not the actual prompt.
  • Anonymize tenant/user identifiers
    • Hash or pseudonymize user IDs
    • Ensure per-tenant isolation in dashboards
  • Redact or tokenize PII before analytics
    • If you store prompt samples for quality analysis, pass them through PII redaction / classification.
  • Access control
    • Billing team can see tokens per tenant, not prompts.
    • Infra team can see model & cluster metrics, not user info.
    • Safety / trust teams might have restricted access to sampled content.
  • Multi-region constraints
    • Keep token events in-region (e.g. EU vs US) for data residency.
    • Have region-local pipelines with a global metadata view (but no raw sensitive content crossing borders).

If you picture it all together:

Services emit structured TokenUsageEvents → collected & streamed → real-time metrics & alerts + warehouse analytics → used for billing, capacity, safety, and product decisions.

What token telemetry looks like in practice

Metrics Examples (Prometheus Style)

Metrics are aggregated, not per-request. They are great for dashboards and alerts.

Token Counters

# HELP llm_input_tokens_total Total input tokens received by model.
# TYPE llm_input_tokens_total counter
llm_input_tokens_total{model="gpt-4.1", region="us-west"} 12938480012

# HELP llm_output_tokens_total Total output tokens generated by model.
# TYPE llm_output_tokens_total counter
llm_output_tokens_total{model="gpt-4.1", region="us-west"} 14589377421

Tokens per Second

llm_tokens_per_second{model="gpt-4.1", region="us-west"} 125430
llm_tokens_per_second{model="gpt-4.1-mini", region="eu-central"} 903210

Latency Buckets By Token Size

llm_request_latency_bucket{model="gpt-4.1",le="100",token_bucket="0_512"} 58291
llm_request_latency_bucket{model="gpt-4.1",le="300",token_bucket="512_4096"} 12893

Context Utilization Gauge

llm_context_usage_ratio{model="gpt-4.1", tenant="acme-corp"} 0.81

Batch Efficiency

llm_batch_size{model="gpt-4.1",gpu_id="A100-7"} 14
llm_batch_tokens{model="gpt-4.1",gpu_id="A100-7"} 17340

Log Examples (JSON Structured Logs)

Logs are per request, used for debugging, anomaly detection, and billing audits.

Gateway Log

{
  "timestamp": "2025-02-14T05:33:23.022Z",
  "service": "api-gateway",
  "request_id": "req_7e2f90d1",
  "tenant_id": "org_abc123",
  "model": "gpt-4.1",
  "region": "us-west",
  "input_tokens": 1423,
  "output_tokens": 816,
  "total_tokens": 2239,
  "status": 200,
  "latency_ms": 904
}

Model Server Log (GPU Level)

{
  "timestamp": "2025-02-14T05:33:23.145Z",
  "service": "model-server",
  "request_id": "req_7e2f90d1",
  "gpu_id": "A100-7",
  "model": "gpt-4.1",
  "input_tokens": 1423,
  "output_tokens": 816,
  "first_token_latency_ms": 172,
  "decode_speed_tokens_per_s": 745,
  "batch_size": 12,
  "batch_tokens": 16384
}

Suspicious Activity Log

{
  "timestamp": "2025-02-14T05:33:23.533Z",
  "service": "abuse-detector",
  "tenant_id": "org_free_trial_9981",
  "reason": "token_storm",
  "avg_output_tokens_last_50_requests": 4500,
  "spike_factor": 3.5,
  "action": "rate_limited"
}

Trace Examples (OpenTelemetry Spans)

Traces show the entire lifecycle of one LLM request.

High-Level Trace

request (trace_id=abcd-1234)
 ├─ gateway.validate_request
 ├─ orchestrator.route_request
 │    ├─ rag.retrieve_documents
 │    ├─ rag.embed_query
 │    └─ rag.generate_context
 └─ model_server.generate (GPU inference)
       ├─ tokenize_input
       ├─ wait_for_batch
       ├─ run_decoder
       └─ stream_output

Example Span JSON

{
  "trace_id": "abcd-1234",
  "span_id": "span-12ab",
  "name": "model_server.generate",
  "start": "2025-02-14T05:33:23.124Z",
  "end": "2025-02-14T05:33:23.871Z",
  "attributes": {
    "model": "gpt-4.1",
    "region": "us-west",
    "input_tokens": 1423,
    "output_tokens": 816,
    "context_window_used": 0.62,
    "first_token_latency_ms": 172,
    "tokens_per_second": 745
  }
}

Tokenization Span

{
  "trace_id": "abcd-1234",
  "span_id": "span-1",
  "name": "tokenize_input",
  "attributes": {
    "tokenizer": "gpt4-tokenizer",
    "input_chars": 6318,
    "output_tokens": 1423,
    "time_ms": 12
  }
}

Batch Queue Span

{
  "name": "batch_queue_wait",
  "attributes": {
    "wait_time_ms": 49,
    "batch_size": 12,
    "batch_tokens": 16384
  }
}

These spans are essential for debugging latency regressions or GPU underutilization.

Full Token Usage Event (Canonical Event)

This is the unified event that powers:

  • billing
  • analytics
  • performance modeling
  • safety signals
  • customer usage dashboards

Example: TokenUsageEvent

{
  "request_id": "req_7e2f90d1",
  "timestamp": "2025-02-14T05:33:23.871Z",
  "tenant_id": "org_abc123",
  "user_id": "user_9df2",
  "model": "gpt-4.1",
  "region": "us-west",
  "input_tokens": 1423,
  "output_tokens": 816,
  "system_tokens": 52,
  "total_tokens": 2291,
  "latency_ms": 904,
  "first_token_latency_ms": 172,
  "tokens_per_second": 745,
  "context_window_used": 0.62,
  "status": "success",
  "generation_parameters": {
    "temperature": 0.7,
    "top_p": 0.9,
    "max_tokens": 2048
  }
}

This is typically what flows through Kafka, lands in warehouses, and drives billing tables.

Example of an Actual Derived Metric Query

In the analytics layer (Snowflake, BigQuery, etc.):

Compute daily cost per tenant

SELECT
  DATE(timestamp) AS day,
  tenant_id,
  model,
  SUM(total_tokens) AS tokens_used,
  SUM(total_tokens) * price_per_token(model) AS estimated_cost
FROM token_usage_events
GROUP BY 1,2,3;

Example Dashboard Cards (what SREs or PMs see)

Real-time alert: output tokens/sec dropped

Model: gpt-4.1
Region: us-west
Current Output Tokens/sec: 93,000
Baseline: 148,000
Drop: -37.2%

Billing Overview

Tenant: Acme Corp
Tokens (last 24h): 148,839,223
Est. Cost: $2,670.18
Primary Models Used: gpt-4.1, gpt-4.1-mini

Context Utilization

p95 context_window_used: 92%
Potential overflow risk: HIGH

Summary

Token telemetry in a real system includes:

  • Metrics → counts, rates, latencies
  • Logs → detailed request-level records
  • Traces → timing and structure of each request
  • Canonical token events → unified schema for billing + analytics

And together, they create the observability picture a hyperscaler needs to run large-scale LLM infrastructure.

A full end-to-end worked example of diagnosing an issue using token telemetry

The Incident

Symptom:
Pager goes off:

“🚨 ALERT: llm_output_tokens_per_second for gpt-4.1 in eu-west dropped 40% vs baseline (5-minute window).”

User-facing symptoms:

  • Customers see slower response times
  • Some timeouts at the tail (p99+)

We’ll diagnose this only using the telemetry layers (metrics → traces → logs → tokens).

Detection – Metrics

You open your “LLM Fleet – Regional Health” dashboard.

You see:

  • llm_output_tokens_per_second{model="gpt-4.1", region="eu-west"}
    • Baseline: ~120k tokens/sec
    • Now: ~70k tokens/sec

Other key metrics:

  • llm_requests_per_second → unchanged → load is steady
  • llm_input_tokens_per_second → steady → same volume & prompt sizes
  • llm_gpu_utilization on that cluster: dropped from ~90% to ~55%
  • llm_request_latency_p95 increased from 800ms → 1.6s

So:

  • Same number of requests and input tokens
  • GPUs are underutilized
  • Fewer output tokens/sec
  • Latency worse

This smells like generation speed / batching / model server rather than traffic.

First Triage – Narrowing Down

You slice metrics further:

  • Breakdown by model version:
    • gpt-4.1-v4 (new canary) vs gpt-4.1-v3 (stable)
  • You see:
    • For gpt-4.1-v4:
      • tokens_per_second: way lower
      • first_token_latency_ms: much higher
    • For gpt-4.1-v3: looks normal

You confirm with a metric:

llm_tokens_per_second{model="gpt-4.1-v4", region="eu-west"}
llm_tokens_per_second{model="gpt-4.1-v3", region="eu-west"}

Result:

  • v4 is degraded, v3 is fine.

You also check release dashboard and see:

“Today 10:02 UTC: Rolled out gpt-4.1-v4 to 50% traffic in eu-west.”

So now you know:

  • Regression likely tied to new model version
  • Not a global infra issue

Deep Dive – Traces + Token Metadata

You open a distributed tracing UI (e.g. Tempo/Jaeger/Datadog APM) and filter:

  • service = "model-server"
  • model = "gpt-4.1-v4"
  • Region eu-west
  • Last 15 minutes

Pick a couple of slow traces (p95+ latency).

Trace structure (simplified)

request (1,700ms)
 ├─ gateway.validate_request (10ms)
 ├─ orchestrator.route_request (40ms)
 └─ model_server.generate (1,600ms)
       ├─ tokenize_input (15ms)
       ├─ batch_queue_wait (140ms)
       ├─ run_decoder (1,350ms)
       └─ stream_output (95ms)

Within model_server.generate span, attributes show:

{
  "model": "gpt-4.1-v4",
  "input_tokens": 950,
  "output_tokens": 820,
  "context_window_used": 0.45,
  "first_token_latency_ms": 420,
  "tokens_per_second": 390
}

You compare with a healthy trace for gpt-4.1-v3:

{
  "model": "gpt-4.1-v3",
  "input_tokens": 960,
  "output_tokens": 810,
  "context_window_used": 0.46,
  "first_token_latency_ms": 180,
  "tokens_per_second": 750
}

Same token profiles, but:

  • First-token latency: 420ms vs 180ms
  • Tokens/sec: 390 vs 750

So:

  • It’s not user behavior (tokens), but how the new model handles them.

Logs – Looking for Corroborating Detail

Next, you inspect model-server logs for v4 in eu-west.

Sample log entry (degraded):

{
  "timestamp": "2025-02-14T10:12:03.145Z",
  "service": "model-server",
  "request_id": "req_7e2f90d1",
  "gpu_id": "A100-7",
  "model": "gpt-4.1-v4",
  "input_tokens": 947,
  "output_tokens": 823,
  "first_token_latency_ms": 437,
  "decode_speed_tokens_per_s": 381,
  "batch_size": 6,
  "batch_tokens": 8920,
  "note": "model_uses_new_sampling_kernel=true"
}

You compare logs from v3:

{
  "model": "gpt-4.1-v3",
  "first_token_latency_ms": 176,
  "decode_speed_tokens_per_s": 761,
  "batch_size": 16,
  "batch_tokens": 16840
}

You notice two critical patterns:

  1. Batch size is much smaller for v4 (6 vs 16)
  2. Decode speed (tokens/sec) is low despite available GPU headroom

You also pull GPU metrics and see:

  • GPU utilization: 55–60% (so we’re under-using hardware)
  • No surge in GPU errors / restarts

So token telemetry tells you:

  • Same input_tokens / output_tokens per request
  • But per-GPU batch_tokens is significantly lower for v4
  • And generation speed per token is slower

This points to:

  • Either a scheduling / batching bug
  • Or v4 being more computationally expensive per token than planned
  • Or a misconfigured kernel/precision setting (e.g. using fp32 instead of fp16/bf16)

Root Cause – Putting the Story Together

You check deployment configs / change logs for v4 and find:

  • A new feature flag: enable_safe_sampling_kernel=true
  • A note: “Fallback to unoptimized kernel path when this flag is set for v4.”

You run an internal test script that sends a standardized prompt to both v3 and v4, and compare token telemetry from that controlled test:

  • v3:
    • output_tokens: ~900
    • tokens_per_second: ~760
  • v4:
    • output_tokens: ~900
    • tokens_per_second: ~380

So even under identical prompt/token conditions, v4 is 2× slower.

Correlating:

  • Field in logs: "note": "using_safe_sampling_kernel=fallback_cpu_path"
  • Batch tokens are smaller because slower decode → less batch throughput

Final diagnosis:

The new gpt-4.1-v4 release accidentally enabled a slow sampling kernel fallback, halving decode speed and causing output_tokens/sec to drop, leading to underutilized GPUs and higher latency.

You got there largely through token telemetry:

  • tokens/sec
  • first_token_latency_ms
  • batch_size / batch_tokens
  • per-model / per-version breakdown

Fix – Mitigation & Verification

Mitigation steps

  1. Roll back traffic from v4 → v3 in eu-west (or disable slow kernel flag).
  2. Watch metrics:
    • llm_output_tokens_per_second{model="gpt-4.1", region="eu-west"} returns to ~120k
    • llm_request_latency_p95 drops back to baseline
    • llm_gpu_utilization returns to ~90%
  3. Confirm via traces:
    • tokens_per_second back to ~750
    • first_token_latency_ms back to ~180
    • batch_tokens similar to pre-incident values
  4. Close the incident once SLOs are met and stable.

Postmortem & Improvements

From here, you’d use the same token telemetry for learning and prevention:

  1. Pre-deploy load tests must compare tokens/sec and latency per token for new model versions vs baseline.
  2. Add an automatic canary guardrail:
    • If tokens/sec for new version falls >X% below control version → auto roll-back.
  3. Add alerts:
    • llm_tokens_per_second{model_version} deviation from baseline
    • llm_batch_tokens dropping below threshold despite steady request volume
  4. Improve dashboards to show:
    • tokens/sec per version
    • tokens/sec per GPU
    • correlation of tokens_per_second with gpu_utilization

Token-level metrics are what make this precise:

  • You’re not just seeing “latency is bad”
  • You’re seeing how the model’s relationship with tokens changed:
    • Slower tokens/sec
    • Smaller effective batch tokens
    • Same token shapes per request