“AI tokens” usually refer to the smallest units of text that a large language model (LLM) reads or produces. They are not full words—often they’re word pieces, characters, or even punctuation.

Here’s a clear breakdown:

What is a Token?

A token is a chunk of text. Depending on the language and tokenizer, a token might be:

A whole short word → “cat”
Part of a long word → “inter-”, “-national”
A punctuation mark → “,”
A single character in some languages → Chinese, Japanese
Even whitespace or special system symbols

LLMs don’t think in letters or words—they think in tokens.

Why Tokens Matter

1. Pricing

Most AI models bill based on the number of tokens:

Input tokens → what you send to the model
Output tokens → what the model generates

More text = more tokens = higher cost.

2. Context Window

Models have a maximum number of tokens they can process at once (their context window).
Example: a 128k-token context model can handle roughly a 300–400 page book.

3. Speed

More tokens → slower responses.

4. Precision

Tokenization affects:

How a model understands spelling variations
How it interprets compound words
How it handles programming languages or math

How Tokenization Works (Simple Example)

Sentence:
“ChatGPT loves learning.”

A tokenizer might split it into:

“Chat”
“G”
“PT”
“ loves”
“ learning”
“.”

Total: 6 tokens

Another tokenizer might split it differently depending on how it was trained.

Types of AI Tokens

1. Text Tokens

Used by LLMs for natural language.

2. Code Tokens

Optimized for programming languages—models like GPT-Engineer or Code Llama may use different tokenization patterns.

3. Special Tokens

Used internally:

<|system|>
<|user|>
<|assistant|>
End-of-text markers

These help the model structure conversations.

Tokenization Methods You Might See

BPE (Byte Pair Encoding) — Very common, used by GPT models.
SentencePiece — Used by some Google models.
WordPiece — Used in BERT.
Unigram LM — Another method used internally by some models.

All aim to compress text efficiently.

Tokens vs Cryptocurrency Tokens

People sometimes confuse the two.
AI tokens ≠ crypto tokens (unless you’re talking about blockchain-based AI projects).
In AI language models, token simply means a chunk of text.

Monetizing AI Tokens

When people talk about “monetizing AI tokens,” they usually mean charging money based on how many tokens an AI model uses. This happens in three main contexts:

1. Monetizing AI Usage (the most common meaning)

This refers to how companies like OpenAI, Anthropic, Google, or API providers make money.

They charge per input token and output token.

Example:

You send 2,000 tokens → billed
The AI replies with 1,000 tokens → billed
Total = 3,000 tokens

This is the core economic unit of LLMs.

Why monetize tokens?

Because tokens are:

Predictable units of compute
Directly tied to GPU usage and cost
Easy to meter for billing
Scalable for apps and businesses

This is the same reason cloud providers bill per compute-second.

2. Token Monetization for Developers / App Builders

If you build an app using an LLM (chatbot, agent, game, coding tool, etc.), your cost is in tokens, so:

You pay wholesale token prices to the model provider
You resell or mark up the service to your users
Profit comes from margin, efficiency, or value-add features

So when a founder says:

“We need to monetize tokens.”

They mean:

“We need to charge users enough to cover the model’s token costs and turn a profit.”

3. Internal Economics: How AI Companies Monetize Tokens

Behind the scenes:

GPUs run inference
GPUs cost money per second
Token throughput determines cost per query

So:
Tokens = compute time = money

AI labs map token pricing to:

Electricity and GPU costs
Model size
Hardware efficiency
Demand and market competition

Tokens are essentially micropayments for LLM compute.

Summary

When people talk about monetizing AI tokens, they almost always mean:

Charging users based on the number of tokens processed by an AI model—which ties directly to compute cost.

Why Token Observability Matters

1. Cost Control & Margin Protection

Tokens ≈ compute.
Compute = the single largest cost center for a hyperscaler running LLM inference.

Monitoring token metrics helps answer:

Which workloads are consuming the most tokens?
Are certain customers causing spiky or abusive token usage?
Are model changes increasing token-per-request cost?

Without token-level data, you can’t accurately understand or optimize unit economics.

2. Capacity Planning & Scaling

Token throughput is directly tied to:

GPU utilization
Model saturation
Latency under load
Queueing behavior

Hyperscalers use token telemetry to:

Predict peak demand
Scale GPU clusters
Allocate inference servers
Tune batching efficiency

Token rate = the strongest predictor of load.

3. Performance Monitoring

Token-level metrics reveal performance bottlenecks:

Drops in tokens/sec → GPU underutilization or network issues
Slow output token generation → model regression or hardware throttling
Sudden intake-token spikes → possible DDOS or abusive workload

Token observability gives real-time performance signals that logs and traces alone cannot.

4. Abuse Detection & Security

Token patterns can reveal:

Automated scraping
Prompt injection attempts
Misuse of free-tier accounts
Traffic laundering
API key sharing

Hyperscalers often build token anomaly detectors to block or throttle bad actors.

5. Customer Billing & Transparency

Tokens are the billing unit, so observability supports:

Accurate metering
Customer usage dashboards
Invoice reconciliation
Disputes handling

If you can’t monitor tokens precisely, you can’t bill precisely.

6. Product Insights and Research

Token telemetry helps model teams understand:

Which features produce high output-token inflation
Which models yield the most efficient tokens-per-task
How users structure prompts
The distribution of prompt lengths across verticals

This feeds into model optimization and product strategy.

What Hyperscalers Actually Monitor

A serious AI provider typically records:

Per-request metrics

Input tokens
Output tokens
Total tokens
Tokens/s (input and output separately)
Latency per stage (tokenization, inference, streaming, post-processing)

Aggregated metrics

Avg. tokens per customer per day
Avg. tokens per model
Peak token throughput
GPU tokens/sec efficiency
Cost-per-1M tokens (by model & hardware)

Anomaly signals

Token spikes
Sudden distribution shifts
Abnormal output-token growth
Token storms (malicious loops)

How Hyperscalers Use Token Observability Operationally

Real-time dashboards show:

Tokens/sec per cluster
Tokens/sec per model
Cost per token by model family
GPU efficiency mapped to token throughput

Alerts trigger when:

Tokens/sec drop below cluster baseline
Output token rate spikes
Billing anomalies arise
Context-length usage nears limits
Tokenization error rate rises

Token observability is treated similarly to:

CPU load
I/O throughput
Memory pressure
in traditional cloud infrastructure — but even more essential, because tokens are the product.

Conclusion

For a hyperscaler, yes — it is extremely important to monitor token-level data. Tokens are the backbone of:

Infrastructure efficiency
Cost control
Security
Billing
Product insights
Scaling
Model performance

Monitoring tokens is equivalent to monitoring compute, cost, customer experience, and revenue.

The Full Observability Stack for LLM Platforms

Metrics → Traces → Logs → Token-level Metadata → Derived Intelligence

This stack looks similar to traditional cloud observability, but AI adds new layers that hyperscalers must track.

Metrics (the high-level, real-time health check)

Metrics are numeric, aggregated, time-series values that tell you how the system is behaving right now.

Standard infra metrics (still essential)

CPU utilization
GPU utilization
GPU memory pressure
Disk & network I/O
Latency percentiles (p50, p95, p99)

AI-specific metrics (new)

These are introduced because LLM behavior depends on tokens:

Input Tokens/sec

High input load means:

Users are sending long prompts
Prompt/chat augmentation systems are expanding context

Output Tokens/sec

Drops indicate:

GPU/TPU throttling
Model regression
Saturated clusters
Poor batching efficiency

Tokens per request (avg, p95, p99)

Useful for:

Capacity planning
Billing accuracy
Detecting abuse (e.g., extremely long conversations)

Context Window Utilization %

When users approach ~80–100% of max tokens:

Latency spikes
GPU memory spikes
Errors rise (context overflow)

Cost per 1M tokens

Internally tracked even if not exposed externally.

Batching Efficiency

LLM servers batch requests to keep GPUs fully fed.

Token metrics drive batching decisions.

Traces (the per-request story)

Traces show how a single request flows through the AI system, end-to-end.

Why traces are critical for LLMs

LLM inference has many stages:

Receive request
Tokenize input
Validate safety filters
Route to the correct model
Reserve GPU memory
Batch with other requests
Run inference
Stream output tokens
Safety redaction or compression
Return to customer

A trace shows timing for each stage.

AI-specific trace spans

Hyperscalers add spans such as:

tokenization_time_ms
model_loading_time_ms (if cold start)
batch_queue_wait_ms
first_token_latency_ms (how fast generation starts)
avg_output_tokens_per_s (generation speed)
safety_filter_decisions
cache_hit/miss for retrieval-augmented generation (RAG)

This is where token-level metadata attaches to the request story.

Logs (the granular, textual details)

Logs are fine-grained raw information useful for debugging and audits.

Standard logs

Errors
Warnings
Timeouts
API failures
Model load/unload events

AI-specific logs

These include:

Tokenizer failures
Abnormally large token counts
Prompt-injection detection
Bad formatting given to LLMs
Model safety blocks
Batching decision logs
GPU kernel execution logs
Per-layer inference anomalies

Logs are essential for diagnosing:

Why a GPU crashed
Why latency spiked
Why a user got incomplete output
Why safety filters triggered

Token-Level Metadata (the new layer unique to LLM systems)

This is the observability layer that did not exist before LLMs.

Token metadata can be attached:

At request-level (summary)
At span-level (trace)
At event-level (log mini-records)

What token metadata includes

Per-request

input_token_count
output_token_count
system_token_count (system + tool messages)
total_token_count

Streaming-level

tokens_per_chunk
time_between_chunks (latency signal)
decoding_sampling_metadata (temperature, top_p, frequency penalties)

User behavior

Average tokens per conversation
Max tokens per session
Token distribution patterns per customer

Billing

Which user consumed how many tokens
What model they used
Which organization it belongs to

Why token metadata matters

It enables:

Accurate billing
Anomaly detection
Performance regression detection
GPU efficiency optimization
Fair usage limits
Safety monitoring
Product insights

It is the single most important observability layer for scaling AI reliably.

Derived Intelligence Layer (the hyperscaler “secret sauce”)

This is where hyperscalers turn raw token data into business and operational intelligence.

Examples:

Predictive scaling

Use tokens/sec + historical trends to forecast when to spin up more GPU instances.

Token anomaly detection

Detect:

Token storms (abusive loops)
Sudden prompt-length explosions
Mass scraping
Token patterns typical of jailbreak attempts

Per-customer cost modeling

Compute:

cost_per_token_per_customer
margin_per_customer
expected_token_growth

Model performance regression detection

If output tokens/sec drops after a new model version → revert or investigate.

Token distribution insights

Understand:

How customers structure their prompts
How long typical conversations last
Whether your models are too verbose

How These Layers Work Together

Here’s the flow:

Layer	Purpose
Metrics	Quick health check & real-time monitoring
Traces	Deep visibility into each request’s path
Logs	Detailed debugging + forensic analysis
Token-level metadata	AI-specific insight for billing, cost, performance, and safety
Derived intelligence	Forecasting, anomaly detection, business insights

Without token-level metadata, the rest of the observability stack fails to diagnose why LLM systems behave the way they do.

How to architect token observability pipelines

(distributed design, telemetry ingestion, GPU node metrics, deduping, sampling, retention, privacy, etc.)

Goals of a Token Observability Pipeline

Before wiring anything, you design around a few core goals:

Billing – exact input/output tokens per tenant, per model, per time period
Cost & capacity – tokens/sec per cluster/model for GPU planning
Performance – latency vs tokens, tokens/sec, context usage
Safety & abuse detection – unusual token patterns, storms, spikes
Product insight – how people actually “spend” their tokens

Everything in the pipeline should serve at least one of these.

High-Level Architecture

Think in four layers:

Producers – where token data is generated
Collectors/Agents – local sidecars or SDKs for telemetry
Transport – message bus / metrics pipeline
Backends – time-series DB, log store, data warehouse, feature stores

A typical stack could be:

Producers: API Gateway, Orchestrator, Model Servers
Collectors: OpenTelemetry agents, custom sidecars
Transport: Kafka / Pulsar (events), Prometheus remote write (metrics)
Backends:
- Time-series: Prometheus / Cortex / Mimir / Thanos
- Logs: Elasticsearch / OpenSearch / ClickHouse
- Warehouse: BigQuery / Snowflake / Redshift
- Online store: Redis / Feature Store for real-time detection

Where Token Data Is Collected

You usually instrument at multiple layers to cross-check and avoid blind spots:

1. API Gateway

Knows: tenant, API key, endpoint, region, status code
Can record: high-level token counts per request (from response headers / body)
Good for billing & rate limiting.

2. Orchestration Layer

(Your “brain”: routes calls to models, tools, RAG, function calling)

Knows: which model, which tools, which pipeline was used
Can log:
- input_token_count
- output_token_count
- system_token_count
- effective context_size
- retries / fallbacks
Good for cost attribution, per-feature usage, A/B tests.

3. Model / Inference Servers

Closest to GPUs
Know:
- exact tokenization
- decode speed tokens/sec
- batching behavior
Emit:
- tokens_in / tokens_out
- first_token_latency_ms
- tokens_per_second
- batch_size & batch_tokens
Critical for performance & hardware efficiency.

4. GPU / Hardware Telemetry

Expose: GPU utilization, memory, kernel errors
Correlate with token throughput from inference layer

You want correlated token metrics at each hop: gateway ↔ orchestrator ↔ model server.

Data Model / Schema for Token Events

Define a canonical token event (or a couple of them) that all services emit.

Example: TokenUsageEvent

Core fields:

request_id
trace_id / span_id (for linking to traces)
timestamp
tenant_id / org_id / user_id (or hashed)
model_id (e.g. gpt-4.1-mini)
region / cluster

Token details:

input_tokens
output_tokens
system_tokens (system + tool messages)
total_tokens
context_window_used (percentage or absolute)
generation_parameters (temperature, top_p, etc.)

Operational:

status (success, timeout, error code)
latency_ms (end-to-end)
first_token_latency_ms
tokens_per_second

You can then derive metrics from these events, rather than letting every service invent its own schema.

Real-Time Path: Metrics & Alerts

From the stream of token events, you build live metrics.

Steps

Emit counters & gauges from services:
- tokens_in_total{model, region, tenant}
- tokens_out_total{model, region, tenant}
- requests_total{status, model}
- tokens_per_request_bucket histograms
Scrape or push into a metrics backend (e.g. Prometheus).
Aggregate & alert:
- Alerts on tokens/sec dropping (possible outage)
- Tokens/sec spiking (possible abuse or launch)
- Latency vs tokens (p95 > SLO)
- Context utilization near 100% (risk of errors)

Dashboards (examples)

Capacity dashboard
- Tokens/sec by model & region
- GPU utilization vs tokens/sec
- Batch efficiency vs tokens
Billing / finance dashboard
- Tokens/day per tenant & model
- Cost extrapolated from tokens
Reliability dashboard
- Error rate vs tokens
- Latency percentiles segmented by token buckets (0–1k, 1k–8k, etc.)

Metrics are usually aggregated, not per-request; they give the “health overview”.

Batch / Analytics Path

All token events should also land in a data lake / warehouse for deep analysis.

Pipeline

Services emit TokenUsageEvent to Kafka (or similar).
Stream is:
- Mirrored to warehouse (e.g. via Kafka Connect / Flink / Beam)
- Optionally pre-aggregated per minute/hour for heavy tenants
In warehouse, build:
- Billing tables: tokens per org/model/day
- Product analytics: average tokens per feature/workflow
- Cost modeling: map tokens → GPU hours → $$
- Forecasting: time-series of tokens usage

Uses

Finance: margin per customer, pricing strategy
Product: which features are driving token usage?
Infra: predicting when another GPU cluster is needed

This batch layer is where data scientists and analysts live.

Sampling, Aggregation & Retention

Token data is high-volume. Hyperscalers must be smart about storage.

Techniques

Event sampling:
- Keep 100% of billing-relevant fields
- Sample detailed trace/log info (e.g. 1–5% of requests)
Time-based aggregation:
- Raw events kept for X days
- Hourly/daily aggregates retained for months/years
Dimension reduction:
- Only keep the necessary tags: model, region, tenant, status
- Avoid high-cardinality chaos like arbitrary user-supplied IDs unless hashed carefully

Example policy

Raw TokenUsageEvent: retained 7–30 days
Aggregated per-tenant-per-day tokens: retained for years (for billing & compliance)

Privacy, Safety, and Multi-Tenant Concerns

Tokens are generated from user prompts, which may be sensitive. Your pipeline must be privacy-conscious.

Key principles

Store counts, not contents
- Token observability rarely needs raw text.
- Keep input_tokens=1523 not the actual prompt.
Anonymize tenant/user identifiers
- Hash or pseudonymize user IDs
- Ensure per-tenant isolation in dashboards
Redact or tokenize PII before analytics
- If you store prompt samples for quality analysis, pass them through PII redaction / classification.
Access control
- Billing team can see tokens per tenant, not prompts.
- Infra team can see model & cluster metrics, not user info.
- Safety / trust teams might have restricted access to sampled content.
Multi-region constraints
- Keep token events in-region (e.g. EU vs US) for data residency.
- Have region-local pipelines with a global metadata view (but no raw sensitive content crossing borders).

If you picture it all together:

Services emit structured TokenUsageEvents → collected & streamed → real-time metrics & alerts + warehouse analytics → used for billing, capacity, safety, and product decisions.

What token telemetry looks like in practice

Metrics Examples (Prometheus Style)

Metrics are aggregated, not per-request. They are great for dashboards and alerts.

Token Counters

# HELP llm_input_tokens_total Total input tokens received by model.
# TYPE llm_input_tokens_total counter
llm_input_tokens_total{model="gpt-4.1", region="us-west"} 12938480012

# HELP llm_output_tokens_total Total output tokens generated by model.
# TYPE llm_output_tokens_total counter
llm_output_tokens_total{model="gpt-4.1", region="us-west"} 14589377421

Tokens per Second

llm_tokens_per_second{model="gpt-4.1", region="us-west"} 125430
llm_tokens_per_second{model="gpt-4.1-mini", region="eu-central"} 903210

Latency Buckets By Token Size

llm_request_latency_bucket{model="gpt-4.1",le="100",token_bucket="0_512"} 58291
llm_request_latency_bucket{model="gpt-4.1",le="300",token_bucket="512_4096"} 12893

Context Utilization Gauge

llm_context_usage_ratio{model="gpt-4.1", tenant="acme-corp"} 0.81

Batch Efficiency

llm_batch_size{model="gpt-4.1",gpu_id="A100-7"} 14
llm_batch_tokens{model="gpt-4.1",gpu_id="A100-7"} 17340

Log Examples (JSON Structured Logs)

Logs are per request, used for debugging, anomaly detection, and billing audits.

Gateway Log

{
  "timestamp": "2025-02-14T05:33:23.022Z",
  "service": "api-gateway",
  "request_id": "req_7e2f90d1",
  "tenant_id": "org_abc123",
  "model": "gpt-4.1",
  "region": "us-west",
  "input_tokens": 1423,
  "output_tokens": 816,
  "total_tokens": 2239,
  "status": 200,
  "latency_ms": 904
}

Model Server Log (GPU Level)

{
  "timestamp": "2025-02-14T05:33:23.145Z",
  "service": "model-server",
  "request_id": "req_7e2f90d1",
  "gpu_id": "A100-7",
  "model": "gpt-4.1",
  "input_tokens": 1423,
  "output_tokens": 816,
  "first_token_latency_ms": 172,
  "decode_speed_tokens_per_s": 745,
  "batch_size": 12,
  "batch_tokens": 16384
}

Suspicious Activity Log

{
  "timestamp": "2025-02-14T05:33:23.533Z",
  "service": "abuse-detector",
  "tenant_id": "org_free_trial_9981",
  "reason": "token_storm",
  "avg_output_tokens_last_50_requests": 4500,
  "spike_factor": 3.5,
  "action": "rate_limited"
}

Trace Examples (OpenTelemetry Spans)

Traces show the entire lifecycle of one LLM request.

High-Level Trace

request (trace_id=abcd-1234)
 ├─ gateway.validate_request
 ├─ orchestrator.route_request
 │    ├─ rag.retrieve_documents
 │    ├─ rag.embed_query
 │    └─ rag.generate_context
 └─ model_server.generate (GPU inference)
       ├─ tokenize_input
       ├─ wait_for_batch
       ├─ run_decoder
       └─ stream_output

Example Span JSON

{
  "trace_id": "abcd-1234",
  "span_id": "span-12ab",
  "name": "model_server.generate",
  "start": "2025-02-14T05:33:23.124Z",
  "end": "2025-02-14T05:33:23.871Z",
  "attributes": {
    "model": "gpt-4.1",
    "region": "us-west",
    "input_tokens": 1423,
    "output_tokens": 816,
    "context_window_used": 0.62,
    "first_token_latency_ms": 172,
    "tokens_per_second": 745
  }
}

Tokenization Span

{
  "trace_id": "abcd-1234",
  "span_id": "span-1",
  "name": "tokenize_input",
  "attributes": {
    "tokenizer": "gpt4-tokenizer",
    "input_chars": 6318,
    "output_tokens": 1423,
    "time_ms": 12
  }
}

Batch Queue Span

{
  "name": "batch_queue_wait",
  "attributes": {
    "wait_time_ms": 49,
    "batch_size": 12,
    "batch_tokens": 16384
  }
}

These spans are essential for debugging latency regressions or GPU underutilization.

Full Token Usage Event (Canonical Event)

This is the unified event that powers:

billing
analytics
performance modeling
safety signals
customer usage dashboards

Example: TokenUsageEvent

{
  "request_id": "req_7e2f90d1",
  "timestamp": "2025-02-14T05:33:23.871Z",
  "tenant_id": "org_abc123",
  "user_id": "user_9df2",
  "model": "gpt-4.1",
  "region": "us-west",
  "input_tokens": 1423,
  "output_tokens": 816,
  "system_tokens": 52,
  "total_tokens": 2291,
  "latency_ms": 904,
  "first_token_latency_ms": 172,
  "tokens_per_second": 745,
  "context_window_used": 0.62,
  "status": "success",
  "generation_parameters": {
    "temperature": 0.7,
    "top_p": 0.9,
    "max_tokens": 2048
  }
}

This is typically what flows through Kafka, lands in warehouses, and drives billing tables.

Example of an Actual Derived Metric Query

In the analytics layer (Snowflake, BigQuery, etc.):

Compute daily cost per tenant

SELECT
  DATE(timestamp) AS day,
  tenant_id,
  model,
  SUM(total_tokens) AS tokens_used,
  SUM(total_tokens) * price_per_token(model) AS estimated_cost
FROM token_usage_events
GROUP BY 1,2,3;

Example Dashboard Cards (what SREs or PMs see)

Real-time alert: output tokens/sec dropped

Model: gpt-4.1
Region: us-west
Current Output Tokens/sec: 93,000
Baseline: 148,000
Drop: -37.2%

Billing Overview

Tenant: Acme Corp
Tokens (last 24h): 148,839,223
Est. Cost: $2,670.18
Primary Models Used: gpt-4.1, gpt-4.1-mini

Context Utilization

p95 context_window_used: 92%
Potential overflow risk: HIGH

Summary

Token telemetry in a real system includes:

Metrics → counts, rates, latencies
Logs → detailed request-level records
Traces → timing and structure of each request
Canonical token events → unified schema for billing + analytics

And together, they create the observability picture a hyperscaler needs to run large-scale LLM infrastructure.

A full end-to-end worked example of diagnosing an issue using token telemetry

The Incident

Symptom:
Pager goes off:

“🚨 ALERT: llm_output_tokens_per_second for gpt-4.1 in eu-west dropped 40% vs baseline (5-minute window).”

User-facing symptoms:

Customers see slower response times
Some timeouts at the tail (p99+)

We’ll diagnose this only using the telemetry layers (metrics → traces → logs → tokens).

Detection – Metrics

You open your “LLM Fleet – Regional Health” dashboard.

You see:

llm_output_tokens_per_second{model="gpt-4.1", region="eu-west"}
- Baseline: ~120k tokens/sec
- Now: ~70k tokens/sec

Other key metrics:

llm_requests_per_second → unchanged → load is steady
llm_input_tokens_per_second → steady → same volume & prompt sizes
llm_gpu_utilization on that cluster: dropped from ~90% to ~55%
llm_request_latency_p95 increased from 800ms → 1.6s

So:

Same number of requests and input tokens
GPUs are underutilized
Fewer output tokens/sec
Latency worse

This smells like generation speed / batching / model server rather than traffic.

First Triage – Narrowing Down

You slice metrics further:

Breakdown by model version:
- gpt-4.1-v4 (new canary) vs gpt-4.1-v3 (stable)
You see:
- For gpt-4.1-v4:
  - tokens_per_second: way lower
  - first_token_latency_ms: much higher
- For gpt-4.1-v3: looks normal

You confirm with a metric:

llm_tokens_per_second{model="gpt-4.1-v4", region="eu-west"}
llm_tokens_per_second{model="gpt-4.1-v3", region="eu-west"}

Result:

v4 is degraded, v3 is fine.

You also check release dashboard and see:

“Today 10:02 UTC: Rolled out gpt-4.1-v4 to 50% traffic in eu-west.”

So now you know:

Regression likely tied to new model version
Not a global infra issue

Deep Dive – Traces + Token Metadata

You open a distributed tracing UI (e.g. Tempo/Jaeger/Datadog APM) and filter:

service = "model-server"
model = "gpt-4.1-v4"
Region eu-west
Last 15 minutes

Pick a couple of slow traces (p95+ latency).

Trace structure (simplified)

request (1,700ms)
 ├─ gateway.validate_request (10ms)
 ├─ orchestrator.route_request (40ms)
 └─ model_server.generate (1,600ms)
       ├─ tokenize_input (15ms)
       ├─ batch_queue_wait (140ms)
       ├─ run_decoder (1,350ms)
       └─ stream_output (95ms)

Within model_server.generate span, attributes show:

{
  "model": "gpt-4.1-v4",
  "input_tokens": 950,
  "output_tokens": 820,
  "context_window_used": 0.45,
  "first_token_latency_ms": 420,
  "tokens_per_second": 390
}

You compare with a healthy trace for gpt-4.1-v3:

{
  "model": "gpt-4.1-v3",
  "input_tokens": 960,
  "output_tokens": 810,
  "context_window_used": 0.46,
  "first_token_latency_ms": 180,
  "tokens_per_second": 750
}

Same token profiles, but:

First-token latency: 420ms vs 180ms
Tokens/sec: 390 vs 750

So:

It’s not user behavior (tokens), but how the new model handles them.

Logs – Looking for Corroborating Detail

Next, you inspect model-server logs for v4 in eu-west.

Sample log entry (degraded):

{
  "timestamp": "2025-02-14T10:12:03.145Z",
  "service": "model-server",
  "request_id": "req_7e2f90d1",
  "gpu_id": "A100-7",
  "model": "gpt-4.1-v4",
  "input_tokens": 947,
  "output_tokens": 823,
  "first_token_latency_ms": 437,
  "decode_speed_tokens_per_s": 381,
  "batch_size": 6,
  "batch_tokens": 8920,
  "note": "model_uses_new_sampling_kernel=true"
}

You compare logs from v3:

{
  "model": "gpt-4.1-v3",
  "first_token_latency_ms": 176,
  "decode_speed_tokens_per_s": 761,
  "batch_size": 16,
  "batch_tokens": 16840
}

You notice two critical patterns:

Batch size is much smaller for v4 (6 vs 16)
Decode speed (tokens/sec) is low despite available GPU headroom

You also pull GPU metrics and see:

GPU utilization: 55–60% (so we’re under-using hardware)
No surge in GPU errors / restarts

So token telemetry tells you:

Same input_tokens / output_tokens per request
But per-GPU batch_tokens is significantly lower for v4
And generation speed per token is slower

This points to:

Either a scheduling / batching bug
Or v4 being more computationally expensive per token than planned
Or a misconfigured kernel/precision setting (e.g. using fp32 instead of fp16/bf16)

Root Cause – Putting the Story Together

You check deployment configs / change logs for v4 and find:

A new feature flag: enable_safe_sampling_kernel=true
A note: “Fallback to unoptimized kernel path when this flag is set for v4.”

You run an internal test script that sends a standardized prompt to both v3 and v4, and compare token telemetry from that controlled test:

v3:
- output_tokens: ~900
- tokens_per_second: ~760
v4:
- output_tokens: ~900
- tokens_per_second: ~380

So even under identical prompt/token conditions, v4 is 2× slower.

Correlating:

Field in logs: "note": "using_safe_sampling_kernel=fallback_cpu_path"
Batch tokens are smaller because slower decode → less batch throughput

Final diagnosis:

The new gpt-4.1-v4 release accidentally enabled a slow sampling kernel fallback, halving decode speed and causing output_tokens/sec to drop, leading to underutilized GPUs and higher latency.

You got there largely through token telemetry:

tokens/sec
first_token_latency_ms
batch_size / batch_tokens
per-model / per-version breakdown

Fix – Mitigation & Verification

Mitigation steps

Roll back traffic from v4 → v3 in eu-west (or disable slow kernel flag).
Watch metrics:
- llm_output_tokens_per_second{model="gpt-4.1", region="eu-west"} returns to ~120k
- llm_request_latency_p95 drops back to baseline
- llm_gpu_utilization returns to ~90%
Confirm via traces:
- tokens_per_second back to ~750
- first_token_latency_ms back to ~180
- batch_tokens similar to pre-incident values
Close the incident once SLOs are met and stable.

Postmortem & Improvements

From here, you’d use the same token telemetry for learning and prevention:

Pre-deploy load tests must compare tokens/sec and latency per token for new model versions vs baseline.
Add an automatic canary guardrail:
- If tokens/sec for new version falls >X% below control version → auto roll-back.
Add alerts:
- llm_tokens_per_second{model_version} deviation from baseline
- llm_batch_tokens dropping below threshold despite steady request volume
Improve dashboards to show:
- tokens/sec per version
- tokens/sec per GPU
- correlation of tokens_per_second with gpu_utilization

Token-level metrics are what make this precise:

You’re not just seeing “latency is bad”
You’re seeing how the model’s relationship with tokens changed:
- Slower tokens/sec
- Smaller effective batch tokens
- Same token shapes per request