AI is not killing observability as a discipline, but it is fundamentally changing how observability is done. The traditional model of “collect everything, store everything, and let humans investigate later” is becoming increasingly impractical in AI-driven infrastructures.

1. Telemetry volume is exploding

Modern systems produce far more telemetry than they did five years ago.

An AI factory may contain:

Tens of thousands of GPUs
Hundreds of thousands of CPU cores
High-speed fabrics (RoCE, InfiniBand)
Kubernetes
Distributed storage (Ceph, Lustre, GPFS)
AI inference services
LLM gateways

Each component exports metrics, logs, traces and events.

For example:

2020

100 servers
↓
100 million metrics/day

2026

20,000 GPUs
30,000 CPUs
5,000 switches

↓

Several trillion data points/day

Humans cannot meaningfully explore that volume.

2. Dashboards don’t scale

Traditional observability assumes people sit looking at Grafana dashboards.

Reality:

nobody watches 400 dashboards
nobody remembers 2,000 PromQL queries
nobody notices slow drift

Instead people increasingly ask:

“Why did training become slower?”

AI investigates.

Not humans.

3. Alert fatigue becomes impossible

Large organisations often generate

50,000 alerts/day
100,000 log anomalies/day

Historically:

Prometheus
↓
Alertmanager
↓
PagerDuty
↓
Human

Future:

Prometheus
↓
AI correlation
↓
Root cause
↓
Human receives one explanation

Instead of:

127 alerts

Engineer receives

GPU node gpu-128 experienced ECC errors causing NCCL retries which slowed training by 18%.

4. Humans don’t query telemetry anymore

Traditional workflow

Grafana
↓
Zoom
↓
PromQL
↓
Logs
↓
Tempo
↓
Find issue

Future

"Why are customer requests slower?"
↓
AI
↓
queries everything
↓
returns explanation

Natural language replaces much of manual exploration.

5. AI is becoming the first investigator

Large enterprises increasingly build systems like:

Telemetry
↓
LLM
↓
Reasoning
↓
Correlation
↓
Recommendation

Instead of asking engineers to join the dots.

6. Sampling changes everything

Historically:

Store every log.

Now:

AI decides

Keep
Discard
Summarise
Compress

Observability becomes intelligent instead of passive.

7. Root cause becomes graph reasoning

Today’s tools often correlate:

metric
+
trace
+
log

Future systems correlate:

topology
Kubernetes
network
storage
deployments
Git commits
feature flags
incidents
Slack discussions
runbooks

into one knowledge graph.

AI reasons across all of it.

8. AI reduces the need for experts

Today an SRE may spend years learning:

PromQL
LogQL
TraceQL
Elasticsearch
Loki
Tempo
Mimir
Kubernetes
networking

Future engineers may simply ask:

Why did latency increase?

The AI handles the underlying queries.

Expertise shifts from query syntax to validating conclusions and understanding system design.

9. Observability vendors are changing

Every major platform is investing heavily in AI assistants.

Examples include:

Grafana Labs
Datadog
Dynatrace
New Relic
Elastic
Splunk

They’re moving from:

dashboards

to:

AI copilots
automatic investigations
causal analysis
anomaly explanations
remediation suggestions

10. Cost is becoming the biggest problem

Storage costs are growing faster than engineering budgets.

For many enterprises:

Telemetry generated

100 TB/day

Engineers never inspect 99% of it.

AI can:

summarize repetitive logs
identify low-value telemetry
adapt sampling dynamically
retain only diagnostically useful data

This can significantly reduce storage and processing costs while preserving investigative value.

What is actually dying?

It is not observability itself, but the manual workflows around it.

Declining:

Manual dashboard creation
Hand-written alert rules for every scenario
Endless log searching
Human correlation across tools
Memorizing complex query languages

Growing:

AI-assisted investigations
Automated root-cause analysis
Predictive anomaly detection
Natural-language querying
Knowledge-graph reasoning
Automated remediation

What this means for SREs

For someone with your background in observability platforms, this shift is an opportunity rather than a threat. The value moves away from being the person who can write the most intricate PromQL query and toward designing the telemetry architecture, ensuring data quality, integrating AI agents with monitoring systems, and deciding when AI-generated conclusions can be trusted.

The emerging role is less “dashboard builder” and more Observability Platform Engineer or AI Operations (AIOps) Engineer—someone who builds systems that enable AI to investigate production issues safely and accurately. That combination of platform engineering, distributed systems, and AI integration is likely to be in increasing demand over the next decade.

What is AI Observability?

AI-era observability is moving from human-driven inspection to machine-assisted reasoning over telemetry, topology, history and operational knowledge.

The key shift is this:

Old observability:

Metrics + logs + traces
        ↓
Dashboards and alerts
        ↓
Human investigates
        ↓
Human decides
        ↓
Human fixes


AI-era observability:

Metrics + logs + traces + topology + deployments + runbooks + incidents
        ↓
AI correlation and reasoning layer
        ↓
Probable cause, blast radius, next action
        ↓
Human approval or automated remediation

Below is a detailed breakdown of the six areas.

1. AI-assisted investigations

What it means

AI-assisted investigation is where an AI system acts like a junior SRE investigator sitting beside you.

It does not necessarily fix the issue automatically. Its main job is to reduce the time spent asking basic investigative questions.

Instead of you manually jumping between:

Grafana → Prometheus/Mimir → Loki → Tempo → Kubernetes → Git → Slack → Runbooks

you ask something like:

Why did checkout latency increase after 14:05?

The AI then queries multiple systems and returns a structured investigation.

What it does

A good AI investigation assistant can:

Detect the relevant service, namespace, cluster or tenant.
Pull related metrics.
Search logs around the incident window.
Inspect traces for slow spans.
Compare current behaviour against baseline behaviour.
Check recent deployments.
Check Kubernetes events.
Check node, pod, container and network health.
Retrieve relevant runbooks.
Summarise likely causes.
Recommend next diagnostic steps.

Example

You ask:

Why is the inference API slower?

The AI investigates:

1. Latency increased at 10:17.
2. p95 rose from 420 ms to 1.8 s.
3. Error rate did not increase.
4. GPU utilisation remained high.
5. Queue depth increased.
6. New model version was deployed at 10:12.
7. Logs show repeated batching timeout warnings.
8. Traces show delay before GPU execution, not during execution.

Result:

Likely issue:
The model service is queueing requests before GPU execution.

Probable cause:
The new batching configuration increased max_batch_wait_ms from 20 ms to 250 ms.

Recommended action:
Rollback batching config or reduce batch wait threshold.

That is much faster than manually checking ten dashboards.

What data it needs

AI-assisted investigation works best when it has access to:

Metrics:
- RED metrics: rate, errors, duration
- USE metrics: utilisation, saturation, errors
- Kubernetes pod/node metrics
- GPU metrics
- Network metrics
- Storage metrics

Logs:
- Application logs
- Kubernetes events
- System logs
- Ingress/controller logs
- Deployment logs

Traces:
- Request path
- Slow spans
- Upstream/downstream dependencies
- Database/storage/API calls

Context:
- Deployment history
- Git commits
- Feature flags
- Config changes
- Runbooks
- Incident history
- Service ownership

Without context, AI just summarises telemetry. With context, it can investigate.

SRE value

For SREs, this means less time doing mechanical investigation and more time validating the diagnosis.

The future SRE skill is not just:

Can I write PromQL?

It becomes:

Can I design telemetry so AI can reason correctly?
Can I validate the AI's conclusion?
Can I prevent unsafe remediation?
Can I encode good operational knowledge into the platform?

2. Automated root-cause analysis

What it means

Automated root-cause analysis, or automated RCA, is the process of identifying the most likely initiating cause of a production issue without relying entirely on manual human correlation.

It tries to answer:

What actually started the incident?

Not merely:

What symptoms are currently visible?

This distinction matters.

Symptom versus root cause

Example incident:

Customer latency is high.
API pods are slow.
Database queries are slow.
Storage latency is high.
Ceph OSDs are rebalancing.
One storage node has a failing disk.

The symptoms are:

High API latency
Slow database responses
Increased request duration
More timeout warnings

The probable root cause is:

A failing disk caused Ceph recovery/rebalancing,
which increased storage latency,
which slowed the database,
which slowed the API.

Automated RCA attempts to build that causal chain.

How automated RCA works

There are several techniques.

1. Temporal correlation

The system checks what changed first.

10:01 disk errors begin
10:03 Ceph recovery starts
10:05 storage latency rises
10:07 database latency rises
10:09 API latency rises
10:10 customer alerts fire

The earliest credible abnormal event is often close to the root cause.

2. Topology-aware analysis

The system understands dependencies.

frontend
   ↓
checkout-api
   ↓
postgres
   ↓
ceph/rbd
   ↓
osd-node-07

If osd-node-07 is unhealthy, and all dependent systems are degraded, the RCA engine can infer blast radius.

3. Change correlation

The system checks recent changes:

Deployments
Config changes
Feature flags
Kernel updates
Node drains
Network changes
Storage migrations
Certificate rotations
DNS changes
Autoscaling events

Many incidents are change-induced. A useful RCA system always asks:

What changed recently?

4. Statistical anomaly ranking

The system ranks abnormal signals.

For example:

Signal                         Abnormality score
GPU ECC errors                 0.98
NCCL retry count               0.94
Training step duration         0.91
CPU usage                      0.22
Memory usage                   0.18

The AI focuses on the strongest abnormal signals.

5. Causal graph reasoning

This is more advanced.

Instead of treating metrics as isolated time series, the system builds a causal model:

Bad disk
  → Ceph recovery
  → Storage latency
  → Database latency
  → API latency
  → Customer impact

This is much closer to how an experienced SRE thinks.

Example automated RCA output

Incident:
Checkout latency p95 increased from 300 ms to 2.4 s.

Likely root cause:
PostgreSQL read latency increased due to degraded Ceph RBD volume performance.

Evidence:
- API latency increased at 13:42.
- PostgreSQL read latency increased at 13:39.
- Ceph pool latency increased at 13:36.
- OSD 12 reported slow ops and disk errors at 13:34.
- No relevant application deployment occurred in the previous hour.

Blast radius:
- checkout-api
- payment-api
- order-history-api

Recommended action:
- Mark OSD 12 out if disk errors continue.
- Move affected workload if possible.
- Check Ceph recovery/backfill limits.
- Consider temporarily scaling API timeout thresholds.

What makes automated RCA hard

Automated RCA is difficult because distributed systems are messy.

Common problems:

Correlation is not causation.
Multiple things can break at once.
Telemetry may be missing.
Logs may be noisy.
Clocks may not be perfectly synchronised.
Service dependency maps may be stale.
The root cause may be outside the monitored system.

This is why good automated RCA usually gives:

Probable cause
Confidence level
Supporting evidence
Contradicting evidence
Recommended next checks

It should not pretend to be certain when it is not.

3. Predictive anomaly detection

What it means

Predictive anomaly detection tries to detect abnormal behaviour before it becomes a major incident.

Traditional alerting says:

Alert when disk usage > 90%.

Predictive alerting says:

Disk usage is growing at a rate that will hit 90% in 11 hours.

That is a major shift.

Traditional threshold alerting

Example:

Alert: DiskAlmostFull
Condition: disk_used_percent > 90

This is simple and useful, but it misses context.

A disk at 85% may be fine if it grows slowly.

A disk at 60% may be dangerous if it is growing rapidly.

Predictive anomaly detection

Predictive systems look at behaviour over time:

Normal pattern:
- CPU rises during business hours
- drops overnight
- spikes during batch processing

Abnormal pattern:
- CPU rises at midnight
- no scheduled job exists
- memory grows continuously
- request rate is normal

The system detects that the pattern is unusual, even if no hard threshold has been crossed.

Types of predictive anomalies

1. Trend-based prediction

Useful for capacity planning.

Disk usage will reach 90% in 3 days.
Mimir object storage will exceed budget in 12 days.
Kafka partition disk will fill in 9 hours.
Ceph pool will hit near-full ratio this weekend.

2. Seasonality-aware anomaly detection

Useful for normal daily/weekly cycles.

Example:

CPU at 80% at 10:00 Monday may be normal.
CPU at 80% at 03:00 Sunday may be abnormal.

The system learns expected patterns.

3. Multivariate anomaly detection

Looks at several signals together.

For example:

Request rate: normal
Error rate: normal
Latency: high
CPU: normal
Database latency: high
Network retransmits: high

Individually, some metrics may not trigger alerts. Together, they reveal an abnormal condition.

4. Behavioural drift detection

Useful in AI and ML platforms.

Example:

Training jobs are completing successfully,
but average step time has increased by 12% over two weeks.

No incident has occurred yet, but performance is drifting.

5. Saturation prediction

Very useful for SRE.

GPU memory saturation likely within 40 minutes.
Kubernetes node memory pressure likely in 2 hours.
Ceph recovery will saturate backend network.
Kafka consumer lag will exceed SLO in 25 minutes.

Example

A predictive anomaly detector observes:

Mimir ingest rate: stable
Object storage write latency: slowly increasing
Compactor duration: increasing
Query latency: increasing
Store-gateway cache hit rate: decreasing

It predicts:

Within 6 hours, users will experience slow dashboard loads.

The remediation might be:

Scale store-gateways.
Check object storage latency.
Increase cache.
Review compactor backlog.

Why it matters

Predictive anomaly detection changes operations from:

React after customer impact

to:

Intervene before customer impact

That is the core SRE value.

4. Natural-language querying

What it means

Natural-language querying allows engineers to ask operational questions in plain English instead of writing PromQL, LogQL, TraceQL, SQL or Elasticsearch queries manually.

Example:

Show me p95 latency for checkout-api over the last 6 hours,
split by Kubernetes namespace.

The AI converts that into the right query.

Traditional workflow

You need to know the query language:

histogram_quantile(
  0.95,
  sum by (le, namespace) (
    rate(http_request_duration_seconds_bucket{
      service="checkout-api"
    }[5m])
  )
)

With natural-language querying:

What is checkout-api p95 latency by namespace for the last 6 hours?

The AI generates and executes the query.

Where this is useful

Natural-language querying is useful across:

Metrics:
- Prometheus
- Mimir
- Thanos
- VictoriaMetrics

Logs:
- Loki
- Elasticsearch/OpenSearch
- ClickHouse

Traces:
- Tempo
- Jaeger
- OpenTelemetry backends

Databases:
- PostgreSQL
- BigQuery
- Snowflake
- ClickHouse

Cloud APIs:
- Kubernetes
- AWS
- Azure
- GCP
- OpenStack

Example questions

Which services had the largest increase in error rate in the last hour?

Show me pods that restarted after the latest deployment.

Find logs for payment-api where timeout errors increased.

Which traces spent the most time waiting on PostgreSQL?

Which Kubernetes nodes have high network retransmits?

Show me Ceph OSDs with rising latency and degraded placement groups.

Which GPU nodes show ECC errors or thermal throttling?

The important part: semantic mapping

Natural-language querying is not just text-to-query.

It needs to understand your telemetry naming.

For example, you may ask:

Show API latency.

But your metrics may be called:

http_request_duration_seconds_bucket
nginx_ingress_controller_request_duration_seconds_bucket
istio_request_duration_milliseconds_bucket
app_http_server_duration_bucket

The AI needs a semantic layer that maps human concepts to real telemetry.

Good natural-language querying needs

Metric catalogue
Label documentation
Service ownership map
Namespace conventions
Dashboard metadata
Runbook links
Known-good query examples
SLO definitions
Deployment metadata

Without that, the AI may generate syntactically valid but operationally useless queries.

Risk: hallucinated queries

Natural-language querying can be dangerous if it invents metric names.

Bad output:

rate(checkout_latency_seconds[5m])

But that metric may not exist.

Better behaviour:

I could not find a metric named checkout_latency_seconds.
I found http_request_duration_seconds_bucket with service="checkout-api".
Using that instead.

The AI should verify queries against the actual telemetry backend.

SRE impact

SREs will still need to understand PromQL, LogQL and traces, but less time will be spent manually composing queries.

The valuable skill becomes designing the semantic layer:

Good metric names
Useful labels
Consistent service metadata
Accurate ownership data
Clear runbooks
Well-documented SLOs

5. Knowledge-graph reasoning

What it means

Knowledge-graph reasoning connects operational facts into a graph so AI can reason over relationships.

Traditional observability stores data like this:

Metric:
checkout-api p95 latency = 2.1s

Log:
timeout connecting to postgres

Trace:
checkout-api → postgres took 1.8s

Kubernetes:
postgres pod moved to node-12

Infrastructure:
node-12 has disk pressure

A knowledge graph connects those facts:

checkout-api
   depends_on → postgres
   runs_in → namespace prod
   owned_by → payments-team

postgres
   runs_on → node-12
   uses → ceph-rbd-volume-44

node-12
   has_condition → disk_pressure

ceph-rbd-volume-44
   backed_by → ceph-pool-prod

Now the AI can reason across relationships.

Why graphs matter

Most incidents are not isolated.

They involve chains:

Application → runtime → Kubernetes → node → network → storage → hardware

Dashboards show symptoms. Graphs show relationships.

Example graph

Customer impact
   ↑
checkout-api latency
   ↑
postgres query latency
   ↑
RBD volume latency
   ↑
Ceph OSD slow ops
   ↑
failing NVMe device

A graph-based system can move up and down this chain.

What goes into the graph

A strong observability knowledge graph includes:

Services
APIs
Databases
Queues
Kubernetes namespaces
Pods
Nodes
Clusters
Storage volumes
Ceph pools
Network devices
Load balancers
Ingress controllers
Deployments
Git commits
Feature flags
SLOs
Alerts
Incidents
Runbooks
Owners
Escalation paths
Cloud resources
OpenStack projects
GPU nodes
Training jobs

How the graph is built

Data sources may include:

Kubernetes API
Prometheus/Mimir labels
OpenTelemetry resource attributes
Service mesh telemetry
CMDB
Terraform state
GitOps repositories
CI/CD systems
Incident management tools
Cloud APIs
OpenStack APIs
Ceph APIs
Network controllers
Runbooks and docs

The graph is continuously updated.

Example reasoning

Question:

Why are training jobs slower on rack 3?

The graph helps the AI discover:

Training-job-982
   runs_on → gpu-node-31, gpu-node-32, gpu-node-33
   located_in → rack-3
   uses_network → leaf-switch-3a
   uses_storage → lustre-client
   depends_on → metadata-server-2

Telemetry shows:

leaf-switch-3a has rising packet drops
NCCL retries increased
GPU utilisation has sawtooth pattern
training step time increased

AI conclusion:

The training slowdown is likely caused by network instability on rack 3,
not by GPU compute saturation.

That is knowledge-graph reasoning.

Why this matters for AI data centres

AI/HPC environments are dependency-heavy.

A single training workload may depend on:

GPU health
GPU memory
NVLink/NVSwitch
PCIe
RoCE/InfiniBand
Leaf-spine network
Storage bandwidth
Metadata servers
Container runtime
Kubernetes scheduler
Slurm scheduler
Image registry
Secrets
DNS
Authentication
Object storage

A flat dashboard cannot represent that well. A graph can.

6. Automated remediation

What it means

Automated remediation is when the system not only detects and diagnoses an issue, but also takes corrective action.

This is the most powerful and most dangerous part of AI-era observability.

It moves from:

Observe → Alert → Human fixes

to:

Observe → Diagnose → Decide → Act → Verify

Simple automated remediation

Low-risk examples:

Restart a failed pod.
Scale a deployment from 3 to 5 replicas.
Clear a stuck job.
Rotate a saturated log file.
Drain a bad Kubernetes node.
Open an incident ticket.
Create a Slack/PagerDuty summary.
Rollback a known-bad deployment.
Increase queue consumers.

Advanced automated remediation

Higher-risk examples:

Move workloads away from degraded storage.
Change Ceph recovery/backfill settings.
Disable a feature flag.
Rebalance Kafka partitions.
Quarantine a GPU node.
Remove a bad node from a load balancer.
Apply a network policy change.
Trigger disaster recovery failover.
Patch a vulnerable service.

These require stronger guardrails.

The remediation loop

A safe remediation system should work like this:

1. Detect
   Something abnormal happened.

2. Diagnose
   Determine probable cause and confidence.

3. Propose
   Generate a remediation plan.

4. Check policy
   Is this action allowed?
   Is the blast radius acceptable?
   Is approval required?

5. Act
   Execute the change.

6. Verify
   Did the metric improve?
   Did errors reduce?
   Did customer impact stop?

7. Roll back
   If not improved, revert or escalate.

8. Learn
   Record the incident and outcome.

Example

Issue:

checkout-api error rate increased after deployment.

AI investigation:

New version deployed at 09:03.
Errors began at 09:05.
Only pods running version v2.7.4 are affected.
Previous version v2.7.3 had no errors.

Remediation proposal:

Rollback checkout-api from v2.7.4 to v2.7.3.

Policy check:

Allowed because:
- service has rollback automation
- error rate exceeds SLO threshold
- last known-good version exists
- no database migration detected

Action:

kubectl rollout undo deployment/checkout-api

Verification:

Error rate returned to baseline after 4 minutes.
p95 latency returned to normal.
Incident summary created.

Guardrails are essential

Automated remediation must not be a reckless agent with production write access.

Good guardrails include:

Read-only by default
Approval required for high-risk actions
Change windows
Blast-radius limits
Dry-run mode
Policy-as-code
RBAC
Audit logs
Rollback plans
Rate limits
Canary execution
Human confirmation for destructive actions

For example:

Allowed automatically:
- restart one unhealthy pod
- scale a stateless service within limits
- create an incident ticket

Requires approval:
- drain production node
- rollback payment service
- modify firewall/network policy
- change Ceph recovery settings
- fail over database

How these six areas fit together

They are not separate ideas. They form a pipeline.

Natural-language querying
        ↓
Lets humans ask better questions

AI-assisted investigations
        ↓
Gathers evidence automatically

Knowledge-graph reasoning
        ↓
Understands relationships and dependencies

Automated root-cause analysis
        ↓
Identifies probable initiating cause

Predictive anomaly detection
        ↓
Finds issues before they become incidents

Automated remediation
        ↓
Fixes or mitigates the issue

A mature AI observability platform combines all six.

Practical architecture for an AI observability platform

A realistic architecture could look like this:

Telemetry sources
  ├─ Prometheus / Mimir metrics
  ├─ Loki logs
  ├─ Tempo traces
  ├─ Kubernetes events
  ├─ Ceph / storage metrics
  ├─ GPU metrics
  ├─ Network telemetry
  ├─ CI/CD events
  ├─ Git commits
  └─ Incident history

        ↓

Data normalization layer
  ├─ OpenTelemetry attributes
  ├─ Service naming standards
  ├─ Environment labels
  ├─ Owner/team labels
  └─ SLO metadata

        ↓

Context layer
  ├─ Runbooks
  ├─ Architecture docs
  ├─ Past incidents
  ├─ Known failure modes
  ├─ Deployment history
  └─ Dependency maps

        ↓

AI reasoning layer
  ├─ LLM
  ├─ RAG over runbooks/docs
  ├─ Query generation
  ├─ Anomaly detection
  ├─ Causal graph reasoning
  └─ RCA ranking

        ↓

Action layer
  ├─ Human-readable incident summary
  ├─ Suggested next steps
  ├─ Ticket creation
  ├─ Slack/PagerDuty update
  ├─ Safe automation
  └─ Approved remediation

What you would build first as an SRE

I would not start with fully automated remediation. That is too risky.

The sensible maturity path is:

Stage 1: AI-assisted read-only investigation

Build a tool that can answer:

What changed?
What alerts fired?
What services are affected?
What logs are unusual?
What traces are slow?
What runbook applies?

No write actions.

Stage 2: Natural-language query assistant

Allow engineers to ask:

Show me p95 latency by service.
Find logs for this incident window.
Show me failed pods after the deployment.
Compare today’s error rate with yesterday.

The assistant should show the generated query so the engineer can verify it.

Stage 3: Incident summariser

Generate structured summaries:

Incident:
Impact:
Start time:
Affected services:
Probable cause:
Evidence:
Actions taken:
Current status:
Recommended next steps:

This alone saves huge operational time.

Stage 4: RCA recommendation engine

Add correlation with:

Deployments
Kubernetes events
Node health
Storage health
Network telemetry
Recent config changes

Output probable root cause with confidence.

Stage 5: Predictive alerting

Start with safer predictions:

Disk will fill.
Object storage usage will exceed budget.
Kafka lag will breach SLO.
Ceph pool will hit near-full.
Certificate will expire.
GPU nodes are showing increasing ECC errors.

Stage 6: Human-approved remediation

The AI proposes actions, but humans approve.

Example:

Recommended action:
Drain node gpu-17 and reschedule workloads.

Reason:
GPU ECC errors increased and training retries are affecting jobs.

Approval required:
Yes.

Stage 7: Limited automatic remediation

Only allow automation for narrow, reversible, low-risk actions.

Restart crashed pod
Scale stateless deployment
Reopen failed consumer
Create incident ticket
Disable noisy alert temporarily with expiry

Main risks

AI observability can go wrong if the system has poor telemetry or too much authority.

1. Bad telemetry in, bad reasoning out

If labels are inconsistent, traces are incomplete, or logs are unstructured, AI conclusions will be weak.

2. Hallucinated root cause

The AI may sound confident while being wrong.

Always require:

Evidence
Confidence
Alternative theories
Query links
Raw data references

3. Unsafe remediation

A bad automated action can make an incident worse.

Example:

AI sees high memory.
AI restarts all pods.
All pods restart at once.
Outage gets worse.

That is why blast-radius control matters.

4. Hidden cost explosion

AI investigation can generate expensive backend queries.

A poorly controlled AI assistant may run huge queries across logs, traces and metrics.

You need:

Query limits
Timeouts
Caching
Sampling
Tenant controls
Cost visibility

5. Security and access control

The AI should not see or do everything.

It needs RBAC:

Read-only access for most users
Sensitive log masking
No secret exposure
Audit trail
Approval for write actions
Tenant isolation

The big picture

These six capabilities are the future of observability:

Capability	Main purpose	Human role
AI-assisted investigations	Speed up incident analysis	Validate findings
Automated RCA	Identify probable cause	Judge evidence
Predictive anomaly detection	Prevent incidents earlier	Tune models and thresholds
Natural-language querying	Make telemetry easier to access	Verify generated queries
Knowledge-graph reasoning	Understand system relationships	Maintain accurate topology
Automated remediation	Fix or mitigate issues	Define guardrails and approve risk

The core change is this:

Observability is no longer just about collecting telemetry.

It is becoming a reasoning system over telemetry.

For SREs, the opportunity is to become the person who builds and governs that reasoning system: telemetry quality, context, automation safety, incident workflows, and trust boundaries.

Commercial AI Observability

Commercial companies are building AI into observability in two directions:

AI for observability — using AI to investigate, correlate, explain, predict and remediate production issues.
Observability for AI — monitoring LLMs, agents, RAG pipelines, vector databases, model quality, hallucinations, token cost, latency, drift and safety.

So the product shift is not just “add a chatbot to dashboards.” The bigger move is toward an AI operations layer that sits above metrics, logs, traces, events, topology and runbooks.

Telemetry + topology + deployments + logs + traces + incidents + runbooks
                                ↓
                       AI reasoning layer
                                ↓
     Explain issue → find cause → predict risk → recommend/execute action

1. Datadog

Datadog is building AI into its platform around Bits AI, Watchdog, and LLM/Agent Observability.

Datadog’s Watchdog is its AI engine for automated alerts, insights and root-cause analysis across Datadog telemetry. It continuously monitors infrastructure and surfaces important signals to help teams detect, troubleshoot and resolve issues.

Datadog’s Bits AI SRE is positioned as an always-on AI SRE agent that helps handle troubleshooting and alerts, with Datadog describing it as able to pinpoint root causes faster by using Datadog’s incident and telemetry context.

Datadog is also pushing Bits AI Agents and Agent Builder, where the platform can build custom AI agents that investigate issues, make decisions and take action using Datadog and third-party data, with prebuilt actions across cloud, security, CI/CD and collaboration tooling.

For the second direction, Datadog has Agent Observability / LLM Observability, aimed at tracing, evaluating and improving LLM-powered applications and AI agents. Datadog says each LLM application request can be represented as a trace, allowing teams to investigate root cause, operational performance, quality, privacy and safety.

In plain SRE terms, Datadog is building:

Datadog AI direction:

Watchdog
  → automatic anomaly detection
  → automated insights
  → RCA suggestions

Bits AI
  → natural-language investigation
  → AI SRE assistant
  → incident summarisation
  → workflow automation

Bits AI Agents
  → custom agentic workflows
  → investigation agents
  → remediation/documentation agents

LLM / Agent Observability
  → traces for LLM calls
  → prompt/response monitoring
  → quality, privacy, safety checks
  → AI-agent debugging

Datadog is also doing deeper model work: its Toto time-series foundation model is specifically designed for observability time-series forecasting and was trained partly on Datadog observability data.

2. Dynatrace

Dynatrace has probably been the most explicit about putting causal AI at the centre of observability.

Its AI engine is Davis AI / Dynatrace Intelligence. Dynatrace describes its AI approach as combining predictive AI, causal AI and generative AI over unified observability and security data to automate workflows.

Dynatrace’s key differentiator is that it does not want the AI to merely correlate metrics. It wants the platform to understand causality: what caused what, what depends on what, and what failure actually triggered the incident. Dynatrace describes causal AI as using causal and deterministic techniques to determine underlying causes and effects rather than just relying on correlation.

Dynatrace also presents Dynatrace Intelligence as combining deterministic insights with agentic action for prevention, remediation and optimisation at scale.

For AI workloads, Dynatrace has AI and LLM Observability for monitoring, optimising and securing generative AI apps, LLMs and agentic workflows, with emphasis on performance, explainability and compliance.

In SRE terms, Dynatrace is building:

Dynatrace AI direction:

Davis AI / Dynatrace Intelligence
  → anomaly detection
  → causal root-cause analysis
  → topology-aware problem detection
  → predictive risk detection
  → generative explanations
  → workflow automation

Causal AI
  → dependency-aware analysis
  → fault-tree-style reasoning
  → root cause, not just symptom correlation

AI and LLM Observability
  → GenAI app monitoring
  → LLM and agentic workflow visibility
  → explainability
  → compliance-oriented monitoring

The important point: Dynatrace is trying to make observability less like “search through telemetry” and more like automated dependency-aware diagnosis.

3. Splunk

Splunk is building AI into observability through Splunk AI Assistant in Observability Cloud, broader AI Observability, and AI/agent monitoring.

Splunk’s AI Assistant in Observability Cloud uses observability data from metrics, traces, logs and alerts through a chat interface inside Splunk Observability Cloud.

Splunk says the AI Assistant can analyze data across APM, Infrastructure Monitoring, Database Monitoring, RUM and log analytics to help with root-cause analysis.

Splunk is also building “observability for AI” capabilities. Its Splunk Observability for AI is described as full-fidelity monitoring and troubleshooting across AI applications and the AI infrastructure components used to build them.

Splunk’s AI Agent Monitoring aims to correlate degraded AI agent/model performance and track operational metrics such as latency and errors alongside quality/security metrics such as hallucinations, bias, drift, accuracy, cost and token usage.

Splunk’s AI Observability positioning is broader: observe and optimise performance, quality, cost and security across agents, LLMs, vector databases and infrastructure.

In SRE terms, Splunk is building:

Splunk AI direction:

AI Assistant in Observability Cloud
  → natural-language investigations
  → logs + metrics + traces + alerts analysis
  → RCA assistance
  → incident summarisation

AI Observability
  → AI application monitoring
  → AI infrastructure monitoring
  → agent performance tracking
  → LLM quality and safety monitoring

AI Agent Monitoring
  → latency and errors
  → hallucination tracking
  → bias/drift/accuracy
  → token and cost visibility
  → model and agent reliability

Splunk’s direction is very aligned with its historical strength: search, correlation and operational analytics, now wrapped in AI-assisted investigation and AI workload monitoring.

4. New Relic

New Relic is building AI into its platform through New Relic AI, AI-powered observability features, and AI Monitoring / LLM observability.

New Relic says New Relic AI can help instrument systems, generate system health reports and identify alert coverage gaps for full-stack observability.

New Relic has also positioned its platform as AI-powered observability that correlates telemetry across the stack to isolate root cause and reduce operational toil.

For LLM applications, New Relic AI monitoring captures telemetry from AI-powered apps through APM agents and collects data from external LLMs and vector stores.

New Relic’s AI monitoring focuses on troubleshooting, comparing and optimising LLM prompts and responses for performance, cost and quality issues such as hallucination, bias and toxicity.

It also supports LLM observability through OpenLIT integration, which automatically generates traces and metrics for LLM and VectorDB performance and cost analysis.

In SRE terms, New Relic is building:

New Relic AI direction:

New Relic AI
  → AI assistant for DevOps
  → system health reports
  → alert coverage analysis
  → instrumentation help

AI-powered observability
  → telemetry correlation
  → root-cause isolation
  → faster troubleshooting

AI Monitoring / LLM Observability
  → prompt/response analysis
  → LLM latency and error tracking
  → cost analysis
  → hallucination, bias and toxicity signals
  → VectorDB visibility

New Relic’s direction is about making its “all-in-one observability” platform more assistant-driven and making AI workloads first-class observable systems.

What they are all converging on

All four vendors are converging on the same broad architecture:

1. Collect telemetry
   metrics, logs, traces, events, profiles, topology

2. Normalize context
   services, owners, deployments, dependencies, SLOs, runbooks

3. Apply AI
   anomaly detection, query generation, summarisation, RCA, prediction

4. Explain
   what happened, why it happened, what changed, what is affected

5. Act
   create ticket, page team, suggest fix, trigger workflow, remediate safely

6. Observe AI itself
   LLM calls, prompts, responses, token cost, model quality, hallucinations,
   safety, drift, vector DBs, RAG pipelines, agent workflows

The big product categories are:

AI capability	What vendors are building
AI assistant	Chat interface over observability data
AI SRE agent	Investigates incidents and proposes actions
Automated RCA	Finds likely root cause using telemetry and topology
Predictive anomaly detection	Spots problems before thresholds are breached
Natural-language querying	Converts plain English into PromQL, LogQL, SQL, trace/log queries
Incident summarisation	Explains impact, timeline, evidence and next steps
Runbook automation	Recommends or triggers operational workflows
AI workload monitoring	Monitors LLMs, agents, prompts, responses, cost and quality
Governance/safety	Tracks hallucination, toxicity, bias, privacy and compliance risks
Cost optimisation	Reduces telemetry waste and tracks LLM/token spend

The strategic reason they are doing this

The observability market is under pressure from three directions.

First, telemetry volumes are exploding. Kubernetes, microservices, edge, GPU clusters, AI workloads and distributed storage produce far more telemetry than humans can manually inspect.

Second, SRE teams are overloaded. Vendors are trying to sell “lower MTTR” and “less operational toil” by making the platform do more triage and correlation automatically.

Third, AI applications create new observability requirements. Traditional APM can tell you latency and error rate, but AI systems also need visibility into prompts, responses, hallucinations, drift, token usage, model quality, RAG retrieval quality, vector database behaviour and agent decisions.

So vendors are not just adding AI because it is fashionable. They are defending and expanding their core observability business.

What this means for an SRE / Observability Platform Engineer

The skill shift is significant.

Old value:

Build dashboards.
Write alert rules.
Know PromQL and LogQL.
Search logs manually.
Correlate incidents by experience.

New value:

Design telemetry that AI can reason over.
Standardise labels and service metadata.
Maintain accurate topology and ownership maps.
Connect observability to deployment and incident data.
Create safe remediation workflows.
Validate AI-generated RCA.
Control cost, access and blast radius.

The winners will not simply be the engineers who know the most dashboards. The winners will be the engineers who can build a trusted operational intelligence layer over metrics, logs, traces, topology and automation.

AI Strategies of New Observability Products

Coralogix is releasing the most explicit “AI observability product suite.” Cribl is positioning itself as the telemetry data layer for AI-era observability. Tsuga is newer and appears to be building an AI-native, bring-your-own-cloud observability architecture rather than simply adding an AI assistant to an old SaaS model.

Quick comparison

Company	AI direction	Product maturity from public material
Coralogix	AI Center, AI guardrails, AI evaluations, AI-SPM, Olly AI observability agent	Very explicit productised AI offering
Cribl	Cribl AI, Copilot, AI-guided Search Investigations, telemetry for humans and agents	Strong AI-assisted telemetry/data-management direction
Tsuga	BYOC observability for the AI era, agent-native observability, MCP/CLI for customer-owned agents	Newer; more architectural and agent-native positioning

1. Coralogix: AI observability as a full product suite

Coralogix is clearly releasing AI-focused products. Its main AI platform is AI Center, which Coralogix describes as a complete platform for AI-powered applications combining observability, guardrails, evaluations, and AI Security Posture Management in one place. It monitors LLM interactions for health, performance, cost, latency, errors, security and quality issues.

The key Coralogix AI products are:

Coralogix AI Center
  ├─ AI Observability
  ├─ AI Guardrails
  ├─ AI Evaluations
  ├─ AI Security Posture Management
  ├─ AI Application Discovery
  └─ AI Explorer / Application Drilldown

What Coralogix is targeting

Coralogix is not just monitoring servers. It is monitoring AI application behaviour:

Prompt
  ↓
LLM call
  ↓
Response
  ↓
Evaluation
  ↓
Guardrail decision
  ↓
Security / quality / cost signal

Its AI Center monitoring gives an organisation-level view of LLM usage and lets teams drill from a trend down to a specific application and even a specific prompt/response interaction.

It also supports OpenTelemetry GenAI semantic conventions, so teams can send GenAI spans into Coralogix AI Center without needing a Coralogix-specific SDK.

Olly: Coralogix’s AI observability agent

Coralogix also has Olly, which it describes as an AI-native observability agent. Olly lets users ask natural-language questions and get answers across logs, metrics, traces and alerts.

In practice, this is the “AI SRE assistant” layer:

Human asks:
“Why is payment latency rising?”

Olly checks:
  ├─ logs
  ├─ metrics
  ├─ traces
  ├─ alerts
  ├─ correlations
  └─ possible root causes

Then returns:
  ├─ explanation
  ├─ evidence
  ├─ affected services
  └─ recommended next steps

Coralogix also positions Olly as more than a simple assistant: it says Olly uses specialised agents for log analysis, trace exploration, metrics interpretation, security research, code debugging, correlation analysis and hypothesis generation.

My read on Coralogix

Coralogix is trying to own AI production reliability:

Monitor AI apps
Evaluate AI outputs
Detect prompt injection / PII / toxicity
Track token cost
Find bad model behaviour
Use AI to investigate normal production incidents

So yes: Coralogix is strongly AI-focused.

2. Cribl: AI platform for telemetry, not classic dashboard observability

Cribl’s AI angle is different. Cribl is not primarily trying to be another Datadog-style full-stack UI. It is positioning itself as the AI Platform for Telemetry: the collection, routing, shaping, searching and governance layer for machine data used by humans and AI agents. Cribl’s homepage describes the platform as giving enterprises choice and control for telemetry, and says it helps manage and analyse telemetry for both humans and agents.

The AI-focused Cribl areas are:

Cribl AI
  ├─ Copilot
  ├─ Copilot Editor
  ├─ AI-guided Search Investigations
  ├─ Natural-language queries
  ├─ AI-assisted pipeline creation
  ├─ AI telemetry parsing
  └─ AI-ready telemetry routing

Cribl AI and Copilot

Cribl says its AI capabilities help teams create and modify pipelines, queries and configurations using natural language. It also says Cribl Copilot provides troubleshooting guidance, answers product/configuration questions and helps teams resolve issues faster.

This matters because a lot of observability toil is not just dashboards. It is:

Parse this log format.
Map this schema.
Route this data.
Drop this noisy field.
Mask this sensitive value.
Send this stream to the SIEM.
Send this other stream to cheaper storage.

Cribl’s AI is aimed at reducing that data-engineering toil.

Copilot Editor

Cribl’s Copilot Editor uses AI to help with schema mapping, translating logs across systems and building telemetry pipelines that clean, filter and route events.

That is important because AI-era observability needs clean, standardised telemetry. A reasoning agent is only useful if the data has usable structure.

Raw logs
  ↓
AI-assisted parsing
  ↓
Schema mapping
  ↓
Enrichment / masking / routing
  ↓
Search / SIEM / observability backend / AI agent

AI-guided Cribl Search Investigations

Cribl Search has an Investigations feature in preview. The docs describe it as a guided workspace where users explore incidents and telemetry using natural-language prompts. It helps analyse telemetry, identify patterns and document findings without manually building every query.

That means Cribl is moving into the AI-assisted investigation workflow:

Alert or question
  ↓
Natural-language investigation
  ↓
Generated queries
  ↓
Pattern discovery
  ↓
Findings captured in one workspace

Cribl’s AI observability thesis

Cribl’s recent AI observability messaging is that AI observability is a telemetry problem, not just a dashboard problem. It argues that LLM apps generate prompts, completions, tool calls, retrieval steps, token counts, model choices, policy events and infrastructure signals, and that those need to be collected and shaped for different teams and tools.

My read on Cribl

Cribl is not saying:

“We are the AI RCA dashboard.”

It is saying:

“We are the telemetry control plane that makes AI investigations possible.”

That is strategically clever. AI agents need cheap, governed, high-quality access to large telemetry volumes. Cribl wants to be the pipe, filter, schema and search layer underneath that.

3. Tsuga: AI-native observability architecture, still early

Tsuga is the newest and least mature publicly compared with Coralogix and Cribl, but it is very clearly positioning itself around the AI-era observability problem.

Tsuga describes itself as a bring-your-own-cloud observability platform for logs, metrics, traces and APM, deployed inside the customer’s AWS account using infrastructure-as-code. It says customers get the control of self-hosted infrastructure without the operational burden of running it.

Its newer positioning is explicitly AI-era focused. Tsuga announced a $35 million Series A on June 23, 2026, saying it is building “observability for the AI era” inside the customer’s cloud so the customer’s data and AI do not leave their control.

Tsuga’s AI claim

Tsuga’s argument is architectural:

Traditional observability:
  telemetry leaves your cloud
  vendor stores it
  cost rises with volume
  AI agents require broad access to vendor-hosted data

Tsuga model:
  observability runs inside your cloud
  telemetry stays inside your perimeter
  AI runs on your own data
  agents can use complete telemetry without exporting sensitive context

Tsuga says its AI tools run on the customer’s data inside the customer’s perimeter. It also says automated root-cause analysis runs on complete, unsampled data, and that its MCP server and CLI let engineering teams build their own agents on that foundation inside their own security boundary.

That MCP point is important. It suggests Tsuga is not only building an observability UI; it is exposing observability context to AI agents.

Agent-native observability

Tsuga has a specific Agent-Native Observability page. It says Tsuga is built so AI agents can use observability data effectively, affordably and inside the customer environment. It highlights agent-first APIs, MCPs, CLIs and query interfaces designed to return relevant context rather than raw data dumps.

That is a very modern product angle.

AI agent asks:
“What changed before this incident?”

Tsuga should return:
  ├─ relevant metrics
  ├─ relevant logs
  ├─ deployment context
  ├─ service ownership
  ├─ topology
  └─ probable causal evidence

Not:
  └─ 10GB of raw logs

What is less clear with Tsuga

Publicly, Tsuga looks less like:

Named AI assistant with lots of screenshots and feature modules

and more like:

AI-native observability architecture:
  BYOC
  complete telemetry
  agent APIs
  MCP
  automated RCA
  customer-owned AI boundary

So my assessment is: yes, Tsuga is AI-focused, but the public product story is currently more architectural and agent-native than feature-by-feature like Coralogix.

The strategic differences

Coralogix: “Observe and govern AI applications”

Coralogix is focused on production AI application reliability:

LLM monitoring
AI guardrails
Evaluations
AI security posture
Prompt/response visibility
Olly AI investigation agent

Best fit:

Teams deploying LLM apps and agents who need monitoring, safety, cost tracking and AI-assisted troubleshooting.

Cribl: “Prepare and control telemetry for AI”

Cribl is focused on the telemetry substrate:

Collect once
Shape data
Mask sensitive fields
Route anywhere
Search cheaply
Let humans and agents investigate
Use AI to build pipelines and queries

Best fit:

Large enterprises drowning in telemetry volume, SIEM costs, log routing complexity and multi-tool data sprawl.

Tsuga: “Run AI-era observability inside your own cloud”

Tsuga is focused on sovereign, cost-controlled, agent-native observability:

BYOC deployment
Telemetry stays in your cloud
AI and agents run inside your boundary
Automated RCA on unsampled data
MCP/CLI for custom SRE agents

Best fit:

Regulated, European, AI-native or high-scale companies that do not want telemetry, prompts, incident history and operational context exported to a third-party SaaS cloud.

The bigger market pattern

These newer players are attacking the incumbents from three angles:

1. Cost
   AI generates more telemetry.
   Per-GB SaaS observability becomes painful.

2. Data control
   AI telemetry includes prompts, responses, business context and security-sensitive data.
   Customers do not always want that in a vendor cloud.

3. Agent-readiness
   Future observability is not just dashboards for humans.
   AI agents need APIs, context retrieval, governed telemetry access and automated RCA.

So the new wave is less about “AI as a dashboard chatbot” and more about building the data foundation for AI-driven operations.

The sharpest summary is:

Coralogix = AI observability product suite
Cribl     = AI-ready telemetry control plane
Tsuga     = AI-native sovereign observability architecture

For an SRE/observability platform engineer, these companies are worth watching because they indicate where the next jobs and platform designs are going: telemetry engineering, AI-readable context, agent-safe access, automated RCA, guardrails and cost-controlled observability architectures.

Opensource AI Observability

AI adoption in open-source observability is happening, but it is different from what Datadog, Dynatrace, Splunk and New Relic are doing.

The commercial vendors are embedding AI directly into their SaaS platforms. The open-source ecosystem is mostly building the standards, collectors, SDKs, self-hostable platforms and agent interfaces that allow AI observability to work without vendor lock-in.

The big shift is this:

Old open-source observability:

Prometheus / Loki / Tempo / Grafana / OpenTelemetry
        ↓
Collect, store, query, dashboard, alert


AI-era open-source observability:

OpenTelemetry + collectors + traces + logs + metrics + AI metadata
        ↓
LLM / agent / RAG / GPU / vector DB visibility
        ↓
AI assistants, AI SRE agents, natural-language querying, RCA

1. Grafana: open observability stack + AI features around it

Grafana Labs is moving in two directions.

First, it is keeping the open observability stack relevant for AI-era workloads: Grafana, Loki, Mimir, Tempo, Pyroscope and Alloy remain the core telemetry stack.

Second, it is adding AI-powered layers on top, especially in Grafana Cloud.

Grafana’s AI Observability product is built on OpenTelemetry and is aimed at teams running LLM agents in production. It monitors agent activity, traces conversations, tracks costs and evaluates quality. Grafana documents SDK support for Go, Python, TypeScript, Java and .NET, plus integrations with frameworks such as LangChain, LangGraph, OpenAI Agents and Vercel AI SDK.

Grafana also has Grafana Assistant, an AI-powered observability agent. It lets users ask questions like “Show me CPU usage” or “Create a dashboard for my database,” and it works across metrics, logs, traces, profiles and databases. Grafana says it can run investigations, manage dashboards, build/refine queries and help users navigate Grafana resources.

The important nuance: Grafana Assistant is not the same thing as open-source Grafana itself. It is primarily a Grafana Cloud AI capability, though Grafana documents a self-managed Assistant app that connects to a Grafana Cloud stack with reduced functionality.

Grafana’s most open-source-relevant AI move is probably Grafana Alloy. Alloy is Grafana Labs’ open-source OpenTelemetry Collector distribution with built-in Prometheus pipelines and support for metrics, logs, traces and profiles. It gives Grafana a standard collector layer for AI-era telemetry pipelines.

So Grafana’s strategy is:

Grafana AI strategy:

Open-source base:
  Grafana
  Loki
  Mimir
  Tempo
  Pyroscope
  Alloy

AI observability:
  LLM / agent traces
  cost tracking
  quality evaluation
  AI workload dashboards

AI assistant:
  natural-language querying
  dashboard creation
  investigation assistance
  query generation

Strategic direction:
  keep the OSS stack open,
  but place high-value AI workflows in Grafana Cloud.

2. OpenTelemetry: the standard layer for AI observability

OpenTelemetry is not a company; it is a CNCF open-source project. Its role is different from Grafana’s.

OpenTelemetry is becoming the standard telemetry schema and instrumentation layer for AI systems.

OpenTelemetry describes itself as an open-source observability framework for cloud-native software, providing APIs, libraries, agents and collector services for capturing telemetry. It also emphasises vendor-neutral instrumentation, meaning you instrument once and export to different backends.

For AI, the key development is OpenTelemetry semantic conventions for generative AI. OpenTelemetry has been extending its conventions so GenAI telemetry can capture model parameters, response metadata, token usage, traces, metrics and events for model interactions.

That matters because LLM systems need new telemetry fields that normal web apps did not need:

Traditional app telemetry:
  service.name
  http.status_code
  duration
  error
  route
  database call

AI app telemetry:
  model name
  prompt
  completion
  token count
  tool call
  retrieval step
  vector DB query
  embedding model
  cost
  temperature
  hallucination score
  safety evaluation

OpenTelemetry is not trying to become an AI assistant. Its value is that it gives the ecosystem a common language for AI telemetry.

So OpenTelemetry’s strategy is:

OpenTelemetry AI strategy:

Standardise:
  spans
  metrics
  logs/events
  attributes
  semantic conventions

Support:
  LLM calls
  model interactions
  prompts/responses
  token usage
  latency
  errors
  provider metadata

Enable:
  Grafana
  SigNoz
  Langfuse
  OpenLIT
  Elastic
  New Relic
  Datadog
  custom platforms

Strategic direction:
  become the neutral telemetry contract for AI applications.

3. OpenLIT: open-source LLM observability on OpenTelemetry

OpenLIT is a good example of the new generation of open-source AI observability projects.

It describes itself as an open-source LLM observability and AI engineering platform built on OpenTelemetry. Its positioning is self-hosted, privacy-first and vendor-neutral.

This is important because many companies do not want prompts, responses, user inputs, sensitive data or AI-agent traces going straight into a third-party SaaS.

OpenLIT’s direction is:

OpenLIT strategy:

Monitor:
  LLM calls
  latency
  token usage
  cost
  model behaviour
  vector DBs
  GPU usage

Deploy:
  self-hosted
  OpenTelemetry-native
  privacy-first

Best fit:
  teams building AI apps who want open-source AI observability
  without committing to a commercial platform first.

4. Langfuse: open-source LLM tracing and evaluation

Langfuse is another major open-source AI observability project.

It focuses on LLM application tracing: capturing prompts, model responses, token usage, latency, tool calls and retrieval steps. Langfuse also provides AI-engineering features such as LLM-as-judge evaluation, prompt management, experiments and datasets, and it can be self-hosted.

Langfuse is less like “Grafana for all infrastructure” and more like “observability and evaluation for LLM applications.”

Its strategy is:

Langfuse strategy:

Trace:
  prompt
  response
  tool call
  RAG step
  latency
  token usage
  cost

Evaluate:
  quality
  scoring
  experiments
  prompt versions
  datasets

Best fit:
  AI product teams who need to debug and improve LLM apps,
  not just monitor infrastructure.

5. SigNoz: open-source observability with AI-agent access

SigNoz is moving from being an open-source Datadog/New Relic alternative into a more AI-aware observability platform.

SigNoz describes itself as an open-source observability tool powered by OpenTelemetry, covering logs, metrics, traces, dashboards, alerts and LLM/AI observability. It also advertises an MCP server for bringing telemetry into coding agents and an AI teammate called Noz for incident investigation, alert tuning and dashboard building.

This is significant because it shows a broader open-source pattern: observability platforms are not just adding AI dashboards; they are exposing telemetry to AI agents.

SigNoz direction:

OpenTelemetry-native observability
  +
LLM/AI observability
  +
MCP access for coding agents
  +
AI teammate for investigations and dashboards

That is where open-source observability is going: not just dashboards for humans, but context APIs for agents.

6. HolmesGPT: open-source AI SRE agent

HolmesGPT is another important example because it is not primarily about observing LLM apps. It is about using AI to investigate production incidents.

HolmesGPT describes itself as an open-source AI agent for investigating production incidents and finding root causes across Kubernetes, VMs, cloud providers, databases and SaaS platforms. It is listed as a CNCF sandbox project.

That puts it closer to the Datadog Bits AI / Dynatrace Davis AI direction, but in open-source form.

HolmesGPT strategy:

Input:
  alerts
  Kubernetes state
  metrics
  logs
  cloud context
  runbooks

AI task:
  investigate incident
  gather evidence
  find probable root cause
  explain next action

Best fit:
  platform teams wanting an open-source AI SRE layer
  over existing observability tools.

The overall open-source adoption pattern

Open-source observability is adopting AI in four layers.

1. AI telemetry standards

This is where OpenTelemetry is most important.

Goal:
  make AI applications observable in a standard way

Examples:
  GenAI semantic conventions
  token usage attributes
  model request spans
  prompt/response events
  tool-call spans

This is foundational. Without standard AI telemetry, every vendor and OSS project invents incompatible schemas.

2. AI workload observability

This is where Grafana AI Observability, OpenLIT, Langfuse and SigNoz fit.

Goal:
  monitor LLM apps, agents and RAG pipelines

Signals:
  latency
  token cost
  prompt/response quality
  hallucination risk
  model errors
  vector DB retrieval
  tool calls
  agent steps

3. AI-assisted operations

This is where Grafana Assistant, HolmesGPT, SigNoz Noz and similar tools fit.

Goal:
  help humans investigate production systems faster

Capabilities:
  natural-language querying
  alert explanation
  dashboard generation
  root-cause hints
  log summarisation
  incident summaries

4. Agent-native observability

This is the newest layer.

Goal:
  let AI agents consume observability data safely

Interfaces:
  MCP servers
  CLI tools
  API access
  context retrieval
  guarded query execution
  evidence-based RCA

This matters because future AI coding agents and SRE agents will need access to production telemetry to debug issues. The observability stack must become queryable by both humans and machines.

The key difference from commercial observability

Commercial vendors are building polished AI experiences inside their own SaaS platforms.

Open-source observability is building the portable foundations:

Layer	Open-source approach
Instrumentation	OpenTelemetry SDKs and semantic conventions
Collection	OpenTelemetry Collector, Grafana Alloy
Storage/query	Grafana LGTM, SigNoz, ClickHouse-based stacks
AI app tracing	OpenLIT, Langfuse, OTel GenAI conventions
AI SRE	HolmesGPT, MCP-enabled tools
Agent access	MCP, APIs, CLI workflows

The strategic difference is:

Commercial vendors:
  "Use our platform and our AI will help you."

Open-source ecosystem:
  "Instrument once, own your data, expose telemetry to any backend or AI agent."

What this means for SREs and observability engineers

The valuable skill is moving from only operating dashboards to building an AI-readable telemetry platform.

That means:

You need:
  consistent OpenTelemetry attributes
  clean service names
  good resource metadata
  deployment markers
  trace/log/metric correlation
  AI workload spans
  token/cost metrics
  evaluation signals
  MCP or API access for agents
  guardrails around sensitive telemetry

For a homelab or professional platform, the modern open-source direction would be:

Applications / AI agents
        ↓
OpenTelemetry SDKs + GenAI semantic conventions
        ↓
Grafana Alloy or OpenTelemetry Collector
        ↓
Mimir / Loki / Tempo / ClickHouse / SigNoz / Langfuse / OpenLIT
        ↓
Grafana dashboards + AI assistant / HolmesGPT / MCP-enabled agents

The sharp summary:

Grafana is making the open observability stack AI-aware.

OpenTelemetry is becoming the standard language for AI telemetry.

OpenLIT, Langfuse and SigNoz are making LLM apps observable.

HolmesGPT-style tools are turning open telemetry into AI-assisted SRE investigations.

So, yes: open-source observability is adopting AI quickly, but the centre of gravity is different. The open-source world is less about one vendor-owned AI brain and more about open telemetry, self-hostable AI observability, and agent-ready operations.