Site Reliability Engineering (SRE) is a discipline that applies software engineering principles to IT operations. Its goal is to create and maintain highly reliable, scalable, and efficient systems.

SRE was pioneered by Google to address the challenge of operating large-scale services while balancing reliability with the pace of feature development.

Core Objectives of SRE

Reliability – Keep services available and functioning correctly.
Scalability – Ensure systems can handle growth in users and traffic.
Performance – Maintain acceptable response times and user experience.
Efficiency – Automate repetitive operational work.
Risk Management – Balance innovation with system stability.

Key Concepts

Service Level Indicators (SLIs)

Metrics that measure service performance, such as:

Request latency
Error rate
Availability (uptime)

Service Level Objectives (SLOs)

Target values for SLIs, for example:

99.9% monthly availability
95% of requests completed within 200 ms

Error Budgets

The acceptable amount of unreliability based on the SLO. If a service has a 99.9% uptime target, the remaining 0.1% downtime is the “error budget.” Teams can use this budget to release new features while managing risk.

Typical SRE Responsibilities

Monitoring and alerting
Incident response and on-call support
Capacity planning
Automation and tooling
Performance optimization
Disaster recovery planning
Postmortem analysis after outages

Example

Suppose an e-commerce website experiences outages during holiday sales.

An SRE team might:

Set up monitoring to detect issues early.
Automate server scaling when traffic increases.
Create alerts for high error rates.
Conduct postmortems after incidents to prevent recurrence.
Define an SLO of 99.95% availability.

SRE vs. DevOps

SRE	DevOps
A specific engineering discipline	A broader culture and philosophy
Focuses heavily on reliability metrics and automation	Focuses on collaboration between development and operations
Uses concepts like SLOs and error budgets	Encourages continuous delivery and shared responsibility

A common saying is: “SRE is one way to implement DevOps.”

Skills Commonly Required for SRE Roles

Linux/Unix administration
Cloud platforms (e.g., AWS, GCP)
Networking fundamentals
Programming (Python, Go, Java, etc.)
Containers and orchestration (Docker, Kubernetes)
Monitoring tools (Prometheus, Grafana)
CI/CD pipelines
Incident management

In short, SRE is about using engineering and automation to keep systems reliable while enabling rapid software delivery.

SRE for AI Hyperscalers

Traditional SRE focuses on keeping applications and services reliable. AI/Hyperscale SRE extends this into managing massive distributed compute, storage, networking, and accelerator infrastructure that powers AI training, inference, and cloud platforms.

Companies such as Google Cloud, Microsoft Azure, Amazon Web Services, CoreWeave, Graphcore, Nscale, and Crusoe all require variants of this role.

Traditional SRE vs AI Infrastructure SRE

Traditional SaaS SRE	AI Infrastructure SRE
Web services	GPU clusters
APIs	Distributed training
Databases	AI storage systems
Application latency	GPU utilization
Service availability	Training job success
HTTP traffic	RDMA/InfiniBand traffic
Kubernetes	Kubernetes + Slurm
CPU metrics	GPU metrics

A traditional SRE might ask:

Is the website available?

An AI SRE asks:

Why did a 2,000-GPU training run fail after 18 hours?

What AI SREs Actually Operate

Compute Layer

AI SREs manage clusters containing:

NVIDIA H100/H200/B200 GPUs
AMD MI300 GPUs
Graphcore IPUs
GPU servers
Bare-metal provisioning systems

Responsibilities:

Firmware management
GPU health validation
Hardware lifecycle management
Capacity planning
Cluster upgrades

Accelerator Fabric

AI workloads depend on ultra-low-latency networks.

Typical technologies:

NVIDIA InfiniBand
RoCEv2
RDMA
NVLink
NVSwitch

The SRE becomes part network engineer.

Example troubleshooting:

PFC misconfiguration
ECN tuning
Packet loss
Congestion hotspots
Link flaps
Switch failures

A single dropped packet can slow an entire distributed training job.

AI Storage

Training jobs may consume petabytes of data.

Storage technologies include:

VAST Data
DDN
Weka
Lustre
BeeGFS
Ceph
Object storage

Key metrics:

Throughput
IOPS
Metadata latency
GPU starvation

Common question:

Are GPUs idle because the storage cannot feed data fast enough?

Kubernetes

Modern AI environments frequently run:

Kubernetes
GPU Operator
KubeRay
Kubeflow
Volcano
Argo Workflows

SRE responsibilities:

Node lifecycle
Scheduling
Cluster upgrades
GPU sharing
Resource quotas
Autoscaling

HPC Schedulers

Many AI companies still use:

Slurm
PBS
LSF

Responsibilities:

Queue management
Fair-share policies
Accounting
Job placement
Multi-tenant isolation

Example:

A user requests 512 GPUs.

The SRE determines:

Where the GPUs exist
Whether they are healthy
Which fabric topology is optimal

Observability at Hyperscale

This is often where senior AI SREs spend most of their time.

Observability includes:

Infrastructure Metrics

CPU
Memory
Disk
Network

GPU Telemetry

Temperature
Power draw
Memory usage
ECC errors
Utilization
NVLink bandwidth

Scheduler Metrics

Queue depth
Job success rate
Pending jobs
Resource fragmentation

Storage Metrics

Throughput
Latency
Metadata performance

Application Metrics

Training loss
Tokens/sec
Model throughput

Tools commonly include:

Prometheus
Grafana
Mimir
Loki
Tempo
OpenTelemetry

This aligns closely with the observability platforms you have built and operated.

Reliability Challenges Unique to AI

Traditional SaaS outage:

Website unavailable.

AI outage:

Training run failed after consuming £100,000 of GPU time.

Examples:

GPU Failure

One GPU reports ECC errors.

Result:

NCCL failures
Job termination
Training restart

Network Congestion

One leaf switch becomes congested.

Result:

Collective operations slow
GPUs wait on synchronization

Storage Bottleneck

Data pipeline cannot deliver training data fast enough.

Result:

GPU utilization drops from 95% to 40%

Scheduler Fragmentation

Enough GPUs exist overall but not in contiguous groups.

Result:

Large jobs cannot start

AI SRE Success Metrics

Instead of focusing only on uptime, AI companies often care about:

Metric	Why
GPU Utilization	GPUs are extremely expensive
Job Success Rate	Failed jobs waste money
Training Throughput	Faster model development
Cluster Efficiency	Revenue and cost control
Mean Time To Recovery	Minimize lost compute time
Resource Availability	Customer satisfaction
Infrastructure Cost per GPU Hour	Business profitability

What Hyperscalers Want

For companies such as Graphcore, CoreWeave, Nscale, and Crusoe need SREs to know:

Linux internals
Kubernetes
Networking
GPU infrastructure
Storage systems
Automation (Python/Go)
Observability
Incident management
Capacity engineering
Distributed systems

At senior levels, the role becomes less about fixing servers and more about answering questions such as:

Why are GPUs underutilized?
Why are training jobs failing?
How can we increase cluster efficiency?
How do we observe 100,000+ accelerators?
How do we automate fleet operations?
How do we reduce cost per training run?

BLU // SAS

Site Reliability Engineering

Core Objectives of SRE

Key Concepts

Service Level Indicators (SLIs)

Service Level Objectives (SLOs)

Error Budgets

Typical SRE Responsibilities

Example

SRE vs. DevOps

Skills Commonly Required for SRE Roles

SRE for AI Hyperscalers

Traditional SRE vs AI Infrastructure SRE

What AI SREs Actually Operate

Compute Layer

Accelerator Fabric

AI Storage

Kubernetes

HPC Schedulers

Observability at Hyperscale

Infrastructure Metrics

GPU Telemetry

Scheduler Metrics

Storage Metrics

Application Metrics

Reliability Challenges Unique to AI

GPU Failure

Network Congestion

Storage Bottleneck

Scheduler Fragmentation

AI SRE Success Metrics

What Hyperscalers Want

Bristol Linux Unix Systems Automation Security