Site Reliability Engineering

Site Reliability Engineering (SRE) is a discipline that applies software engineering principles to IT operations. Its goal is to create and maintain highly reliable, scalable, and efficient systems.

SRE was pioneered by Google to address the challenge of operating large-scale services while balancing reliability with the pace of feature development.

Core Objectives of SRE

  1. Reliability – Keep services available and functioning correctly.
  2. Scalability – Ensure systems can handle growth in users and traffic.
  3. Performance – Maintain acceptable response times and user experience.
  4. Efficiency – Automate repetitive operational work.
  5. Risk Management – Balance innovation with system stability.

Key Concepts

Service Level Indicators (SLIs)

Metrics that measure service performance, such as:

  • Request latency
  • Error rate
  • Availability (uptime)

Service Level Objectives (SLOs)

Target values for SLIs, for example:

  • 99.9% monthly availability
  • 95% of requests completed within 200 ms

Error Budgets

The acceptable amount of unreliability based on the SLO. If a service has a 99.9% uptime target, the remaining 0.1% downtime is the “error budget.” Teams can use this budget to release new features while managing risk.

Typical SRE Responsibilities

  • Monitoring and alerting
  • Incident response and on-call support
  • Capacity planning
  • Automation and tooling
  • Performance optimization
  • Disaster recovery planning
  • Postmortem analysis after outages

Example

Suppose an e-commerce website experiences outages during holiday sales.

An SRE team might:

  • Set up monitoring to detect issues early.
  • Automate server scaling when traffic increases.
  • Create alerts for high error rates.
  • Conduct postmortems after incidents to prevent recurrence.
  • Define an SLO of 99.95% availability.

SRE vs. DevOps

SREDevOps
A specific engineering disciplineA broader culture and philosophy
Focuses heavily on reliability metrics and automationFocuses on collaboration between development and operations
Uses concepts like SLOs and error budgetsEncourages continuous delivery and shared responsibility

A common saying is: “SRE is one way to implement DevOps.”

Skills Commonly Required for SRE Roles

  • Linux/Unix administration
  • Cloud platforms (e.g., AWS, GCP)
  • Networking fundamentals
  • Programming (Python, Go, Java, etc.)
  • Containers and orchestration (Docker, Kubernetes)
  • Monitoring tools (Prometheus, Grafana)
  • CI/CD pipelines
  • Incident management

In short, SRE is about using engineering and automation to keep systems reliable while enabling rapid software delivery.

SRE for AI Hyperscalers

Traditional SRE focuses on keeping applications and services reliable. AI/Hyperscale SRE extends this into managing massive distributed compute, storage, networking, and accelerator infrastructure that powers AI training, inference, and cloud platforms.

Companies such as Google Cloud, Microsoft Azure, Amazon Web Services, CoreWeave, Graphcore, Nscale, and Crusoe all require variants of this role.

Traditional SRE vs AI Infrastructure SRE

Traditional SaaS SREAI Infrastructure SRE
Web servicesGPU clusters
APIsDistributed training
DatabasesAI storage systems
Application latencyGPU utilization
Service availabilityTraining job success
HTTP trafficRDMA/InfiniBand traffic
KubernetesKubernetes + Slurm
CPU metricsGPU metrics

A traditional SRE might ask:

Is the website available?

An AI SRE asks:

Why did a 2,000-GPU training run fail after 18 hours?


What AI SREs Actually Operate

Compute Layer

AI SREs manage clusters containing:

  • NVIDIA H100/H200/B200 GPUs
  • AMD MI300 GPUs
  • Graphcore IPUs
  • GPU servers
  • Bare-metal provisioning systems

Responsibilities:

  • Firmware management
  • GPU health validation
  • Hardware lifecycle management
  • Capacity planning
  • Cluster upgrades

Accelerator Fabric

AI workloads depend on ultra-low-latency networks.

Typical technologies:

  • NVIDIA InfiniBand
  • RoCEv2
  • RDMA
  • NVLink
  • NVSwitch

The SRE becomes part network engineer.

Example troubleshooting:

  • PFC misconfiguration
  • ECN tuning
  • Packet loss
  • Congestion hotspots
  • Link flaps
  • Switch failures

A single dropped packet can slow an entire distributed training job.


AI Storage

Training jobs may consume petabytes of data.

Storage technologies include:

  • VAST Data
  • DDN
  • Weka
  • Lustre
  • BeeGFS
  • Ceph
  • Object storage

Key metrics:

  • Throughput
  • IOPS
  • Metadata latency
  • GPU starvation

Common question:

Are GPUs idle because the storage cannot feed data fast enough?


Kubernetes

Modern AI environments frequently run:

  • Kubernetes
  • GPU Operator
  • KubeRay
  • Kubeflow
  • Volcano
  • Argo Workflows

SRE responsibilities:

  • Node lifecycle
  • Scheduling
  • Cluster upgrades
  • GPU sharing
  • Resource quotas
  • Autoscaling

HPC Schedulers

Many AI companies still use:

  • Slurm
  • PBS
  • LSF

Responsibilities:

  • Queue management
  • Fair-share policies
  • Accounting
  • Job placement
  • Multi-tenant isolation

Example:

A user requests 512 GPUs.

The SRE determines:

  • Where the GPUs exist
  • Whether they are healthy
  • Which fabric topology is optimal

Observability at Hyperscale

This is often where senior AI SREs spend most of their time.

Observability includes:

Infrastructure Metrics

  • CPU
  • Memory
  • Disk
  • Network

GPU Telemetry

  • Temperature
  • Power draw
  • Memory usage
  • ECC errors
  • Utilization
  • NVLink bandwidth

Scheduler Metrics

  • Queue depth
  • Job success rate
  • Pending jobs
  • Resource fragmentation

Storage Metrics

  • Throughput
  • Latency
  • Metadata performance

Application Metrics

  • Training loss
  • Tokens/sec
  • Model throughput

Tools commonly include:

  • Prometheus
  • Grafana
  • Mimir
  • Loki
  • Tempo
  • OpenTelemetry

This aligns closely with the observability platforms you have built and operated.


Reliability Challenges Unique to AI

Traditional SaaS outage:

Website unavailable.

AI outage:

Training run failed after consuming £100,000 of GPU time.

Examples:

GPU Failure

One GPU reports ECC errors.

Result:

  • NCCL failures
  • Job termination
  • Training restart

Network Congestion

One leaf switch becomes congested.

Result:

  • Collective operations slow
  • GPUs wait on synchronization

Storage Bottleneck

Data pipeline cannot deliver training data fast enough.

Result:

  • GPU utilization drops from 95% to 40%

Scheduler Fragmentation

Enough GPUs exist overall but not in contiguous groups.

Result:

  • Large jobs cannot start

AI SRE Success Metrics

Instead of focusing only on uptime, AI companies often care about:

MetricWhy
GPU UtilizationGPUs are extremely expensive
Job Success RateFailed jobs waste money
Training ThroughputFaster model development
Cluster EfficiencyRevenue and cost control
Mean Time To RecoveryMinimize lost compute time
Resource AvailabilityCustomer satisfaction
Infrastructure Cost per GPU HourBusiness profitability

What Hyperscalers Want

For companies such as Graphcore, CoreWeave, Nscale, and Crusoe need SREs to know:

  1. Linux internals
  2. Kubernetes
  3. Networking
  4. GPU infrastructure
  5. Storage systems
  6. Automation (Python/Go)
  7. Observability
  8. Incident management
  9. Capacity engineering
  10. Distributed systems

At senior levels, the role becomes less about fixing servers and more about answering questions such as:

  • Why are GPUs underutilized?
  • Why are training jobs failing?
  • How can we increase cluster efficiency?
  • How do we observe 100,000+ accelerators?
  • How do we automate fleet operations?
  • How do we reduce cost per training run?