Site Reliability Engineering (SRE) is a discipline that applies software engineering principles to IT operations. Its goal is to create and maintain highly reliable, scalable, and efficient systems.
SRE was pioneered by Google to address the challenge of operating large-scale services while balancing reliability with the pace of feature development.
Core Objectives of SRE
- Reliability – Keep services available and functioning correctly.
- Scalability – Ensure systems can handle growth in users and traffic.
- Performance – Maintain acceptable response times and user experience.
- Efficiency – Automate repetitive operational work.
- Risk Management – Balance innovation with system stability.
Key Concepts
Service Level Indicators (SLIs)
Metrics that measure service performance, such as:
- Request latency
- Error rate
- Availability (uptime)
Service Level Objectives (SLOs)
Target values for SLIs, for example:
- 99.9% monthly availability
- 95% of requests completed within 200 ms
Error Budgets
The acceptable amount of unreliability based on the SLO. If a service has a 99.9% uptime target, the remaining 0.1% downtime is the “error budget.” Teams can use this budget to release new features while managing risk.
Typical SRE Responsibilities
- Monitoring and alerting
- Incident response and on-call support
- Capacity planning
- Automation and tooling
- Performance optimization
- Disaster recovery planning
- Postmortem analysis after outages
Example
Suppose an e-commerce website experiences outages during holiday sales.
An SRE team might:
- Set up monitoring to detect issues early.
- Automate server scaling when traffic increases.
- Create alerts for high error rates.
- Conduct postmortems after incidents to prevent recurrence.
- Define an SLO of 99.95% availability.
SRE vs. DevOps
| SRE | DevOps |
|---|---|
| A specific engineering discipline | A broader culture and philosophy |
| Focuses heavily on reliability metrics and automation | Focuses on collaboration between development and operations |
| Uses concepts like SLOs and error budgets | Encourages continuous delivery and shared responsibility |
A common saying is: “SRE is one way to implement DevOps.”
Skills Commonly Required for SRE Roles
- Linux/Unix administration
- Cloud platforms (e.g., AWS, GCP)
- Networking fundamentals
- Programming (Python, Go, Java, etc.)
- Containers and orchestration (Docker, Kubernetes)
- Monitoring tools (Prometheus, Grafana)
- CI/CD pipelines
- Incident management
In short, SRE is about using engineering and automation to keep systems reliable while enabling rapid software delivery.
SRE for AI Hyperscalers
Traditional SRE focuses on keeping applications and services reliable. AI/Hyperscale SRE extends this into managing massive distributed compute, storage, networking, and accelerator infrastructure that powers AI training, inference, and cloud platforms.
Companies such as Google Cloud, Microsoft Azure, Amazon Web Services, CoreWeave, Graphcore, Nscale, and Crusoe all require variants of this role.
Traditional SRE vs AI Infrastructure SRE
| Traditional SaaS SRE | AI Infrastructure SRE |
|---|---|
| Web services | GPU clusters |
| APIs | Distributed training |
| Databases | AI storage systems |
| Application latency | GPU utilization |
| Service availability | Training job success |
| HTTP traffic | RDMA/InfiniBand traffic |
| Kubernetes | Kubernetes + Slurm |
| CPU metrics | GPU metrics |
A traditional SRE might ask:
Is the website available?
An AI SRE asks:
Why did a 2,000-GPU training run fail after 18 hours?
What AI SREs Actually Operate
Compute Layer
AI SREs manage clusters containing:
- NVIDIA H100/H200/B200 GPUs
- AMD MI300 GPUs
- Graphcore IPUs
- GPU servers
- Bare-metal provisioning systems
Responsibilities:
- Firmware management
- GPU health validation
- Hardware lifecycle management
- Capacity planning
- Cluster upgrades
Accelerator Fabric
AI workloads depend on ultra-low-latency networks.
Typical technologies:
- NVIDIA InfiniBand
- RoCEv2
- RDMA
- NVLink
- NVSwitch
The SRE becomes part network engineer.
Example troubleshooting:
- PFC misconfiguration
- ECN tuning
- Packet loss
- Congestion hotspots
- Link flaps
- Switch failures
A single dropped packet can slow an entire distributed training job.
AI Storage
Training jobs may consume petabytes of data.
Storage technologies include:
- VAST Data
- DDN
- Weka
- Lustre
- BeeGFS
- Ceph
- Object storage
Key metrics:
- Throughput
- IOPS
- Metadata latency
- GPU starvation
Common question:
Are GPUs idle because the storage cannot feed data fast enough?
Kubernetes
Modern AI environments frequently run:
- Kubernetes
- GPU Operator
- KubeRay
- Kubeflow
- Volcano
- Argo Workflows
SRE responsibilities:
- Node lifecycle
- Scheduling
- Cluster upgrades
- GPU sharing
- Resource quotas
- Autoscaling
HPC Schedulers
Many AI companies still use:
- Slurm
- PBS
- LSF
Responsibilities:
- Queue management
- Fair-share policies
- Accounting
- Job placement
- Multi-tenant isolation
Example:
A user requests 512 GPUs.
The SRE determines:
- Where the GPUs exist
- Whether they are healthy
- Which fabric topology is optimal
Observability at Hyperscale
This is often where senior AI SREs spend most of their time.
Observability includes:
Infrastructure Metrics
- CPU
- Memory
- Disk
- Network
GPU Telemetry
- Temperature
- Power draw
- Memory usage
- ECC errors
- Utilization
- NVLink bandwidth
Scheduler Metrics
- Queue depth
- Job success rate
- Pending jobs
- Resource fragmentation
Storage Metrics
- Throughput
- Latency
- Metadata performance
Application Metrics
- Training loss
- Tokens/sec
- Model throughput
Tools commonly include:
- Prometheus
- Grafana
- Mimir
- Loki
- Tempo
- OpenTelemetry
This aligns closely with the observability platforms you have built and operated.
Reliability Challenges Unique to AI
Traditional SaaS outage:
Website unavailable.
AI outage:
Training run failed after consuming £100,000 of GPU time.
Examples:
GPU Failure
One GPU reports ECC errors.
Result:
- NCCL failures
- Job termination
- Training restart
Network Congestion
One leaf switch becomes congested.
Result:
- Collective operations slow
- GPUs wait on synchronization
Storage Bottleneck
Data pipeline cannot deliver training data fast enough.
Result:
- GPU utilization drops from 95% to 40%
Scheduler Fragmentation
Enough GPUs exist overall but not in contiguous groups.
Result:
- Large jobs cannot start
AI SRE Success Metrics
Instead of focusing only on uptime, AI companies often care about:
| Metric | Why |
|---|---|
| GPU Utilization | GPUs are extremely expensive |
| Job Success Rate | Failed jobs waste money |
| Training Throughput | Faster model development |
| Cluster Efficiency | Revenue and cost control |
| Mean Time To Recovery | Minimize lost compute time |
| Resource Availability | Customer satisfaction |
| Infrastructure Cost per GPU Hour | Business profitability |
What Hyperscalers Want
For companies such as Graphcore, CoreWeave, Nscale, and Crusoe need SREs to know:
- Linux internals
- Kubernetes
- Networking
- GPU infrastructure
- Storage systems
- Automation (Python/Go)
- Observability
- Incident management
- Capacity engineering
- Distributed systems
At senior levels, the role becomes less about fixing servers and more about answering questions such as:
- Why are GPUs underutilized?
- Why are training jobs failing?
- How can we increase cluster efficiency?
- How do we observe 100,000+ accelerators?
- How do we automate fleet operations?
- How do we reduce cost per training run?