Scripting, Programming and Development for SRE

Shell Scripting for SRE: From Beginner to Advanced

1. What Is A Shell?

User

Shell (bash / zsh)

Linux Kernel

Hardware

Shell acts as:

  • Command interpreter
  • Automation engine
  • Glue between tools
  • System administration interface

Common shells:

ShellPurpose
BashIndustry standard
ZshModern interactive shell
ShPOSIX shell
FishUser friendly
KshEnterprise UNIX

Key SRE point:

Most production automation still relies heavily on Bash.


2. Shell Basics

Variables

NAME="Son"
echo $NAME

Output:

Son

Reading Input

read HOST
echo $HOST

Environment Variables

export ENV=prod

View:

env
printenv

Common examples:

PATH
HOME
USER
SHELL
HOSTNAME

3. Command Substitution

Capture command output.

DATE=$(date)

or

HOST=$(hostname)

4. Arithmetic

COUNT=$((1+1))
echo $COUNT

Output:

2

5. Conditions

If Statement

if [ -f file.txt ]
then
echo "Exists"
fi

Comparison Operators

OperatorMeaning
-eqequal
-nenot equal
-gtgreater
-ltless
-gegreater equal
-leless equal

String Tests

if [ "$ENV" = "prod" ]

6. Loops

For Loop

for i in {1..5}
do
echo $i
done

While Loop

while read host
do
echo $host
done

7. Functions

Reusable code.

backup() {
echo "Running backup"
}

Call:

backup

8. Arrays

HOSTS=("web1" "web2" "web3")

Access:

echo ${HOSTS[0]}

Loop:

for h in "${HOSTS[@]}"

Intermediate SRE Skills


9. Exit Codes

Every command returns:

0 = success
1-255 = failure

Example:

systemctl restart nginx

if [ $? -ne 0 ]
then
echo "Failed"
fi

Better:

if systemctl restart nginx
then
echo OK
fi

10. Pipes

Linux superpower.

ps aux | grep nginx
kubectl get pods | grep Error
journalctl | grep timeout

11. Redirection

stdout

>

append

>>

stderr

2>

both

&>

Example:

./backup.sh > backup.log 2>&1

12. Text Processing

grep

grep ERROR logfile

awk

awk '{print $1}'

sed

sed 's/prod/dev/g'

sort

sort file

uniq

uniq

13. File Processing

find /var/log -name "*.log"
xargs
cut
tr

14. Process Management

View:

ps aux

Monitor:

top
htop

Kill:

kill
pkill
killall

15. Logging

logger "Backup started"

Syslog:

/var/log/messages
journalctl

Advanced SRE Shell


16. Strict Mode

Production scripts should start with:

#!/usr/bin/env bash

set -euo pipefail

Meaning:

e = exit on error
u = undefined variable fails
o pipefail = fail pipelines correctly

17. Traps

Cleanup handling.

trap cleanup EXIT

Example:

cleanup() {
rm -f /tmp/file
}

18. Error Handling

if ! command
then
exit 1
fi

19. Parallel Execution

job1 &
job2 &
wait

20. Lock Files

Prevent multiple runs.

flock

Example:

flock -n /tmp/script.lock

21. API Calls

Modern SRE automation.

curl

Example:

curl https://api.example.com

JSON:

jq

Example:

curl api | jq .

22. Kubernetes Automation

kubectl get pods

Loop:

for pod in $(kubectl get pods -o name)

23. SSH Automation

ssh host command

Parallel:

parallel-ssh

or

for host in hosts.txt

24. Monitoring Scripts

Disk usage:

df -h

Memory:

free -m

CPU:

mpstat

Network:

ss -tulpn

25. Production Shell Script Structure

#!/usr/bin/env bash

set -euo pipefail

log()
{
echo "$(date) $*"
}

check_prereqs()
{
:
}

main()
{
check_prereqs
log "Starting"
}

main "$@"

Shell Scripting Tools Every SRE Should Know

CategoryTools
Text Processinggrep, awk, sed, tr, cut
JSONjq
YAMLyq
HTTPcurl
Debuggingset -x, bash -x
Parallelismxargs -P, GNU parallel
Schedulingcron, systemd timers
Monitoringtop, htop, vmstat, iostat
Networkingss, netstat, tcpdump
Kuberneteskubectl
Containersdocker, podman

Senior SRE Shell Interview Topics

Beginner

  • Variables
  • Loops
  • Functions
  • Conditions

Intermediate

  • Pipes
  • Redirection
  • grep/awk/sed
  • Exit codes

Advanced

  • set -euo pipefail
  • traps
  • flock
  • process substitution
  • arrays
  • xargs -P
  • jq
  • API automation

Expert

  • Production-grade shell frameworks
  • Kubernetes automation
  • CI/CD shell pipelines
  • Parallel execution
  • Distributed operations via SSH
  • Resilient error handling
  • Observability automation

Key Takeaway

A strong SRE uses shell as the operating system automation language:

Linux

Shell

Automation

Monitoring

Troubleshooting

Platform Reliability

Shell Fundamentals → Automation → Text Processing → System Administration → Kubernetes → Production Scripting → Senior SRE Techniques

Python Programming for SRE

From Shell-Like Scripts to Reusable Automation Libraries

Python is the SRE language for automation, APIs, data handling, tooling, integrations, and reusable operational code.


1. Python as “Better Shell”

Simple Python scripts often replace shell scripts when logic grows.

Shell-style task:

for host in $(cat hosts.txt); do
ping -c 1 "$host"
done

Python equivalent:

from pathlib import Path
import subprocess

for host in Path("hosts.txt").read_text().splitlines():
result = subprocess.run(["ping", "-c", "1", host])
print(host, result.returncode)

Use Python when you need:

Shell is fine forPython is better for
Simple commandsComplex logic
PipesAPIs and JSON
One-linersError handling
Small glue scriptsReusable tools
Quick checksMaintainable automation

2. Beginner Python for SRE

Core concepts:

name = "blusas"
hosts = ["web1", "web2", "db1"]

for host in hosts:
print(f"Checking {host}")

Useful basics:

ConceptSRE Use
VariablesStore config
Lists / dictsHosts, services, metadata
LoopsRepeat checks
FunctionsReusable actions
FilesRead inventories/logs
ExceptionsHandle failures
Exit codesCI/CD and cron jobs

3. Python Script Structure

A production-friendly script should look like this:

#!/usr/bin/env python3

import argparse
import logging
import sys


def main() -> int:
parser = argparse.ArgumentParser()
parser.add_argument("--host", required=True)
args = parser.parse_args()

logging.basicConfig(level=logging.INFO)
logging.info("Checking host %s", args.host)

return 0


if __name__ == "__main__":
sys.exit(main())

Why this matters:

PatternBenefit
main()Clean entry point
argparseCLI options
loggingOperational visibility
sys.exit()Correct exit codes
Type hintsEasier review
FunctionsEasier testing

4. Environments and Dependencies

Never rely only on “whatever Python is installed”.

Use virtual environments:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install requests pyyaml

Freeze dependencies:

pip freeze > requirements.txt

Install later:

pip install -r requirements.txt

Modern project layout:

sre-toolkit/
pyproject.toml
src/
sre_toolkit/
__init__.py
checks.py
clients.py
tests/
test_checks.py

5. Essential SRE Modules

ModuleUse
osEnvironment variables
sysExit codes, arguments
pathlibFile paths
subprocessRun shell commands
jsonParse API responses
yaml / pyyamlKubernetes/config files
requestsHTTP APIs
loggingLogs
argparseCLI tools
datetimeTimestamps
concurrent.futuresParallel checks
asyncioAsync network tasks
paramikoSSH automation
boto3AWS automation
kubernetesKubernetes API
prometheus_clientExport metrics

6. API Automation

Python is excellent for API-driven operations.

import requests

response = requests.get("https://api.example.com/health", timeout=5)
response.raise_for_status()

data = response.json()
print(data)

Typical SRE uses:

API TargetExample
Kubernetes APICheck pods, nodes, events
Grafana APICreate dashboards
Prometheus APIQuery alerts and metrics
GitLab APIPipeline automation
Cloud APIsProvision or inspect infra
Storage APIsHealth checks and capacity
Redfish/IPMIHardware automation

7. Error Handling

Bad automation fails silently. Good automation fails clearly.

import logging
import requests

try:
r = requests.get("https://api.example.com/health", timeout=5)
r.raise_for_status()
except requests.Timeout:
logging.error("API timed out")
except requests.HTTPError as err:
logging.error("API returned HTTP error: %s", err)
except requests.RequestException as err:
logging.error("API request failed: %s", err)

8. Functions for Reuse

Instead of copying code:

def check_http(url: str) -> bool:
response = requests.get(url, timeout=5)
return response.status_code == 200

Then reuse:

services = [
"https://grafana.example.com",
"https://prometheus.example.com",
]

for service in services:
print(service, check_http(service))

9. Classes and Methods

Classes help package operational logic so other engineers can reuse it.

import requests


class HealthClient:
def __init__(self, base_url: str, timeout: int = 5):
self.base_url = base_url.rstrip("/")
self.timeout = timeout

def check(self) -> bool:
response = requests.get(
f"{self.base_url}/health",
timeout=self.timeout,
)
return response.status_code == 200

def version(self) -> str:
response = requests.get(
f"{self.base_url}/version",
timeout=self.timeout,
)
response.raise_for_status()
return response.json()["version"]

Use it:

client = HealthClient("https://grafana.example.com")

if client.check():
print(client.version())

This is how small scripts evolve into shared SRE libraries.


10. Advanced Python for SRE

Advanced SkillSRE Value
PackagingShare tools internally
Unit testingSafer automation
MockingTest APIs without real systems
Type hintsCleaner reviews
DataclassesStructured config/models
ConcurrencyFaster fleet checks
Async IOEfficient API/network automation
Retries/backoffResilient automation
PluginsExtensible tooling
CI/CD integrationQuality gates
Metrics exportBuild custom exporters

Example dataclass:

from dataclasses import dataclass


@dataclass
class Service:
name: str
url: str
owner: str

Example parallel checks:

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=10) as pool:
results = pool.map(check_http, services)

11. Production SRE Python Checklist

RequirementWhy
Clear CLI argumentsEasy operation
Logging, not printBetter troubleshooting
Timeouts everywhereAvoid hung scripts
Retries with backoffHandle transient failure
Config files/env varsAvoid hardcoding
Secrets outside codeSecurity
Unit testsSafer changes
Type hintsBetter maintainability
Exit codesCI/CD compatible
README/runbookShareable with team

12. Senior SRE Python Use Cases

AreaExample
Incident responseAuto-collect logs, events, metrics
KubernetesDetect CrashLoopBackOff, bad nodes, failed jobs
ObservabilityQuery Prometheus, create Grafana dashboards
Cloud automationAudit resources, tags, cost, security
CI/CDValidate deployment manifests
StorageCapacity, replication, object checks
NetworkingPing, DNS, HTTP, TLS validation
HardwareRedfish health and firmware inventory
Platform toolingInternal CLIs and reusable libraries

Summary

Python starts like shell scripting, but grows into structured, reusable engineering automation.

The progression is:

Simple scripts
→ Functions
→ CLI tools
→ Modules
→ Classes
→ Packages
→ Shared SRE platforms

For SREs, Python is the bridge between manual operations and reliable platform automation.

Golang Development for SRE

From Using Go-Based Tools to Building Cloud-Native Platforms

Go (Golang) has become the de facto language of cloud infrastructure, Kubernetes, observability, service meshes, and modern platform engineering.

Unlike Python, which is often used for automation scripts, Go is commonly used to build the actual platforms, controllers, exporters, operators, agents, and distributed systems that SREs operate.


1. Why Go Matters to SREs

Many of the tools used daily by SREs are written in Go.

ToolPurpose
KubernetesContainer orchestration
HelmPackage management
PrometheusMetrics
Grafana Agent / AlloyTelemetry collection
LokiLog aggregation
TempoDistributed tracing
ThanosLong-term metrics
MimirScalable Prometheus
CiliumeBPF networking
TerraformInfrastructure as Code
DockerContainers
containerdRuntime
MinIOObject storage
ConsulService discovery
VaultSecrets management
EtcdDistributed key-value store
ArgoCDGitOps

Key point:

Most modern cloud-native infrastructure is built in Go.


2. Go Templates in Daily SRE Work

Many SREs use Go before writing Go.


Helm Templates

Helm uses Go templating.

Example:

apiVersion: v1
kind: Service

metadata:
name: {{ .Release.Name }}

spec:
type: {{ .Values.service.type }}

Values:

service:
type: ClusterIP

Output:

type: ClusterIP

Grafana Alert Templates

{{ .Labels.instance }}

Alertmanager Templates

{{ .CommonLabels.alertname }}

ArgoCD Templates

{{ .metadata.name }}

External Secrets

{{ .secret.username }}

Beginner Go


3. Your First Go Program

package main

import "fmt"

func main() {
fmt.Println("Hello SRE")
}

Execution:

go run main.go

Build:

go build

Produces:

main

Single binary.

No interpreter required.


4. Program Structure

package main

import (
"fmt"
)

func main() {
fmt.Println("Hello")
}

Key pieces:

ComponentPurpose
packageNamespace
importDependencies
funcFunction
mainEntry point

5. Data Types

var name string
var count int
var healthy bool

Examples:

name := "grafana"
pods := 5
healthy := true

Common types:

TypeExample
stringhostname
intreplicas
float64latency
boolhealth
[]stringhosts
maplabels

6. Collections

Slice

hosts := []string{
"web1",
"web2",
}

Loop:

for _, host := range hosts {
fmt.Println(host)
}

Maps

labels := map[string]string{
"env": "prod",
}

Lookup:

fmt.Println(labels["env"])

7. Functions

func check(host string) bool {
return true
}

Usage:

healthy := check("web1")

Intermediate Go


8. Packages

Organize reusable code.

project/

cmd/
pkg/
internal/

Example:

cmd/
└── sre-tool

pkg/
└── monitoring

internal/
└── config

Import:

import "sre-tool/pkg/monitoring"

9. Structs

Structs are Go’s primary data model.

type Service struct {
Name string
URL string
}

Create:

svc := Service{
Name: "Grafana",
URL: "https://grafana",
}

10. Methods

Attach behavior to structs.

func (s Service) Healthy() bool {
return true
}

Usage:

svc.Healthy()

11. JSON Processing

Very common in APIs.

type Health struct {
Status string `json:"status"`
}

Decode:

json.NewDecoder(resp.Body).Decode(&health)

Used everywhere:

  • Kubernetes APIs
  • Prometheus APIs
  • Grafana APIs
  • Cloud APIs
  • Redfish APIs

12. HTTP Clients

SRE automation often talks to APIs.

resp, err := http.Get(url)

Production:

client := http.Client{
Timeout: 5 * time.Second,
}

13. Modules

Modern dependency management.

Initialize:

go mod init sre-tool

Creates:

go.mod

Add package:

go get github.com/prometheus/client_golang

Download:

go mod tidy

Advanced Go


14. Goroutines

Go’s biggest feature.

Run concurrently:

go checkNode()
go checkStorage()
go checkNetwork()

No threads required.


Without Go:

Task A
Task B
Task C

With Go:

Task A
Task B
Task C

Running simultaneously

Perfect for:

  • Fleet health checks
  • API polling
  • Monitoring agents
  • Exporters

15. Channels

Safe communication between goroutines.

results := make(chan string)

go func() {
results <- "healthy"
}()

Receive:

msg := <-results

16. Worker Pools

Massively useful for SRE tooling.

jobs := make(chan Job)
results := make(chan Result)

Multiple workers:

for w := 1; w <= 10; w++ {
go worker(jobs, results)
}

Use cases:

  • Check 10,000 servers
  • Scan clusters
  • Query APIs
  • Gather inventory

17. Interfaces

Go’s abstraction mechanism.

type Checker interface {
Check() error
}

Implement:

type HTTPChecker struct{}
type DNSChecker struct{}

Both satisfy:

Check()

Benefits:

  • Extensible code
  • Plugin architectures
  • Easier testing

18. Context

Critical in production systems.

ctx, cancel := context.WithTimeout(
context.Background(),
5*time.Second,
)

Used for:

  • API timeouts
  • Kubernetes clients
  • Database operations
  • Distributed systems

19. Error Handling

Go favors explicit errors.

result, err := doWork()

if err != nil {
return err
}

This is seen everywhere.


20. Package Design

Typical enterprise Go project:

sre-toolkit/

cmd/
└── sre-tool

internal/
├── config
├── api
├── logging
└── monitoring

pkg/
├── prometheus
├── kubernetes
├── grafana
└── redfish

go.mod

21. Building Production Services

Typical components:

CLI

Config

Logging

Metrics

API Client

Business Logic

Exporter / Service

22. Senior SRE Go Use Cases

AreaExample
Kubernetes OperatorsCustom controllers
Prometheus ExportersCustom metrics
Grafana PluginsDashboards
Monitoring AgentsNode collectors
Redfish AutomationHardware management
Fleet ManagementServer inventory
Cloud AutomationInfrastructure tooling
Storage AutomationCapacity and health
Service MeshNetwork observability
Internal PlatformsShared engineering tools

23. Production Go Best Practices

PracticeWhy
Use context everywherePrevent hangs
Structured loggingBetter debugging
Metrics exposureObservability
Unit testsReliability
InterfacesExtensibility
Worker poolsScalability
TimeoutsSafety
Dependency injectionTestability
Small packagesMaintainability
Semantic versioningSafe releases

Summary

The Go learning journey for an SRE typically looks like:

Using Go-based tools

Using Go templates (Helm)

Reading Go code

Writing small programs

Structs & Packages

Modules & APIs

Concurrency (Goroutines)

Interfaces

Production Services

Operators & Platform Engineering

Key Takeaway

Python is often used to automate systems.

Go is often used to build the systems being automated.

Shell

Automation

Python

Platform Automation

Go

Cloud-Native Platforms

Kubernetes

Observability Systems

Infrastructure Services


Go Ecosystem → Templates → Language Basics → Packages → Modules → Concurrency → Interfaces → Production Design → Senior SRE Use Cases → Summary.