Git for SREs: from basic to advanced

Git is a distributed version control system. For an SRE, Git is not only for application source code. It becomes the control plane for:

Software versioning
Tracking code changes, releases, hotfixes, rollbacks, and collaboration.

Infrastructure as Code
Managing Terraform, Ansible, Helm, Kubernetes manifests, OpenStack configs, CI/CD pipelines, observability configs, and runbooks.

GitOps
Using Git as the declared source of truth for infrastructure and platform state, with automated agents applying changes into environments.

1. Git basics

What Git solves

Before Git, teams often worked with:

app-v1.tar.gz
app-v2-final.tar.gz
app-v2-final-fixed.tar.gz
app-v2-prod-hotfix.tar.gz

This becomes unmanageable.

Git gives you:

who changed what
when it changed
why it changed
what files changed
how to revert it
how to compare it
how to merge it
how to release it

For SREs, this matters because production systems change constantly. Git gives traceability.

2. Core Git concepts

Repository

A Git repository is a project tracked by Git.

git init

Or clone an existing one:

git clone https://github.com/org/project.git

A repository contains:

working directory  -> your local files
staging area       -> changes prepared for commit
commit history     -> permanent snapshots
remote             -> shared copy, e.g. GitHub/GitLab

Working tree

This is the current directory you edit.

Check status:

git status

Example:

modified: terraform/main.tf
untracked: ansible/inventory.yml

Staging area

Before committing, you stage changes:

git add terraform/main.tf

Stage everything:

git add .

Commit

A commit is a versioned snapshot.

git commit -m "Add Terraform module for VPC networking"

Good commit messages explain intent:

Add Prometheus scrape config for node exporters
Fix Loki retention config for production cluster
Refactor Terraform security group module

Poor messages:

fix
changes
stuff
update

Log

View history:

git log

Compact view:

git log --oneline --graph --decorate --all

Very useful SRE alias:

alias glog='git log --oneline --graph --decorate --all'

Diff

See what changed:

git diff

See staged changes:

git diff --staged

Compare two commits:

git diff abc123 def456

For SREs, git diff is critical before applying infrastructure changes.

3. Branching

A branch is an independent line of development.

git branch feature/add-alertmanager-rules
git checkout feature/add-alertmanager-rules

Modern command:

git switch -c feature/add-alertmanager-rules

Typical branch names:

feature/add-mimir-alerts
bugfix/fix-nginx-timeout
hotfix/prod-loki-retention
infra/add-openstack-network
docs/update-runbook

Branches allow work without directly changing main.

4. Merge

Merging combines branches.

git switch main
git merge feature/add-alertmanager-rules

Example:

main
  A---B---C
       \
        D---E feature

After merge:

main
  A---B---C-------M
       \         /
        D---E---

The merge commit records the integration.

5. Rebase

Rebase rewrites your branch on top of another branch.

git switch feature/add-alertmanager-rules
git rebase main

Before:

main:    A---B---C
feature:      \---D---E

After:

main:    A---B---C
                  \---D'---E'

Use rebase to keep a clean history.

Common use:

git fetch origin
git rebase origin/main

Do not casually rebase shared branches unless the team agrees.

6. Pull, fetch, push

Fetch

Downloads changes but does not modify your branch:

git fetch origin

Safe operation.

Pull

Fetches and merges/rebases:

git pull

Often better:

git pull --rebase

Push

Uploads your branch:

git push origin feature/add-alertmanager-rules

Set upstream:

git push -u origin feature/add-alertmanager-rules

7. Pull requests / merge requests

In GitHub: Pull Request.
In GitLab: Merge Request.

For SRE work, a PR/MR should show:

what changed
why it changed
risk level
how it was tested
rollback plan
related ticket/change request

Example SRE MR description:

## Summary
Adds Prometheus alert rules for Kubernetes node disk pressure.

## Risk
Low. Alert-only change. No runtime workload impact.

## Testing
Validated with promtool:
promtool check rules alerts/node-disk.yml

## Rollback
Revert this MR or remove the alert rule file.

8. Tags and releases

Tags mark important points in history.

git tag v1.2.0
git push origin v1.2.0

Annotated tag:

git tag -a v1.2.0 -m "Release v1.2.0"

SRE usage:

application release versions
Terraform module versions
Helm chart versions
Ansible role versions
container image tags
rollback anchors

Example:

git checkout v1.2.0

9. Git for software versioning

Software teams use Git to manage:

features
bug fixes
release branches
hotfixes
semantic versions
changelogs
build pipelines
deployment promotion

Semantic versioning

Common format:

MAJOR.MINOR.PATCH

Example:

1.4.2

Meaning:

MAJOR: breaking changes
MINOR: backward-compatible features
PATCH: backward-compatible bug fixes

Examples:

1.4.2 -> 1.4.3  patch fix
1.4.2 -> 1.5.0  new feature
1.4.2 -> 2.0.0  breaking change

For SREs, semantic versioning helps understand upgrade risk.

10. Common Git workflows

Trunk-based development

Most changes go through short-lived branches into main.

main
 |
 +-- short feature branch
 +-- quick MR
 +-- merge

Advantages:

fast delivery
less merge pain
good for CI/CD
encourages small changes

Best for mature teams with strong tests and automation.

Git Flow

Older, more structured model:

main
develop
feature/*
release/*
hotfix/*

Advantages:

clear release process
useful for slower release cycles

Disadvantages:

more branch complexity
slower integration
larger merge conflicts
less ideal for continuous delivery

Environment branch model

Common but risky:

dev
staging
prod

This is sometimes used for infrastructure, but it can become messy because each branch drifts.

Better pattern for IaC:

main
envs/dev/
envs/staging/
envs/prod/

Same branch, different directories.

11. Git for Infrastructure as Code

For SREs, Git is where infrastructure definitions live.

Examples:

terraform/
ansible/
kubernetes/
helm/
packer/
cloud-init/
openstack/
ceph/
slurm/
grafana/
prometheus/
loki/
tempo/
mimir/

Infrastructure becomes reviewable and repeatable.

Example Terraform repository layout

infra/
├── modules/
│   ├── network/
│   ├── compute/
│   ├── security-group/
│   └── object-storage/
├── envs/
│   ├── dev/
│   │   └── main.tf
│   ├── staging/
│   │   └── main.tf
│   └── prod/
│       └── main.tf
└── README.md

SRE workflow:

git switch -c infra/add-prod-network
terraform fmt
terraform validate
terraform plan
git add .
git commit -m "Add production OpenStack network module"
git push

The MR should include the Terraform plan output or CI-generated plan.

Example Kubernetes repository layout

platform-k8s/
├── clusters/
│   ├── dev/
│   ├── staging/
│   └── prod/
├── apps/
│   ├── grafana/
│   ├── prometheus/
│   ├── loki/
│   ├── mimir/
│   └── tempo/
├── base/
├── overlays/
└── README.md

With Kustomize:

base/
  deployment.yaml
  service.yaml

overlays/prod/
  kustomization.yaml
  replica-patch.yaml

Example observability config in Git

observability/
├── prometheus/
│   ├── scrape-configs/
│   └── alert-rules/
├── grafana/
│   ├── dashboards/
│   └── datasources/
├── loki/
│   └── recording-rules/
├── mimir/
│   └── alertmanager/
└── tempo/

Benefits:

alerts are reviewed
dashboards are versioned
rollbacks are possible
production config is auditable
changes can be tested in CI

12. GitOps

GitOps means Git is the source of truth for desired system state.

Instead of manually running:

kubectl apply -f deployment.yaml

You commit the desired state to Git. Then a controller applies it.

Common GitOps tools:

Argo CD
Flux CD
Fleet
Jenkins X
GitLab Agent for Kubernetes

GitOps flow

Engineer changes YAML/Helm/Kustomize
        |
        v
Pull request / merge request
        |
        v
Review + CI validation
        |
        v
Merge to main
        |
        v
GitOps controller detects change
        |
        v
Applies desired state to cluster
        |
        v
Reports sync / drift / health

GitOps mental model

Git contains:

desired state

The cluster contains:

actual state

GitOps continuously reconciles:

actual state -> desired state

If someone manually changes the cluster, the GitOps tool detects drift.

13. GitOps with Kubernetes

Example app:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: api
          image: registry.example.com/api:v1.4.2

To deploy a new version, change:

image: registry.example.com/api:v1.4.3

Commit and merge.

The GitOps controller applies it.

14. GitOps with Helm

Repository:

apps/
└── grafana/
    ├── Chart.yaml
    ├── values-dev.yaml
    ├── values-staging.yaml
    └── values-prod.yaml

Example production values:

replicas: 2

resources:
  requests:
    cpu: 500m
    memory: 1Gi

persistence:
  enabled: true
  size: 20Gi

GitOps tool deploys Helm release from Git.

15. GitOps with Terraform

Terraform GitOps is more sensitive because it modifies infrastructure.

Typical flow:

MR opened
  -> terraform fmt
  -> terraform validate
  -> terraform plan
  -> security scan
  -> approval
  -> terraform apply

Common tools:

Atlantis
Spacelift
Terraform Cloud
Terragrunt pipelines
GitLab CI
GitHub Actions

For production, apply should usually require approval.

16. SRE methodology: Git as operational control

For SREs, Git supports:

change control
incident rollback
auditability
disaster recovery
configuration management
access review
platform standardisation
repeatability

A mature SRE team avoids undocumented production changes.

Bad:

ssh prod-node-01
vim /etc/nginx/nginx.conf
systemctl reload nginx

Better:

change config in Git
open MR
CI validates
merge
GitOps applies
monitor rollout

Emergency changes may still happen, but they should be backfilled into Git afterwards.

17. Advanced Git commands for SREs

Restore file

git restore file.yaml

Restore staged file

git restore --staged file.yaml

Checkout file from another branch

git checkout main -- path/to/file.yaml

Modern:

git restore --source main path/to/file.yaml

Revert commit

Safe for shared branches:

git revert abc123

This creates a new commit that undoes the previous one.

Best for production rollback.

Reset commit

Dangerous if pushed:

git reset --hard HEAD~1

This rewrites local history.

Use carefully.

Cherry-pick

Apply one commit onto another branch:

git cherry-pick abc123

Useful for hotfixes:

fix goes into main
same fix cherry-picked into release branch

Bisect

Find which commit introduced a problem:

git bisect start
git bisect bad
git bisect good v1.2.0

Then Git walks through commits until the bad one is found.

Very useful for regressions.

Blame

Find who changed a line and when:

git blame config.yaml

Use professionally. It is for investigation, not accusation.

Stash

Temporarily save local changes:

git stash

Restore:

git stash pop

Useful during urgent context switching.

Worktree

Have multiple branches checked out at once:

git worktree add ../prod-hotfix hotfix/prod-fix

Useful for SREs handling urgent hotfixes while keeping normal work untouched.

18. Git security for SREs

Never commit secrets

Do not commit:

passwords
API tokens
SSH private keys
cloud credentials
kubeconfigs
database URLs with passwords
TLS private keys

Use:

Vault
SOPS
Sealed Secrets
External Secrets Operator
cloud secret managers
Kubernetes secrets generated by pipeline

Secret scanning

Use tools such as:

gitleaks
trufflehog
git-secrets
GitHub secret scanning
GitLab secret detection

Example:

gitleaks detect

Signed commits

For regulated or high-trust environments:

git commit -S -m "Update production alert rules"

This proves commit authenticity.

Branch protection

Production repositories should require:

MR/PR review
passing CI
no force-push to main
signed commits where required
CODEOWNERS approval
security scanning
status checks

19. CODEOWNERS

Example:

/terraform/prod/ @platform-team @sre-leads
/kubernetes/prod/ @sre-team
/security/ @security-team
/observability/ @observability-team

This ensures sensitive areas get reviewed by the right people.

20. CI/CD with Git

Git events trigger automation:

push
merge request
tag
release
schedule
manual approval

Example GitLab CI:

stages:
  - validate
  - plan
  - apply

terraform_validate:
  stage: validate
  script:
    - terraform fmt -check
    - terraform validate

terraform_plan:
  stage: plan
  script:
    - terraform plan

terraform_apply:
  stage: apply
  when: manual
  script:
    - terraform apply -auto-approve

For SREs, CI protects production from bad changes.

21. Testing infrastructure changes

Before merge, test:

syntax
formatting
schema validation
policy compliance
security
dry-run
diff
plan
integration behaviour

Examples:

terraform fmt -check
terraform validate
terraform plan
ansible-lint
yamllint
kubeconform
kubectl diff
helm lint
helm template
promtool check rules
conftest test

22. Policy as Code

Git can enforce standards.

Examples:

no public S3 buckets
no privileged Kubernetes pods
no LoadBalancer in dev
all resources must have owner labels
production changes require approval
no plaintext secrets

Tools:

OPA
Conftest
Kyverno
Gatekeeper
Checkov
Terrascan
tfsec

Example policy idea:

Deny Kubernetes workloads using privileged: true
unless namespace is explicitly approved.

23. GitOps drift detection

Drift means production differs from Git.

Example:

Git says replicas: 3
Cluster has replicas: 5

Possible causes:

manual kubectl edit
autoscaler
emergency change
failed sync
controller conflict
wrong environment overlay

GitOps tools can show:

Synced
OutOfSync
Healthy
Degraded
Progressing
Missing

SRE response:

identify drift
understand whether intentional
reconcile from Git
or commit the required change back to Git

24. GitOps anti-patterns

Storing secrets directly in Git

Bad:

password: supersecret123

Better:

externalSecretRef:
  name: database-password

Manual production changes

Bad:

kubectl edit deployment api

Better:

change Git
review
merge
sync

Too many environment branches

Bad:

dev branch
test branch
staging branch
prod branch

Often leads to drift.

Better:

main branch
envs/dev
envs/staging
envs/prod

Giant pull requests

Bad:

changed Terraform, Helm, alerts, dashboards, network policy and database config together

Better:

small, reviewable, reversible changes

25. Git for incident response

During incidents, Git helps answer:

what changed recently?
who changed it?
was there a deployment?
what config changed?
can we revert it?
which version was previously healthy?

Useful commands:

git log --since="2 hours ago"
git diff HEAD~1 HEAD
git show abc123
git revert abc123

For Kubernetes:

git diff HEAD~1 HEAD -- clusters/prod/

For Terraform:

git log -- terraform/prod/

26. Git rollback strategies

Application rollback

Change image tag back:

image: app:v1.4.2

instead of:

image: app:v1.4.3

Commit, merge, sync.

Config rollback

git revert abc123

Terraform rollback

Be careful. Reverting Terraform code does not always safely reverse infrastructure state.

You must inspect:

terraform plan

Rollback may delete resources.

Helm rollback

If using Helm directly:

helm rollback grafana 12

With GitOps, prefer changing Git back to the known-good values.

27. Git repository strategies for SRE teams

Mono-repo

One large repo:

platform/
├── terraform/
├── kubernetes/
├── observability/
├── ansible/
└── docs/

Advantages:

single source of truth
easy cross-system changes
centralised review

Disadvantages:

can become large
permissions harder
CI can become complex

Multi-repo

Separate repos:

terraform-infra
k8s-platform
observability-config
ansible-roles
service-catalog

Advantages:

clear ownership
smaller repos
separate permissions

Disadvantages:

cross-repo coordination harder
versioning complexity

Hybrid

Common mature pattern:

terraform modules repo
environment infra repo
k8s platform repo
app deployment repos
observability repo

28. Git for OpenStack, Kubernetes and AI/HPC platforms

For an SRE working with cloud and HPC-style infrastructure, Git can manage:

OpenStack

Nova configs
Neutron networks
Cinder backend configs
Glance images
Heat templates
Terraform OpenStack provider code
Ansible OpenStack deployment configs
Ceph integration settings

Kubernetes

cluster manifests
CNI configs
ingress controllers
storage classes
Helm releases
network policies
RBAC
operators

Ceph

cephadm specs
Rook manifests
pool definitions
storage class configs
monitoring rules

Slurm / HPC

slurm.conf
gres.conf
cgroup.conf
Prometheus exporters
GPU health checks
node provisioning scripts
job accounting config

Observability

Prometheus rules
Grafana dashboards
Loki pipelines
Tempo sampling config
Mimir overrides
OpenTelemetry Collector configs
Alertmanager routes
SLO definitions

29. Advanced SRE Git practices

Make every production change traceable

Every production change should have:

commit
review
CI result
deployment record
rollback path
owner
ticket/change reference

Use small commits

Good:

Add node disk pressure alert
Add runbook link to alert
Tune alert threshold after staging test

Bad:

Big observability update

Use conventional commits

Example:

feat: add Grafana dashboard for Slurm GPUs
fix: correct Loki retention period
chore: update Terraform provider version
docs: add OpenStack recovery runbook

This helps automation generate changelogs.

Use protected environments

For production:

manual approval
restricted deployers
change window checks
automated rollback signals

Use deployment metadata

Every deployment should expose:

git commit SHA
version
build timestamp
branch
pipeline URL

Example app endpoint:

{
  "version": "1.4.2",
  "commit": "a1b2c3d",
  "build_time": "2026-06-14T10:00:00Z"
}

This makes incident debugging much easier.

30. What an SRE should be able to say in an interview

A strong answer:

Git is the audit trail and collaboration mechanism for both software and infrastructure. For SRE, I use it to manage application releases, Terraform, Kubernetes manifests, Helm values, Ansible, observability configs, alert rules and runbooks. Changes should go through pull requests, CI validation, policy checks, peer review and controlled deployment. With GitOps, Git becomes the desired state, and tools like Argo CD or Flux reconcile that state into Kubernetes. This reduces manual drift, improves rollback, and makes production changes auditable.

31. Practical SRE Git skill checklist

You should be comfortable with:

clone, branch, commit, push, pull
merge and rebase
diff and log
revert and cherry-pick
tags and releases
resolving merge conflicts
writing good commit messages
reviewing pull requests
using CI/CD pipelines
managing Terraform through Git
managing Kubernetes through Git
GitOps with Argo CD or Flux
secret scanning
branch protection
CODEOWNERS
incident rollback using Git
drift detection

32. The key mindset

For a junior engineer, Git is where code is stored.

For a DevOps engineer, Git is where automation starts.

For an SRE, Git is the operational source of truth.

For a platform engineer, Git is the interface between humans, automation, infrastructure and production reality.

Git Aliases for SRE

Add this to ~/.bashrc, ~/.zshrc, or ~/.profile:

# -------------------------------------------------------------------
# Git aliases for SRE / Platform / DevOps work
# -------------------------------------------------------------------

# Status / inspection
alias gs='git status -sb'
alias gst='git status'
alias gd='git diff'
alias gds='git diff --staged'
alias gdc='git diff --cached'
alias gshow='git show --stat --oneline'
alias gsh='git show'
alias gl='git log --oneline --decorate --graph --all'
alias gla='git log --oneline --decorate --graph --all --stat'
alias glp='git log --patch'
alias glast='git log -1 --stat'
alias gbl='git blame'
alias gcount='git shortlog -sn'

# Branches
alias gb='git branch'
alias gba='git branch -a'
alias gbd='git branch -d'
alias gbD='git branch -D'
alias gco='git checkout'
alias gsw='git switch'
alias gswc='git switch -c'
alias gmain='git switch main'
alias gmaster='git switch master'

# Add / commit
alias ga='git add'
alias gaa='git add .'
alias gap='git add -p'
alias gc='git commit'
alias gcm='git commit -m'
alias gca='git commit --amend'
alias gcan='git commit --amend --no-edit'

# Fetch / pull / push
alias gf='git fetch'
alias gfa='git fetch --all --prune'
alias gp='git push'
alias gpu='git push -u origin HEAD'
alias gpf='git push --force-with-lease'
alias gpl='git pull'
alias gpr='git pull --rebase'
alias gup='git fetch origin && git rebase origin/main'

# Merge / rebase
alias gm='git merge'
alias gr='git rebase'
alias gri='git rebase -i'
alias grc='git rebase --continue'
alias gra='git rebase --abort'
alias gmc='git merge --continue'
alias gma='git merge --abort'

# Restore / reset
alias grs='git restore'
alias grst='git restore --staged'
alias grhard='git reset --hard'
alias grsoft='git reset --soft'
alias gclean='git clean -fd'
alias gundo='git reset --soft HEAD~1'

# Stash
alias gstash='git stash'
alias gstashp='git stash pop'
alias gstasha='git stash apply'
alias gstashl='git stash list'
alias gstashd='git stash drop'

# Tags / releases
alias gt='git tag'
alias gta='git tag -a'
alias gtl='git tag --list'
alias gtp='git push origin --tags'

# Cherry-pick / revert
alias gcp='git cherry-pick'
alias gcpc='git cherry-pick --continue'
alias gcpa='git cherry-pick --abort'
alias grev='git revert'
alias grevc='git revert --continue'
alias greva='git revert --abort'

# Remote
alias grv='git remote -v'
alias gro='git remote show origin'

# Useful SRE investigation aliases
alias gchanged='git diff --name-only HEAD~1 HEAD'
alias grecent='git log --since="24 hours ago" --oneline --decorate --all'
alias gprodlog='git log --oneline --decorate --graph --all -- envs/prod terraform/prod clusters/prod'
alias gwho='git shortlog -sn --all'
alias gconflicts='git diff --name-only --diff-filter=U'

# Safety / validation helpers
alias gignored='git status --ignored'
alias guntracked='git ls-files --others --exclude-standard'
alias gignoredfiles='git ls-files --ignored --exclude-standard -o'
alias groot='cd "$(git rev-parse --show-toplevel)"'

# Worktree
alias gw='git worktree'
alias gwl='git worktree list'
alias gwa='git worktree add'
alias gwr='git worktree remove'

Most important aliases to memorise

gs      # short status
gd      # unstaged diff
gds     # staged diff
gaa     # add everything
gap     # interactively stage hunks
gcm     # commit with message
gl      # readable graph log
gfa     # fetch all and prune deleted branches
gpr     # pull with rebase
gpu     # push current branch and set upstream
gpf     # safer force push
gundo   # undo last commit but keep changes
grev    # revert a bad commit safely
gstash  # temporarily save work
gstashp # restore stashed work
groot   # jump to repo root

Why SREs use these

The main problems they solve are speed, safety, and incident response.

gs, gd, and gds stop you committing accidental changes.

gap lets you split messy work into clean, reviewable commits.

gl, gshow, gbl, and grecent help during incidents when you need to answer: “what changed recently?”

gfa, gpr, and gpu make normal branch workflow faster.

gpf uses --force-with-lease, which is safer than raw --force.

grev is the production-safe rollback command because it creates a new undo commit instead of rewriting shared history.

gundo is useful before pushing when your last local commit needs reworking.

gstash and gstashp are useful when you are interrupted by urgent production work.

gprodlog is useful in IaC/GitOps repos where production files live under paths like envs/prod, terraform/prod, or clusters/prod.

GitLab Community Edition

GitLab CE means GitLab Community Edition. It is the self-hosted, open-source edition of GitLab. It provides:

Git repository hosting
Merge requests
Issue tracking
Wiki
Container registry
CI/CD pipelines
GitLab runners
Webhooks
Access control
Branch protection
Deploy keys/tokens
Project/group management

For an SRE, GitLab CE is useful because it can become the internal platform for:

application delivery
infrastructure as code
Terraform pipelines
Ansible automation
Kubernetes deployments
GitOps workflows
observability config management
release management
incident rollback

1. GitLab CE architecture

A basic GitLab CE installation usually contains:

GitLab web UI
GitLab Rails application
Gitaly
PostgreSQL
Redis
Sidekiq
Nginx
GitLab Shell
GitLab Workhorse
Container Registry
GitLab Runner

Main components

GitLab Rails

The main web application.

Handles:

users
projects
groups
merge requests
issues
CI/CD configuration
permissions
API

Gitaly

GitLab’s Git storage service.

Handles Git repository access:

clone
fetch
push
diff
commit browsing
repository metadata

For larger setups, Gitaly performance matters a lot.

PostgreSQL

Stores GitLab metadata:

users
groups
projects
permissions
pipeline records
merge request data
issue data
CI/CD metadata

The actual Git repository data is not stored in PostgreSQL.

Redis

Used for caching and background job coordination.

Sidekiq

Processes background jobs:

pipeline scheduling
email sending
webhooks
merge request updates
repository housekeeping
import/export jobs

GitLab Workhorse

A smart reverse proxy between Nginx and Rails.

Handles:

large Git HTTP traffic
file uploads
archive downloads
repository requests

GitLab Shell

Handles SSH Git operations:

git clone git@gitlab.example.com:group/project.git
git push

Container Registry

Optional but very useful.

Used to store Docker/OCI images:

registry.gitlab.example.com/group/project/app:1.2.3

GitLab Runner

Executes CI/CD jobs.

This is the part SREs usually care about most.

2. Typical GitLab CE installation

The common installation method is the Omnibus package.

Example for Ubuntu/Debian:

sudo apt update
sudo apt install -y curl openssh-server ca-certificates tzdata perl

curl https://packages.gitlab.com/install/repositories/gitlab/gitlab-ce/script.deb.sh | sudo bash

sudo EXTERNAL_URL="https://gitlab.example.com" apt install gitlab-ce

Then reconfigure:

sudo gitlab-ctl reconfigure

Check status:

sudo gitlab-ctl status

Restart:

sudo gitlab-ctl restart

View logs:

sudo gitlab-ctl tail

3. Main GitLab config file

The main config file is:

/etc/gitlab/gitlab.rb

After changing it, run:

sudo gitlab-ctl reconfigure

Important settings:

external_url 'https://gitlab.example.com'

gitlab_rails['time_zone'] = 'Europe/London'

gitlab_rails['gitlab_shell_ssh_port'] = 22

nginx['redirect_http_to_https'] = true

letsencrypt['enable'] = true

For internal TLS or reverse proxy setups, you may configure Nginx differently.

4. GitLab CE backup and restore

Backups are critical.

Create backup:

sudo gitlab-backup create

Backup location usually:

/var/opt/gitlab/backups/

Also back up:

/etc/gitlab/gitlab.rb
/etc/gitlab/gitlab-secrets.json

These are essential for restoring the instance.

A proper SRE backup strategy should include:

scheduled backups
off-host storage
restore testing
database consistency
registry backup
artifact backup
repository backup
secret file backup

5. GitLab Runner

GitLab Runner is the agent that executes pipeline jobs.

GitLab itself schedules jobs.
Runner actually runs them.

Basic flow:

Developer pushes code
        ↓
GitLab creates pipeline
        ↓
Job waits for runner
        ↓
Runner picks up job
        ↓
Runner executes script
        ↓
Runner sends logs/status/artifacts back to GitLab

6. Runner installation

On Ubuntu/Debian:

curl -L https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh | sudo bash

sudo apt install gitlab-runner

Check service:

sudo systemctl status gitlab-runner

Start/enable:

sudo systemctl enable --now gitlab-runner

7. Registering a runner

You register a runner against GitLab.

Typical command:

sudo gitlab-runner register

It asks for:

GitLab URL
registration/authentication token
runner description
tags
executor type
default image if Docker executor

Example:

sudo gitlab-runner register \
  --url "https://gitlab.example.com" \
  --token "RUNNER_AUTH_TOKEN" \
  --description "docker-runner-01" \
  --executor "docker" \
  --docker-image "ubuntu:24.04"

Runner config is stored in:

/etc/gitlab-runner/config.toml

Restart after changes:

sudo systemctl restart gitlab-runner

8. Runner types

Instance runner

Available to all projects.

Good for:

shared CI workloads
general build jobs
small internal platforms

Risk:

less isolation
capacity contention
possible secret exposure if misconfigured

Group runner

Available to projects in a group.

Good for:

platform team repos
environment-specific runners
team-level isolation

Project runner

Assigned to one project.

Good for:

sensitive deployments
production infrastructure repos
regulated workloads
privileged jobs

9. Runner executors

The executor determines how jobs run.

Shell executor

Runs jobs directly on the runner host.

Example:

executor = "shell"

Advantages:

simple
fast
good for controlled internal automation
easy access to host tools

Disadvantages:

weak isolation
jobs can modify runner host
dependency conflicts
not ideal for untrusted code

Use for:

Ansible control node
simple scripts
internal admin tasks
trusted infra jobs

Avoid for:

untrusted projects
public repositories
multi-tenant workloads

Docker executor

Runs each job inside a container.

Example:

executor = "docker"

Advantages:

clean job environment
reproducible builds
better isolation than shell
easy per-job images
good for most CI/CD workloads

Disadvantages:

Docker-in-Docker needs care
volume/cache permissions can be annoying
privileged mode can be risky

Use for:

builds
tests
linting
Terraform plans
Helm validation
container image builds

Kubernetes executor

Runs each CI job as a Kubernetes pod.

Advantages:

scalable
ephemeral
good isolation
native cloud/platform fit
works well for large CI estates

Disadvantages:

more complex
requires Kubernetes cluster
RBAC and network policy design needed
cache/artifact configuration required

Use for:

large CI platforms
multi-team environments
elastic runner capacity
cloud-native organisations

SSH executor

Runs jobs over SSH on remote machines.

Less common now.

Use only for specific legacy workflows.

10. Runner tags

Tags match jobs to runners.

Runner registered with tags:

docker
linux
terraform
prod

Job uses:

job:
  tags:
    - terraform

GitLab schedules the job only on runners with matching tags.

Good tag strategy:

docker
shell
k8s
terraform
ansible
prod
staging
gpu
arm64
x86_64
privileged

Avoid vague tags:

runner1
test
misc

11. Runner config example

Example Docker runner:

concurrent = 4
check_interval = 3

[[runners]]
  name = "docker-runner-01"
  url = "https://gitlab.example.com"
  token = "TOKEN"
  executor = "docker"

  [runners.docker]
    image = "ubuntu:24.04"
    privileged = false
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/cache"]
    shm_size = 0

Important settings:

concurrent          total jobs this runner process can run
executor           shell/docker/kubernetes/etc.
image              default container image
privileged         needed for some Docker builds, but risky
volumes            cache/shared mounts

12. Runner security

Runner security is critical.

Key risks

secrets leaked to logs
untrusted code running on privileged runner
shared runner accessing production credentials
Docker socket exposure
persistent workspace contamination
branch from fork accessing secrets

Safer practices

use protected runners for production
use protected variables
avoid shell executor for untrusted code
avoid Docker socket mounting where possible
prefer short-lived credentials
use masked variables
use scoped deploy tokens
separate build and deploy runners
separate dev/staging/prod runners
restrict who can modify .gitlab-ci.yml

13. Basic GitLab CI/CD

Pipeline config lives in:

.gitlab-ci.yml

Minimal example:

stages:
  - test

test:
  stage: test
  image: alpine:latest
  script:
    - echo "Running tests"
    - echo "Done"

When pushed, GitLab creates a pipeline.

14. Stages and jobs

A pipeline contains jobs.

Jobs are grouped into stages.

Example:

stages:
  - lint
  - test
  - build
  - deploy

lint:
  stage: lint
  script:
    - echo "Linting"

test:
  stage: test
  script:
    - echo "Testing"

build:
  stage: build
  script:
    - echo "Building"

deploy:
  stage: deploy
  script:
    - echo "Deploying"

Default behaviour:

all jobs in a stage run in parallel
next stage starts only when previous stage succeeds

15. Using images

With Docker/Kubernetes runners, each job can define an image:

terraform_plan:
  image: hashicorp/terraform:latest
  script:
    - terraform version
    - terraform init
    - terraform plan

Better: pin versions.

image: hashicorp/terraform:1.9.8

Avoid unpinned latest for production pipelines.

16. Variables

Define variables globally:

variables:
  TF_IN_AUTOMATION: "true"
  TF_INPUT: "false"

Use them:

job:
  script:
    - echo "$TF_IN_AUTOMATION"

Sensitive variables should be stored in GitLab UI:

Settings → CI/CD → Variables

Use:

masked
protected
environment-scoped

17. Artifacts

Artifacts are files saved after a job.

Example:

build:
  stage: build
  script:
    - mkdir dist
    - echo "binary" > dist/app
  artifacts:
    paths:
      - dist/
    expire_in: 1 week

Use artifacts for:

compiled binaries
test reports
Terraform plans
coverage reports
SBOMs
generated manifests

18. Cache

Cache speeds up pipelines.

Example:

cache:
  key: "$CI_COMMIT_REF_SLUG"
  paths:
    - .npm/
    - vendor/

Artifacts and cache are different:

cache      speeds future jobs
artifact   passes output or stores result

19. Rules

rules control when jobs run.

Example:

test:
  script:
    - echo "test"
  rules:
    - if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
    - if: '$CI_COMMIT_BRANCH == "main"'

Run only on main:

deploy_prod:
  script:
    - echo "deploy prod"
  rules:
    - if: '$CI_COMMIT_BRANCH == "main"'

Manual job:

deploy_prod:
  script:
    - echo "deploy prod"
  when: manual

Better with rules:

deploy_prod:
  script:
    - echo "deploy prod"
  rules:
    - if: '$CI_COMMIT_BRANCH == "main"'
      when: manual

20. Protected branches and protected variables

For production:

main branch protected
production variables protected
production runner protected
deployment job manual
approval required through MR

This prevents feature branches from accessing production secrets.

21. Basic application pipeline

Example:

stages:
  - lint
  - test
  - build
  - deploy

variables:
  IMAGE_TAG: "$CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA"

lint:
  stage: lint
  image: node:22
  script:
    - npm ci
    - npm run lint

test:
  stage: test
  image: node:22
  script:
    - npm ci
    - npm test

build_image:
  stage: build
  image: docker:27
  services:
    - docker:27-dind
  variables:
    DOCKER_TLS_CERTDIR: "/certs"
  script:
    - docker build -t "$IMAGE_TAG" .
    - docker push "$IMAGE_TAG"

deploy_staging:
  stage: deploy
  image: alpine:latest
  script:
    - echo "Deploy $IMAGE_TAG to staging"
  rules:
    - if: '$CI_COMMIT_BRANCH == "main"'

22. Docker image builds

There are several approaches.

Docker-in-Docker

image: docker:27

services:
  - docker:27-dind

variables:
  DOCKER_TLS_CERTDIR: "/certs"

build:
  script:
    - docker build -t "$CI_REGISTRY_IMAGE:$CI_COMMIT_SHA" .
    - docker push "$CI_REGISTRY_IMAGE:$CI_COMMIT_SHA"

Requires privileged runner in many setups.

Risk:

privileged containers
larger attack surface
careful runner isolation needed

Kaniko

Good for Kubernetes runners.

build:
  image:
    name: gcr.io/kaniko-project/executor:debug
    entrypoint: [""]
  script:
    - /kaniko/executor
      --context "$CI_PROJECT_DIR"
      --dockerfile "$CI_PROJECT_DIR/Dockerfile"
      --destination "$CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA"

Advantage:

builds container images without Docker daemon
safer for Kubernetes-based CI

Buildah / Podman

Useful in Red Hat-style environments.

23. Terraform pipeline

Basic Terraform pipeline:

stages:
  - validate
  - plan
  - apply

variables:
  TF_IN_AUTOMATION: "true"
  TF_INPUT: "false"

terraform_fmt:
  stage: validate
  image: hashicorp/terraform:1.9.8
  script:
    - terraform fmt -check -recursive

terraform_validate:
  stage: validate
  image: hashicorp/terraform:1.9.8
  script:
    - terraform init -backend=false
    - terraform validate

terraform_plan:
  stage: plan
  image: hashicorp/terraform:1.9.8
  script:
    - terraform init
    - terraform plan -out=tfplan
  artifacts:
    paths:
      - tfplan
    expire_in: 1 day

terraform_apply:
  stage: apply
  image: hashicorp/terraform:1.9.8
  script:
    - terraform init
    - terraform apply -auto-approve tfplan
  dependencies:
    - terraform_plan
  rules:
    - if: '$CI_COMMIT_BRANCH == "main"'
      when: manual

Production improvements:

remote backend
state locking
manual approval
protected variables
protected runner
separate plan/apply credentials
policy checks
cost estimation
drift detection

24. Ansible pipeline

stages:
  - lint
  - syntax
  - deploy

ansible_lint:
  stage: lint
  image: cytopia/ansible-lint:latest
  script:
    - ansible-lint .

syntax_check:
  stage: syntax
  image: alpine/ansible:latest
  script:
    - ansible-playbook site.yml --syntax-check

deploy_prod:
  stage: deploy
  image: alpine/ansible:latest
  script:
    - ansible-playbook -i inventories/prod site.yml
  rules:
    - if: '$CI_COMMIT_BRANCH == "main"'
      when: manual

Use runners carefully here. Ansible often needs network access to internal infrastructure.

25. Kubernetes / Helm pipeline

stages:
  - validate
  - deploy

helm_lint:
  stage: validate
  image: alpine/helm:3.15.4
  script:
    - helm lint charts/myapp

helm_template:
  stage: validate
  image: alpine/helm:3.15.4
  script:
    - helm template myapp charts/myapp -f values-prod.yaml > rendered.yaml
  artifacts:
    paths:
      - rendered.yaml

deploy_staging:
  stage: deploy
  image: bitnami/kubectl:latest
  script:
    - kubectl apply -f rendered.yaml
  rules:
    - if: '$CI_COMMIT_BRANCH == "main"'

For production, GitOps is often better than direct kubectl apply.

26. GitOps-style GitLab pipeline

Instead of deploying directly, the pipeline updates a deployment repo.

Example:

app repo builds image
        ↓
pipeline pushes image
        ↓
pipeline updates image tag in GitOps repo
        ↓
Argo CD / Flux applies change

This gives better auditability.

Example flow:

update_gitops_repo:
  stage: deploy
  image: alpine/git:latest
  script:
    - git clone https://oauth2:${GITOPS_TOKEN}@gitlab.example.com/platform/gitops.git
    - cd gitops
    - sed -i "s/tag:.*/tag: ${CI_COMMIT_SHORT_SHA}/" apps/myapp/values-prod.yaml
    - git config user.email "ci@gitlab.example.com"
    - git config user.name "GitLab CI"
    - git add apps/myapp/values-prod.yaml
    - git commit -m "deploy: myapp ${CI_COMMIT_SHORT_SHA}"
    - git push
  rules:
    - if: '$CI_COMMIT_BRANCH == "main"'
      when: manual

27. Advanced pipeline control: needs

By default, stages are sequential.

needs allows DAG-style pipelines.

Example:

stages:
  - test
  - package
  - deploy

unit_tests:
  stage: test
  script:
    - echo "unit tests"

security_scan:
  stage: test
  script:
    - echo "security scan"

package:
  stage: package
  needs:
    - unit_tests
  script:
    - echo "package without waiting for security_scan"

This makes pipelines faster.

28. Parallel jobs

test:
  stage: test
  parallel: 4
  script:
    - echo "Running test shard $CI_NODE_INDEX of $CI_NODE_TOTAL"

Useful for:

large test suites
matrix builds
multi-platform validation

29. Matrix builds

test:
  stage: test
  parallel:
    matrix:
      - PYTHON_VERSION: ["3.11", "3.12"]
        OS: ["ubuntu", "debian"]
  image: python:$PYTHON_VERSION
  script:
    - echo "Testing on $OS with Python $PYTHON_VERSION"

30. Child pipelines

Useful for mono-repos.

Parent:

stages:
  - trigger

terraform:
  stage: trigger
  trigger:
    include: terraform/.gitlab-ci.yml

kubernetes:
  stage: trigger
  trigger:
    include: kubernetes/.gitlab-ci.yml

Benefits:

smaller pipeline files
domain-specific CI
better monorepo scalability

31. Multi-project pipelines

A pipeline in one project can trigger another.

Example:

trigger_deploy:
  stage: deploy
  trigger:
    project: platform/gitops
    branch: main

Useful for:

app repo triggering platform deployment repo
build repo triggering release repo
infra repo triggering environment repo

32. Includes and templates

Avoid huge .gitlab-ci.yml files.

Example:

include:
  - local: ci/templates/terraform.yml
  - local: ci/templates/security.yml

Remote project include:

include:
  - project: platform/ci-templates
    file: /terraform/base.yml

This allows centralised SRE CI standards.

33. YAML anchors

Useful for reuse.

.default_terraform:
  image: hashicorp/terraform:1.9.8
  before_script:
    - terraform version
    - terraform init

plan:
  <<: *default_terraform

However, GitLab CI has its own extends, which is often clearer.

34. Extends

.terraform_base:
  image: hashicorp/terraform:1.9.8
  before_script:
    - terraform version
    - terraform init

terraform_plan:
  extends: .terraform_base
  script:
    - terraform plan

Good for platform-wide consistency.

35. before_script and after_script

before_script:
  - echo "Prepare environment"

after_script:
  - echo "Cleanup"

Per-job:

job:
  before_script:
    - echo "Job-specific setup"
  script:
    - echo "Main job"
  after_script:
    - echo "Collect logs"

36. Environments

GitLab environments model deployment targets.

deploy_staging:
  stage: deploy
  script:
    - echo "deploy"
  environment:
    name: staging
    url: https://staging.example.com

Production:

deploy_prod:
  stage: deploy
  script:
    - echo "deploy prod"
  environment:
    name: production
    url: https://example.com
  rules:
    - if: '$CI_COMMIT_BRANCH == "main"'
      when: manual

Benefits:

deployment history
environment visibility
manual controls
rollback awareness

37. Review apps

Review apps create temporary environments for merge requests.

Example:

review_app:
  stage: deploy
  script:
    - echo "Deploy review app for $CI_COMMIT_REF_SLUG"
  environment:
    name: review/$CI_COMMIT_REF_SLUG
    url: https://$CI_COMMIT_REF_SLUG.review.example.com
    on_stop: stop_review_app
  rules:
    - if: '$CI_PIPELINE_SOURCE == "merge_request_event"'

stop_review_app:
  stage: deploy
  script:
    - echo "Destroy review app"
  environment:
    name: review/$CI_COMMIT_REF_SLUG
    action: stop
  when: manual

Useful for:

testing branch-specific changes
previewing UI/API changes
validating infrastructure modules

38. Resource groups

Prevent concurrent production deploys.

deploy_prod:
  stage: deploy
  script:
    - echo "deploy prod"
  resource_group: production

This ensures only one production deployment runs at a time.

Very important for SRE-controlled deployments.

39. Retry and timeout

flaky_test:
  script:
    - ./run-tests.sh
  retry: 2
  timeout: 30 minutes

Use carefully. Retrying hides real failures if abused.

40. Allow failure

experimental_scan:
  script:
    - ./scan.sh
  allow_failure: true

Good for:

new checks being introduced
non-blocking advisory scans
experimental jobs

Not good for critical checks.

41. Manual gates

deploy_prod:
  stage: deploy
  script:
    - ./deploy-prod.sh
  when: manual

Better:

deploy_prod:
  stage: deploy
  script:
    - ./deploy-prod.sh
  rules:
    - if: '$CI_COMMIT_BRANCH == "main"'
      when: manual

Use for:

production deploys
Terraform apply
database migrations
dangerous maintenance tasks

42. Pipeline schedules

Use scheduled pipelines for:

nightly builds
drift detection
dependency scanning
backup verification
certificate expiry checks
Terraform plan against production
container rebuilds

Example scheduled job:

drift_detection:
  stage: validate
  script:
    - terraform init
    - terraform plan -detailed-exitcode
  rules:
    - if: '$CI_PIPELINE_SOURCE == "schedule"'

43. Advanced Terraform drift detection

terraform_drift:
  stage: validate
  image: hashicorp/terraform:1.9.8
  script:
    - terraform init
    - terraform plan -detailed-exitcode
  rules:
    - if: '$CI_PIPELINE_SOURCE == "schedule"'
  allow_failure: true

Terraform detailed exit codes:

0 = no changes
1 = error
2 = changes detected

For stricter behaviour, wrap it:

terraform plan -detailed-exitcode
code=$?

if [ "$code" -eq 0 ]; then
  echo "No drift"
elif [ "$code" -eq 2 ]; then
  echo "Drift detected"
  exit 1
else
  echo "Terraform error"
  exit 1
fi

44. Pipeline for observability config

Example:

stages:
  - validate
  - deploy

prometheus_rules:
  stage: validate
  image: prom/prometheus:v2.55.0
  script:
    - promtool check rules prometheus/rules/*.yaml

alertmanager_config:
  stage: validate
  image: prom/alertmanager:v0.27.0
  script:
    - amtool check-config alertmanager/alertmanager.yml

grafana_dashboards:
  stage: validate
  image: python:3.12
  script:
    - python scripts/validate-dashboards.py grafana/dashboards/

deploy_observability:
  stage: deploy
  script:
    - echo "Deploy via GitOps or API"
  rules:
    - if: '$CI_COMMIT_BRANCH == "main"'

This is a strong SRE use case.

45. Pipeline for Kubernetes manifests

stages:
  - validate

kubeconform:
  stage: validate
  image: ghcr.io/yannh/kubeconform:latest
  script:
    - kubeconform -strict -summary manifests/*.yaml

kubectl_dry_run:
  stage: validate
  image: bitnami/kubectl:latest
  script:
    - kubectl apply --dry-run=client -f manifests/

Better with server-side validation in a test cluster:

kubectl apply --dry-run=server -f manifests/

46. Policy checks

Example with Conftest:

policy_check:
  stage: validate
  image: openpolicyagent/conftest:latest
  script:
    - conftest test manifests/

Typical policies:

no privileged containers
CPU/memory limits required
owner label required
no latest image tag
no public ingress without annotation
production requires replicas >= 2

47. Pipeline for GitLab Runner health

For SRE-managed GitLab, monitor runners.

Useful checks:

runner online/offline
runner job queue time
runner failure rate
runner disk space
runner CPU/memory pressure
Docker daemon health
Kubernetes executor pod failures
cache backend latency
artifact upload failures

Operational commands:

sudo gitlab-runner verify
sudo gitlab-runner list
sudo gitlab-runner status
sudo journalctl -u gitlab-runner -f

48. Runner scaling

Small setup

1 GitLab CE VM
1 or 2 Docker runners
local disk cache
manual production deploys

Medium setup

GitLab CE VM
separate runner VMs
runner tags by workload
S3-compatible cache
container registry
protected production runner

Larger setup

GitLab CE with external PostgreSQL/Redis
multiple runners
Kubernetes executor
autoscaling runners
object storage for artifacts/cache
monitoring and alerting
backup and restore testing

49. Runner capacity planning

Important metrics:

pipeline duration
queued duration
job concurrency
CPU usage
memory usage
disk I/O
network throughput
cache hit rate
artifact upload time
container pull time
failure rate

Symptoms of undercapacity:

jobs pending for a long time
pipelines blocked waiting for runners
Docker pull time dominates
runner host high load
frequent job timeouts
disk full on runner

Fixes:

increase concurrent
add runners
split runner pools
use caching
pre-pull common images
use Kubernetes executor
avoid oversized artifacts
optimise pipeline DAG

50. Useful GitLab CI variables

Common built-in variables:

CI_COMMIT_SHA
CI_COMMIT_SHORT_SHA
CI_COMMIT_BRANCH
CI_COMMIT_TAG
CI_COMMIT_REF_SLUG
CI_PIPELINE_SOURCE
CI_PROJECT_DIR
CI_PROJECT_NAME
CI_PROJECT_PATH
CI_REGISTRY
CI_REGISTRY_IMAGE
CI_JOB_ID
CI_JOB_URL
CI_PIPELINE_ID
CI_PIPELINE_URL
CI_ENVIRONMENT_NAME
CI_DEFAULT_BRANCH

Example:

script:
  - echo "Commit: $CI_COMMIT_SHA"
  - echo "Branch: $CI_COMMIT_BRANCH"
  - echo "Image: $CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA"

51. Common SRE GitLab CI/CD patterns

Pattern 1: validate everything on MR

validate:
  script:
    - ./ci/validate.sh
  rules:
    - if: '$CI_PIPELINE_SOURCE == "merge_request_event"'

Purpose:

catch errors before merge
protect main
improve review quality

Pattern 2: deploy only from main

deploy:
  script:
    - ./deploy.sh
  rules:
    - if: '$CI_COMMIT_BRANCH == "main"'

Purpose:

only reviewed code reaches shared environments

Pattern 3: production deploy is manual

deploy_prod:
  script:
    - ./deploy-prod.sh
  rules:
    - if: '$CI_COMMIT_BRANCH == "main"'
      when: manual

Purpose:

human gate for production

Pattern 4: tags create releases

release:
  script:
    - ./release.sh
  rules:
    - if: '$CI_COMMIT_TAG'

Purpose:

versioned release process

Pattern 5: scheduled drift detection

drift:
  script:
    - ./terraform-drift.sh
  rules:
    - if: '$CI_PIPELINE_SOURCE == "schedule"'

Purpose:

detect infrastructure drift

52. Advanced pipeline example for SRE/IaC repo

stages:
  - lint
  - validate
  - security
  - plan
  - apply

variables:
  TF_IN_AUTOMATION: "true"
  TF_INPUT: "false"

workflow:
  rules:
    - if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
    - if: '$CI_COMMIT_BRANCH == "main"'
    - if: '$CI_PIPELINE_SOURCE == "schedule"'

terraform_fmt:
  stage: lint
  image: hashicorp/terraform:1.9.8
  script:
    - terraform fmt -check -recursive

yamllint:
  stage: lint
  image: cytopia/yamllint:latest
  script:
    - yamllint .

terraform_validate:
  stage: validate
  image: hashicorp/terraform:1.9.8
  script:
    - terraform init -backend=false
    - terraform validate

checkov:
  stage: security
  image: bridgecrew/checkov:latest
  script:
    - checkov -d .
  allow_failure: false

terraform_plan:
  stage: plan
  image: hashicorp/terraform:1.9.8
  script:
    - terraform init
    - terraform plan -out=tfplan
  artifacts:
    paths:
      - tfplan
    expire_in: 1 day
  rules:
    - if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
    - if: '$CI_COMMIT_BRANCH == "main"'

terraform_apply:
  stage: apply
  image: hashicorp/terraform:1.9.8
  script:
    - terraform init
    - terraform apply -auto-approve tfplan
  dependencies:
    - terraform_plan
  resource_group: production-infra
  rules:
    - if: '$CI_COMMIT_BRANCH == "main"'
      when: manual

53. Common GitLab CI/CD mistakes

Using shell runners for everything

Problem:

poor isolation
host contamination
security risk
hard-to-reproduce builds

Better:

Docker or Kubernetes executor for most jobs
shell only for trusted admin automation

Putting secrets in `.gitlab-ci.yml`

Bad:

variables:
  PASSWORD: "supersecret"

Better:

GitLab CI/CD protected masked variables
Vault integration
short-lived tokens
environment-scoped secrets

Using `latest` everywhere

Bad:

image: terraform:latest

Better:

image: hashicorp/terraform:1.9.8

No branch protection

Problem:

anyone can modify main
production variables exposed
deployment jobs unsafe

Better:

protected main
protected prod variables
protected prod runners
MR approvals
CODEOWNERS

One huge pipeline

Problem:

slow
hard to debug
hard to reuse
poor ownership

Better:

includes
templates
child pipelines
domain-specific jobs
DAG with needs

54. Troubleshooting GitLab runners

Job stuck pending

Likely causes:

no runner assigned
runner offline
tag mismatch
protected runner cannot run unprotected branch
runner reached concurrency limit
runner locked to another project

Check:

sudo gitlab-runner verify
sudo gitlab-runner list
sudo systemctl status gitlab-runner
sudo journalctl -u gitlab-runner -f

Job fails immediately

Likely causes:

bad image
bad entrypoint
script syntax error
missing shell
runner cannot pull image
authentication problem

Docker build fails

Likely causes:

Docker daemon unavailable
privileged mode missing
DinD TLS mismatch
registry login failed
disk full
network cannot pull base image

Artifacts fail to upload

Likely causes:

GitLab storage issue
Nginx/body size limit
object storage problem
large artifact
network timeout

Cache not working

Likely causes:

wrong cache key
cache backend missing
runner cache path misconfigured
different runners with no shared cache
permissions issue

Production job cannot access secret

Likely causes:

variable is protected but branch is not protected
variable environment scope does not match
masked variable contains unsupported characters
runner is not protected
job is running from MR/fork

55. What an experienced SRE should say

A strong SRE explanation:

GitLab CE gives us self-hosted Git, merge requests, access control and CI/CD. The key operational component is GitLab Runner, which executes jobs using shell, Docker, Kubernetes or other executors. I would separate runners by trust level and workload: general Docker runners for build/test, restricted protected runners for production deploys, and possibly Kubernetes runners for scalable ephemeral CI. Pipelines should validate on merge requests, deploy only from protected branches, use masked/protected variables, generate artifacts such as Terraform plans, and use manual gates for production. Advanced usage includes DAG pipelines with needs, child pipelines, reusable includes, policy-as-code, scheduled drift detection, GitOps deployment flows, and resource groups to prevent concurrent production changes.

56. Practical SRE checklist

You should know how to:

install GitLab CE
configure /etc/gitlab/gitlab.rb
run gitlab-ctl reconfigure
backup and restore GitLab
install GitLab Runner
register shell, Docker and Kubernetes runners
use runner tags
secure protected runners
write .gitlab-ci.yml
use stages, jobs, variables, artifacts and cache
write rules
use manual gates
build container images
run Terraform plan/apply
run Ansible jobs
validate Kubernetes and Helm manifests
use includes and templates
use child pipelines
use scheduled pipelines
detect drift
protect production branches and variables
troubleshoot stuck jobs
monitor runner health

57. Core summary

For an SRE:

GitLab CE = self-hosted Git platform
GitLab Runner = execution engine
.gitlab-ci.yml = automation definition
CI = validate, test, scan, build
CD = deploy, release, reconcile
GitOps = Git becomes desired state

A mature GitLab setup should make production changes:

reviewed
tested
auditable
repeatable
reversible
secure
observable

That is why GitLab CI/CD is one of the most important practical tools for SRE, platform engineering and infrastructure automation.

GitHub for SRE

GitHub for SREs: From Basic to Advanced

Most engineers initially think of GitHub as “Git in the cloud.”

That is true, but for modern SREs, platform engineers, cloud engineers, and DevOps teams, GitHub is really a software delivery platform consisting of:

Git Repositories
Pull Requests
Issues
Projects
Actions (CI/CD)
Packages
Container Registry
Security Scanning
Dependabot
Code Owners
Environments
Secrets Management
Webhooks
Apps & Integrations
REST APIs
GraphQL APIs
GitOps Integration

GitHub is effectively:

GitLab SaaS equivalent
+
Marketplace ecosystem
+
Developer platform
+
API platform

1. GitHub Architecture

At a high level:

Developer
    |
    v
 GitHub Repository
    |
    +--> Pull Requests
    |
    +--> Actions
    |
    +--> Security
    |
    +--> Webhooks
    |
    +--> APIs
    |
    +--> Packages

GitHub stores:

source code
infrastructure code
documentation
Helm charts
Terraform modules
Kubernetes manifests
GitHub Actions workflows

2. GitHub Cloud vs GitHub Enterprise

GitHub.com

SaaS service.

GitHub manages:

servers
backups
scaling
availability
security
upgrades

You manage:

repositories
users
permissions
workflows

GitHub Enterprise Server

Self-hosted.

Similar to GitLab CE.

Used by:

banks
government
defence
regulated environments

Provides:

private deployment
air-gapped environments
full data ownership
custom integrations

3. GitHub Repository Structure

Example:

repo/
├── .github/
│   ├── workflows/
│   ├── CODEOWNERS
│   └── dependabot.yml
├── terraform/
├── kubernetes/
├── ansible/
├── src/
└── README.md

Special GitHub folder:

.github/

Contains:

Actions workflows
issue templates
PR templates
dependabot config
security policies

4. GitHub Authentication

Historically:

username/password

No longer recommended.

Modern methods:

PAT (Personal Access Token)
SSH Keys
GitHub App Tokens
OIDC Tokens
Deploy Keys
GITHUB_TOKEN

5. Personal Access Tokens (PATs)

Most common.

Example:

git clone https://github.com/org/project.git

Using PAT:

https://username:TOKEN@github.com/org/project.git

PAT permissions can be scoped:

repo
workflow
packages
read-only
admin

Best practice:

least privilege
short lifetime
rotation

6. SSH Keys

Very common.

Generate:

ssh-keygen -t ed25519

Add public key:

GitHub
→ Settings
→ SSH Keys

Clone:

git clone git@github.com:org/repo.git

Benefits:

secure
easy automation
widely used

7. GitHub Apps

Modern integration mechanism.

Instead of:

long-lived PAT

GitHub Apps use:

signed JWT
short-lived access tokens
fine-grained permissions

Used by:

ArgoCD
Dependabot
Renovate
Backstage
Jenkins
Atlantis
Terraform Cloud

Preferred over PATs.

8. GitHub REST API

GitHub exposes almost everything through APIs.

Example:

curl \
  -H "Authorization: Bearer TOKEN" \
  https://api.github.com/repos/org/repo

Use cases:

create repositories
manage PRs
create issues
manage runners
read workflow status
manage secrets
query commits

9. GitHub GraphQL API

More powerful than REST.

Example:

{
  repository(name:"repo", owner:"org") {
    pullRequests(first:10) {
      nodes {
        title
        state
      }
    }
  }
}

Useful for:

automation
dashboards
reporting
large-scale repository management

10. Webhooks

GitHub can notify systems when events occur.

Example:

push
pull request
merge
issue
release
workflow completed

Example:

GitHub
   |
   +----> Jenkins
   |
   +----> Slack
   |
   +----> ArgoCD
   |
   +----> Internal Platform

11. GitHub Actions

This is GitHub’s CI/CD platform.

Equivalent to:

GitLab CI/CD
Jenkins Pipelines
Azure DevOps Pipelines
CircleCI

Actions are defined in:

.github/workflows/

Example:

.github/workflows/build.yml

12. Basic Workflow

Example:

name: Build

on:
  push:

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - run: echo "Hello World"

Workflow:

Push
  |
  v
GitHub Actions
  |
  v
Runner
  |
  v
Job executes

13. Workflow Structure

name:
on:
jobs:
steps:

Example:

name: CI

on:
  push:
  pull_request:

jobs:
  test:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - run: npm test

14. Events

Actions can trigger on events.

Examples:

on:
  push:

on:
  pull_request:

on:
  release:

on:
  workflow_dispatch:

on:
  schedule:

15. Manual Pipelines

Equivalent to GitLab:

on:
  workflow_dispatch:

Provides:

Run Workflow button

Useful for:

Terraform apply
Production deploy
Database migration
Disaster recovery

16. Scheduled Workflows

Equivalent to cron.

Example:

on:
  schedule:
    - cron: "0 2 * * *"

Runs daily at 2am.

Useful for:

drift detection
certificate checks
backup validation
dependency updates

17. Jobs

A workflow contains jobs.

Example:

jobs:
  lint:
  test:
  build:

Jobs run:

parallel by default

Unlike GitLab stages.

18. Dependencies

Example:

jobs:
  build:

  deploy:
    needs: build

Equivalent to:

GitLab needs:

Creates DAG pipelines.

19. Runners

GitHub Actions jobs execute on runners.

Equivalent to GitLab Runners.

Options:

GitHub-hosted
Self-hosted
Larger runners
ARM runners
GPU runners

20. GitHub Hosted Runners

GitHub provides:

runs-on: ubuntu-latest

Examples:

runs-on: ubuntu-latest
runs-on: windows-latest
runs-on: macos-latest

Advantages:

easy
maintained
ephemeral
secure

Disadvantages:

limited customization
usage costs

21. Self Hosted Runners

You provide infrastructure.

Example:

runs-on: self-hosted

Common labels:

runs-on:
  - self-hosted
  - linux
  - terraform

Useful for:

internal deployments
private networks
GPU builds
Kubernetes management
Terraform applies

22. Self Hosted Runner Architecture

GitHub
   |
   v
Self Hosted Runner
   |
   +---- Terraform
   +---- Kubernetes
   +---- Ansible
   +---- Internal APIs

Runner polls GitHub.

Receives jobs.

Executes locally.

23. Actions Marketplace

Huge GitHub advantage.

Examples:

uses: actions/checkout@v4

uses: docker/build-push-action@v6

uses: hashicorp/setup-terraform@v3

uses: azure/setup-kubectl@v4

Thousands available.

24. Reusable Actions

Example:

.github/actions/setup/

runs:
  using: composite

Reusable organization-wide automation.

25. Reusable Workflows

Example:

jobs:
  call-workflow:
    uses: org/platform/.github/workflows/terraform.yml@main

Equivalent to GitLab CI templates.

Very useful for platform teams.

26. Secrets

GitHub stores secrets.

Repository
Environment
Organization

Example:

AWS_ACCESS_KEY_ID

Access:

${{ secrets.AWS_ACCESS_KEY_ID }}

Never hardcode credentials.

27. Variables

Example:

${{ vars.ENVIRONMENT }}

Useful for:

regions
URLs
cluster names
project IDs

28. Environments

Examples:

dev
staging
prod

Provide:

approval gates
secret scoping
deployment history
protection rules

29. Production Protection

Example:

Production Environment

Requires:
  Approval
  Specific reviewers
  Protected branch

Equivalent to GitLab protected environments.

30. Basic CI Pipeline

name: CI

on:
  pull_request:

jobs:
  test:

    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - run: npm install

      - run: npm test

31. Docker Build Pipeline

jobs:
  build:

    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - uses: docker/login-action@v3

      - uses: docker/build-push-action@v6

Builds:

Docker
OCI
Container images

32. Terraform Pipeline

jobs:
  terraform:

    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3

      - run: terraform fmt -check

      - run: terraform validate

      - run: terraform plan

33. Advanced Terraform

Production pattern:

PR
 ↓
fmt
 ↓
validate
 ↓
security scan
 ↓
plan
 ↓
approval
 ↓
apply

Tools:

Checkov
tfsec
OPA
Conftest
Infracost

34. Kubernetes Pipeline

jobs:
  deploy:

    runs-on: ubuntu-latest

    steps:
      - uses: azure/setup-kubectl@v4

      - run: kubectl apply -f manifests/

Better:

GitOps

rather than direct deployment.

35. GitOps Pattern

Preferred architecture:

App Repo
    |
    +--> Build Image
    |
    +--> Push Image
    |
    +--> Update GitOps Repo
              |
              +--> ArgoCD
              |
              +--> Flux

This creates:

audit trail
rollback
change control

36. OIDC Authentication

Modern best practice.

Instead of:

AWS Keys
Azure Secrets
GCP Service Accounts

Use:

GitHub OIDC

Example:

permissions:
  id-token: write

Benefits:

short-lived credentials
no stored cloud secrets
better security

Used heavily by:

AWS IAM
Azure Entra ID
Google Workload Identity

37. Matrix Builds

Run multiple builds.

Example:

strategy:
  matrix:
    os:
      - ubuntu
      - windows

    version:
      - 3.11
      - 3.12

Creates:

Ubuntu + 3.11
Ubuntu + 3.12
Windows + 3.11
Windows + 3.12

Automatically.

38. Parallel Testing

Example:

strategy:
  matrix:
    shard: [1,2,3,4]

Useful for:

large test suites

39. Artifact Management

Example:

uses: actions/upload-artifact@v4

Store:

Terraform plans
reports
binaries
SBOMs
test results

40. Cache

Example:

uses: actions/cache@v4

Speeds up:

npm
maven
pip
terraform providers

41. Security Scanning

GitHub Advanced Security provides:

CodeQL
Secret scanning
Dependency scanning
Dependabot

42. Dependabot

Automatically updates dependencies.

Example:

.github/dependabot.yml

Creates PRs.

Great for:

Terraform providers
Helm charts
npm packages
Python packages

43. CodeQL

GitHub’s SAST engine.

Example:

uses: github/codeql-action/init@v3

Scans:

Go
Python
Java
JavaScript
C#

44. Branch Protection

Recommended:

Require PR
Require review
Require checks
Require signed commits
Require linear history

Protect:

main
production
release/*

45. CODEOWNERS

Example:

terraform/prod/ @platform-team
kubernetes/prod/ @sre-team

Automatically requests reviews.

46. Common SRE Use Cases

GitHub Actions excels at:

Infrastructure

Terraform
OpenTofu
CloudFormation
Pulumi

Kubernetes

Helm validation
Manifest validation
GitOps updates

Observability

Prometheus rule validation
Grafana dashboard validation
Loki config validation
OpenTelemetry config validation

Security

SAST
Secrets scanning
Policy checks
Compliance checks

47. Advanced Enterprise Patterns

Large organizations often use:

Reusable workflows
OIDC
Self-hosted runners
GitHub Apps
GitOps
Environment approvals
CODEOWNERS
Security gates

Architecture:

Developers
      |
      v
Pull Request
      |
      v
Actions Workflow
      |
      +--> Lint
      +--> Tests
      +--> Security
      +--> Terraform Plan
      +--> Build Image
      |
      v
Approval
      |
      v
GitOps Repository
      |
      v
ArgoCD / Flux
      |
      v
Production

48. What an SRE should say in an interview

A strong answer:

GitHub is more than Git hosting; it’s a developer platform that exposes repositories, APIs, webhooks, GitHub Apps, Actions, Packages and security tooling. GitHub Actions provides CI/CD through workflow definitions stored in the repository. For production infrastructure I would use pull requests, branch protection, CODEOWNERS, reusable workflows, self-hosted runners where internal access is needed, OIDC instead of long-lived cloud credentials, security scanning, Terraform plan/apply separation, and GitOps deployments through Argo CD or Flux. This gives an auditable, automated and secure software delivery platform.