Step 1: Check PVE and GPU

PVE0 has the GPU:

sont@sont-LOQ-15ARP9:~$ ssh pve0
Linux pve0 7.0.2-6-pve #1 SMP PREEMPT_DYNAMIC PMX 7.0.2-6 (2026-05-20T08:55Z) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Fri Jun 26 17:56:47 2026 from 192.168.1.115
sont@pve0:~$ sudo -i
root@pve0:~# lspci -nn | grep -Ei 'nvidia|vga|3d|display|audio'
00:1b.0 Audio device [0403]: Intel Corporation 82801JI (ICH10 Family) HD Audio Controller [8086:3a3e]
03:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM204 [GeForce GTX 970] [10de:13c2] (rev a1)
03:00.1 Audio device [0403]: NVIDIA Corporation GM204 High Definition Audio Controller [10de:0fbb] (rev a1)

IOMMU Grouping 16:

IOMMU Group 16
	03:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM204 [GeForce GTX 970] [10de:13c2] (rev a1)
	03:00.1 Audio device [0403]: NVIDIA Corporation GM204 High Definition Audio Controller [10de:0fbb] (rev a1)

Step 2: Enable IOMMU on Proxmox

Check CPU vendor:

lscpu | grep -i vendor

For Intel, edit GRUB:

nano /etc/default/grub

Set:

GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt"

For AMD:

GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt"

Update bootloader:

update-grub

Load VFIO modules:

cat >/etc/modules-load.d/vfio.conf <<'EOF'
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd
EOF

Bind the GPU to vfio-pci. Replace the IDs with yours:

cat >/etc/modprobe.d/vfio.conf <<'EOF'
options vfio-pci ids=10de:1b81,10de:10f0 disable_vga=1
EOF

Blacklist host NVIDIA/Nouveau drivers on the Proxmox host:

cat >/etc/modprobe.d/blacklist-gpu.conf <<'EOF'
blacklist nouveau
blacklist nvidia
blacklist nvidiafb
blacklist nvidia_drm
blacklist nvidia_modeset
blacklist nvidia_uvm
EOF

Rebuild initramfs:

update-initramfs -u -k all
reboot

After reboot:

dmesg | grep -Ei 'iommu|dmar'
lspci -nnk -s 03:00.0
lspci -nnk -s 03:00.1

Expected:

Kernel driver in use: vfio-pci

Step 3: Pass the GPU into the OpenStack compute VM

Shut down the OpenStack compute VM first.

Example if your GPU compute VM ID is 1212:

qm shutdown 1212

Set machine type and CPU model:

qm set 1212 --machine q35
qm set 1212 --cpu host

Enable nested virtualization features as far as practical:

qm set 1212 --numa 1

Pass through both GPU functions:

qm set 1212 --hostpci0 01:00.0,pcie=1
qm set 1212 --hostpci1 01:00.1,pcie=1

For some NVIDIA GPUs, especially if it is also the boot/display GPU, you may need:

qm set 1212 --hostpci0 01:00.0,pcie=1,x-vga=1

For IOMMU Group:

qm set 1212 --args '-machine kernel_irqchip=split -device intel-iommu,intremap=on,caching-mode=on'

Start the VM:

qm start 1212

Inside the OpenStack compute VM:

lspci -nn | grep -Ei 'nvidia|vga|3d|audio'

Expected:

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation ... [10de:xxxx]
01:00.1 Audio device [0403]: NVIDIA Corporation ... [10de:xxxx]

At this point, the OpenStack compute VM can see the physical GPU.

Step 4 — Prepare the OpenStack compute VM for PCI passthrough

Inside the OpenStack compute VM, verify KVM and IOMMU:

egrep -c '(vmx|svm)' /proc/cpuinfo
ls -ld /sys/kernel/iommu_groups/*

Install tools:

sudo apt update
sudo apt install -y pciutils

Check the GPU:

lspci -nnk | grep -A4 -Ei 'nvidia|vga|3d'

For OpenStack passthrough, the GPU should usually be bound to vfio-pci on the OpenStack compute VM, not used by the compute VM itself.

Create VFIO config inside the OpenStack compute VM:

sudo tee /etc/modules-load.d/vfio.conf >/dev/null <<'EOF'
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd
EOF

Replace IDs with your actual GPU IDs:

sudo tee /etc/modprobe.d/vfio.conf >/dev/null <<'EOF'
options vfio-pci ids=10de:1b81,10de:10f0 disable_vga=1
EOF

Blacklist Nouveau/NVIDIA inside the compute VM:

sudo tee /etc/modprobe.d/blacklist-gpu.conf >/dev/null <<'EOF'
blacklist nouveau
blacklist nvidia
blacklist nvidiafb
blacklist nvidia_drm
blacklist nvidia_modeset
blacklist nvidia_uvm
EOF

Update initramfs:

sudo update-initramfs -u -k all
sudo reboot

After reboot:

lspci -nnk -s 01:00.0

Expected:

Kernel driver in use: vfio-pci

Step 5 — Configure Nova PCI passthrough in Kolla-Ansible

On your Kolla deployment host/controller, activate your Kolla environment:

source /opt/kolla-venv/bin/activate

Confirm your inventory variable:

echo "$KOLLA_INVENTORY"

Example:

export KOLLA_INVENTORY=/etc/kolla/multinode

Check your OpenStack nodes:

ansible -i "$KOLLA_INVENTORY" all -m ping

Now create Kolla Nova config override directories:

sudo mkdir -p /etc/kolla/config/nova
sudo mkdir -p /etc/kolla/config/nova/gpu

You need two levels of config:

Controller / scheduler / API side: define the alias.
Compute node side: define which PCI devices are available.

Nova’s current PCI passthrough syntax uses [pci] device_spec and [pci] alias. Older examples may show passthrough_whitelist; prefer device_spec for current Nova. OpenStack documents that the device request is made through flavor extra specs using pci_passthrough:alias.

Create a global Nova override for the alias:

sudo tee /etc/kolla/config/nova.conf >/dev/null <<'EOF'
[pci]
alias = { "vendor_id": "10de", "product_id": "13c2", "device_type": "type-PCI", "name": "nvidia-gpu" }
EOF

Replace 1b81 with your GPU product ID.

Then create a host-specific compute override for the GPU compute node. Replace gpu with the exact hostname from your Kolla inventory:

sudo tee /etc/kolla/config/nova/gpu/nova.conf >/dev/null <<'EOF'
[pci]
device_spec = { "vendor_id": "10de", "product_id": "13c2", "device_type": "type-PCI" }
alias = { "vendor_id": "10de", "product_id": "13c2", "device_type": "type-PCI", "name": "nvidia-gpu" }
EOF

Important: the alias and device_spec must match in vendor_id, product_id, and device_type. Red Hat’s OpenStack PCI passthrough guidance also warns that device_spec on Compute nodes and alias on the control plane must use the same device_type for the same device.

Step 6 — Reconfigure Nova with Kolla-Ansible

Run prechecks first:

source /opt/kolla-venv/bin/activate
kolla-ansible prechecks -i "$KOLLA_INVENTORY" --tags nova

Then reconfigure Nova:

source /opt/kolla-venv/bin/activate
kolla-ansible reconfigure -i "$KOLLA_INVENTORY" --tags nova

Check Nova containers:

docker ps --format 'table {{.Names}}\t{{.Status}}' | grep nova

On the GPU compute VM, check the Nova compute logs:

docker logs nova_compute --tail 200

Look for PCI discovery messages:

docker logs nova_compute 2>&1 | grep -Ei 'pci|vfio|nvidia|device_spec|alias'

Step 7 — Verify OpenStack sees the GPU compute node

Load OpenStack credentials:

source /etc/kolla/admin-openrc.sh

Check services:

openstack compute service list
openstack hypervisor list
openstack hypervisor show gpu

Check resource providers:

openstack resource provider list

Then inspect the GPU compute resource provider:

openstack resource provider list | grep gpu

Nova PCI passthrough devices are not always obvious from simple openstack hypervisor show, so logs are often more useful at this stage.

Check Nova scheduler logs:

docker logs nova_scheduler --tail 200 | grep -Ei 'pci|alias|placement|resource'

Step 8 — Create a GPU flavor

Create a small GPU flavor:

openstack flavor create g1.gpu \
  --ram 8192 \
  --disk 40 \
  --vcpus 4

Attach the PCI alias:

openstack flavor set g1.gpu \
  --property "pci_passthrough:alias"="nvidia-gpu:1"

Verify:

openstack flavor show g1.gpu

Expected property:

pci_passthrough:alias='nvidia-gpu:1'

Step 9 — Prepare a GPU-ready image

Use Ubuntu 24.04 cloud image:

wget https://cloud-images.ubuntu.com/noble/current/noble-server-cloudimg-amd64.img

Upload to Glance:

openstack image create ubuntu-24.04-gpu \
  --file noble-server-cloudimg-amd64.img \
  --disk-format qcow2 \
  --container-format bare \
  --public

Optional but useful:

openstack image set ubuntu-24.04-gpu \
  --property hw_qemu_guest_agent=yes

Step 10 — Boot the GPU instance

Find network:

openstack network list

Find keypair:

openstack keypair list

Boot:

openstack server create gpu-test-01 \
  --image ubuntu-24.04-gpu \
  --flavor g1.gpu \
  --network private \
  --key-name your-key

Watch build status:

openstack server list
openstack server show gpu-test-01

If it fails with No valid host, check:

docker logs nova_scheduler --tail 300
docker logs nova_compute --tail 300

Common causes:

PCI alias not defined on controller side
device_spec missing on compute side
wrong product_id
GPU not bound to vfio-pci
GPU compute node disabled
IOMMU not visible inside compute VM
nested passthrough unsupported by host/guest combination

Step 11 — Verify GPU inside the OpenStack instance

SSH into the instance:

ssh ubuntu@<floating-ip>

Check PCI visibility:

lspci -nn | grep -Ei 'nvidia|vga|3d'

Expected:

NVIDIA Corporation ...

Install NVIDIA driver:

sudo apt update
sudo apt install -y ubuntu-drivers-common
ubuntu-drivers devices

Install the recommended driver:

sudo ubuntu-drivers install
sudo reboot

After reboot:

nvidia-smi

Expected:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI ... Driver Version ... CUDA Version ...                           |
| GPU  Name ...                                                                |
+-----------------------------------------------------------------------------+

That completes the core deliverable:

GPU-enabled VM via OpenStack
nvidia-smi works

Step 12 — Install CUDA test tooling

Inside the OpenStack GPU instance:

sudo apt update
sudo apt install -y build-essential dkms linux-headers-$(uname -r)

Install CUDA toolkit from Ubuntu packages:

sudo apt install -y nvidia-cuda-toolkit

Check:

nvcc --version || true
nvidia-smi

Run a simple GPU stress check:

sudo apt install -y git make g++
git clone https://github.com/wilicc/gpu-burn.git
cd gpu-burn
make
./gpu_burn 60

Watch GPU activity:

watch -n1 nvidia-smi

Step 13 — Run PyTorch on the GPU

Inside the instance:

sudo apt install -y python3-venv python3-pip
python3 -m venv ~/venvs/gpu
source ~/venvs/gpu/bin/activate
pip install --upgrade pip

Install PyTorch CUDA wheel. The exact wheel depends on the current PyTorch/CUDA release, so check the PyTorch selector when you do this. A typical command looks like:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Test:

python - <<'PY'
import torch
print("Torch:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
print("GPU count:", torch.cuda.device_count())
if torch.cuda.is_available():
    print("GPU name:", torch.cuda.get_device_name(0))
    x = torch.rand(4096, 4096, device="cuda")
    y = x @ x
    print("Matrix result:", y[0][0].item())
PY

Expected:

CUDA available: True
GPU count: 1
GPU name: <your NVIDIA GPU>

Step 14 — Make the GPU visible to Slurm with GRES

If you are building Slurm on top of OpenStack instances, the GPU node is now simply a Slurm worker with an NVIDIA GPU.

Install Slurm worker packages on the GPU instance:

sudo apt update
sudo apt install -y slurmd munge

Install NVIDIA tooling if not already done:

nvidia-smi

Create GRES config:

sudo tee /etc/slurm/gres.conf >/dev/null <<'EOF'
Name=gpu Type=nvidia File=/dev/nvidia0
EOF

On the Slurm controller, configure the node in slurm.conf.

Example:

NodeName=gpu-test-01 CPUs=4 RealMemory=7500 Gres=gpu:nvidia:1 State=UNKNOWN
PartitionName=gpu Nodes=gpu-test-01 Default=YES MaxTime=INFINITE State=UP

Restart Slurm controller and worker:

On controller:

sudo systemctl restart slurmctld

On GPU instance:

sudo systemctl restart munge
sudo systemctl restart slurmd

Check:

sinfo
scontrol show node gpu-test-01

Expected:

Gres=gpu:nvidia:1

Submit a GPU job:

cat > gpu-test.sbatch <<'EOF'
#!/bin/bash
#SBATCH --job-name=gpu-test
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --time=00:05:00
#SBATCH --output=gpu-test.out

hostname
nvidia-smi
python3 - <<'PY'
import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))
PY
EOF

sbatch gpu-test.sbatch

Check:

squeue
cat gpu-test.out

Step 15 — Performance and validation checklist

Run these from the OpenStack instance:

lspci -nn | grep -Ei 'nvidia|vga|3d'
nvidia-smi
nvidia-smi topo -m
nvidia-smi dmon -s pucvmt

Run CUDA/PyTorch test:

python3 - <<'PY'
import torch, time
device = "cuda"
a = torch.randn((8192, 8192), device=device)
b = torch.randn((8192, 8192), device=device)
torch.cuda.synchronize()
start = time.time()
c = a @ b
torch.cuda.synchronize()
print("Seconds:", time.time() - start)
print("GPU:", torch.cuda.get_device_name(0))
PY

Check from OpenStack side:

openstack server show gpu-test-01
openstack flavor show g1.gpu
openstack compute service list

Check Nova logs:

docker logs nova_compute --tail 200 | grep -Ei 'pci|claim|vfio|nvidia'
docker logs nova_scheduler --tail 200 | grep -Ei 'pci|alias|claim|filter'

Common failure modes

1. GPU not isolated in IOMMU group

Symptom:

vfio: group is not viable

Fix:

Check IOMMU group:

for g in /sys/kernel/iommu_groups/*; do
  echo "Group ${g##*/}"
  for d in "$g"/devices/*; do
    lspci -nns ${d##*/}
  done
done

You may need to move the GPU to another PCIe slot or enable motherboard ACS/IOMMU options.

2. GPU still using Nouveau or NVIDIA driver on compute node

Symptom:

Kernel driver in use: nouveau

Fix:

sudo modprobe -r nouveau
sudo update-initramfs -u -k all
sudo reboot

Confirm:

lspci -nnk -s 01:00.0

Expected:

Kernel driver in use: vfio-pci

3. OpenStack says `No valid host`

Likely causes:

Wrong vendor_id/product_id
Alias only configured on compute but not controller/API/scheduler
device_spec only configured globally but not on GPU compute
Nova containers not reconfigured
GPU compute node disabled
PCI device already consumed by another instance
Nested passthrough not working

Commands:

openstack compute service list
docker logs nova_scheduler --tail 300
docker logs nova_compute --tail 300

4. Instance boots but no GPU appears

Inside the instance:

lspci -nn | grep -Ei 'nvidia|vga|3d'

If empty, Nova/libvirt did not attach the PCI device.

Check on the GPU compute node:

docker logs nova_compute --tail 300 | grep -Ei 'pci|vfio|libvirt|qemu'

5. `nvidia-smi` fails inside the instance

If lspci shows the GPU but nvidia-smi fails, the passthrough path probably worked, but the guest driver is wrong.

Check:

lspci -nnk | grep -A4 -Ei 'nvidia|vga|3d'
dkms status
uname -r

Then reinstall:

sudo apt purge -y 'nvidia-*'
sudo apt autoremove -y
sudo ubuntu-drivers install
sudo reboot

Final success criteria

You are done when all of these work:

openstack server create gpu-test-01 --flavor g1.gpu ...
openstack server show gpu-test-01
ssh ubuntu@<floating-ip>
lspci -nn | grep -i nvidia
nvidia-smi
python -c 'import torch; print(torch.cuda.is_available())'
scontrol show node gpu-test-01 | grep Gres
sbatch gpu-test.sbatch

The Phase 4 deliverable becomes:

GPU-enabled VM via OpenStack:      complete
nvidia-smi works:                  complete
AI frameworks on GPU:              complete
GPU visible to Slurm via GRES:     complete

BLU // SAS

GPU Pass-Through PVE and Configuration

Step 1: Check PVE and GPU

Step 2: Enable IOMMU on Proxmox

Step 3: Pass the GPU into the OpenStack compute VM

Step 4 — Prepare the OpenStack compute VM for PCI passthrough

Step 5 — Configure Nova PCI passthrough in Kolla-Ansible

Step 6 — Reconfigure Nova with Kolla-Ansible

Step 7 — Verify OpenStack sees the GPU compute node

Step 8 — Create a GPU flavor

Step 9 — Prepare a GPU-ready image

Step 10 — Boot the GPU instance

Step 11 — Verify GPU inside the OpenStack instance

Step 12 — Install CUDA test tooling

Step 13 — Run PyTorch on the GPU

Step 14 — Make the GPU visible to Slurm with GRES

Step 15 — Performance and validation checklist

Common failure modes

1. GPU not isolated in IOMMU group

2. GPU still using Nouveau or NVIDIA driver on compute node

3. OpenStack says `No valid host`

4. Instance boots but no GPU appears

5. `nvidia-smi` fails inside the instance

Final success criteria

Bristol Linux Unix Systems Automation Security

Step 1: Check PVE and GPU

Step 2: Enable IOMMU on Proxmox

Step 3: Pass the GPU into the OpenStack compute VM

Step 4 — Prepare the OpenStack compute VM for PCI passthrough

Step 5 — Configure Nova PCI passthrough in Kolla-Ansible

Step 6 — Reconfigure Nova with Kolla-Ansible

Step 7 — Verify OpenStack sees the GPU compute node

Step 8 — Create a GPU flavor

Step 9 — Prepare a GPU-ready image

Step 10 — Boot the GPU instance

Step 11 — Verify GPU inside the OpenStack instance

Step 12 — Install CUDA test tooling

Step 13 — Run PyTorch on the GPU

Step 14 — Make the GPU visible to Slurm with GRES

Step 15 — Performance and validation checklist

Common failure modes

1. GPU not isolated in IOMMU group

2. GPU still using Nouveau or NVIDIA driver on compute node

3. OpenStack says No valid host

4. Instance boots but no GPU appears

5. nvidia-smi fails inside the instance

Final success criteria

Bristol Linux Unix Systems Automation Security

3. OpenStack says `No valid host`

5. `nvidia-smi` fails inside the instance