Nova Test GPU Instance

GPU passthrough verification summary: gpu-test-01

You successfully proved that OpenStack Nova can schedule a VM to the gpu compute host and pass through the NVIDIA GTX 970 as a PCI device.

The final working VM was:

gpu-test-01

with:

Flavor: g1.gtx970
Image: ubuntu-24.04-gpu
Network: gpu-private
Fixed IP: 10.10.10.36
Host: gpu
Status: ACTIVE

The key successful OpenStack state was:

OS-EXT-SRV-ATTR:host  gpu
OS-EXT-STS:vm_state active
OS-EXT-STS:task_state None
addresses gpu-private=10.10.10.36
status ACTIVE

1. Creating the GPU test VM

You first created the server using the GPU-enabled flavor:

NET_ID="$(openstack network show gpu-private -f value -c id)"

openstack server create gpu-test-01 \
--flavor g1.gtx970 \
--image ubuntu-24.04-gpu \
--nic net-id="$NET_ID" \
--key-name sont-key \
--security-group default \
--availability-zone nova:gpu

The important parts of this command are:

--flavor g1.gtx970

This flavor contains the PCI passthrough request:

pci_passthrough:alias = nvidia-gpu:1

That tells Nova:

This VM requires one PCI device matching the alias nvidia-gpu.

The other important part is:

--availability-zone nova:gpu

That forces scheduling to the gpu compute node, which is the host that owns the physical GTX 970.


2. Initial alias problem

The first failure was:

PCI alias nvidia-gtx970 is not defined

The cause was a mismatch between the flavor and Nova config.

The flavor originally requested:

nvidia-gtx970:1

but Nova had defined:

nvidia-gpu

The fix was to make the flavor use the same alias that Nova knew about:

openstack flavor unset g1.gtx970 --property "pci_passthrough:alias"

openstack flavor set g1.gtx970 \
--property "pci_passthrough:alias"="nvidia-gpu:1"

After that, Nova accepted the request and scheduled the instance onto the gpu host.


3. Build delay during image conversion

The VM sat in:

status: BUILD
task_state: spawning
vm_state: building

for a long time.

This was not a GPU problem. The logs showed Nova had reached:

Claim successful on node gpu
Creating image(s)

The process check showed:

qemu-img convert -t none -O raw -f qcow2 \
/var/lib/nova/instances/_base/<image>.part \
/var/lib/nova/instances/_base/<image>.converted

So Nova was converting the Glance qcow2 image into a raw base image on the gpu node.

Disk I/O showed the disk was saturated:

sda %util: ~99%
write throughput: around 1 MB/s

That explained the slow build. The correct action was to wait, not delete the VM. Once qemu-img convert finished, the VM became ACTIVE.

For future builds, using a raw image in Glance would avoid this slow first-boot conversion:

qemu-img convert -f qcow2 -O raw noble-server-cloudimg-amd64.img noble-server-cloudimg-amd64.raw

openstack image create ubuntu-24.04-gpu-raw \
--file noble-server-cloudimg-amd64.raw \
--disk-format raw \
--container-format bare \
--public

4. Verifying the VM booted correctly

The console log proved the VM booted and cloud-init finished:

Ubuntu 24.04.4 LTS gpu-test-01 ttyS0
cloud-init finished
Authorized keys from /home/ubuntu/.ssh/authorized_keys for user ubuntu

That confirmed:

Ubuntu booted successfully
cloud-init completed
SSH key was injected correctly
The ubuntu user has your key

So SSH failure was not caused by the guest OS or missing key.


5. Verifying the GPU was attached by libvirt

You checked the running libvirt domain:

ansible -i "$KOLLA_INVENTORY" gpu -m shell -a '
echo "=== libvirt domain ==="
docker exec nova_libvirt virsh list --all

echo "=== GPU hostdev in XML ==="
docker exec nova_libvirt virsh dumpxml instance-00000001 | grep -Ei "hostdev|vendor|product|0x10de|0x13c2|pci|source|address" -A10 -B5 || true
'

The domain was running:

Id   Name                State
-----------------------------------
1 instance-00000001 running

The decisive passthrough evidence was:

<hostdev mode='subsystem' type='pci' managed='yes'>
<driver name='vfio'/>
<source>
<address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
</source>
<alias name='hostdev0'/>
</hostdev>

That proves Nova/libvirt passed a physical PCI device into the VM using VFIO.

This means the OpenStack passthrough chain works:

Nova flavor requests PCI alias
Nova scheduler selects gpu host
nova-compute claims PCI device
libvirt starts VM
VFIO hostdev is attached to the guest

6. Security group verification

The VM’s Neutron port was:

Port ID: 21463eff-7c0b-4fa0-90af-43b876226ce8
IP: 10.10.10.36
MAC: fa:16:3e:13:15:e2
Status: ACTIVE
Host: gpu

The security group attached to the port was:

0bc4d73d-f18f-4dfd-8560-0886b1ace50e

You confirmed it already had:

TCP 22 ingress from 0.0.0.0/0
ICMP ingress from 0.0.0.0/0
IPv4 egress to 0.0.0.0/0

So the SSH timeout was not caused by the security group.


7. Why direct SSH from ctrl did not work

This command timed out:

ssh -i ~/.ssh/id_ed25519_kolla ubuntu@10.10.10.36

The reason is that 10.10.10.36 is on the private tenant network:

gpu-private

The controller host root namespace does not automatically have a route into that tenant network.

So this was the problem:

ctrl root namespace  ---> no direct route ---> 10.10.10.36

The VM is alive, SSH is running, and the key is present. The missing piece is the correct Neutron network namespace path.


8. Finding the Neutron namespace

To find the namespaces on the controller/network node:

ip netns list

For your private network, the DHCP namespace is based on the Neutron network ID:

gpu-private network ID:
54829687-5a62-4d95-a7d0-42f3e30f7dbf

So the DHCP namespace is:

qdhcp-54829687-5a62-4d95-a7d0-42f3e30f7dbf

You can test all namespaces like this:

for ns in $(ip netns list | awk '{print $1}'); do
echo "=== $ns ==="
sudo ip netns exec "$ns" ping -c 2 -W 2 10.10.10.36 || true
done

Then test SSH/TCP 22:

for ns in $(ip netns list | awk '{print $1}'); do
echo "=== $ns ==="
sudo ip netns exec "$ns" timeout 3 bash -c '</dev/tcp/10.10.10.36/22' && echo "SSH TCP OK" || echo "SSH TCP FAIL"
done

9. Explaining the working namespace SSH command

The command is:

sudo ip netns exec qdhcp-54829687-5a62-4d95-a7d0-42f3e30f7dbf \
ssh -i /home/sont/.ssh/id_ed25519_kolla ubuntu@10.10.10.36

Breakdown:

sudo

Required because entering Linux network namespaces needs elevated privileges.

ip netns exec

Runs a command inside a specific Linux network namespace.

qdhcp-54829687-5a62-4d95-a7d0-42f3e30f7dbf

This is the Neutron DHCP namespace for the gpu-private tenant network.

That namespace has an interface directly attached to the tenant network, so it can reach:

10.10.10.36

where the controller root namespace could not.

ssh -i /home/sont/.ssh/id_ed25519_kolla ubuntu@10.10.10.36

This runs SSH from inside the tenant network namespace using your OpenStack keypair private key.

In plain English:

Enter the Neutron DHCP namespace for gpu-private, then SSH from there into the VM using the injected keypair.

This bypasses the lack of routing from the controller host root namespace.


10. Verifying GPU access inside the VM

Once logged into gpu-test-01, first install PCI tools if needed:

sudo apt update
sudo apt install -y pciutils

Then check for the NVIDIA GPU:

lspci -nn | grep -Ei 'nvidia|vga|3d'

Expected result should show the GTX 970 / GM204 device:

NVIDIA Corporation GM204 [GeForce GTX 970] [10de:13c2]

That proves the guest OS can see the passed-through GPU.

To check kernel binding inside the guest:

lspci -nnk | grep -Ei 'nvidia|vga|3d|kernel driver' -A3

At first, before installing NVIDIA drivers, it may show no NVIDIA kernel driver or may bind to nouveau depending on the image. That is normal.

Then install the NVIDIA driver:

sudo apt install -y ubuntu-drivers-common
ubuntu-drivers devices
sudo ubuntu-drivers install
sudo reboot

After reboot, reconnect through the namespace:

sudo ip netns exec qdhcp-54829687-5a62-4d95-a7d0-42f3e30f7dbf \
ssh -i /home/sont/.ssh/id_ed25519_kolla ubuntu@10.10.10.36

Then run:

nvidia-smi

Expected result:

NVIDIA-SMI output showing the GTX 970

That is the final end-to-end confirmation:

OpenStack scheduled GPU VM
libvirt attached GPU via VFIO
Ubuntu booted
guest sees NVIDIA PCI device
NVIDIA driver loads
nvidia-smi works

Final status

Your GPU passthrough setup has reached the key milestone:

OpenStack GPU passthrough is working.

The only remaining operational task is choosing a cleaner long-term access model:

1. Continue using ip netns exec for lab testing
2. Add a router and floating IP network
3. Boot GPU VMs directly on a provider network
4. Add controlled routing from ctrl into gpu-private