RabbitMQ troubleshooting summary
The RabbitMQ issue started while you were reconfiguring Nova for GPU passthrough. The Kolla-Ansible Nova reconfigure failed at:
TASK [service-rabbitmq : nova | Ensure RabbitMQ users exist]
That told us the problem was not Nova GPU passthrough directly. Nova was failing because Kolla could not reliably talk to RabbitMQ to create or verify the service users.
1. Initial symptom: RabbitMQ container was running but unhealthy
You checked the RabbitMQ container:
docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}' | grep -Ei 'rabbit|NAME'
Result:
rabbitmq Up ... (unhealthy)
Then you inspected the healthcheck:
docker inspect rabbitmq --format '{{json .State.Health}}' | python3 -m json.tool
The healthcheck repeatedly failed with:
Failed to connect and authenticate to rabbit@ctrl in 5000 ms
That established the real failure domain: RabbitMQ CLI could not contact/authenticate to the Erlang node rabbit@ctrl.
2. RabbitMQ diagnostics showed Erlang node connection failure
You ran:
docker exec rabbitmq rabbitmq-diagnostics status
The diagnostics showed:
attempted to contact: [rabbit@ctrl]
rabbit@ctrl:
* connected to epmd (port 4369) on ctrl
* epmd reports node 'rabbit' uses port 25672
* can't establish TCP connection to the target node, reason: econnrefused
This was important because it proved:
epmd on 4369 was reachable
RabbitMQ advertised node traffic on 25672
The CLI could not connect to 25672
So the issue was not simply “container down”. It was RabbitMQ node reachability / hostname / binding / state.
3. First root cause found: bad /etc/hosts
You checked hostname resolution inside the container and on the host:
docker exec rabbitmq getent hosts ctrl
docker exec rabbitmq cat /etc/hosts
hostname
hostname -f
getent hosts ctrl
cat /etc/hosts
Both the container and host showed:
127.0.1.1 ctrl ctrl
That was the first major problem. RabbitMQ was running as:
rabbit@ctrl
but ctrl resolved to loopback. That breaks RabbitMQ’s Erlang distribution traffic because RabbitMQ node names are tied to hostname resolution. The file also showed cloud-init was managing /etc/hosts, meaning manual edits could be overwritten.
Expected fix
ctrl, cmp, and gpu should resolve to their real management IPs, not loopback.
The expected model was initially:
ctrl -> real controller IP
cmp -> real compute IP
gpu -> real GPU compute IP
You asked for an Ansible playbook, and the intended fix was to:
disable cloud-init host-file rewriting
remove bad 127.0.1.1 mappings
write explicit OpenStack node mappings
validate getent hosts output
Core validation command:
ansible -i "$KOLLA_INVENTORY" all -m shell -a '
echo HOST=$(hostname);
echo FQDN=$(hostname -f || true);
echo CTRL=$(getent hosts ctrl);
echo CMP=$(getent hosts cmp);
echo GPU=$(getent hosts gpu);
'
4. Kolla-Ansible command ordering issue
While trying to stop/redeploy services, this command failed:
kolla-ansible -i "$KOLLA_INVENTORY" stop --tags nova,neutron,glance,placement
and also this form failed:
kolla-ansible -i "$KOLLA_INVENTORY" --tags rabbitmq deploy
Your installed Kolla-Ansible wrapper required this command shape instead:
kolla-ansible <command> -i "$KOLLA_INVENTORY" --tags <tags>
Correct examples:
kolla-ansible deploy -i "$KOLLA_INVENTORY" --tags rabbitmq
kolla-ansible reconfigure -i "$KOLLA_INVENTORY" --tags nova
kolla-ansible stop -i "$KOLLA_INVENTORY" --tags nova,neutron,glance,placement
This was a tooling/CLI syntax issue, not a RabbitMQ issue, but it affected the recovery workflow.
5. RabbitMQ state reset was attempted
Because RabbitMQ remained unhealthy after hostname repair attempts, we moved to a homelab-safe RabbitMQ state reset.
First, you identified the real RabbitMQ data mount:
docker inspect rabbitmq --format '{{range .Mounts}}{{println .Type .Name .Source "->" .Destination}}{{end}}'
Result:
volume rabbitmq /var/lib/docker/volumes/rabbitmq/_data -> /var/lib/rabbitmq
volume kolla_logs /var/lib/docker/volumes/kolla_logs/_data -> /var/log/kolla
So the actual persistent RabbitMQ state path was:
/var/lib/docker/volumes/rabbitmq/_data
The clean reset approach was:
docker stop rabbitmq || true
docker rm rabbitmq || true
BACKUP_DIR="$HOME/rabbitmq-backup-$(date +%F-%H%M)"
mkdir -p "$BACKUP_DIR"
sudo tar czf "$BACKUP_DIR/rabbitmq-volume-before-wipe.tgz" \
-C /var/lib/docker/volumes/rabbitmq/_data . 2>/dev/null || true
sudo find /var/lib/docker/volumes/rabbitmq/_data -mindepth 1 -maxdepth 1 -exec rm -rf {} +
sudo rm -rf /var/lib/docker/volumes/kolla_logs/_data/rabbitmq
sudo mkdir -p /var/lib/docker/volumes/kolla_logs/_data/rabbitmq
kolla-ansible deploy -i "$KOLLA_INVENTORY" --tags rabbitmq
Expected result
A fresh RabbitMQ should create a new .erlang.cookie, new mnesia, a new rabbit@ctrl database directory, and eventually become healthy.
Your fresh output did show new state being created:
.erlang.cookie
mnesia
rabbit@ctrl
rabbitmq.pid
and RabbitMQ logs showed a fresh node path with 0 record(s) recovered, meaning the old quorum queue state was no longer the main issue.
6. New problem found: IP mismatch between ctrl and RabbitMQ bind address
After the reset, you found this:
docker exec rabbitmq ss -tlnp | grep -E '4369|5672|15672|25672'
The output showed:
LISTEN 192.168.1.51:25672
LISTEN 192.168.1.50:15672
LISTEN 0.0.0.0:4369
At the same time, getent hosts ctrl returned:
192.168.1.50 ctrl
That was the second key root cause. RabbitMQ was running as:
rabbit@ctrl
but the Erlang distribution listener was bound to:
192.168.1.51:25672
while ctrl resolved to:
192.168.1.50
So RabbitMQ CLI tried to reach rabbit@ctrl via 192.168.1.50:25672, but the Erlang distribution listener was actually on 192.168.1.51:25672.
7. Final diagnosis: VIP confused with host IP
You then inspected Kolla-generated RabbitMQ config:
docker exec rabbitmq cat /etc/rabbitmq/rabbitmq-env.conf
docker exec rabbitmq cat /etc/rabbitmq/rabbitmq.conf
docker exec rabbitmq cat /etc/rabbitmq/advanced.config
sudo grep -RniE '192\.168\.1\.51|192\.168\.1\.50|interface|listeners|distribution|node_ip|api_interface|network_interface|tunnel_interface' \
/etc/kolla/rabbitmq /etc/kolla/globals.yml /etc/kolla/config 2>/dev/null
RabbitMQ config showed it was intentionally binding to 192.168.1.51:
ERL_EPMD_ADDRESS=192.168.1.51
listeners.tcp.1 = 192.168.1.51:5672
management.listener.ip = 192.168.1.51
inet_dist_use_interface = {192,168,1,51}
Meanwhile /etc/kolla/globals.yml showed:
network_interface: "eth0"
api_interface: "eth0"
kolla_internal_vip_address: "192.168.1.50"
kolla_external_vip_address: "192.168.1.50"
So the correct model is:
192.168.1.51 = ctrl host management/API interface
192.168.1.50 = Kolla VIP
The mistake was mapping ctrl to the VIP:
Wrong:
192.168.1.50 ctrl
The correct mapping should be:
Correct:
192.168.1.51 ctrl
192.168.1.50 kolla-vip openstack-vip
Final expected /etc/hosts model
On all OpenStack nodes:
127.0.0.1 localhost
192.168.1.51 ctrl
192.168.1.X cmp
192.168.1.Y gpu
192.168.1.50 kolla-vip openstack-vip
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
Replace 192.168.1.X and 192.168.1.Y with the actual cmp and gpu management IPs.
The key rule:
Node hostname -> physical/VM node management IP
Kolla VIP -> separate alias, not the node hostname
Final recovery command sequence
After fixing /etc/hosts so ctrl resolves to 192.168.1.51, recreate RabbitMQ so the container gets the corrected host mapping:
docker stop rabbitmq || true
docker rm rabbitmq || true
source /opt/kolla-venv/bin/activate
kolla-ansible deploy -i "$KOLLA_INVENTORY" --tags rabbitmq
Then verify:
docker exec rabbitmq getent hosts ctrl
docker exec rabbitmq ss -tlnp | grep -E '4369|5672|15672|25672'
docker exec rabbitmq rabbitmq-diagnostics ping
docker ps --format 'table {{.Names}}\t{{.Status}}' | grep rabbit
Expected:
docker exec rabbitmq getent hosts ctrl
192.168.1.51 ctrl
Expected listener alignment:
192.168.1.51:5672
192.168.1.51:15672
192.168.1.51:25672
Expected health:
Ping succeeded
rabbitmq Up ... healthy
After RabbitMQ is healthy
Reconfigure RabbitMQ and the OpenStack services that depend on it:
source /opt/kolla-venv/bin/activate
kolla-ansible reconfigure -i "$KOLLA_INVENTORY" --tags rabbitmq
kolla-ansible reconfigure -i "$KOLLA_INVENTORY" --tags placement,glance,neutron,nova
Then validate:
docker ps --format 'table {{.Names}}\t{{.Status}}' | grep -Ei 'rabbit|nova|neutron|glance|placement'
source /etc/kolla/admin-openrc.sh
openstack compute service list
openstack network agent list
openstack image list
Root cause in one line
RabbitMQ failed because the node name rabbit@ctrl depended on ctrl resolving to the controller’s real management IP, but /etc/hosts first mapped ctrl to loopback and later mapped it to the Kolla VIP. The final fix is to map ctrl to 192.168.1.51, keep 192.168.1.50 as a VIP alias, recreate RabbitMQ, and then rerun Kolla service reconfiguration.
Authentication and kolla_toolbox Troubleshooting
Authentication issue and kolla_toolbox troubleshooting summary
After the RabbitMQ hostname, VIP, listener, and container health problems were fixed, the Kolla-Ansible Nova reconfigure still failed at:
TASK [service-rabbitmq : nova | Ensure RabbitMQ users exist]
The task output was hidden because Kolla marks the result with no_log: true, so the failure had to be diagnosed indirectly.
1. RabbitMQ itself was proven healthy
You first confirmed the RabbitMQ container was no longer the problem.
Commands used:
docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}' | grep -Ei 'rabbit|NAME'
docker exec rabbitmq getent hosts ctrl
docker exec rabbitmq ss -tlnp | grep -E '4369|5672|15672|25672'
docker exec rabbitmq rabbitmq-diagnostics ping
docker exec rabbitmq rabbitmq-diagnostics listeners
docker exec rabbitmq rabbitmq-diagnostics check_running
docker exec rabbitmq rabbitmq-diagnostics check_local_alarms
Expected and observed good state:
rabbitmq Up ... healthy
192.168.1.51 ctrl
Ping succeeded
RabbitMQ on node rabbit@ctrl is fully booted and running
Node rabbit@ctrl reported no local alarms
The important listener alignment was correct:
192.168.1.51:5672 AMQP
192.168.1.51:25672 Erlang distribution / RabbitMQ CLI
192.168.1.51:15672 RabbitMQ management API
This ruled out the earlier problems:
ctrl resolving to 127.0.1.1
ctrl resolving to the Kolla VIP
RabbitMQ binding to the wrong IP
RabbitMQ container unhealthy
RabbitMQ not fully booted
2. RabbitMQ user and vhost state was checked
You listed RabbitMQ users, vhosts, and permissions:
docker exec rabbitmq rabbitmqctl list_users
docker exec rabbitmq rabbitmqctl list_vhosts
docker exec rabbitmq rabbitmqctl list_permissions -p /
Observed state:
Listing users ...
user tags
openstack [administrator]
Listing vhosts ...
name
/
Listing permissions for vhost "/" ...
user configure write read
openstack .* .* .*
That showed RabbitMQ had the expected shared OpenStack messaging user:
openstack
with full permissions on the default vhost:
/
3. RabbitMQ management API authentication was verified
Because Kolla often manages RabbitMQ users through the RabbitMQ API or CLI tooling, you tested the management API directly.
Without credentials:
curl -sS -i http://192.168.1.51:15672/api/overview | head -40
Expected result:
HTTP/1.1 401 Unauthorized
That was good. It meant the management API was reachable.
Then the stored Kolla RabbitMQ password was tested:
RABBIT_PASS="$(awk '/^rabbitmq_password:/ {print $2}' /etc/kolla/passwords.yml)"
docker exec rabbitmq rabbitmqctl authenticate_user openstack "$RABBIT_PASS"
Observed:
Authenticating user "openstack" ...
Success
Then API auth was tested:
curl -sS -u "openstack:${RABBIT_PASS}" \
http://192.168.1.51:15672/api/whoami | python3 -m json.tool
Observed:
{
"name": "openstack",
"tags": [
"administrator"
],
"is_internal_user": true
}
You also tested both the VIP and host IP:
for ip in 192.168.1.50 192.168.1.51; do
echo "=== Testing RabbitMQ API on $ip ==="
curl -sS -i -u "openstack:${RABBIT_PASS}" "http://${ip}:15672/api/whoami" | head -20
done
Both returned:
HTTP/1.1 200 OK
That ruled out:
bad openstack RabbitMQ password
bad openstack RabbitMQ permissions
RabbitMQ management API unreachable
VIP/API access problem
4. Manual RabbitMQ user creation was tested
To prove RabbitMQ itself could create users and permissions, you ran a direct test:
docker exec rabbitmq rabbitmqctl add_user test_kolla_user testpassword123
docker exec rabbitmq rabbitmqctl set_permissions -p / test_kolla_user ".*" ".*" ".*"
docker exec rabbitmq rabbitmqctl authenticate_user test_kolla_user testpassword123
docker exec rabbitmq rabbitmqctl delete_user test_kolla_user
Observed:
Adding user "test_kolla_user" ...
Setting permissions for user "test_kolla_user" in vhost "/" ...
Authenticating user "test_kolla_user" ...
Success
Deleting user "test_kolla_user" ...
That proved RabbitMQ’s own user database and permission system were working. The continuing Kolla failure therefore had to be outside RabbitMQ itself.
5. The hidden Kolla task was inspected
You searched for the failing task:
grep -Rni "Ensure RabbitMQ users exist" /opt/kolla-venv/share/kolla-ansible/ansible/roles
It was found here:
/opt/kolla-venv/share/kolla-ansible/ansible/roles/service-rabbitmq/tasks/main.yml
You inspected the task:
TASK_FILE="$(grep -Rli "Ensure RabbitMQ users exist" /opt/kolla-venv/share/kolla-ansible/ansible/roles | head -1)"
echo "$TASK_FILE"
sed -n '1,220p' "$TASK_FILE"
The key part was:
- name: "{{ project_name }} | Ensure RabbitMQ users exist"
kolla_toolbox:
container_engine: "{{ kolla_container_engine }}"
module_name: rabbitmq_user
module_args:
user: "{{ item.user }}"
password: "{{ item.password }}"
node: "rabbit@{{ hostvars[service_rabbitmq_delegate_host]['ansible_facts']['hostname'] }}"
update_password: always
vhost: "{{ item.vhost }}"
configure_priv: ".*"
read_priv: ".*"
tags: "{{ item.tags | default([]) | join(',') }}"
write_priv: ".*"
user: rabbitmq
This was the key discovery.
The RabbitMQ user-management task was not being executed from inside the rabbitmq container. It was being executed by the kolla_toolbox container, as the rabbitmq user, using the rabbitmq_user Ansible module.
That changed the diagnosis.
The question became:
Can kolla_toolbox resolve ctrl correctly?
Can kolla_toolbox reach rabbit@ctrl?
Does kolla_toolbox have the same Erlang cookie as RabbitMQ?
6. kolla_toolbox was checked and found stale
You ran:
echo "=== kolla_toolbox container ==="
docker ps -a --format 'table {{.Names}}\t{{.Status}}' | grep -Ei 'toolbox|NAME'
echo "=== kolla_toolbox hostname resolution ==="
docker exec kolla_toolbox getent hosts ctrl || true
docker exec kolla_toolbox hostname || true
docker exec kolla_toolbox hostname -f || true
echo "=== rabbitmq user inside kolla_toolbox ==="
docker exec kolla_toolbox getent passwd rabbitmq || true
docker exec kolla_toolbox id rabbitmq || true
echo "=== rabbitmqctl from kolla_toolbox ==="
docker exec -u rabbitmq kolla_toolbox rabbitmqctl -n rabbit@ctrl list_users || true
Initial kolla_toolbox state showed:
kolla_toolbox Up 5 hours
127.0.1.1 ctrl ctrl
So kolla_toolbox still had the old bad hostname mapping.
That explained the first failure mode from kolla_toolbox:
connected to epmd on ctrl
epmd reports node rabbit uses port 25672
can't establish TCP connection to the target node, reason: econnrefused
At that point, the issue was:
RabbitMQ container: ctrl -> 192.168.1.51
Host: ctrl -> 192.168.1.51
kolla_toolbox: ctrl -> 127.0.1.1
So Kolla’s RabbitMQ task was failing because kolla_toolbox had stale /etc/hosts data.
7. kolla_toolbox was rebuilt
The recommended repair was to remove and recreate kolla_toolbox:
source /opt/kolla-venv/bin/activate
docker stop kolla_toolbox || true
docker rm kolla_toolbox || true
kolla-ansible deploy -i "$KOLLA_INVENTORY" --tags common
After rebuilding, you tested again:
echo "=== kolla_toolbox resolution ==="
docker exec kolla_toolbox getent hosts ctrl
docker exec kolla_toolbox hostname
docker exec kolla_toolbox hostname -f
echo "=== kolla_toolbox rabbitmqctl test ==="
docker exec -u rabbitmq kolla_toolbox rabbitmqctl -n rabbit@ctrl list_users
The hostname issue was fixed:
192.168.1.51 ctrl
ctrl
ctrl
So rebuilding kolla_toolbox successfully fixed the stale host resolution.
8. Second kolla_toolbox issue: Erlang cookie mismatch
After the rebuild, the rabbitmqctl test from kolla_toolbox still failed, but the error changed:
TCP connection succeeded but Erlang distribution failed
suggestion: check if the Erlang cookie is identical
That was a major improvement.
It meant:
hostname resolution: fixed
TCP connectivity to RabbitMQ: fixed
Erlang authentication: still broken
The problem was now specifically the Erlang cookie.
RabbitMQ CLI tools authenticate to the RabbitMQ Erlang node using the .erlang.cookie. Since RabbitMQ had previously been wiped and recreated, it had a fresh cookie. kolla_toolbox still had a different cookie.
The earlier cookie check showed:
docker exec rabbitmq sh -c 'sha256sum /var/lib/rabbitmq/.erlang.cookie; ls -l /var/lib/rabbitmq/.erlang.cookie'
docker exec kolla_toolbox sh -c 'find / -name .erlang.cookie -type f -maxdepth 5 -exec ls -l {} \; -exec sha256sum {} \; 2>/dev/null'
RabbitMQ had:
/var/lib/rabbitmq/.erlang.cookie
and kolla_toolbox also had:
/var/lib/rabbitmq/.erlang.cookie
but they did not match.
9. Cookie copy was attempted
You copied the cookie from RabbitMQ into kolla_toolbox:
docker cp rabbitmq:/var/lib/rabbitmq/.erlang.cookie /tmp/.erlang.cookie.rabbitmq
docker cp /tmp/.erlang.cookie.rabbitmq kolla_toolbox:/var/lib/rabbitmq/.erlang.cookie
docker exec kolla_toolbox chown rabbitmq:rabbitmq /var/lib/rabbitmq/.erlang.cookie
docker exec kolla_toolbox chmod 400 /var/lib/rabbitmq/.erlang.cookie
The copy itself succeeded:
Successfully copied 20B to /tmp/.erlang.cookie.rabbitmq
Successfully copied 20B to kolla_toolbox:/var/lib/rabbitmq/.erlang.cookie
But ownership and permission changes failed:
chown: Operation not permitted
chmod: Operation not permitted
Then a hash check from the default container user failed:
sha256sum: /var/lib/rabbitmq/.erlang.cookie: Permission denied
That showed the file was present but the permissions/ownership still needed to be repaired as root inside the container.
10. Corrected cookie repair approach
The next repair step was to run the ownership and permission fix as root inside kolla_toolbox:
docker exec -u 0 kolla_toolbox sh -c '
ls -la /var/lib/rabbitmq
ls -l /var/lib/rabbitmq/.erlang.cookie || true
'
Then install the copied cookie with correct owner and mode:
docker cp rabbitmq:/var/lib/rabbitmq/.erlang.cookie /tmp/.erlang.cookie.rabbitmq
docker cp /tmp/.erlang.cookie.rabbitmq kolla_toolbox:/tmp/.erlang.cookie.rabbitmq
docker exec -u 0 kolla_toolbox sh -c '
install -o rabbitmq -g rabbitmq -m 400 /tmp/.erlang.cookie.rabbitmq /var/lib/rabbitmq/.erlang.cookie
rm -f /tmp/.erlang.cookie.rabbitmq
'
Expected cookie validation:
echo "=== rabbitmq cookie ==="
docker exec -u 0 rabbitmq sh -c '
sha256sum /var/lib/rabbitmq/.erlang.cookie
ls -l /var/lib/rabbitmq/.erlang.cookie
'
echo "=== kolla_toolbox cookie ==="
docker exec -u 0 kolla_toolbox sh -c '
sha256sum /var/lib/rabbitmq/.erlang.cookie
ls -l /var/lib/rabbitmq/.erlang.cookie
'
Expected result:
both SHA256 hashes match
both files owned by rabbitmq:rabbitmq
mode is 400 or equivalent
Then test as the same user Kolla uses:
docker exec -u rabbitmq kolla_toolbox sh -c '
sha256sum /var/lib/rabbitmq/.erlang.cookie
rabbitmqctl -n rabbit@ctrl list_users
'
Expected:
Listing users ...
user tags
openstack [administrator]
11. Final expected recovery step
Once kolla_toolbox can run:
docker exec -u rabbitmq kolla_toolbox rabbitmqctl -n rabbit@ctrl list_users
successfully, retry Nova:
source /opt/kolla-venv/bin/activate
kolla-ansible reconfigure -i "$KOLLA_INVENTORY" --tags nova
If Nova succeeds, the authentication and toolbox layers are fixed.
Root cause chain
The complete root cause chain was:
1. RabbitMQ was unhealthy because hostname/IP resolution was wrong.
2. ctrl first resolved to 127.0.1.1.
3. ctrl was then incorrectly mapped to the Kolla VIP, 192.168.1.50.
4. RabbitMQ actually needed ctrl to resolve to the real controller IP, 192.168.1.51.
5. RabbitMQ was reset and became healthy.
6. The openstack RabbitMQ user and password were valid.
7. Manual RabbitMQ user creation worked.
8. Kolla still failed because the RabbitMQ user task runs from kolla_toolbox.
9. kolla_toolbox was stale and still resolved ctrl to 127.0.1.1.
10. Rebuilding kolla_toolbox fixed hostname resolution.
11. kolla_toolbox then reached RabbitMQ over TCP but failed Erlang distribution auth.
12. Final issue: kolla_toolbox had the wrong Erlang cookie after RabbitMQ was recreated.
Key technical lesson
The Kolla-Ansible RabbitMQ user-management task depends on three layers being correct:
RabbitMQ container:
- must be healthy
- must listen on the correct IP
- must have the expected openstack user/password
Host/container DNS:
- ctrl must resolve to the real controller management IP
- ctrl must not resolve to loopback
- ctrl must not resolve to the Kolla VIP
kolla_toolbox:
- must have current host resolution
- must have rabbitmqctl available
- must have the same /var/lib/rabbitmq/.erlang.cookie as the RabbitMQ node
The final actionable fix was:
rebuild kolla_toolbox to refresh /etc/hosts
copy/repair the RabbitMQ Erlang cookie inside kolla_toolbox
rerun kolla-ansible reconfigure --tags nova
