Skip to content

VMM crashes when guest programs BAR to unallocatable MMIO address #7938

@CMGS

Description

@CMGS

Describe the bug

While debugging Windows virtio-win 0.1.285 NetKVM driver issues (#7925), we observed a scenario where the guest programs a PCI BAR to an address that falls outside the MMIO64 allocator range. This causes move_bar to fail, but the BAR config register has already been updated by detect_bar_reprogramming(). The resulting inconsistency between config space and the MMIO bus mapping leads to virtio device activation failure (Number of enabled queues lower than min), crashing the VMM.

In our case, the trigger was a driver bug (wrong used_len in ctrl_queue response caused the driver to reset and attempt BAR reprogramming to a high address). After fixing the driver-side issue, this crash path is no longer hit. However, the underlying VMM behavior — crashing on a failed BAR move — seems worth hardening regardless of the trigger.

There are two independent issues:

1. BAR config register not rolled back on failed move

detect_bar_reprogramming() in pci/src/configuration.rs updates bars[].addr before move_bar is called. When move_bar fails (e.g., address outside allocator range), the config register says the BAR is at the new address, but the MMIO bus mapping is still at the old address. The guest reads back the new address, assumes it worked, and configures queues accordingly — but the device backend is at the old address. This causes queue setup to partially fail, and activate() rejects the device.

2. MMIO64 allocator loses ~4 GiB at top of address space

create_mmio_allocators() computes segment size as (range / alignment) * alignment, which truncates up to one alignment unit (4 GiB with 4 GiB alignment). Addresses near the top of the physical address space fall in this gap and cannot be allocated. While this didn't cause our issue directly (it was the driver bug), it's a latent issue for hosts with 46-bit physical addressing where the gap is at ~64 TiB.

Error log (before fix):

WARN: Failed moving device BAR: failed allocating new MMIO range: 0x3fffffd80000->0x3fffffe00000(0x80000)
ERROR: Number of enabled queues lower than min: 1 vs 2
Fatal error: VmmThread(ActivateVirtioDevices(...))

Version

cloud-hypervisor v51.0.0 (main branch)

VM configuration

cloud-hypervisor \
  --kernel CLOUDHV.fd \
  --disk path=windows.qcow2,num_queues=2,queue_size=512 \
  --cpus boot=2,kvm_hyperv=on \
  --memory size=4G \
  --net tap=tap0,mac=52:54:00:12:34:56,num_queues=2 \
  --serial tty --console off

Guest: Windows 11 25H2, virtio-win 0.1.285
Host: Linux 6.17, 46-bit physical addressing

Defensive fixes

We have two commits that harden the VMM against this class of failures. We're not sure if upstream considers this worth merging since the original trigger was a driver issue, but sharing in case it's useful:

Branch: https://github.com/CMGS/cloud-hypervisor/tree/fix/bar-rollback-defensive

  1. pci: rollback BAR address on failed move_bar — restores config register and bar_regions to old values when move_bar fails, keeping device state consistent with MMIO mapping
  2. vmm: extend last MMIO64 allocator to cover full range — gives the last PCI segment allocator all remaining space up to end of device area, eliminating the alignment truncation gap

Logs

N/A — the crash is deterministic when triggered.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions