Skip to content

pci: expand sub-page VFIO BAR mmap to page size#7939

Merged
rbradford merged 1 commit intocloud-hypervisor:mainfrom
saravan2:64KiB
Apr 12, 2026
Merged

pci: expand sub-page VFIO BAR mmap to page size#7939
rbradford merged 1 commit intocloud-hypervisor:mainfrom
saravan2:64KiB

Conversation

@saravan2
Copy link
Copy Markdown
Member

@saravan2 saravan2 commented Mar 31, 2026

On aarch64 with 64K host pages, VFIO passthrough of devices with sub-page BARs (e.g. 16K NVMe BAR0) crashes with EINVAL from KVM_SET_USER_MEMORY_REGION, which requires memory_size to be a multiple of the host page size.

Expand the mmap to page size instead of rejecting it, matching QEMU's approach. The kernel's vfio_pci_probe_mmaps() already verifies that sub-page BARs are page-aligned and reserves the remainder of the page, so this is safe.

The expanded mmap region will not overlap with the relocated MSI-X trap region because fixup_msix_region() ensures MSI-X relocation at >= page_size offset.

Validation

Host: aarch64, 64K pages (kernel 6.11), VFIO passthrough.
Device Under Test

Intel NVMe DC SSD [3DNAND, Sentinel Rock Controller] — 16K BAR0 with MSI-X in the same BAR.

  0006:01:00.0 Non-Volatile memory controller: Intel Corporation NVMe DC SSD
    Region 0: Memory at 630040010000 (64-bit, non-prefetchable) [size=16K]
    Capabilities: [50] MSI-X: Enable- Count=136 Masked-
            Vector table: BAR=0 offset=00002000
            PBA: BAR=0 offset=00003000

Launch

sudo /home/saravanand/cloud-hypervisor/target/aarch64-unknown-linux-musl/release/cloud-hypervisor \
    --cpus boot=4 \
    --memory size=0 \
    --memory-zone id=mem0,size=1G id=mem1,size=1G \
    --numa \
        guest_numa_id=0,cpus=[0-1],distances=[1@20,2@15],memory_zones=mem0 \
        guest_numa_id=1,cpus=[2-3],distances=[0@20,2@25],memory_zones=mem1 \
        guest_numa_id=2,device_id=nvme0,distances=[0@15,1@25] \
    --firmware /home/saravanand/test-generic-initiator/CLOUDHV_EFI.fd \
    --disk path=/home/saravanand/test-generic-initiator/ubuntu-24.04-server-cloudimg-arm64.raw \
           path=/home/saravanand/test-generic-initiator/seed.img \
    --device id=nvme0,path=/sys/bus/pci/devices/0006:01:00.0 \
    --net "tap=,mac=,ip=192.168.249.1,mask=255.255.255.0" \
    --serial tty \
    --console off

Functional

  • Guest enumerates NVMe controller and creates block device (/dev/nvme0n1)
  • Guest lspci shows doubled virtual BAR (128K) with MSI-X relocated to upper half
  • dmesg confirms 4/0/0 default/read/poll queues — full NVMe operation

Performance (fio 4K random read, io_uring, iodepth=64, numjobs=4, 60s, 3 runs)

ubuntu@ubuntu:~$ sudo fio \
    --name=randread \
    --ioengine=io_uring \
    --direct=1 \
    --bs=4k \
    --iodepth=64 \
    --numjobs=4 \
    --rw=randread \
    --size=1G \
    --runtime=60 \
    --time_based \
    --filename=/dev/nvme0n1 \
    --group_reporting

Median IOPS: 747,589 — no regression from baseline Pre-#7904

Copy link
Copy Markdown
Member

@phip1611 phip1611 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a PCI expert but I think the changes are good - thanks for the contribution! I'm worried about code readability and think we can do a little better here

Comment thread pci/src/vfio.rs Outdated
Comment thread pci/src/vfio.rs Outdated
@saravan2 saravan2 marked this pull request as ready for review April 11, 2026 00:55
@saravan2 saravan2 requested a review from a team as a code owner April 11, 2026 00:55
@saravan2
Copy link
Copy Markdown
Member Author

saravan2 commented Apr 11, 2026

Before #7904, generate_sparse_areas() read the relocated MSI-X offsets from
msix.cap (already modified by fixup_msix_region()).

For 16K BAR device on a 64K page host:

  1. fixup_msix_region() relocates MSI-X from 0x2000 to 0x10000 (upper half of a 128K virtual BAR)
  2. generate_sparse_areas(region_size=0x4000) inserts a hole at
    align_page_size_down(0x10000) = 0x10000beyond the physical 16K BAR
  3. The hole-carving loop sees range_offset (0x10000) > current_offset (0) and creates
    a sparse area {offset: 0, size: 0x10000} = 64K
  4. This 64K sparse area exceeds the physical BAR (16K) but happens to equal the host page
    size. The kernel's vfio_pci_probe_mmaps() reserves the full page for sub-page BARs,
    so mmap(64K) succeeds. KVM_SET_USER_MEMORY_REGION(size=64K) succeeds because 64K
    is page-aligned

The out-of-range MSI-X hole accidentally inflated the sparse area to exactly page size,
masking the sub-page BAR problem.

#7904 added an offset < region_size guard to skip out-of-range holes. Now
generate_sparse_areas() correctly produces a 16K sparse area matching the physical BAR.
But KVM_SET_USER_MEMORY_REGION rejects 16K with EINVAL because it is not a multiple
of the 64K host page size. Passthrough fails at VM boot.

This patch explicitly expands sub-page sparse area mmaps to host page size in
map_mmio_regions(), and ensures fixup_msix_region() makes the virtual BAR lower half
at least one page so the expanded KVM slot cannot overlap the relocated MSI-X trap region.
The result is the same 64K mmap as the accidental pre-#7904 path, but achieved correctly.

Copy link
Copy Markdown
Member

@likebreath likebreath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution with detailed comments and performance regression test.

Some comments below. Please also don't forget to update the commit message and comments added.

Comment thread pci/src/vfio.rs Outdated
Comment thread pci/src/vfio.rs
Comment thread pci/src/vfio.rs Outdated
@saravan2
Copy link
Copy Markdown
Member Author

Some comments below. Please also don't forget to update the commit message and comments added.

Addressed all of them. Updated comments, commit message and PR description. Thanks for your review.

@saravan2 saravan2 requested a review from likebreath April 12, 2026 01:23
Comment thread pci/src/vfio.rs
);
return Err(VfioPciError::MmapArea);
}
info!(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Include the BAR ID (region.index) here and in the error above for easier debugging?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added region.index information to the messages.

On aarch64 with 64K host pages, VFIO passthrough of devices with
sub-page BARs (e.g. 16K NVMe BAR0) crashes with EINVAL from
KVM_SET_USER_MEMORY_REGION, which requires memory_size to be a
multiple of the host page size.

Expand the mmap to page size instead of rejecting it, matching
QEMU's approach. The kernel's vfio_pci_probe_mmaps() already
verifies that sub-page BARs are page-aligned and reserves the
remainder of the page, so expansion is safe at offset 0. Reject
sub-page sparse areas at non-zero offsets where this guarantee
does not apply.

The expanded mmap region will not overlap with the relocated MSI-X
trap region because fixup_msix_region() ensures MSI-X relocation
at >= page_size offset.

Signed-off-by: Saravanan D <saravanand@crusoe.ai>
@rbradford rbradford enabled auto-merge April 12, 2026 10:44
@rbradford rbradford added this pull request to the merge queue Apr 12, 2026
Merged via the queue into cloud-hypervisor:main with commit 23a980c Apr 12, 2026
37 of 38 checks passed
@saravan2 saravan2 deleted the 64KiB branch April 17, 2026 07:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants