Skip to content

block: Add async QCOW2 backend with io_uring#7882

Open
weltling wants to merge 32 commits intocloud-hypervisor:mainfrom
weltling:qcow-async
Open

block: Add async QCOW2 backend with io_uring#7882
weltling wants to merge 32 commits intocloud-hypervisor:mainfrom
weltling:qcow-async

Conversation

@weltling
Copy link
Copy Markdown
Member

@weltling weltling commented Mar 23, 2026

Adds QcowDiskAsync, an async QCOW2 backend using io_uring for data I/O. Automatically selected when io_uring is available; falls back to QcowDiskSync otherwise.

Design

Each virtio queue gets its own QcowAsync instance with an independent file descriptor (via dup) and io_uring ring, enabling parallel data I/O across queues. Metadata (QcowMetadata) is shared via Arc<RwLock<QcowState>>, allowing concurrent read access. Pending requests are batched into a single io_uring::enter() call to amortize syscall overhead.

The implementation uses DiskBackend::Next, implementing AsyncFullDiskFile rather than the legacy async_io::DiskFile trait.

Changes

Refactoring: Extract shared code from qcow_sync.rs into reusable modules:

  • qcow_common.rs: positional I/O helpers (pread64/pwrite64) and iovec scatter/gather
  • qcow/backing.rs: backing file construction (RawBacking, Qcow2MetadataBacking, shared_backing_from)

Async backend:

  • QcowDiskAsync: DiskFile + AsyncDiskFile (size, geometry, sparse, resize, per queue fd creation)
  • QcowAsync: AsyncIo (read, write, fsync, punch_hole, write_zeroes via io_uring)

Batching: Group pending I/O submissions before io_uring::enter()

Integration: Route QCOW2 to QcowDiskAsync in device_manager.rs when io_uring is supported

Async scope

This implementation makes single cluster allocated reads fully async via io_uring. The remaining paths use synchronous I/O with synthetic completions through the same eventfd mechanism, so callers see a uniform async interface regardless:

  • Writes: synchronous (metadata allocation must complete before the data write)
  • Multi mapping reads: synchronous per cluster preadv loop
  • Compressed / backing file reads: synchronous (CPU bound decompression, backing file seek)
  • punch_hole, write_zeroes, fsync: synchronous (single syscall, no io_uring benefit)

The planned progression is described in the block refactoring RFC linked from #7694. Cycle 3 covers async writes with COW, compression handling, and batched multi mapping submissions.

Performance (async vs main, MiB/s)

Uncompressed QCOW2:

Test main async Delta
read 1579 1812 +15%
random_read 38 98 +156%
read_warm 1799 2239 +24%
mq_read 4559 4867 +7%
mq_random_read 104 133 +28%
mq_read_warm 5634 6174 +10%

Overlay with backing files:

Test main async Delta
backing_qcow2_read 1333 1485 +11%
backing_qcow2_read_warm 1793 2198 +23%
mq_backing_qcow2_read 4383 5595 +28%
mq_backing_qcow2_read_warm 5764 7513 +30%
backing_raw_read 1384 1628 +18%
backing_raw_read_warm 1869 2181 +17%
mq_backing_raw_read 4703 5956 +27%
mq_backing_raw_read_warm 6207 7580 +22%

Overlay random reads (backing files):

Test main async Delta
backing_qcow2_random_read 36 35 -3%
backing_raw_random_read 43 30 -29%
mq_backing_qcow2_random_read 95 87 -9%
mq_backing_raw_random_read 114 88 -23%

These are low absolute values (30 to 114 MiB/s) in the same regime as the uncompressed random_read numbers. The random_read regression exists on main as well and is unrelated to the async backend.

Compressed (zlib/zstd): Single queue random reads at parity. Single queue sequential reads 9 to 16% slower. Multi queue 11 to 24% slower. Compressed reads are CPU bound with decompression dominating, and the io_uring submission path adds overhead without benefit. Moving decompression out of the metadata lock is planned as follow up.

Full benchmark logs attached (30 tests each, 5 iterations per test).

perf_main.log
perf_qcow_async.log

Copy link
Copy Markdown
Member

@phip1611 phip1611 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Code wise, mostly looking good. My io_uring knowledge is very limited. I've consulted Claude Code with Sonnet 4.6 model for review. So please take the following remarks with a grain of salt - I can't assess if these are valid or not

Show Claude Code (Sonnet 4.6) Remarks

io_uring-related issues

Confirmed bugs

  • next_completed_request missing cq.sync() — without it the kernel-side CQ head is
    not refreshed before reading, so completions can be silently missed on rings not using
    IORING_SETUP_SQPOLL
  • Fast-path short reads — io_uring returns the raw readv result (which may be less than
    total_len), while the slow path always reports total_len as the result; upper layers see
    inconsistent completion values

Partial submit on error

  • If submit_batch_requests errors mid-batch (e.g. a metadata failure on an Out request),
    SQEs already pushed to the SQ but not yet sync()'d are abandoned — their user_data
    completions never arrive and the virtio queue stalls waiting for them

Asymmetric eventfd signaling

  • The slow path explicitly calls eventfd.write(1) to signal completions; the fast path relies
    on the kernel writing to the eventfd via register_eventfd — this is correct but
    undocumented, making the two paths harder to reason about together

split() lifetime coupling

  • io_uring.split() is called in both read_vectored and submit_batch_requests; holding two
    split guards simultaneously would cause a borrow conflict — currently safe because they are in
    separate call frames, but fragile if the call structure changes

submit_batch_requests misattributes submit errors

  • The submitter.submit() error is wrapped in AsyncIoError::ReadVectored even when the batch
    contained only write requests — there is no appropriate variant for a generic submission
    failure, so errors are misattributed


fn next_completed_request(&mut self) -> Option<(u64, i32)> {
// Drain io_uring completions first, then synthetic ones.
if let Some(entry) = self.io_uring.completion().next() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be written as

// Drain io_uring completions first, then synthetic ones.
self.io_uring
        .completion()
        .next()
        .map(|entry| (entry.user_data(), entry.result()))
        .or_else(|| self.completion_list.pop_front())

General question: What are synthetic ones in this context?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I rewrote this. Also just saw raw_async.rs follows the same pattern (except it doesn't need the synthetic queue), so it's more consistent.

Synthetic means completions from operations that were handled synchronously like writes, multi mapping reads, compressed, backing file reads or any other. They just simulate a completion but have not been actually processed through io_uring. Some will become async in the follow up patches, though, like writes are planned for cycle 3 as a separate effort due to complexity.

Thanks


fn write_zeroes(&mut self, offset: u64, length: u64, user_data: u64) -> AsyncIoResult<()> {
unimplemented!()
self.punch_hole(offset, length, user_data)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment here?

// unallocated QCOW2 clusters inherently read as zero

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment here with an explanation, matching what qcow_sync.rs has, too. While unallocated reading as zero is a spec guarantee, here is primarily about deallocate_bytes doing the zeroing.

Thanks

{
info!("Using asynchronous QCOW2 disk file (io_uring)");

#[cfg(not(feature = "io_uring"))]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks a little weird - any idea how we could rewrite it in a way that reads more naturally?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This follows the existing pattern used by VHD and RAW io_uring backends in the same function.

Restructuring would affect all the involved image types. Maybe an option could be moving having this same #[cfg] just once to the top and have a global if branch, so the compiler would still see the dead branch at runtime, when the feature is off.

Thanks

for req in batch_request {
match req.request_type {
RequestType::In => {
let address = req.offset as u64;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inner code of these match arms are kinda huge. Combined with embedded continue, it is difficult to read. Do you see potential to move some part into helper functions?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you mention - yep, it became big. I went for extracting some reusable common parts and splitting into helpers. It's still a fair amount on code there, but hopefully it's now easier to handle. And also, using #[inline] to curb the context switch effects.

Thanks

host_offset,
length,
} => {
let _ = self.data_file.file_mut().punch_hole(host_offset, length);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're silently ignoring a potential error here. Might be worth to log an error?

host_offset,
length,
} => {
let _ = self
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're silently ignoring a potential error here. Might be worth to log an error?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These let _ = are intentional, same as the other case, as the underlying filesystem may not support FALLOC_FL_PUNCH_HOLE or FALLOC_FL_ZERO_RANGE. The error is not fatal. The QCOW2 metadata has already been updated at this point, so future reads will return zeros regardless. The data file operations are purely an optimization to reclaim physical space if sparse=on.

The same pattern is used in the sync part. Also please consider the comment in the metadata layer itself.

Wrt logging - in cases where it's not supported, every DISCARD or WRITE_ZEROES would emit a warning then, would be probably quite noisy.

Thanks

}

if needs_submit {
sq.sync();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sq.sync() is called after the loop only when needs_submit is true. If SQEs were pushed but the loop also hit an error mid-batch (e.g. a metadata error on an Out request returning ?), the SQEs already pushed to sq but not yet sync'd will be abandoned.

Is this a problem? I know very little about io_uring

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very good question. I went checking the io-uring code again. SubmissionQueue::Drop does store the tail, so pushed SQEs become visible in the ring. However, without a submit() call, they won't be processed until the next submission. To my understanding, at that point they'd produce CQEs for already failed requests.

That said, this is a pre existing pattern, as raw_async.rs has the same pattern if sq.push() fails mid batch. This edge case seems shared.

A proper fix would need to either roll back the tail on error or submit and drain the orphaned SQEs before returning. I'll look into this as a followup.

Thanks

}
sync_completions.push((req.user_data, total_len as i32));
}
_ => {}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure that we do not net sq.sync(); here? What if the request is Flush or Discard - doesn't the backend needs to perform actions for that?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Today anything else except In and Out is filtered, see block.rs. But, it should fail hard here. I've added an unreachable! call to uncover that, just in case.

Thanks


fn next_completed_request(&mut self) -> Option<(u64, i32)> {
// Drain io_uring completions first, then synthetic ones.
if let Some(entry) = self.io_uring.completion().next() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the ring was constructed without IORING_SETUP_SQPOLL, it might be necessary to call cq.sync() beforehand 🤔

But I could also be wrong - not an expert with io_uring

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've double checked in the io-uring crate code and it shows that completion()creates a fresh CompletionQueue on each call. For that, it also loads the tail pointer from the kernel. This means, it operates on the same data anyways. The existing RAW io_uring backend raw_async.rs uses the same pattern.

Thanks

@russell-islam
Copy link
Copy Markdown
Contributor

Majority of the DIsk IO for small block size is more than one cluster mapping. THis PR acutlaly does almost 90% of the sync read. What benefit can we get from this partially implemented Async IO?

@weltling
Copy link
Copy Markdown
Member Author

Thanks for working on this! Code wise, mostly looking good. My io_uring knowledge is very limited. I've consulted Claude Code with Sonnet 4.6 model for review. So please take the following remarks with a grain of salt - I can't assess if these are valid or not.

Thanks for the review and for being transparent about using Claude Code! I use AI tools too, they're genuinely helpful for spotting patterns and sanity checking assumptions, especially in unfamiliar territory like io_uring. I'm learning io_uring as I go myself, same as you, so I'm happy to double check on any questions or do further investigation together to make sure we get this right.

Show Claude Code (Sonnet 4.6) Remarks

io_uring-related issues

Confirmed bugs

* **`next_completed_request` missing `cq.sync()`** — without it the kernel-side CQ head is
  not refreshed before reading, so completions can be silently missed on rings not using
  `IORING_SETUP_SQPOLL`

Addressed inline, false positive. completion() would create a brand new object, no stale state to refresh.

* **Fast-path short reads** — io_uring returns the raw `readv` result (which may be less than
  `total_len`), while the slow path always reports `total_len` as the result; upper layers see
  inconsistent completion values

A QCOW2 file is always a regular file. The fast path operates on clusters fully allocated and contiguous on disk. There is same logic in raw_async.rs. Working on a file descriptor, not a socket or alike.

Partial submit on error

* If `submit_batch_requests` errors mid-batch (e.g. a metadata failure on an `Out` request),
  SQEs already pushed to the SQ but not yet `sync()`'d are abandoned — their `user_data`
  completions never arrive and the virtio queue stalls waiting for them

Addressed inline. It's valid and it's a pre existing patttern.

Asymmetric eventfd signaling

* The slow path explicitly calls `eventfd.write(1)` to signal completions; the fast path relies
  on the kernel writing to the eventfd via `register_eventfd` — this is correct but
  undocumented, making the two paths harder to reason about together

It's actually the same eventfd. It's registered with io_uring for kernel driven signaling, and manually written to for synthetic completions. That's what unifies both paths under a single notification mechanism. The epoll loop doesn't need to know where the completion came from. See QcowAsync::new().

split() lifetime coupling

* `io_uring.split()` is called in both `read_vectored` and `submit_batch_requests`; holding two
  split guards simultaneously would cause a borrow conflict — currently safe because they are in
  separate call frames, but fragile if the call structure changes

It's unclear why Claude tells it's fragile. If &mut self would cause a data race, it would fail to compile. So it's not fragile. But should it change, it'll need to be restructured.

submit_batch_requests misattributes submit errors

* The `submitter.submit()` error is wrapped in `AsyncIoError::ReadVectored` even when the batch
  contained only write requests — there is no appropriate variant for a generic submission
  failure, so errors are misattributed

Valid. I just fixed this, using the more appropriate AsyncIoError::SubmitBatchRequests now.

Overall - a nice demonstration of AI assisted review catching real patterns (the partial submit issue, the misattributed error variant) alongside false positives (the cq.sync() one). Good to scrutinize both together. Thanks for taking the time to go through these and happy to dig deeper into any of the points if needed.

Thanks!

@weltling
Copy link
Copy Markdown
Member Author

Majority of the DIsk IO for small block size is more than one cluster mapping. THis PR acutlaly does almost 90% of the sync read. What benefit can we get from this partially implemented Async IO?

Thanks for looking at this.

Could you share how you arrived at the 90% sync figure? I would like to understand the reasoning, because the math I get points in the opposite direction.

The fast path condition is - single allocated cluster mapping covering the full request. The default QCOW2 cluster size is 64K. With 4K block I/O, a read crosses a cluster boundary only when its starting offset falls in the last 4095 bytes of a cluster, that is 4095/65536 = ~6.25% of random offsets. So about 94% of random 4K reads land entirely within one cluster and, if allocated, take the async io_uring path. With smaller block sizes it gets even better - 512 byte I/O only crosses a boundary at 511/65536 = ~0.78% of offsets, so over 99% of those reads hit the fast path. Smaller block sizes make the async path more likely, not less. For aligned sequential reads, 4K divides evenly into 64K clusters, so 100% likely take the async path.

The sync fallback is not about block size. It fires for unallocated clusters, compressed clusters, backing file reads, and writes. Those are image structure properties. On a warmed up uncompressed QCOW2 without backing, virtually all clusters are allocated, so virtually all reads go through io_uring.

Even with only async reads landed so far, the performance results in the PR description already show measurable gains for uncompressed QCOW2 workloads. Those gains will only grow as the remaining sync paths move to io_uring in follow up cycles.

Regarding the "partially implemented" aspect - this is Cycle 1 of the refactoring plan tracked in #7560. The follow up cycles cover async writes, async backing file reads, and async compressed cluster reads. Each of those adds substantial complexity and is scoped separately intentionally, not because it was overlooked.

Thanks

@russell-islam
Copy link
Copy Markdown
Contributor

Majority of the DIsk IO for small block size is more than one cluster mapping. THis PR acutlaly does almost 90% of the sync read. What benefit can we get from this partially implemented Async IO?

Thanks for looking at this.

Could you share how you arrived at the 90% sync figure? I would like to understand the reasoning, because the math I get points in the opposite direction.

The fast path condition is - single allocated cluster mapping covering the full request. The default QCOW2 cluster size is 64K. With 4K block I/O, a read crosses a cluster boundary only when its starting offset falls in the last 4095 bytes of a cluster, that is 4095/65536 = ~6.25% of random offsets. So about 94% of random 4K reads land entirely within one cluster and, if allocated, take the async io_uring path. With smaller block sizes it gets even better - 512 byte I/O only crosses a boundary at 511/65536 = ~0.78% of offsets, so over 99% of those reads hit the fast path. Smaller block sizes make the async path more likely, not less. For aligned sequential reads, 4K divides evenly into 64K clusters, so 100% likely take the async path.

The sync fallback is not about block size. It fires for unallocated clusters, compressed clusters, backing file reads, and writes. Those are image structure properties. On a warmed up uncompressed QCOW2 without backing, virtually all clusters are allocated, so virtually all reads go through io_uring.

Even with only async reads landed so far, the performance results in the PR description already show measurable gains for uncompressed QCOW2 workloads. Those gains will only grow as the remaining sync paths move to io_uring in follow up cycles.

Regarding the "partially implemented" aspect - this is Cycle 1 of the refactoring plan tracked in #7560. The follow up cycles cover async writes, async backing file reads, and async compressed cluster reads. Each of those adds substantial complexity and is scoped separately intentionally, not because it was overlooked.

Thanks

I misunderstood your code, thanks for the clarification. How about the block size is 1MB?

Ok(())
}

// Writes are synchronous. Async writes require a multi step
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are going to support async write later please put a TODO here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks.

@weltling
Copy link
Copy Markdown
Member Author

Majority of the DIsk IO for small block size is more than one cluster mapping. THis PR acutlaly does almost 90% of the sync read. What benefit can we get from this partially implemented Async IO?

Thanks for looking at this.
Could you share how you arrived at the 90% sync figure? I would like to understand the reasoning, because the math I get points in the opposite direction.
The fast path condition is - single allocated cluster mapping covering the full request. The default QCOW2 cluster size is 64K. With 4K block I/O, a read crosses a cluster boundary only when its starting offset falls in the last 4095 bytes of a cluster, that is 4095/65536 = ~6.25% of random offsets. So about 94% of random 4K reads land entirely within one cluster and, if allocated, take the async io_uring path. With smaller block sizes it gets even better - 512 byte I/O only crosses a boundary at 511/65536 = ~0.78% of offsets, so over 99% of those reads hit the fast path. Smaller block sizes make the async path more likely, not less. For aligned sequential reads, 4K divides evenly into 64K clusters, so 100% likely take the async path.
The sync fallback is not about block size. It fires for unallocated clusters, compressed clusters, backing file reads, and writes. Those are image structure properties. On a warmed up uncompressed QCOW2 without backing, virtually all clusters are allocated, so virtually all reads go through io_uring.
Even with only async reads landed so far, the performance results in the PR description already show measurable gains for uncompressed QCOW2 workloads. Those gains will only grow as the remaining sync paths move to io_uring in follow up cycles.
Regarding the "partially implemented" aspect - this is Cycle 1 of the refactoring plan tracked in #7560. The follow up cycles cover async writes, async backing file reads, and async compressed cluster reads. Each of those adds substantial complexity and is scoped separately intentionally, not because it was overlooked.
Thanks

I misunderstood your code, thanks for the clarification. How about the block size is 1MB?

With 1MB I/O and 64K clusters, a single request spans 16 clusters, so it produces multiple mappings and takes the sync fallback. That is correct.

A couple of things to consider:

  • Cloud Hypervisor supports all cluster sizes from 9 to 21 bit as defined by the QCOW2 spec, with default 16 bit. With larger clusters, more I/O lands in the fast path. For example with 1MB clusters, a 1MB aligned read is a single mapping and goes through io_uring.
  • The tradeoff of larger clusters is the increase of the internal fragmentation and COW amplification on partial writes, so the default 64K is a reasonable middle ground for most workloads.
  • The current batching batches across virtio requests, not across cluster mappings within a single request. When one read spans 16 clusters, those 16 preadv ops are still synchronous. Submitting them to io_uring as a batch can be done as a follow up.
  • Adjacent allocated mappings could also be coalesced into a single preadv spanning multiple clusters, avoiding the per cluster overhead entirely. The metadata layer already returns mappings in guest address order, so detecting contiguous host offsets should be straightforward.

This is the initial async implementation. Not all possible optimizations are in place yet and I am sure this will be a returning point for multiple rounds of improvements. The performance data in the description was measured with the default 64K cluster size and standard fio workloads. The gains reflect the common case where the majority of reads hit the fast path.

As a side note - QCOW2 performance tuning guide would also be a good topic to address in the future. I'll make a note to do some writing, when time permits.

Thanks

@DemiMarie
Copy link
Copy Markdown
Contributor

Would it make sense to use something more scalable than RwLock?

@weltling
Copy link
Copy Markdown
Member Author

Would it make sense to use something more scalable than RwLock?

Good question. RwLock is fundamental to the current QCOW metadata design. std::sync::RwLock is the conservative choice here, providing poisoning tracking for free across several complex multistep mutation paths in metadata.

parking_lot::RwLock would have lower overhead but drops poisoning and adds an external dependency. Could be revisited if contention shows up at higher queue counts.

If you have a specific alternative in mind, I'd be interested to hear it.

Thanks

@DemiMarie
Copy link
Copy Markdown
Contributor

If you have a specific alternative in mind, I'd be interested to hear it.

I was thinking of a read-optimized design, like RCU, hazard pointers, or sharded locking. These avoid needing atomic operations in the read path (no metadata change) at the expense of slowing down the write path. I believe that RwLock bottlenecks at less than 2 million IOPS due to cache line ping-pong.

@phip1611
Copy link
Copy Markdown
Member

phip1611 commented Mar 24, 2026

If you have a specific alternative in mind, I'd be interested to hear it.

I was thinking of a read-optimized design, like RCU, hazard pointers, or sharded locking. These avoid needing atomic operations in the read path (no metadata change) at the expense of slowing down the write path. I believe that RwLock bottlenecks at less than 2 million IOPS due to cache line ping-pong.

Please note that we at Cyberus Technology have a comprehensive benchmarking infrastructure (fio throughput, IOPS) including visualizations in Grafana - we might could help investigating such optimizations

@weltling
Copy link
Copy Markdown
Member Author

If you have a specific alternative in mind, I'd be interested to hear it.

I was thinking of a read-optimized design, like RCU, hazard pointers, or sharded locking. These avoid needing atomic operations in the read path (no metadata change) at the expense of slowing down the write path. I believe that RwLock bottlenecks at less than 2 million IOPS due to cache line ping-pong.

The suggestions are ambitious but for sure interesting.

RCU is a natural fit for the read path conceptually, the L2 lookup is a pure shared read. The difficulty is on the write side. Today the metadata mutations are already multistep (allocate cluster, update L2, adjust refcounts, evict cache, grow L1), and some are recursive. Beyond that, QCOW2 features like snapshots, dirty bitmaps, and extended L2 entries would add more structures that must stay consistent with each other under writes.

Adapting those to RCU grace periods or hazard pointer retire semantics means the "copy" side needs to capture a consistent view across multiple interdependent structures, not just swap one pointer. That is substantial additional complexity. That's also why I chose clearly simplicity of one coarse lock vs. fine grained multiple locks/mutexes. As in, which object should be protected, when a single operation touches 5 structures? :)

On the 2M IOPS ceiling, the lock protects only the metadata lookup (two array indexes + a cache map get), not the data I/O itself. The actual read goes through io_uring after the lock is released. Per op hold time is very short. Whether the lock becomes the bottleneck at higher queue counts is worth profiling before committing to a more complex scheme.

@phip1611 that benchmarking infrastructure sounds useful for validating whether metadata locking is the limiter at scale. Happy to collaborate on that.

Worth investigating as a separate effort once the async refactoring is complete and there is a stable baseline to measure against.

Thanks

@weltling weltling force-pushed the qcow-async branch 2 times, most recently from 5034690 to a8cac2c Compare March 28, 2026 12:33
@phip1611
Copy link
Copy Markdown
Member

I was thinking of a read-optimized design, like RCU,

I think this out of scope here. Is there a follow-up ticket?

Copy link
Copy Markdown
Member

@phip1611 phip1611 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code-wise, looking good. thanks for your work on that! It is hard for me to verify the io_uring/async specific parts. Could you add a unit test or an integration test? this would give me more confidence for an approval

thanks for your great work!


match result {
Ok(actions) => {
for action in actions {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could the for loop or at least its body moved to a helper function? the long code block doesn't read that nice

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, moved the deallocation action body into the apply_dealloc_action() helper, which made the punch_hole() shorter.

Thanks

@phip1611 phip1611 mentioned this pull request Mar 30, 2026
@weltling
Copy link
Copy Markdown
Member Author

I was thinking of a read-optimized design, like RCU,

I think this out of scope here. Is there a follow-up ticket?

Yep, it's out of scope here as it's another broad refactoring on top of this one, once complete. On my side also there's no plan targeting a change in the locking model, but rather to focus on the features made possible thanks to the separated metadata I/O and further optimizations upon the initial implementation.

Thanks

@weltling
Copy link
Copy Markdown
Member Author

Code-wise, looking good. thanks for your work on that! It is hard for me to verify the io_uring/async specific parts. Could you add a unit test or an integration test? this would give me more confidence for an approval

thanks for your great work!

Thanks for the thorough reviews!

I've added async implementation specific unit tests, covering the most important situations like read/write roundtrip, cross cluster boundary, mixed batch requests, etc. Probably more unit tests can be written as it goes and certainly more to come in the form of micro benchmarks.

For the integration tests - the existing integration tests already cover QcowAsync, since io_uring is available. That's the pattern used for RAW as well - if io_uring is available, it's preferred. It was also the reason I haven't done more integration tests. Otheriwese, there are some ideas, I'm going to add more during the further optimizations (async writes, decompression rework, and much more).

Thanks!

@russell-islam
Copy link
Copy Markdown
Contributor

This PR adds partial Async support, shouldn't the PR description/ subject mention that?

@weltling
Copy link
Copy Markdown
Member Author

This PR adds partial Async support, shouldn't the PR description/ subject mention that?

I've added an "Async scope" section to the description that breaks down which paths are async and which remain synchronous, with the rationale for each.

That said, "partial" is nuanced here. The backend implements the full AsyncIo trait and every operation goes through the async interface. What varies is the internal path - not all I/O can go through io_uring, so it will never be 100% io_uring. What can move to io_uring in follow up cycles is planned through the RFC.

Thanks

weltling added 22 commits April 7, 2026 21:21
QCOW2 images support both sparse operations and the zero flag.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Delegates to QcowMetadata::resize. Rejects resize when a backing
file is present, same as the sync backend.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Marker supertrait combining all composable capability traits.
QcowDiskAsync now satisfies the full DiskFile contract.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Per queue I/O worker that uses io_uring for asynchronous reads
against fully allocated clusters. The struct holds the shared
metadata, data file, optional backing reader, the io_uring
instance and a synthetic completion list.

Feature gated on io_uring in lib.rs.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add the AsyncIo trait impl with notifier and next_completed_request
filled in. The remaining methods are stubbed with unimplemented
and will be filled in by subsequent commits.

next_completed_request drains io_uring completions first, then
falls back to the synthetic completion list.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Single allocated cluster reads are submitted to io_uring for true
async completion. Mixed mapping reads (zero, compressed, backing,
multi cluster) fall back to synchronous pread64 with synthetic
completions.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Synchronous per cluster write path - gather guest data from iovecs,
map each cluster through QcowMetadata, and pwrite to the allocated
host offset. Partial cluster writes with a backing file read the
backing data first so map_cluster_for_write can perform COW.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Flush dirty metadata caches and sync the underlying file via
QcowMetadata::flush, then signal synthetic completion.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Deallocate clusters through QcowMetadata::deallocate_bytes, then
apply the resulting DeallocAction list (punch hole or write zeroes
at host offsets). write_zeroes delegates to punch_hole since
unallocated QCOW2 clusters inherently read as zero.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
try_clone shares the Arc wrapped metadata and backing file.
new_async_io creates a QcowAsync worker with its own io_uring
instance for the given ring depth.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
When io_uring is available and not disabled, open QCOW2 images
with QcowDiskAsync for asynchronous reads. Falls back to
QcowDiskSync otherwise.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Implement batch_requests_enabled() and submit_batch_requests() for
QcowAsync. Without batching, each read_vectored call performs its own
io_uring submit() syscall. With batching, the virtio queue handler
collects all pending requests and submits them in a single call,
pushing multiple SQEs before one submit() syscall.

Each request in the batch is classified through the metadata layer.
Requests that hit the fast path (single allocated cluster mapping)
are pushed to the io_uring submission queue. Requests that require
the slow path (compressed, backing, zero fill, or mixed mappings)
are completed synchronously and queued as synthetic completions.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a QcowAsync unit test for punch hole completion. The test
verifies that a punch hole request reports successful completion
and that the deallocated range reads back as zeroes.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a QcowAsync unit test for write zeroes completion. The test
verifies that a write zeroes request reports successful completion
and that the zeroed range reads back as zeroes.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a QcowAsync unit test that writes a byte pattern through
write_vectored, reads it back through read_vectored, and verifies
the data matches. This exercises the core async write and read
paths end to end.

Also add an async_write helper for future tests.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a QcowAsync unit test that writes distinct patterns into two
adjacent clusters, then issues a single read spanning the cluster
boundary. Verifies that multi mapping read resolution returns the
correct data from both clusters.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a QcowAsync unit test that submits a batch of interleaved
write and read requests via submit_batch_requests. Verifies that
all completions arrive with the correct user_data and that the
read back data matches the written data.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a QcowAsync unit test that reads from a range that was never
written. Verifies the fundamental QCOW contract that unallocated
clusters return zeroes.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a QcowAsync unit test that writes 4K into the middle of a
cluster, then reads the entire cluster back. Verifies that the
written region matches and surrounding bytes remain zero. This
exercises the COW path where unwritten parts of a newly allocated
cluster must be zero filled.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Write data, punch hole to deallocate, then rewrite the same
range and verify the new contents read back correctly.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Write a distinct byte pattern into each of eight consecutive
clusters in a single operation, then read the full range back
and verify per cluster contents.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
The unit test container runs with Docker default seccomp
profile which blocks io_uring_setup, io_uring_enter and
io_uring_register. This causes all qcow_async unit tests to
fail with EPERM when creating an io_uring instance.

Add --security-opt seccomp=unconfined to the unit test docker
run invocation. The container already has --device access and
cap_net_admin, so this does not materially change the security
posture.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
@weltling
Copy link
Copy Markdown
Member Author

weltling commented Apr 7, 2026

I have followed this PR through multiple review rounds. At this point, though, given the substantial qcow2 and io_uring involvement, I do not feel confident enough to give it a safe approval. That said, the code has improved a lot since the initial version.

Thanks for the thorough reviews across all the rounds, the feedback has genuinely improved the code. Completely understand the hesitation given the qcow2 and io_uring depth.

The async backend already shows measurable gains on the read side, particularly for random reads, multi queue workloads, and backing file reads. The existing RAW async backend using the same io_uring patterns has been running in production for a while now. The QCOW2 side follows those established patterns closely, and the unit tests cover the key paths. The integration tests also exercise this backend transparently on any io_uring capable host.

That said, if there are specific areas that feel uncertain, happy to add more targeted tests or documentation to build confidence. Let me know what would be most useful.

Thanks

Copy link
Copy Markdown
Contributor

@DemiMarie DemiMarie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, please do not merge this PR until #7971 is in. #7971 contains important memory safety fixes to the existing async I/O backends. This code also uses io_uring without proper lifetime management for guest memory and so has the same problems.

If this PR needs to be merged before #7971, please wrap the existing RawFileAsync rather than using io_uring directly. #7971 will turn RawFileAsync into a safe wrapper around io_uring.

Comment on lines +229 to +234
fn read_vectored(
&mut self,
offset: libc::off_t,
iovecs: &[libc::iovec],
user_data: u64,
) -> AsyncIoResult<()> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A safe function containing libc::iovec is almost always a bad idea. libc::iovec contains a raw pointer, which safe code can set to anything.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this is a concern with the AsyncIo trait definition itself. The read_vectored and write_vectored signatures taking &[libc::iovec] in a safe function predate this PR. Every backend implementing the trait has the same exposure. Fixing it at the trait level is the right approach, which is what #7971 targets.

Thanks

Comment on lines +248 to +259
// SAFETY: fd is valid and iovecs point to valid guest memory.
unsafe {
sq.push(
&opcode::Readv::new(types::Fd(fd), iovecs.as_ptr(), iovecs.len() as u32)
.offset(host_offset)
.build()
.user_data(user_data),
)
.map_err(|_| {
AsyncIoError::ReadVectored(Error::other("Submission queue is full"))
})?;
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This safety justification isn’t sufficient. First, safe Rust code can set the pointers and lengths in the iovecs to anything. Second, there is no guarantee that the guest memory will not be freed before the operation completes. See #7971 and #7955.

#7971 fixes the RawFileAsync and RawFileAsyncAio backends, but not this one because it isn’t merged yet. Please use the APIs introduced in #7971 instead of adding new unsafe uses of io_uring.

As per #7955, I believe this is not strictly theoretical. Right now, triggering this bug requires hot-unplugging a virtio-pmem device, but in the future it could be triggered by any guest with a virtio-GPU device (implemented via the generic vhost-user frontend). @alyssais is going to be sending a PR for this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the same pattern and the same gap exist in RawFileAsync::read_vectored today.

#7971 fixes that by tracking in-flight operations and holding guest memory until the CQE is reaped, which is the correct approach. I'm happy adapt to use those APIs once #7971 lands, or if this PR goes first, happy contribute the adaptation to be part of the #7971 integration.

Thanks

@DemiMarie
Copy link
Copy Markdown
Contributor

@weltling I don’t think this PR makes the existing situation worse. However, it does make fixing it more difficult. In particular, it’s much better to have a single, safe abstraction around io_uring (and possibly AIO), rather than trying to reimplement them.

If any of the APIs in #7971 aren’t suitable for your needs, please let me know. My goal is that all disk I/O in Cloud Hypervisor should go through them.

@weltling
Copy link
Copy Markdown
Member Author

weltling commented Apr 8, 2026

@weltling I don’t think this PR makes the existing situation worse. However, it does make fixing it more difficult. In particular, it’s much better to have a single, safe abstraction around io_uring (and possibly AIO), rather than trying to reimplement them.

If any of the APIs in #7971 aren’t suitable for your needs, please let me know. My goal is that all disk I/O in Cloud Hypervisor should go through them.

@DemiMarie thanks for the detailed review and for linking #7955 and #7971.

Both points are valid. The AsyncIo trait taking &[libc::iovec] in a safe function is a design issue in the trait itself, not introduced by this PR, but that doesn't make it less of a concern. And the guest memory lifetime concern is real - if io_uring holds references to guest memory that gets unmapped, that's unsound regardless of how unlikely the trigger is today.

I've looked at #7971 and the approach of tracking in-flight operations by user_data and holding a guest memory snapshot until the CQE is reaped makes sense. Those APIs will also be helpful when implementing async writes
for QCOW2. I'm going for some more detailed checks on those PRs.

This PR currently follows the same pattern as the existing RawFileAsync backend, which #7971 also fixes. The sequencing between the two PRs needs to be coordinated - whichever lands last will need to adapt and rebase on top of the other. Happy to rebase this one on top of #7971 once it lands. If this one happens to go first, the adaptation would be on the #7971 side.

Thanks

@weltling
Copy link
Copy Markdown
Member Author

weltling commented Apr 8, 2026

@DemiMarie to note here also - wrapping RawFileAsync isn't directly applicable here because QCOW2 reads involve address translation before the io_uring read. QcowAsync does its own sq.push() at the resolved host offset after the metadata lookup; it's not a pass-through to a raw file. That said, if #7971 exposes safe primitives for io_uring submission with guest memory lifetime tracking, so QcowAsync can adopt those instead of calling io_uring::IoUring directly.

Thanks

@DemiMarie
Copy link
Copy Markdown
Contributor

@weltling What if BatchRequestInner allowed setting its offset and decreasing its size? That would allow RawFileAsync to be used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants