block: Add async QCOW2 backend with io_uring by weltling · Pull Request #7882 · cloud-hypervisor/cloud-hypervisor

weltling · 2026-03-23T10:46:59Z

Adds QcowDiskAsync, an async QCOW2 backend using io_uring for data I/O. Automatically selected when io_uring is available; falls back to QcowDiskSync otherwise.

Design

Each virtio queue gets its own QcowAsync instance with an independent file descriptor (via dup) and io_uring ring, enabling parallel data I/O across queues. Metadata (QcowMetadata) is shared via Arc<RwLock<QcowState>>, allowing concurrent read access. Pending requests are batched into a single io_uring::enter() call to amortize syscall overhead.

The implementation uses DiskBackend::Next, implementing AsyncFullDiskFile rather than the legacy async_io::DiskFile trait.

Changes

Refactoring: Extract shared code from qcow_sync.rs into reusable modules:

qcow_common.rs: positional I/O helpers (pread64/pwrite64) and iovec scatter/gather
qcow/backing.rs: backing file construction (RawBacking, Qcow2MetadataBacking, shared_backing_from)

Async backend:

QcowDiskAsync: DiskFile + AsyncDiskFile (size, geometry, sparse, resize, per queue fd creation)
QcowAsync: AsyncIo (read, write, fsync, punch_hole, write_zeroes via io_uring)

Batching: Group pending I/O submissions before io_uring::enter()

Integration: Route QCOW2 to QcowDiskAsync in device_manager.rs when io_uring is supported

Async scope

This implementation makes single cluster allocated reads fully async via io_uring. The remaining paths use synchronous I/O with synthetic completions through the same eventfd mechanism, so callers see a uniform async interface regardless:

Writes: synchronous (metadata allocation must complete before the data write)
Multi mapping reads: synchronous per cluster preadv loop
Compressed / backing file reads: synchronous (CPU bound decompression, backing file seek)
punch_hole, write_zeroes, fsync: synchronous (single syscall, no io_uring benefit)

The planned progression is described in the block refactoring RFC linked from #7694. Cycle 3 covers async writes with COW, compression handling, and batched multi mapping submissions.

Performance (async vs main, MiB/s)

Uncompressed QCOW2:

Test	main	async	Delta
read	1579	1812	+15%
random_read	38	98	+156%
read_warm	1799	2239	+24%
mq_read	4559	4867	+7%
mq_random_read	104	133	+28%
mq_read_warm	5634	6174	+10%

Overlay with backing files:

Test	main	async	Delta
backing_qcow2_read	1333	1485	+11%
backing_qcow2_read_warm	1793	2198	+23%
mq_backing_qcow2_read	4383	5595	+28%
mq_backing_qcow2_read_warm	5764	7513	+30%
backing_raw_read	1384	1628	+18%
backing_raw_read_warm	1869	2181	+17%
mq_backing_raw_read	4703	5956	+27%
mq_backing_raw_read_warm	6207	7580	+22%

Overlay random reads (backing files):

Test	main	async	Delta
backing_qcow2_random_read	36	35	-3%
backing_raw_random_read	43	30	-29%
mq_backing_qcow2_random_read	95	87	-9%
mq_backing_raw_random_read	114	88	-23%

These are low absolute values (30 to 114 MiB/s) in the same regime as the uncompressed random_read numbers. The random_read regression exists on main as well and is unrelated to the async backend.

Compressed (zlib/zstd): Single queue random reads at parity. Single queue sequential reads 9 to 16% slower. Multi queue 11 to 24% slower. Compressed reads are CPU bound with decompression dominating, and the io_uring submission path adds overhead without benefit. Moving decompression out of the metadata lock is planned as follow up.

Full benchmark logs attached (30 tests each, 5 iterations per test).

perf_main.log
perf_qcow_async.log

phip1611

Thanks for working on this! Code wise, mostly looking good. My io_uring knowledge is very limited. I've consulted Claude Code with Sonnet 4.6 model for review. So please take the following remarks with a grain of salt - I can't assess if these are valid or not

Show Claude Code (Sonnet 4.6) Remarks

io_uring-related issues

Confirmed bugs

next_completed_request missing cq.sync() — without it the kernel-side CQ head is
not refreshed before reading, so completions can be silently missed on rings not using
IORING_SETUP_SQPOLL
Fast-path short reads — io_uring returns the raw readv result (which may be less than
total_len), while the slow path always reports total_len as the result; upper layers see
inconsistent completion values

Partial submit on error

If submit_batch_requests errors mid-batch (e.g. a metadata failure on an Out request),
SQEs already pushed to the SQ but not yet sync()'d are abandoned — their user_data
completions never arrive and the virtio queue stalls waiting for them

Asymmetric eventfd signaling

The slow path explicitly calls eventfd.write(1) to signal completions; the fast path relies
on the kernel writing to the eventfd via register_eventfd — this is correct but
undocumented, making the two paths harder to reason about together

`split()` lifetime coupling

io_uring.split() is called in both read_vectored and submit_batch_requests; holding two
split guards simultaneously would cause a borrow conflict — currently safe because they are in
separate call frames, but fragile if the call structure changes

`submit_batch_requests` misattributes submit errors

The submitter.submit() error is wrapped in AsyncIoError::ReadVectored even when the batch
contained only write requests — there is no appropriate variant for a generic submission
failure, so errors are misattributed

phip1611 · 2026-03-23T11:42:42Z

block/src/qcow_async.rs

+
+    fn next_completed_request(&mut self) -> Option<(u64, i32)> {
+        // Drain io_uring completions first, then synthetic ones.
+        if let Some(entry) = self.io_uring.completion().next() {


I think this could be written as

// Drain io_uring completions first, then synthetic ones. self.io_uring .completion() .next() .map(|entry| (entry.user_data(), entry.result())) .or_else(|| self.completion_list.pop_front())

General question: What are synthetic ones in this context?

Indeed, I rewrote this. Also just saw raw_async.rs follows the same pattern (except it doesn't need the synthetic queue), so it's more consistent.

Synthetic means completions from operations that were handled synchronously like writes, multi mapping reads, compressed, backing file reads or any other. They just simulate a completion but have not been actually processed through io_uring. Some will become async in the follow up patches, though, like writes are planned for cycle 3 as a separate effort due to complexity.

Thanks

phip1611 · 2026-03-23T11:48:07Z

block/src/qcow_async.rs


    fn write_zeroes(&mut self, offset: u64, length: u64, user_data: u64) -> AsyncIoResult<()> {
-        unimplemented!()
+        self.punch_hole(offset, length, user_data)


Maybe add a comment here?

// unallocated QCOW2 clusters inherently read as zero

Added a comment here with an explanation, matching what qcow_sync.rs has, too. While unallocated reading as zero is a spec guarantee, here is primarily about deallocate_bytes doing the zeroing.

Thanks

phip1611 · 2026-03-23T11:50:02Z

vmm/src/device_manager.rs

+                    {
+                        info!("Using asynchronous QCOW2 disk file (io_uring)");
+
+                        #[cfg(not(feature = "io_uring"))]


this looks a little weird - any idea how we could rewrite it in a way that reads more naturally?

This follows the existing pattern used by VHD and RAW io_uring backends in the same function.

Restructuring would affect all the involved image types. Maybe an option could be moving having this same #[cfg] just once to the top and have a global if branch, so the compiler would still see the dead branch at runtime, when the feature is off.

Thanks

phip1611 · 2026-03-23T11:55:38Z

block/src/qcow_async.rs

+        for req in batch_request {
+            match req.request_type {
+                RequestType::In => {
+                    let address = req.offset as u64;


The inner code of these match arms are kinda huge. Combined with embedded continue, it is difficult to read. Do you see potential to move some part into helper functions?

Now that you mention - yep, it became big. I went for extracting some reusable common parts and splitting into helpers. It's still a fair amount on code there, but hopefully it's now easier to handle. And also, using #[inline] to curb the context switch effects.

Thanks

phip1611 · 2026-03-23T11:57:45Z

block/src/qcow_async.rs

+                            host_offset,
+                            length,
+                        } => {
+                            let _ = self.data_file.file_mut().punch_hole(host_offset, length);


we're silently ignoring a potential error here. Might be worth to log an error?

phip1611 · 2026-03-23T11:57:49Z

block/src/qcow_async.rs

+                            host_offset,
+                            length,
+                        } => {
+                            let _ = self


we're silently ignoring a potential error here. Might be worth to log an error?

These let _ = are intentional, same as the other case, as the underlying filesystem may not support FALLOC_FL_PUNCH_HOLE or FALLOC_FL_ZERO_RANGE. The error is not fatal. The QCOW2 metadata has already been updated at this point, so future reads will return zeros regardless. The data file operations are purely an optimization to reclaim physical space if sparse=on.

The same pattern is used in the sync part. Also please consider the comment in the metadata layer itself.

Wrt logging - in cases where it's not supported, every DISCARD or WRITE_ZEROES would emit a warning then, would be probably quite noisy.

Thanks

phip1611 · 2026-03-23T13:11:55Z

block/src/qcow_async.rs

+        }
+
+        if needs_submit {
+            sq.sync();


sq.sync() is called after the loop only when needs_submit is true. If SQEs were pushed but the loop also hit an error mid-batch (e.g. a metadata error on an Out request returning ?), the SQEs already pushed to sq but not yet sync'd will be abandoned.

Is this a problem? I know very little about io_uring

A very good question. I went checking the io-uring code again. SubmissionQueue::Drop does store the tail, so pushed SQEs become visible in the ring. However, without a submit() call, they won't be processed until the next submission. To my understanding, at that point they'd produce CQEs for already failed requests.

That said, this is a pre existing pattern, as raw_async.rs has the same pattern if sq.push() fails mid batch. This edge case seems shared.

A proper fix would need to either roll back the tail on error or submit and drain the orphaned SQEs before returning. I'll look into this as a followup.

Thanks

phip1611 · 2026-03-23T13:16:05Z

block/src/qcow_async.rs

+                    }
+                    sync_completions.push((req.user_data, total_len as i32));
+                }
+                _ => {}


are we sure that we do not net sq.sync(); here? What if the request is Flush or Discard - doesn't the backend needs to perform actions for that?

Good catch! Today anything else except In and Out is filtered, see block.rs. But, it should fail hard here. I've added an unreachable! call to uncover that, just in case.

Thanks

phip1611 · 2026-03-23T13:17:24Z

block/src/qcow_async.rs

+
+    fn next_completed_request(&mut self) -> Option<(u64, i32)> {
+        // Drain io_uring completions first, then synthetic ones.
+        if let Some(entry) = self.io_uring.completion().next() {


When the ring was constructed without IORING_SETUP_SQPOLL, it might be necessary to call cq.sync() beforehand 🤔

But I could also be wrong - not an expert with io_uring

I've double checked in the io-uring crate code and it shows that completion()creates a fresh CompletionQueue on each call. For that, it also loads the tail pointer from the kernel. This means, it operates on the same data anyways. The existing RAW io_uring backend raw_async.rs uses the same pattern.

Thanks

russell-islam · 2026-03-23T18:52:55Z

Majority of the DIsk IO for small block size is more than one cluster mapping. THis PR acutlaly does almost 90% of the sync read. What benefit can we get from this partially implemented Async IO?

weltling · 2026-03-23T19:00:08Z

Thanks for working on this! Code wise, mostly looking good. My io_uring knowledge is very limited. I've consulted Claude Code with Sonnet 4.6 model for review. So please take the following remarks with a grain of salt - I can't assess if these are valid or not.

Thanks for the review and for being transparent about using Claude Code! I use AI tools too, they're genuinely helpful for spotting patterns and sanity checking assumptions, especially in unfamiliar territory like io_uring. I'm learning io_uring as I go myself, same as you, so I'm happy to double check on any questions or do further investigation together to make sure we get this right.

Show Claude Code (Sonnet 4.6) Remarks

io_uring-related issues

Confirmed bugs

* **`next_completed_request` missing `cq.sync()`** — without it the kernel-side CQ head is
  not refreshed before reading, so completions can be silently missed on rings not using
  `IORING_SETUP_SQPOLL`

Addressed inline, false positive. completion() would create a brand new object, no stale state to refresh.

* **Fast-path short reads** — io_uring returns the raw `readv` result (which may be less than
  `total_len`), while the slow path always reports `total_len` as the result; upper layers see
  inconsistent completion values

A QCOW2 file is always a regular file. The fast path operates on clusters fully allocated and contiguous on disk. There is same logic in raw_async.rs. Working on a file descriptor, not a socket or alike.

Partial submit on error

* If `submit_batch_requests` errors mid-batch (e.g. a metadata failure on an `Out` request),
  SQEs already pushed to the SQ but not yet `sync()`'d are abandoned — their `user_data`
  completions never arrive and the virtio queue stalls waiting for them

Addressed inline. It's valid and it's a pre existing patttern.

Asymmetric eventfd signaling

* The slow path explicitly calls `eventfd.write(1)` to signal completions; the fast path relies
  on the kernel writing to the eventfd via `register_eventfd` — this is correct but
  undocumented, making the two paths harder to reason about together

It's actually the same eventfd. It's registered with io_uring for kernel driven signaling, and manually written to for synthetic completions. That's what unifies both paths under a single notification mechanism. The epoll loop doesn't need to know where the completion came from. See QcowAsync::new().

`split()` lifetime coupling

* `io_uring.split()` is called in both `read_vectored` and `submit_batch_requests`; holding two
  split guards simultaneously would cause a borrow conflict — currently safe because they are in
  separate call frames, but fragile if the call structure changes

It's unclear why Claude tells it's fragile. If &mut self would cause a data race, it would fail to compile. So it's not fragile. But should it change, it'll need to be restructured.

`submit_batch_requests` misattributes submit errors

* The `submitter.submit()` error is wrapped in `AsyncIoError::ReadVectored` even when the batch
  contained only write requests — there is no appropriate variant for a generic submission
  failure, so errors are misattributed

Valid. I just fixed this, using the more appropriate AsyncIoError::SubmitBatchRequests now.

Overall - a nice demonstration of AI assisted review catching real patterns (the partial submit issue, the misattributed error variant) alongside false positives (the cq.sync() one). Good to scrutinize both together. Thanks for taking the time to go through these and happy to dig deeper into any of the points if needed.

Thanks!

weltling · 2026-03-23T20:04:18Z

Majority of the DIsk IO for small block size is more than one cluster mapping. THis PR acutlaly does almost 90% of the sync read. What benefit can we get from this partially implemented Async IO?

Thanks for looking at this.

Could you share how you arrived at the 90% sync figure? I would like to understand the reasoning, because the math I get points in the opposite direction.

The fast path condition is - single allocated cluster mapping covering the full request. The default QCOW2 cluster size is 64K. With 4K block I/O, a read crosses a cluster boundary only when its starting offset falls in the last 4095 bytes of a cluster, that is 4095/65536 = ~6.25% of random offsets. So about 94% of random 4K reads land entirely within one cluster and, if allocated, take the async io_uring path. With smaller block sizes it gets even better - 512 byte I/O only crosses a boundary at 511/65536 = ~0.78% of offsets, so over 99% of those reads hit the fast path. Smaller block sizes make the async path more likely, not less. For aligned sequential reads, 4K divides evenly into 64K clusters, so 100% likely take the async path.

The sync fallback is not about block size. It fires for unallocated clusters, compressed clusters, backing file reads, and writes. Those are image structure properties. On a warmed up uncompressed QCOW2 without backing, virtually all clusters are allocated, so virtually all reads go through io_uring.

Even with only async reads landed so far, the performance results in the PR description already show measurable gains for uncompressed QCOW2 workloads. Those gains will only grow as the remaining sync paths move to io_uring in follow up cycles.

Regarding the "partially implemented" aspect - this is Cycle 1 of the refactoring plan tracked in #7560. The follow up cycles cover async writes, async backing file reads, and async compressed cluster reads. Each of those adds substantial complexity and is scoped separately intentionally, not because it was overlooked.

Thanks

russell-islam · 2026-03-23T22:24:07Z

Majority of the DIsk IO for small block size is more than one cluster mapping. THis PR acutlaly does almost 90% of the sync read. What benefit can we get from this partially implemented Async IO?

Thanks for looking at this.

Could you share how you arrived at the 90% sync figure? I would like to understand the reasoning, because the math I get points in the opposite direction.

The fast path condition is - single allocated cluster mapping covering the full request. The default QCOW2 cluster size is 64K. With 4K block I/O, a read crosses a cluster boundary only when its starting offset falls in the last 4095 bytes of a cluster, that is 4095/65536 = ~6.25% of random offsets. So about 94% of random 4K reads land entirely within one cluster and, if allocated, take the async io_uring path. With smaller block sizes it gets even better - 512 byte I/O only crosses a boundary at 511/65536 = ~0.78% of offsets, so over 99% of those reads hit the fast path. Smaller block sizes make the async path more likely, not less. For aligned sequential reads, 4K divides evenly into 64K clusters, so 100% likely take the async path.

The sync fallback is not about block size. It fires for unallocated clusters, compressed clusters, backing file reads, and writes. Those are image structure properties. On a warmed up uncompressed QCOW2 without backing, virtually all clusters are allocated, so virtually all reads go through io_uring.

Even with only async reads landed so far, the performance results in the PR description already show measurable gains for uncompressed QCOW2 workloads. Those gains will only grow as the remaining sync paths move to io_uring in follow up cycles.

Regarding the "partially implemented" aspect - this is Cycle 1 of the refactoring plan tracked in #7560. The follow up cycles cover async writes, async backing file reads, and async compressed cluster reads. Each of those adds substantial complexity and is scoped separately intentionally, not because it was overlooked.

Thanks

I misunderstood your code, thanks for the clarification. How about the block size is 1MB?

russell-islam · 2026-03-23T22:29:23Z

block/src/qcow_async.rs

+        Ok(())
+    }
+
+    // Writes are synchronous. Async writes require a multi step


If you are going to support async write later please put a TODO here.

Done, thanks.

block/src/qcow/backing.rs

block/src/qcow_async.rs

weltling · 2026-03-24T10:37:32Z

Majority of the DIsk IO for small block size is more than one cluster mapping. THis PR acutlaly does almost 90% of the sync read. What benefit can we get from this partially implemented Async IO?

Thanks for looking at this.
Could you share how you arrived at the 90% sync figure? I would like to understand the reasoning, because the math I get points in the opposite direction.
The fast path condition is - single allocated cluster mapping covering the full request. The default QCOW2 cluster size is 64K. With 4K block I/O, a read crosses a cluster boundary only when its starting offset falls in the last 4095 bytes of a cluster, that is 4095/65536 = ~6.25% of random offsets. So about 94% of random 4K reads land entirely within one cluster and, if allocated, take the async io_uring path. With smaller block sizes it gets even better - 512 byte I/O only crosses a boundary at 511/65536 = ~0.78% of offsets, so over 99% of those reads hit the fast path. Smaller block sizes make the async path more likely, not less. For aligned sequential reads, 4K divides evenly into 64K clusters, so 100% likely take the async path.
The sync fallback is not about block size. It fires for unallocated clusters, compressed clusters, backing file reads, and writes. Those are image structure properties. On a warmed up uncompressed QCOW2 without backing, virtually all clusters are allocated, so virtually all reads go through io_uring.
Even with only async reads landed so far, the performance results in the PR description already show measurable gains for uncompressed QCOW2 workloads. Those gains will only grow as the remaining sync paths move to io_uring in follow up cycles.
Regarding the "partially implemented" aspect - this is Cycle 1 of the refactoring plan tracked in #7560. The follow up cycles cover async writes, async backing file reads, and async compressed cluster reads. Each of those adds substantial complexity and is scoped separately intentionally, not because it was overlooked.
Thanks

I misunderstood your code, thanks for the clarification. How about the block size is 1MB?

With 1MB I/O and 64K clusters, a single request spans 16 clusters, so it produces multiple mappings and takes the sync fallback. That is correct.

A couple of things to consider:

Cloud Hypervisor supports all cluster sizes from 9 to 21 bit as defined by the QCOW2 spec, with default 16 bit. With larger clusters, more I/O lands in the fast path. For example with 1MB clusters, a 1MB aligned read is a single mapping and goes through io_uring.
The tradeoff of larger clusters is the increase of the internal fragmentation and COW amplification on partial writes, so the default 64K is a reasonable middle ground for most workloads.
The current batching batches across virtio requests, not across cluster mappings within a single request. When one read spans 16 clusters, those 16 preadv ops are still synchronous. Submitting them to io_uring as a batch can be done as a follow up.
Adjacent allocated mappings could also be coalesced into a single preadv spanning multiple clusters, avoiding the per cluster overhead entirely. The metadata layer already returns mappings in guest address order, so detecting contiguous host offsets should be straightforward.

This is the initial async implementation. Not all possible optimizations are in place yet and I am sure this will be a returning point for multiple rounds of improvements. The performance data in the description was measured with the default 64K cluster size and standard fio workloads. The gains reflect the common case where the majority of reads hit the fast path.

As a side note - QCOW2 performance tuning guide would also be a good topic to address in the future. I'll make a note to do some writing, when time permits.

Thanks

DemiMarie · 2026-03-24T11:57:43Z

Would it make sense to use something more scalable than RwLock?

weltling · 2026-03-24T15:22:53Z

Would it make sense to use something more scalable than RwLock?

Good question. RwLock is fundamental to the current QCOW metadata design. std::sync::RwLock is the conservative choice here, providing poisoning tracking for free across several complex multistep mutation paths in metadata.

parking_lot::RwLock would have lower overhead but drops poisoning and adds an external dependency. Could be revisited if contention shows up at higher queue counts.

If you have a specific alternative in mind, I'd be interested to hear it.

Thanks

DemiMarie · 2026-03-24T16:58:30Z

If you have a specific alternative in mind, I'd be interested to hear it.

I was thinking of a read-optimized design, like RCU, hazard pointers, or sharded locking. These avoid needing atomic operations in the read path (no metadata change) at the expense of slowing down the write path. I believe that RwLock bottlenecks at less than 2 million IOPS due to cache line ping-pong.

phip1611 · 2026-03-24T17:30:25Z

If you have a specific alternative in mind, I'd be interested to hear it.

I was thinking of a read-optimized design, like RCU, hazard pointers, or sharded locking. These avoid needing atomic operations in the read path (no metadata change) at the expense of slowing down the write path. I believe that RwLock bottlenecks at less than 2 million IOPS due to cache line ping-pong.

Please note that we at Cyberus Technology have a comprehensive benchmarking infrastructure (fio throughput, IOPS) including visualizations in Grafana - we might could help investigating such optimizations

weltling · 2026-03-24T20:49:20Z

If you have a specific alternative in mind, I'd be interested to hear it.

I was thinking of a read-optimized design, like RCU, hazard pointers, or sharded locking. These avoid needing atomic operations in the read path (no metadata change) at the expense of slowing down the write path. I believe that RwLock bottlenecks at less than 2 million IOPS due to cache line ping-pong.

The suggestions are ambitious but for sure interesting.

RCU is a natural fit for the read path conceptually, the L2 lookup is a pure shared read. The difficulty is on the write side. Today the metadata mutations are already multistep (allocate cluster, update L2, adjust refcounts, evict cache, grow L1), and some are recursive. Beyond that, QCOW2 features like snapshots, dirty bitmaps, and extended L2 entries would add more structures that must stay consistent with each other under writes.

Adapting those to RCU grace periods or hazard pointer retire semantics means the "copy" side needs to capture a consistent view across multiple interdependent structures, not just swap one pointer. That is substantial additional complexity. That's also why I chose clearly simplicity of one coarse lock vs. fine grained multiple locks/mutexes. As in, which object should be protected, when a single operation touches 5 structures? :)

On the 2M IOPS ceiling, the lock protects only the metadata lookup (two array indexes + a cache map get), not the data I/O itself. The actual read goes through io_uring after the lock is released. Per op hold time is very short. Whether the lock becomes the bottleneck at higher queue counts is worth profiling before committing to a more complex scheme.

@phip1611 that benchmarking infrastructure sounds useful for validating whether metadata locking is the limiter at scale. Happy to collaborate on that.

Worth investigating as a separate effort once the async refactoring is complete and there is a stable baseline to measure against.

Thanks

phip1611 · 2026-03-30T17:01:56Z

I was thinking of a read-optimized design, like RCU,

I think this out of scope here. Is there a follow-up ticket?

phip1611

Code-wise, looking good. thanks for your work on that! It is hard for me to verify the io_uring/async specific parts. Could you add a unit test or an integration test? this would give me more confidence for an approval

thanks for your great work!

phip1611 · 2026-03-30T17:08:06Z

block/src/qcow_async.rs

+
+        match result {
+            Ok(actions) => {
+                for action in actions {


could the for loop or at least its body moved to a helper function? the long code block doesn't read that nice

Yep, moved the deallocation action body into the apply_dealloc_action() helper, which made the punch_hole() shorter.

Thanks

weltling · 2026-03-30T20:25:21Z

I was thinking of a read-optimized design, like RCU,

I think this out of scope here. Is there a follow-up ticket?

Yep, it's out of scope here as it's another broad refactoring on top of this one, once complete. On my side also there's no plan targeting a change in the locking model, but rather to focus on the features made possible thanks to the separated metadata I/O and further optimizations upon the initial implementation.

Thanks

weltling · 2026-03-30T21:54:07Z

Code-wise, looking good. thanks for your work on that! It is hard for me to verify the io_uring/async specific parts. Could you add a unit test or an integration test? this would give me more confidence for an approval

thanks for your great work!

Thanks for the thorough reviews!

I've added async implementation specific unit tests, covering the most important situations like read/write roundtrip, cross cluster boundary, mixed batch requests, etc. Probably more unit tests can be written as it goes and certainly more to come in the form of micro benchmarks.

For the integration tests - the existing integration tests already cover QcowAsync, since io_uring is available. That's the pattern used for RAW as well - if io_uring is available, it's preferred. It was also the reason I haven't done more integration tests. Otheriwese, there are some ideas, I'm going to add more during the further optimizations (async writes, decompression rework, and much more).

Thanks!

russell-islam · 2026-03-31T01:58:05Z

This PR adds partial Async support, shouldn't the PR description/ subject mention that?

weltling · 2026-03-31T07:35:46Z

This PR adds partial Async support, shouldn't the PR description/ subject mention that?

I've added an "Async scope" section to the description that breaks down which paths are async and which remain synchronous, with the rationale for each.

That said, "partial" is nuanced here. The backend implements the full AsyncIo trait and every operation goes through the async interface. What varies is the internal path - not all I/O can go through io_uring, so it will never be 100% io_uring. What can move to io_uring in follow up cycles is planned through the RFC.

Thanks

QCOW2 images support both sparse operations and the zero flag. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

Delegates to QcowMetadata::resize. Rejects resize when a backing file is present, same as the sync backend. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

Marker supertrait combining all composable capability traits. QcowDiskAsync now satisfies the full DiskFile contract. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

Per queue I/O worker that uses io_uring for asynchronous reads against fully allocated clusters. The struct holds the shared metadata, data file, optional backing reader, the io_uring instance and a synthetic completion list. Feature gated on io_uring in lib.rs. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

Add the AsyncIo trait impl with notifier and next_completed_request filled in. The remaining methods are stubbed with unimplemented and will be filled in by subsequent commits. next_completed_request drains io_uring completions first, then falls back to the synthetic completion list. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

Single allocated cluster reads are submitted to io_uring for true async completion. Mixed mapping reads (zero, compressed, backing, multi cluster) fall back to synchronous pread64 with synthetic completions. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

Synchronous per cluster write path - gather guest data from iovecs, map each cluster through QcowMetadata, and pwrite to the allocated host offset. Partial cluster writes with a backing file read the backing data first so map_cluster_for_write can perform COW. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

Flush dirty metadata caches and sync the underlying file via QcowMetadata::flush, then signal synthetic completion. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

Deallocate clusters through QcowMetadata::deallocate_bytes, then apply the resulting DeallocAction list (punch hole or write zeroes at host offsets). write_zeroes delegates to punch_hole since unallocated QCOW2 clusters inherently read as zero. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

try_clone shares the Arc wrapped metadata and backing file. new_async_io creates a QcowAsync worker with its own io_uring instance for the given ring depth. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

When io_uring is available and not disabled, open QCOW2 images with QcowDiskAsync for asynchronous reads. Falls back to QcowDiskSync otherwise. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

Implement batch_requests_enabled() and submit_batch_requests() for QcowAsync. Without batching, each read_vectored call performs its own io_uring submit() syscall. With batching, the virtio queue handler collects all pending requests and submits them in a single call, pushing multiple SQEs before one submit() syscall. Each request in the batch is classified through the metadata layer. Requests that hit the fast path (single allocated cluster mapping) are pushed to the io_uring submission queue. Requests that require the slow path (compressed, backing, zero fill, or mixed mappings) are completed synchronously and queued as synthetic completions. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

Add a QcowAsync unit test for punch hole completion. The test verifies that a punch hole request reports successful completion and that the deallocated range reads back as zeroes. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

Add a QcowAsync unit test for write zeroes completion. The test verifies that a write zeroes request reports successful completion and that the zeroed range reads back as zeroes. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

Add a QcowAsync unit test that writes a byte pattern through write_vectored, reads it back through read_vectored, and verifies the data matches. This exercises the core async write and read paths end to end. Also add an async_write helper for future tests. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

Add a QcowAsync unit test that writes distinct patterns into two adjacent clusters, then issues a single read spanning the cluster boundary. Verifies that multi mapping read resolution returns the correct data from both clusters. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

Add a QcowAsync unit test that submits a batch of interleaved write and read requests via submit_batch_requests. Verifies that all completions arrive with the correct user_data and that the read back data matches the written data. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

Add a QcowAsync unit test that reads from a range that was never written. Verifies the fundamental QCOW contract that unallocated clusters return zeroes. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

Add a QcowAsync unit test that writes 4K into the middle of a cluster, then reads the entire cluster back. Verifies that the written region matches and surrounding bytes remain zero. This exercises the COW path where unwritten parts of a newly allocated cluster must be zero filled. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

Write data, punch hole to deallocate, then rewrite the same range and verify the new contents read back correctly. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

Write a distinct byte pattern into each of eight consecutive clusters in a single operation, then read the full range back and verify per cluster contents. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

The unit test container runs with Docker default seccomp profile which blocks io_uring_setup, io_uring_enter and io_uring_register. This causes all qcow_async unit tests to fail with EPERM when creating an io_uring instance. Add --security-opt seccomp=unconfined to the unit test docker run invocation. The container already has --device access and cap_net_admin, so this does not materially change the security posture. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

weltling · 2026-04-07T20:40:51Z

I have followed this PR through multiple review rounds. At this point, though, given the substantial qcow2 and io_uring involvement, I do not feel confident enough to give it a safe approval. That said, the code has improved a lot since the initial version.

Thanks for the thorough reviews across all the rounds, the feedback has genuinely improved the code. Completely understand the hesitation given the qcow2 and io_uring depth.

The async backend already shows measurable gains on the read side, particularly for random reads, multi queue workloads, and backing file reads. The existing RAW async backend using the same io_uring patterns has been running in production for a while now. The QCOW2 side follows those established patterns closely, and the unit tests cover the key paths. The integration tests also exercise this backend transparently on any io_uring capable host.

That said, if there are specific areas that feel uncertain, happy to add more targeted tests or documentation to build confidence. Let me know what would be most useful.

Thanks

DemiMarie

If possible, please do not merge this PR until #7971 is in. #7971 contains important memory safety fixes to the existing async I/O backends. This code also uses io_uring without proper lifetime management for guest memory and so has the same problems.

If this PR needs to be merged before #7971, please wrap the existing RawFileAsync rather than using io_uring directly. #7971 will turn RawFileAsync into a safe wrapper around io_uring.

DemiMarie · 2026-04-07T23:04:12Z

block/src/qcow_async.rs

+    fn read_vectored(
+        &mut self,
+        offset: libc::off_t,
+        iovecs: &[libc::iovec],
+        user_data: u64,
+    ) -> AsyncIoResult<()> {


A safe function containing libc::iovec is almost always a bad idea. libc::iovec contains a raw pointer, which safe code can set to anything.

Agreed, this is a concern with the AsyncIo trait definition itself. The read_vectored and write_vectored signatures taking &[libc::iovec] in a safe function predate this PR. Every backend implementing the trait has the same exposure. Fixing it at the trait level is the right approach, which is what #7971 targets.

Thanks

DemiMarie · 2026-04-07T23:08:54Z

block/src/qcow_async.rs

+            // SAFETY: fd is valid and iovecs point to valid guest memory.
+            unsafe {
+                sq.push(
+                    &opcode::Readv::new(types::Fd(fd), iovecs.as_ptr(), iovecs.len() as u32)
+                        .offset(host_offset)
+                        .build()
+                        .user_data(user_data),
+                )
+                .map_err(|_| {
+                    AsyncIoError::ReadVectored(Error::other("Submission queue is full"))
+                })?;
+            };


This safety justification isn’t sufficient. First, safe Rust code can set the pointers and lengths in the iovecs to anything. Second, there is no guarantee that the guest memory will not be freed before the operation completes. See #7971 and #7955.

#7971 fixes the RawFileAsync and RawFileAsyncAio backends, but not this one because it isn’t merged yet. Please use the APIs introduced in #7971 instead of adding new unsafe uses of io_uring.

As per #7955, I believe this is not strictly theoretical. Right now, triggering this bug requires hot-unplugging a virtio-pmem device, but in the future it could be triggered by any guest with a virtio-GPU device (implemented via the generic vhost-user frontend). @alyssais is going to be sending a PR for this.

Yep, the same pattern and the same gap exist in RawFileAsync::read_vectored today.

#7971 fixes that by tracking in-flight operations and holding guest memory until the CQE is reaped, which is the correct approach. I'm happy adapt to use those APIs once #7971 lands, or if this PR goes first, happy contribute the adaptation to be part of the #7971 integration.

Thanks

DemiMarie · 2026-04-07T23:23:14Z

@weltling I don’t think this PR makes the existing situation worse. However, it does make fixing it more difficult. In particular, it’s much better to have a single, safe abstraction around io_uring (and possibly AIO), rather than trying to reimplement them.

If any of the APIs in #7971 aren’t suitable for your needs, please let me know. My goal is that all disk I/O in Cloud Hypervisor should go through them.

weltling · 2026-04-08T05:20:03Z

@weltling I don’t think this PR makes the existing situation worse. However, it does make fixing it more difficult. In particular, it’s much better to have a single, safe abstraction around io_uring (and possibly AIO), rather than trying to reimplement them.

If any of the APIs in #7971 aren’t suitable for your needs, please let me know. My goal is that all disk I/O in Cloud Hypervisor should go through them.

@DemiMarie thanks for the detailed review and for linking #7955 and #7971.

Both points are valid. The AsyncIo trait taking &[libc::iovec] in a safe function is a design issue in the trait itself, not introduced by this PR, but that doesn't make it less of a concern. And the guest memory lifetime concern is real - if io_uring holds references to guest memory that gets unmapped, that's unsound regardless of how unlikely the trigger is today.

I've looked at #7971 and the approach of tracking in-flight operations by user_data and holding a guest memory snapshot until the CQE is reaped makes sense. Those APIs will also be helpful when implementing async writes
for QCOW2. I'm going for some more detailed checks on those PRs.

This PR currently follows the same pattern as the existing RawFileAsync backend, which #7971 also fixes. The sequencing between the two PRs needs to be coordinated - whichever lands last will need to adapt and rebase on top of the other. Happy to rebase this one on top of #7971 once it lands. If this one happens to go first, the adaptation would be on the #7971 side.

Thanks

weltling · 2026-04-08T11:41:00Z

@DemiMarie to note here also - wrapping RawFileAsync isn't directly applicable here because QCOW2 reads involve address translation before the io_uring read. QcowAsync does its own sq.push() at the resolved host offset after the metadata lookup; it's not a pass-through to a raw file. That said, if #7971 exposes safe primitives for io_uring submission with guest memory lifetime tracking, so QcowAsync can adopt those instead of calling io_uring::IoUring directly.

Thanks

DemiMarie · 2026-04-08T14:07:13Z

@weltling What if BatchRequestInner allowed setting its offset and decreasing its size? That would allow RawFileAsync to be used.

weltling requested a review from a team as a code owner March 23, 2026 10:47

weltling force-pushed the qcow-async branch from 22619a8 to 8af3cbf Compare March 23, 2026 11:34

weltling mentioned this pull request Mar 23, 2026

block: Crate Refactoring - Cycle 1: QCOW2 Foundation + Async Reads #7694

Open

4 tasks

phip1611 reviewed Mar 23, 2026

View reviewed changes

weltling force-pushed the qcow-async branch from 8af3cbf to bd0399e Compare March 23, 2026 17:54

weltling force-pushed the qcow-async branch from bd0399e to 2f6d28f Compare March 23, 2026 19:00

russell-islam reviewed Mar 23, 2026

View reviewed changes

weltling force-pushed the qcow-async branch from 2f6d28f to bab0b68 Compare March 24, 2026 10:10

weltling force-pushed the qcow-async branch 2 times, most recently from 5034690 to a8cac2c Compare March 28, 2026 12:33

phip1611 reviewed Mar 30, 2026

View reviewed changes

phip1611 mentioned this pull request Mar 30, 2026

More I/O safety work #7888

Draft

weltling force-pushed the qcow-async branch from a8cac2c to 967baa9 Compare March 30, 2026 21:53

weltling force-pushed the qcow-async branch from 967baa9 to dbed8e6 Compare March 30, 2026 21:55

weltling force-pushed the qcow-async branch from c2e9268 to fef73f2 Compare April 1, 2026 09:43

weltling added 22 commits April 7, 2026 21:21

block: qcow_async: impl SparseCapable for QcowDiskAsync

fbef4dc

QCOW2 images support both sparse operations and the zero flag. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

block: qcow_async: impl Resizable for QcowDiskAsync

c049357

Delegates to QcowMetadata::resize. Rejects resize when a backing file is present, same as the sync backend. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

block: qcow_async: impl DiskFile for QcowDiskAsync

f944e0b

Marker supertrait combining all composable capability traits. QcowDiskAsync now satisfies the full DiskFile contract. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

block: qcow_async: impl fsync for QcowAsync

87460d7

Flush dirty metadata caches and sync the underlying file via QcowMetadata::flush, then signal synthetic completion. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

block: qcow_async: impl AsyncDiskFile for QcowDiskAsync

bc82c62

try_clone shares the Arc wrapped metadata and backing file. new_async_io creates a QcowAsync worker with its own io_uring instance for the given ring depth. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

vmm: device_manager: Wire up QcowDiskAsync with io_uring

ae5ac35

When io_uring is available and not disabled, open QCOW2 images with QcowDiskAsync for asynchronous reads. Falls back to QcowDiskSync otherwise. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

block: qcow: Test async punch hole

9befec4

Add a QcowAsync unit test for punch hole completion. The test verifies that a punch hole request reports successful completion and that the deallocated range reads back as zeroes. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

block: qcow: Test async write zeroes

71b9d3d

Add a QcowAsync unit test for write zeroes completion. The test verifies that a write zeroes request reports successful completion and that the zeroed range reads back as zeroes. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

block: qcow: Test async read of unallocated region

3f29eb0

Add a QcowAsync unit test that reads from a range that was never written. Verifies the fundamental QCOW contract that unallocated clusters return zeroes. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

block: qcow_async: Add write after punch hole test

8544c8d

Write data, punch hole to deallocate, then rewrite the same range and verify the new contents read back correctly. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

block: qcow_async: Add large sequential I/O test

034fad8

Write a distinct byte pattern into each of eight consecutive clusters in a single operation, then read the full range back and verify per cluster contents. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>

weltling force-pushed the qcow-async branch from bada471 to ceb7b8f Compare April 7, 2026 19:51

DemiMarie suggested changes Apr 7, 2026

View reviewed changes

weltling mentioned this pull request Apr 8, 2026

block: Keep guest memory alive while in use #7971

Open

Conversation

weltling commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Design

Changes

Async scope

Performance (async vs main, MiB/s)

Uh oh!

phip1611 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

io_uring-related issues

Confirmed bugs

Partial submit on error

Asymmetric eventfd signaling

split() lifetime coupling

submit_batch_requests misattributes submit errors

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

russell-islam commented Mar 23, 2026

Uh oh!

weltling commented Mar 23, 2026

io_uring-related issues

Confirmed bugs

Partial submit on error

Asymmetric eventfd signaling

split() lifetime coupling

submit_batch_requests misattributes submit errors

Uh oh!

weltling commented Mar 23, 2026

Uh oh!

russell-islam commented Mar 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

weltling commented Mar 24, 2026

Uh oh!

DemiMarie commented Mar 24, 2026

Uh oh!

weltling commented Mar 24, 2026

Uh oh!

DemiMarie commented Mar 24, 2026

Uh oh!

weltling commented Mar 23, 2026 •

edited

Loading

phip1611 left a comment •

edited

Loading

`split()` lifetime coupling

`submit_batch_requests` misattributes submit errors

`split()` lifetime coupling

`submit_batch_requests` misattributes submit errors

phip1611 commented Mar 24, 2026 •

edited

Loading