block: Add async QCOW2 backend with io_uring#7882
block: Add async QCOW2 backend with io_uring#7882weltling wants to merge 32 commits intocloud-hypervisor:mainfrom
Conversation
There was a problem hiding this comment.
Thanks for working on this! Code wise, mostly looking good. My io_uring knowledge is very limited. I've consulted Claude Code with Sonnet 4.6 model for review. So please take the following remarks with a grain of salt - I can't assess if these are valid or not
Show Claude Code (Sonnet 4.6) Remarks
io_uring-related issues
Confirmed bugs
next_completed_requestmissingcq.sync()— without it the kernel-side CQ head is
not refreshed before reading, so completions can be silently missed on rings not using
IORING_SETUP_SQPOLL- Fast-path short reads — io_uring returns the raw
readvresult (which may be less than
total_len), while the slow path always reportstotal_lenas the result; upper layers see
inconsistent completion values
Partial submit on error
- If
submit_batch_requestserrors mid-batch (e.g. a metadata failure on anOutrequest),
SQEs already pushed to the SQ but not yetsync()'d are abandoned — theiruser_data
completions never arrive and the virtio queue stalls waiting for them
Asymmetric eventfd signaling
- The slow path explicitly calls
eventfd.write(1)to signal completions; the fast path relies
on the kernel writing to the eventfd viaregister_eventfd— this is correct but
undocumented, making the two paths harder to reason about together
split() lifetime coupling
io_uring.split()is called in bothread_vectoredandsubmit_batch_requests; holding two
split guards simultaneously would cause a borrow conflict — currently safe because they are in
separate call frames, but fragile if the call structure changes
submit_batch_requests misattributes submit errors
- The
submitter.submit()error is wrapped inAsyncIoError::ReadVectoredeven when the batch
contained only write requests — there is no appropriate variant for a generic submission
failure, so errors are misattributed
block/src/qcow_async.rs
Outdated
|
|
||
| fn next_completed_request(&mut self) -> Option<(u64, i32)> { | ||
| // Drain io_uring completions first, then synthetic ones. | ||
| if let Some(entry) = self.io_uring.completion().next() { |
There was a problem hiding this comment.
I think this could be written as
// Drain io_uring completions first, then synthetic ones.
self.io_uring
.completion()
.next()
.map(|entry| (entry.user_data(), entry.result()))
.or_else(|| self.completion_list.pop_front())General question: What are synthetic ones in this context?
There was a problem hiding this comment.
Indeed, I rewrote this. Also just saw raw_async.rs follows the same pattern (except it doesn't need the synthetic queue), so it's more consistent.
Synthetic means completions from operations that were handled synchronously like writes, multi mapping reads, compressed, backing file reads or any other. They just simulate a completion but have not been actually processed through io_uring. Some will become async in the follow up patches, though, like writes are planned for cycle 3 as a separate effort due to complexity.
Thanks
|
|
||
| fn write_zeroes(&mut self, offset: u64, length: u64, user_data: u64) -> AsyncIoResult<()> { | ||
| unimplemented!() | ||
| self.punch_hole(offset, length, user_data) |
There was a problem hiding this comment.
Maybe add a comment here?
// unallocated QCOW2 clusters inherently read as zero
There was a problem hiding this comment.
Added a comment here with an explanation, matching what qcow_sync.rs has, too. While unallocated reading as zero is a spec guarantee, here is primarily about deallocate_bytes doing the zeroing.
Thanks
| { | ||
| info!("Using asynchronous QCOW2 disk file (io_uring)"); | ||
|
|
||
| #[cfg(not(feature = "io_uring"))] |
There was a problem hiding this comment.
this looks a little weird - any idea how we could rewrite it in a way that reads more naturally?
There was a problem hiding this comment.
This follows the existing pattern used by VHD and RAW io_uring backends in the same function.
Restructuring would affect all the involved image types. Maybe an option could be moving having this same #[cfg] just once to the top and have a global if branch, so the compiler would still see the dead branch at runtime, when the feature is off.
Thanks
block/src/qcow_async.rs
Outdated
| for req in batch_request { | ||
| match req.request_type { | ||
| RequestType::In => { | ||
| let address = req.offset as u64; |
There was a problem hiding this comment.
The inner code of these match arms are kinda huge. Combined with embedded continue, it is difficult to read. Do you see potential to move some part into helper functions?
There was a problem hiding this comment.
Now that you mention - yep, it became big. I went for extracting some reusable common parts and splitting into helpers. It's still a fair amount on code there, but hopefully it's now easier to handle. And also, using #[inline] to curb the context switch effects.
Thanks
block/src/qcow_async.rs
Outdated
| host_offset, | ||
| length, | ||
| } => { | ||
| let _ = self.data_file.file_mut().punch_hole(host_offset, length); |
There was a problem hiding this comment.
we're silently ignoring a potential error here. Might be worth to log an error?
block/src/qcow_async.rs
Outdated
| host_offset, | ||
| length, | ||
| } => { | ||
| let _ = self |
There was a problem hiding this comment.
we're silently ignoring a potential error here. Might be worth to log an error?
There was a problem hiding this comment.
These let _ = are intentional, same as the other case, as the underlying filesystem may not support FALLOC_FL_PUNCH_HOLE or FALLOC_FL_ZERO_RANGE. The error is not fatal. The QCOW2 metadata has already been updated at this point, so future reads will return zeros regardless. The data file operations are purely an optimization to reclaim physical space if sparse=on.
The same pattern is used in the sync part. Also please consider the comment in the metadata layer itself.
Wrt logging - in cases where it's not supported, every DISCARD or WRITE_ZEROES would emit a warning then, would be probably quite noisy.
Thanks
| } | ||
|
|
||
| if needs_submit { | ||
| sq.sync(); |
There was a problem hiding this comment.
sq.sync() is called after the loop only when needs_submit is true. If SQEs were pushed but the loop also hit an error mid-batch (e.g. a metadata error on an Out request returning ?), the SQEs already pushed to sq but not yet sync'd will be abandoned.
Is this a problem? I know very little about io_uring
There was a problem hiding this comment.
A very good question. I went checking the io-uring code again. SubmissionQueue::Drop does store the tail, so pushed SQEs become visible in the ring. However, without a submit() call, they won't be processed until the next submission. To my understanding, at that point they'd produce CQEs for already failed requests.
That said, this is a pre existing pattern, as raw_async.rs has the same pattern if sq.push() fails mid batch. This edge case seems shared.
A proper fix would need to either roll back the tail on error or submit and drain the orphaned SQEs before returning. I'll look into this as a followup.
Thanks
block/src/qcow_async.rs
Outdated
| } | ||
| sync_completions.push((req.user_data, total_len as i32)); | ||
| } | ||
| _ => {} |
There was a problem hiding this comment.
are we sure that we do not net sq.sync(); here? What if the request is Flush or Discard - doesn't the backend needs to perform actions for that?
There was a problem hiding this comment.
Good catch! Today anything else except In and Out is filtered, see block.rs. But, it should fail hard here. I've added an unreachable! call to uncover that, just in case.
Thanks
block/src/qcow_async.rs
Outdated
|
|
||
| fn next_completed_request(&mut self) -> Option<(u64, i32)> { | ||
| // Drain io_uring completions first, then synthetic ones. | ||
| if let Some(entry) = self.io_uring.completion().next() { |
There was a problem hiding this comment.
When the ring was constructed without IORING_SETUP_SQPOLL, it might be necessary to call cq.sync() beforehand 🤔
But I could also be wrong - not an expert with io_uring
There was a problem hiding this comment.
I've double checked in the io-uring crate code and it shows that completion()creates a fresh CompletionQueue on each call. For that, it also loads the tail pointer from the kernel. This means, it operates on the same data anyways. The existing RAW io_uring backend raw_async.rs uses the same pattern.
Thanks
|
Majority of the DIsk IO for small block size is more than one cluster mapping. THis PR acutlaly does almost 90% of the sync read. What benefit can we get from this partially implemented Async IO? |
Thanks for the review and for being transparent about using Claude Code! I use AI tools too, they're genuinely helpful for spotting patterns and sanity checking assumptions, especially in unfamiliar territory like
Addressed inline, false positive.
A QCOW2 file is always a regular file. The fast path operates on clusters fully allocated and contiguous on disk. There is same logic in
Addressed inline. It's valid and it's a pre existing patttern.
It's actually the same eventfd. It's registered with io_uring for kernel driven signaling, and manually written to for synthetic completions. That's what unifies both paths under a single notification mechanism. The epoll loop doesn't need to know where the completion came from. See
It's unclear why Claude tells it's fragile. If
Valid. I just fixed this, using the more appropriate Overall - a nice demonstration of AI assisted review catching real patterns (the partial submit issue, the misattributed error variant) alongside false positives (the Thanks! |
Thanks for looking at this. Could you share how you arrived at the 90% sync figure? I would like to understand the reasoning, because the math I get points in the opposite direction. The fast path condition is - single allocated cluster mapping covering the full request. The default QCOW2 cluster size is 64K. With 4K block I/O, a read crosses a cluster boundary only when its starting offset falls in the last 4095 bytes of a cluster, that is The sync fallback is not about block size. It fires for unallocated clusters, compressed clusters, backing file reads, and writes. Those are image structure properties. On a warmed up uncompressed QCOW2 without backing, virtually all clusters are allocated, so virtually all reads go through io_uring. Even with only async reads landed so far, the performance results in the PR description already show measurable gains for uncompressed QCOW2 workloads. Those gains will only grow as the remaining sync paths move to io_uring in follow up cycles. Regarding the "partially implemented" aspect - this is Cycle 1 of the refactoring plan tracked in #7560. The follow up cycles cover async writes, async backing file reads, and async compressed cluster reads. Each of those adds substantial complexity and is scoped separately intentionally, not because it was overlooked. Thanks |
I misunderstood your code, thanks for the clarification. How about the block size is 1MB? |
| Ok(()) | ||
| } | ||
|
|
||
| // Writes are synchronous. Async writes require a multi step |
There was a problem hiding this comment.
If you are going to support async write later please put a TODO here.
With 1MB I/O and 64K clusters, a single request spans 16 clusters, so it produces multiple mappings and takes the sync fallback. That is correct. A couple of things to consider:
This is the initial async implementation. Not all possible optimizations are in place yet and I am sure this will be a returning point for multiple rounds of improvements. The performance data in the description was measured with the default 64K cluster size and standard fio workloads. The gains reflect the common case where the majority of reads hit the fast path. As a side note - QCOW2 performance tuning guide would also be a good topic to address in the future. I'll make a note to do some writing, when time permits. Thanks |
|
Would it make sense to use something more scalable than |
Good question.
If you have a specific alternative in mind, I'd be interested to hear it. Thanks |
I was thinking of a read-optimized design, like RCU, hazard pointers, or sharded locking. These avoid needing atomic operations in the read path (no metadata change) at the expense of slowing down the write path. I believe that RwLock bottlenecks at less than 2 million IOPS due to cache line ping-pong. |
Please note that we at Cyberus Technology have a comprehensive benchmarking infrastructure (fio throughput, IOPS) including visualizations in Grafana - we might could help investigating such optimizations |
The suggestions are ambitious but for sure interesting. RCU is a natural fit for the read path conceptually, the L2 lookup is a pure shared read. The difficulty is on the write side. Today the metadata mutations are already multistep (allocate cluster, update L2, adjust refcounts, evict cache, grow L1), and some are recursive. Beyond that, QCOW2 features like snapshots, dirty bitmaps, and extended L2 entries would add more structures that must stay consistent with each other under writes. Adapting those to RCU grace periods or hazard pointer retire semantics means the "copy" side needs to capture a consistent view across multiple interdependent structures, not just swap one pointer. That is substantial additional complexity. That's also why I chose clearly simplicity of one coarse lock vs. fine grained multiple locks/mutexes. As in, which object should be protected, when a single operation touches 5 structures? :) On the 2M IOPS ceiling, the lock protects only the metadata lookup (two array indexes + a cache map get), not the data I/O itself. The actual read goes through io_uring after the lock is released. Per op hold time is very short. Whether the lock becomes the bottleneck at higher queue counts is worth profiling before committing to a more complex scheme. @phip1611 that benchmarking infrastructure sounds useful for validating whether metadata locking is the limiter at scale. Happy to collaborate on that. Worth investigating as a separate effort once the async refactoring is complete and there is a stable baseline to measure against. Thanks |
5034690 to
a8cac2c
Compare
I think this out of scope here. Is there a follow-up ticket? |
phip1611
left a comment
There was a problem hiding this comment.
Code-wise, looking good. thanks for your work on that! It is hard for me to verify the io_uring/async specific parts. Could you add a unit test or an integration test? this would give me more confidence for an approval
thanks for your great work!
block/src/qcow_async.rs
Outdated
|
|
||
| match result { | ||
| Ok(actions) => { | ||
| for action in actions { |
There was a problem hiding this comment.
could the for loop or at least its body moved to a helper function? the long code block doesn't read that nice
There was a problem hiding this comment.
Yep, moved the deallocation action body into the apply_dealloc_action() helper, which made the punch_hole() shorter.
Thanks
Yep, it's out of scope here as it's another broad refactoring on top of this one, once complete. On my side also there's no plan targeting a change in the locking model, but rather to focus on the features made possible thanks to the separated metadata I/O and further optimizations upon the initial implementation. Thanks |
Thanks for the thorough reviews! I've added async implementation specific unit tests, covering the most important situations like read/write roundtrip, cross cluster boundary, mixed batch requests, etc. Probably more unit tests can be written as it goes and certainly more to come in the form of micro benchmarks. For the integration tests - the existing integration tests already cover Thanks! |
|
This PR adds partial Async support, shouldn't the PR description/ subject mention that? |
I've added an "Async scope" section to the description that breaks down which paths are async and which remain synchronous, with the rationale for each. That said, "partial" is nuanced here. The backend implements the full Thanks |
QCOW2 images support both sparse operations and the zero flag. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Delegates to QcowMetadata::resize. Rejects resize when a backing file is present, same as the sync backend. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Marker supertrait combining all composable capability traits. QcowDiskAsync now satisfies the full DiskFile contract. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Per queue I/O worker that uses io_uring for asynchronous reads against fully allocated clusters. The struct holds the shared metadata, data file, optional backing reader, the io_uring instance and a synthetic completion list. Feature gated on io_uring in lib.rs. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add the AsyncIo trait impl with notifier and next_completed_request filled in. The remaining methods are stubbed with unimplemented and will be filled in by subsequent commits. next_completed_request drains io_uring completions first, then falls back to the synthetic completion list. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Single allocated cluster reads are submitted to io_uring for true async completion. Mixed mapping reads (zero, compressed, backing, multi cluster) fall back to synchronous pread64 with synthetic completions. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Synchronous per cluster write path - gather guest data from iovecs, map each cluster through QcowMetadata, and pwrite to the allocated host offset. Partial cluster writes with a backing file read the backing data first so map_cluster_for_write can perform COW. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Flush dirty metadata caches and sync the underlying file via QcowMetadata::flush, then signal synthetic completion. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Deallocate clusters through QcowMetadata::deallocate_bytes, then apply the resulting DeallocAction list (punch hole or write zeroes at host offsets). write_zeroes delegates to punch_hole since unallocated QCOW2 clusters inherently read as zero. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
try_clone shares the Arc wrapped metadata and backing file. new_async_io creates a QcowAsync worker with its own io_uring instance for the given ring depth. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
When io_uring is available and not disabled, open QCOW2 images with QcowDiskAsync for asynchronous reads. Falls back to QcowDiskSync otherwise. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Implement batch_requests_enabled() and submit_batch_requests() for QcowAsync. Without batching, each read_vectored call performs its own io_uring submit() syscall. With batching, the virtio queue handler collects all pending requests and submits them in a single call, pushing multiple SQEs before one submit() syscall. Each request in the batch is classified through the metadata layer. Requests that hit the fast path (single allocated cluster mapping) are pushed to the io_uring submission queue. Requests that require the slow path (compressed, backing, zero fill, or mixed mappings) are completed synchronously and queued as synthetic completions. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a QcowAsync unit test for punch hole completion. The test verifies that a punch hole request reports successful completion and that the deallocated range reads back as zeroes. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a QcowAsync unit test for write zeroes completion. The test verifies that a write zeroes request reports successful completion and that the zeroed range reads back as zeroes. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a QcowAsync unit test that writes a byte pattern through write_vectored, reads it back through read_vectored, and verifies the data matches. This exercises the core async write and read paths end to end. Also add an async_write helper for future tests. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a QcowAsync unit test that writes distinct patterns into two adjacent clusters, then issues a single read spanning the cluster boundary. Verifies that multi mapping read resolution returns the correct data from both clusters. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a QcowAsync unit test that submits a batch of interleaved write and read requests via submit_batch_requests. Verifies that all completions arrive with the correct user_data and that the read back data matches the written data. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a QcowAsync unit test that reads from a range that was never written. Verifies the fundamental QCOW contract that unallocated clusters return zeroes. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a QcowAsync unit test that writes 4K into the middle of a cluster, then reads the entire cluster back. Verifies that the written region matches and surrounding bytes remain zero. This exercises the COW path where unwritten parts of a newly allocated cluster must be zero filled. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Write data, punch hole to deallocate, then rewrite the same range and verify the new contents read back correctly. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Write a distinct byte pattern into each of eight consecutive clusters in a single operation, then read the full range back and verify per cluster contents. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
The unit test container runs with Docker default seccomp profile which blocks io_uring_setup, io_uring_enter and io_uring_register. This causes all qcow_async unit tests to fail with EPERM when creating an io_uring instance. Add --security-opt seccomp=unconfined to the unit test docker run invocation. The container already has --device access and cap_net_admin, so this does not materially change the security posture. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Thanks for the thorough reviews across all the rounds, the feedback has genuinely improved the code. Completely understand the hesitation given the qcow2 and io_uring depth. The async backend already shows measurable gains on the read side, particularly for random reads, multi queue workloads, and backing file reads. The existing RAW async backend using the same io_uring patterns has been running in production for a while now. The QCOW2 side follows those established patterns closely, and the unit tests cover the key paths. The integration tests also exercise this backend transparently on any io_uring capable host. That said, if there are specific areas that feel uncertain, happy to add more targeted tests or documentation to build confidence. Let me know what would be most useful. Thanks |
DemiMarie
left a comment
There was a problem hiding this comment.
If possible, please do not merge this PR until #7971 is in. #7971 contains important memory safety fixes to the existing async I/O backends. This code also uses io_uring without proper lifetime management for guest memory and so has the same problems.
If this PR needs to be merged before #7971, please wrap the existing RawFileAsync rather than using io_uring directly. #7971 will turn RawFileAsync into a safe wrapper around io_uring.
| fn read_vectored( | ||
| &mut self, | ||
| offset: libc::off_t, | ||
| iovecs: &[libc::iovec], | ||
| user_data: u64, | ||
| ) -> AsyncIoResult<()> { |
There was a problem hiding this comment.
A safe function containing libc::iovec is almost always a bad idea. libc::iovec contains a raw pointer, which safe code can set to anything.
There was a problem hiding this comment.
Agreed, this is a concern with the AsyncIo trait definition itself. The read_vectored and write_vectored signatures taking &[libc::iovec] in a safe function predate this PR. Every backend implementing the trait has the same exposure. Fixing it at the trait level is the right approach, which is what #7971 targets.
Thanks
| // SAFETY: fd is valid and iovecs point to valid guest memory. | ||
| unsafe { | ||
| sq.push( | ||
| &opcode::Readv::new(types::Fd(fd), iovecs.as_ptr(), iovecs.len() as u32) | ||
| .offset(host_offset) | ||
| .build() | ||
| .user_data(user_data), | ||
| ) | ||
| .map_err(|_| { | ||
| AsyncIoError::ReadVectored(Error::other("Submission queue is full")) | ||
| })?; | ||
| }; |
There was a problem hiding this comment.
This safety justification isn’t sufficient. First, safe Rust code can set the pointers and lengths in the iovecs to anything. Second, there is no guarantee that the guest memory will not be freed before the operation completes. See #7971 and #7955.
#7971 fixes the RawFileAsync and RawFileAsyncAio backends, but not this one because it isn’t merged yet. Please use the APIs introduced in #7971 instead of adding new unsafe uses of io_uring.
As per #7955, I believe this is not strictly theoretical. Right now, triggering this bug requires hot-unplugging a virtio-pmem device, but in the future it could be triggered by any guest with a virtio-GPU device (implemented via the generic vhost-user frontend). @alyssais is going to be sending a PR for this.
There was a problem hiding this comment.
Yep, the same pattern and the same gap exist in RawFileAsync::read_vectored today.
#7971 fixes that by tracking in-flight operations and holding guest memory until the CQE is reaped, which is the correct approach. I'm happy adapt to use those APIs once #7971 lands, or if this PR goes first, happy contribute the adaptation to be part of the #7971 integration.
Thanks
|
@weltling I don’t think this PR makes the existing situation worse. However, it does make fixing it more difficult. In particular, it’s much better to have a single, safe abstraction around io_uring (and possibly AIO), rather than trying to reimplement them. If any of the APIs in #7971 aren’t suitable for your needs, please let me know. My goal is that all disk I/O in Cloud Hypervisor should go through them. |
@DemiMarie thanks for the detailed review and for linking #7955 and #7971. Both points are valid. The I've looked at #7971 and the approach of tracking in-flight operations by This PR currently follows the same pattern as the existing Thanks |
|
@DemiMarie to note here also - wrapping Thanks |
|
@weltling What if |
Adds
QcowDiskAsync, an async QCOW2 backend using io_uring for data I/O. Automatically selected when io_uring is available; falls back toQcowDiskSyncotherwise.Design
Each virtio queue gets its own
QcowAsyncinstance with an independent file descriptor (viadup) and io_uring ring, enabling parallel data I/O across queues. Metadata (QcowMetadata) is shared viaArc<RwLock<QcowState>>, allowing concurrent read access. Pending requests are batched into a singleio_uring::enter()call to amortize syscall overhead.The implementation uses
DiskBackend::Next, implementingAsyncFullDiskFilerather than the legacyasync_io::DiskFiletrait.Changes
Refactoring: Extract shared code from
qcow_sync.rsinto reusable modules:qcow_common.rs: positional I/O helpers (pread64/pwrite64) and iovec scatter/gatherqcow/backing.rs: backing file construction (RawBacking,Qcow2MetadataBacking,shared_backing_from)Async backend:
QcowDiskAsync:DiskFile+AsyncDiskFile(size, geometry, sparse, resize, per queue fd creation)QcowAsync:AsyncIo(read, write, fsync, punch_hole, write_zeroes via io_uring)Batching: Group pending I/O submissions before
io_uring::enter()Integration: Route QCOW2 to
QcowDiskAsyncindevice_manager.rswhen io_uring is supportedAsync scope
This implementation makes single cluster allocated reads fully async via
io_uring. The remaining paths use synchronous I/O with synthetic completions through the same eventfd mechanism, so callers see a uniform async interface regardless:The planned progression is described in the block refactoring RFC linked from #7694. Cycle 3 covers async writes with COW, compression handling, and batched multi mapping submissions.
Performance (async vs main, MiB/s)
Uncompressed QCOW2:
Overlay with backing files:
Overlay random reads (backing files):
These are low absolute values (30 to 114 MiB/s) in the same regime as the uncompressed random_read numbers. The random_read regression exists on main as well and is unrelated to the async backend.
Compressed (zlib/zstd): Single queue random reads at parity. Single queue sequential reads 9 to 16% slower. Multi queue 11 to 24% slower. Compressed reads are CPU bound with decompression dominating, and the io_uring submission path adds overhead without benefit. Moving decompression out of the metadata lock is planned as follow up.
Full benchmark logs attached (30 tests each, 5 iterations per test).
perf_main.log
perf_qcow_async.log