Skip to content

Fix block device resize disk#7948

Open
vincent-thomas wants to merge 2 commits intocloud-hypervisor:mainfrom
vincent-thomas:fix-block-device-resize-disk
Open

Fix block device resize disk#7948
vincent-thomas wants to merge 2 commits intocloud-hypervisor:mainfrom
vincent-thomas:fix-block-device-resize-disk

Conversation

@vincent-thomas
Copy link
Copy Markdown

@vincent-thomas vincent-thomas commented Apr 1, 2026

As it currently is, CH doesn't support resizing disks of VMs whose storage is backed by a host block device. This PR addresses this and prevents (but doesn't fix) a deadlock of the CH process (which in turn causes VM cpu to freeze) from happening when doing so.

Previous discussion and further details: #7923

Fixes #7923

@vincent-thomas vincent-thomas requested a review from a team as a code owner April 1, 2026 19:08
Block devices (LVM volumes, loop devices, etc.) cannot be resized via
ftruncate - they are resized externally. When vm.resize-disk is called
for a block device backend, skip the ftruncate call and verify the
device is accessible via BLKGETSIZE64 ioctl.

Re-query the actual size after resize to ensure the guest receives the
correct capacity for externally resized block devices.

Signed-off-by: Vincent Thomas <vincent@v-thomas.com>
@vincent-thomas vincent-thomas force-pushed the fix-block-device-resize-disk branch from ecbc12e to 51a33af Compare April 1, 2026 19:22

let nsectors = new_size / SECTOR_SIZE;

self.common.pause().map_err(Error::PauseVcpus)?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we need this for proper synchronization during disk-resize?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, PauseVcpus is wrong here (a mistake I made). We should correct this and make it PauseVirtioThreads or similar.

Copy link
Copy Markdown
Author

@vincent-thomas vincent-thomas Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pause and resume actually caused a deadlock of the cloud-hypervisor process and froze the whole VM indefinitely.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not happen. Does it only deadlock when you use block devices? For me, using a raw file with the original code works perfectly fine.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not deadlock when using a block device on the host with "fast" I/O (for example losetup).
It does deadlock when using a host block device on the host with "slow", network-bound I/O (for example ceph; rbd mapped block device). Both scenarios happen 100% of the time when manually testing.

The removal of the pause/resume came from experimentation, which turned out to fix this issue and work well.

This comment was marked as outdated.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the VM was paused, and we can ensure that it says paused during the resize call, I would suggest that we only then skip pausing/resuming the virtio-queues. @phip1611 @vincent-thomas

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phip1611 you mix up vcpu pausing and virtqueu pausing ;)

Copy link
Copy Markdown
Author

@vincent-thomas vincent-thomas Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tpressure @phip1611 Have a look at my recently updated diff, It is changed to fix the root cause.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only then skip pausing/resuming the virtio-queues.

I agree - looks like @vincent-thomas did the right thing now. LGTM on a first glance

@vincent-thomas vincent-thomas force-pushed the fix-block-device-resize-disk branch from 51a33af to 26e3ab4 Compare April 2, 2026 10:11
vincent-thomas added a commit to vincent-thomas/cloud-hypervisor that referenced this pull request Apr 8, 2026
Previously running pause when already paused would wait for threads
that were already parked, which caused a deadlock, see:
cloud-hypervisor#7948 (comment).
Pause endpoint now checks if it paused already. If so then it returns
success instead of continuing, this prevents the deadlock.

This commit adds correct host block device querieng in '/vm.resize-disk'.
Because the content and mutability of a block device essentially is
unknown, CH cannot resize it itself. In the case where a VMs disk is backed
by a host block device, the resize disk endpoint only succeeds (and sends config interrupt)
if host block device size matches the "new_size" given to the endpoint.

Signed-off-by: Vincent Thomas <vincent@v-thomas.com>
@vincent-thomas vincent-thomas force-pushed the fix-block-device-resize-disk branch from 26e3ab4 to bbc9d98 Compare April 8, 2026 10:37
Previously running pause when already paused would wait for threads
that were already parked, which caused a deadlock, see:
cloud-hypervisor#7948 (comment).
Pause endpoint now checks if it paused already. If so then it returns
success instead of continuing, this prevents the deadlock.

This commit adds correct host block device querieng in '/vm.resize-disk'.
Because the content and mutability of a block device essentially is
unknown, CH cannot resize it itself. In the case where a VMs disk is backed
by a host block device, the resize disk endpoint only succeeds (and sends config interrupt)
if host block device size matches the "new_size" given to the endpoint.

Signed-off-by: Vincent Thomas <vincent@v-thomas.com>
@vincent-thomas vincent-thomas force-pushed the fix-block-device-resize-disk branch from bbc9d98 to 603b73b Compare April 8, 2026 10:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

/vm.resize-disk missbehaves when VMs disk is backed by a host block device.

4 participants