vm-migration: improve downtime observability by phip1611 · Pull Request #7979 · cloud-hypervisor/cloud-hypervisor

phip1611 · 2026-04-08T15:04:28Z

This PR touches a few areas described in #7111:

it further prepares live migration statistics (queryable via a dedicated API endpoint)
- here, we just gather more data, that we eventually might export
log the actual downtime of the VM (on the source side)
log the expensive tasks during the downtime (break down downtime in its components)
- this helps in the analysis of the downtime in further refactorings and developments

Please review this commit-by-commit.

New Log Messages

I performed a local TCP migration of a small VM with a workload.

New log messages on sender:

cloud-hypervisor:  23.076773s: <vmm> DEBUG:vmm/src/lib.rs:1453 -- Migration downtime breakdown: final_memory_delta=55ms snapshot=2ms sending_snapshot=7ms completing=1ms
cloud-hypervisor:  23.076797s: <vmm> INFO:vmm/src/lib.rs:1461 -- Migration complete: actual downtime was 68ms and migration took 0.96s

New log messages on destination:

cloud-hypervisor:  24.694016s: <vmm> DEBUG:vmm/src/lib.rs:945 -- Migration (incoming): receiving state took 9ms
cloud-hypervisor:  24.695341s: <vmm> INFO:vmm/src/lib.rs:959 -- Migration (incoming): resuming took 1ms

Learnings

For very small downtimes, we might have to optimize the snapshot path. Future work.

phip1611 · 2026-04-08T15:14:21Z

vm-migration/src/context.rs


 #[cfg(test)]
 mod unit_tests {
-    use std::time::{Duration, Instant};


diff doesn't look that nice 🤔 I just moved everything into sub module memory_migration_ctx_tests

This helps to better separate the unit tests from the new ones in the following commit. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>

Extend vm-migration/context to track overall migration metrics beyond just the memory phase. This enables measuring the effective VM downtime (time between the final pause() on the source and resume() on the destination) and lays the groundwork for a future migration statistics endpoint. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>

Use OngoingMigrationContext to measure and log the effective VM downtime (pause to remote resume) and the cost of each non-trivial step in the downtime window: snapshotting, sending the snapshot, and awaiting completion. This makes it straightforward to identify and reduce downtime as live migration matures. Example: ``` cloud-hypervisor: 23.076773s: <vmm> DEBUG:vmm/src/lib.rs:1453 -- Migration downtime breakdown: final_memory_delta=55ms snapshot=2ms sending_snapshot=7ms completing=1ms cloud-hypervisor: 23.076797s: <vmm> INFO:vmm/src/lib.rs:1461 -- Migration complete: actual downtime was 68ms and migration took 0.96s ``` Note: downtime is measured on the source only; cross-host clock skew may cause unreliable results. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>

Instrument the two main downtime-phase operations on the destination side - receiving state and resuming the VM - so their costs are visible in logs and can be iterated on. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>

phip1611 · 2026-04-08T16:05:51Z

vmm/src/lib.rs

+        // This is very helpful to minimize the downtime as migration becomes
+        // production-ready in Cloud Hypervisor.
+        debug!(
+            "Migration downtime breakdown: final_memory_delta={}ms snapshot={}ms sending_snapshot={}ms completing={}ms",


PSA: this is missing the overhead from the last iteration. Fetching the dirty log is quite expensive (I have many optimizations in my mind/pipeline). Partially related to #7816.

I think however that this is okay. the overhead is already logged with debug level in do_memory_migration()

phip1611 · 2026-04-08T16:17:09Z

vm-migration/src/context.rs

+        send_snapshot_duration: Duration,
+        completing_duration: Duration,
+    ) -> CompletedMigrationContext {
+        let (migration_begin, downtime_begin, finalized_memory_ctx) = match self {


note to self: for the whole type, I forgot the unix socket/local migration path:

https://github.com/cloud-hypervisor/cloud-hypervisor/actions/runs/24143270699/job/70450282236#step:7:7205

I'll fix that tomorrow.

phip1611 · 2026-04-08T18:14:47Z

vm-migration/src/context.rs

+    /// Total duration of the migration.
+    migration_duration: Duration,
+    /// Duration of creating the final VM snapshot.
+    snapshot_duration: Duration,


Note to self: use a dedicated struct DowntimeContext. This better groups functionality. This will also improve the code if we add more observibility in the future.

phip1611 requested a review from a team as a code owner April 8, 2026 15:04

phip1611 self-assigned this Apr 8, 2026

phip1611 force-pushed the upstream-migration-effective-downtime branch from b78bd51 to f8ebc2c Compare April 8, 2026 15:11

phip1611 requested a review from likebreath April 8, 2026 15:13

phip1611 commented Apr 8, 2026

View reviewed changes

phip1611 added 2 commits April 8, 2026 17:16

vm-migration: context: move unit tests into module

fda22ef

This helps to better separate the unit tests from the new ones in the following commit. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>

phip1611 force-pushed the upstream-migration-effective-downtime branch from f8ebc2c to 2f0ff71 Compare April 8, 2026 15:17

phip1611 added 2 commits April 8, 2026 17:18

phip1611 force-pushed the upstream-migration-effective-downtime branch from 2f0ff71 to 6e4e25b Compare April 8, 2026 15:18

phip1611 commented Apr 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vm-migration: improve downtime observability#7979

vm-migration: improve downtime observability#7979
phip1611 wants to merge 4 commits intocloud-hypervisor:mainfrom
phip1611:upstream-migration-effective-downtime

phip1611 commented Apr 8, 2026 •

edited

Loading

Uh oh!

phip1611 Apr 8, 2026

Uh oh!

phip1611 Apr 8, 2026 •

edited

Loading

Uh oh!

phip1611 Apr 8, 2026

Uh oh!

phip1611 Apr 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

phip1611 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New Log Messages

Learnings

Uh oh!

phip1611 Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

phip1611 Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phip1611 Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

phip1611 Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

phip1611 commented Apr 8, 2026 •

edited

Loading

phip1611 Apr 8, 2026 •

edited

Loading

phip1611 Apr 8, 2026 •

edited

Loading