Skip to content

vm-migration: improve downtime observability#7979

Open
phip1611 wants to merge 4 commits intocloud-hypervisor:mainfrom
phip1611:upstream-migration-effective-downtime
Open

vm-migration: improve downtime observability#7979
phip1611 wants to merge 4 commits intocloud-hypervisor:mainfrom
phip1611:upstream-migration-effective-downtime

Conversation

@phip1611
Copy link
Copy Markdown
Member

@phip1611 phip1611 commented Apr 8, 2026

This PR touches a few areas described in #7111:

  • it further prepares live migration statistics (queryable via a dedicated API endpoint)
    • here, we just gather more data, that we eventually might export
  • log the actual downtime of the VM (on the source side)
  • log the expensive tasks during the downtime (break down downtime in its components)
    • this helps in the analysis of the downtime in further refactorings and developments

Please review this commit-by-commit.

New Log Messages

I performed a local TCP migration of a small VM with a workload.

New log messages on sender:

cloud-hypervisor:  23.076773s: <vmm> DEBUG:vmm/src/lib.rs:1453 -- Migration downtime breakdown: final_memory_delta=55ms snapshot=2ms sending_snapshot=7ms completing=1ms
cloud-hypervisor:  23.076797s: <vmm> INFO:vmm/src/lib.rs:1461 -- Migration complete: actual downtime was 68ms and migration took 0.96s

New log messages on destination:

cloud-hypervisor:  24.694016s: <vmm> DEBUG:vmm/src/lib.rs:945 -- Migration (incoming): receiving state took 9ms
cloud-hypervisor:  24.695341s: <vmm> INFO:vmm/src/lib.rs:959 -- Migration (incoming): resuming took 1ms

Learnings

For very small downtimes, we might have to optimize the snapshot path. Future work.

@phip1611 phip1611 requested a review from a team as a code owner April 8, 2026 15:04
@phip1611 phip1611 self-assigned this Apr 8, 2026
@phip1611 phip1611 force-pushed the upstream-migration-effective-downtime branch from b78bd51 to f8ebc2c Compare April 8, 2026 15:11
@phip1611 phip1611 requested a review from likebreath April 8, 2026 15:13

#[cfg(test)]
mod unit_tests {
use std::time::{Duration, Instant};
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diff doesn't look that nice 🤔 I just moved everything into sub module memory_migration_ctx_tests

phip1611 added 2 commits April 8, 2026 17:16
This helps to better separate the unit tests from the new ones in the
following commit.

On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
Extend vm-migration/context to track overall migration metrics beyond
just the memory phase. This enables measuring the effective VM downtime
(time between the final pause() on the source and resume() on the
destination) and lays the groundwork for a future migration statistics
endpoint.

On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
@phip1611 phip1611 force-pushed the upstream-migration-effective-downtime branch from f8ebc2c to 2f0ff71 Compare April 8, 2026 15:17
phip1611 added 2 commits April 8, 2026 17:18
Use OngoingMigrationContext to measure and log the effective VM downtime
(pause to remote resume) and the cost of each non-trivial step in the
downtime window: snapshotting, sending the snapshot, and awaiting
completion. This makes it straightforward to identify and reduce
downtime as live migration matures.

Example:

```
cloud-hypervisor:  23.076773s: <vmm> DEBUG:vmm/src/lib.rs:1453 -- Migration downtime breakdown: final_memory_delta=55ms snapshot=2ms sending_snapshot=7ms completing=1ms
cloud-hypervisor:  23.076797s: <vmm> INFO:vmm/src/lib.rs:1461 -- Migration complete: actual downtime was 68ms and migration took 0.96s
```

Note: downtime is measured on the source only; cross-host clock skew
may cause unreliable results.

On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
Instrument the two main downtime-phase operations on the destination
side - receiving state and resuming the VM - so their costs are visible
in logs and can be iterated on.

On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
@phip1611 phip1611 force-pushed the upstream-migration-effective-downtime branch from 2f0ff71 to 6e4e25b Compare April 8, 2026 15:18
// This is very helpful to minimize the downtime as migration becomes
// production-ready in Cloud Hypervisor.
debug!(
"Migration downtime breakdown: final_memory_delta={}ms snapshot={}ms sending_snapshot={}ms completing={}ms",
Copy link
Copy Markdown
Member Author

@phip1611 phip1611 Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PSA: this is missing the overhead from the last iteration. Fetching the dirty log is quite expensive (I have many optimizations in my mind/pipeline). Partially related to #7816.

I think however that this is okay. the overhead is already logged with debug level in do_memory_migration()

send_snapshot_duration: Duration,
completing_duration: Duration,
) -> CompletedMigrationContext {
let (migration_begin, downtime_begin, finalized_memory_ctx) = match self {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: for the whole type, I forgot the unix socket/local migration path:

https://github.com/cloud-hypervisor/cloud-hypervisor/actions/runs/24143270699/job/70450282236#step:7:7205

I'll fix that tomorrow.

/// Total duration of the migration.
migration_duration: Duration,
/// Duration of creating the final VM snapshot.
snapshot_duration: Duration,
Copy link
Copy Markdown
Member Author

@phip1611 phip1611 Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: use a dedicated struct DowntimeContext. This better groups functionality. This will also improve the code if we add more observibility in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant