vm-migration: improve downtime observability#7979
vm-migration: improve downtime observability#7979phip1611 wants to merge 4 commits intocloud-hypervisor:mainfrom
Conversation
b78bd51 to
f8ebc2c
Compare
|
|
||
| #[cfg(test)] | ||
| mod unit_tests { | ||
| use std::time::{Duration, Instant}; |
There was a problem hiding this comment.
diff doesn't look that nice 🤔 I just moved everything into sub module memory_migration_ctx_tests
This helps to better separate the unit tests from the new ones in the following commit. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
Extend vm-migration/context to track overall migration metrics beyond just the memory phase. This enables measuring the effective VM downtime (time between the final pause() on the source and resume() on the destination) and lays the groundwork for a future migration statistics endpoint. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
f8ebc2c to
2f0ff71
Compare
Use OngoingMigrationContext to measure and log the effective VM downtime (pause to remote resume) and the cost of each non-trivial step in the downtime window: snapshotting, sending the snapshot, and awaiting completion. This makes it straightforward to identify and reduce downtime as live migration matures. Example: ``` cloud-hypervisor: 23.076773s: <vmm> DEBUG:vmm/src/lib.rs:1453 -- Migration downtime breakdown: final_memory_delta=55ms snapshot=2ms sending_snapshot=7ms completing=1ms cloud-hypervisor: 23.076797s: <vmm> INFO:vmm/src/lib.rs:1461 -- Migration complete: actual downtime was 68ms and migration took 0.96s ``` Note: downtime is measured on the source only; cross-host clock skew may cause unreliable results. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
Instrument the two main downtime-phase operations on the destination side - receiving state and resuming the VM - so their costs are visible in logs and can be iterated on. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
2f0ff71 to
6e4e25b
Compare
| // This is very helpful to minimize the downtime as migration becomes | ||
| // production-ready in Cloud Hypervisor. | ||
| debug!( | ||
| "Migration downtime breakdown: final_memory_delta={}ms snapshot={}ms sending_snapshot={}ms completing={}ms", |
There was a problem hiding this comment.
PSA: this is missing the overhead from the last iteration. Fetching the dirty log is quite expensive (I have many optimizations in my mind/pipeline). Partially related to #7816.
I think however that this is okay. the overhead is already logged with debug level in do_memory_migration()
| send_snapshot_duration: Duration, | ||
| completing_duration: Duration, | ||
| ) -> CompletedMigrationContext { | ||
| let (migration_begin, downtime_begin, finalized_memory_ctx) = match self { |
There was a problem hiding this comment.
note to self: for the whole type, I forgot the unix socket/local migration path:
I'll fix that tomorrow.
| /// Total duration of the migration. | ||
| migration_duration: Duration, | ||
| /// Duration of creating the final VM snapshot. | ||
| snapshot_duration: Duration, |
There was a problem hiding this comment.
Note to self: use a dedicated struct DowntimeContext. This better groups functionality. This will also improve the code if we add more observibility in the future.
This PR touches a few areas described in #7111:
Please review this commit-by-commit.
New Log Messages
I performed a local TCP migration of a small VM with a workload.
New log messages on sender:
New log messages on destination:
Learnings
For very small downtimes, we might have to optimize the snapshot path. Future work.