KVM HA: fence by confirming host power state (fix host stuck in Fencing when already powered off)#13377
Open
andrijapanicsb wants to merge 1 commit into
Open
Conversation
KVMHAProvider.fence() declared a host fenced only when the out-of-band power-off command reported success. Against an already-off chassis the BMC rejects the power-off (e.g. Redfish returns HTTP 409), so fence() failed and the host stayed stuck in the Fencing HA state, which maps to Disconnected (not Down). VM-HA therefore never restarted the VMs until the dead host was powered back on. Fencing now succeeds based on the actual chassis power state: - if the host is already powered off (OOBM STATUS == Off), treat it as fenced; - otherwise issue a best-effort power-off and confirm via OOBM STATUS; - only a confirmed Off state counts as success; if the state cannot be confirmed (e.g. unreachable BMC) the fence fails and is retried, to avoid split-brain. Also map Redfish PowerOperation.OFF to ForceOff (hard power-off) instead of GracefulShutdown, consistent with the ipmitool driver and appropriate for fencing an unresponsive host (SOFT remains the graceful ACPI shutdown). Fixes apache#13376
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 4.22 #13377 +/- ##
=========================================
Coverage 17.67% 17.68%
- Complexity 15792 15798 +6
=========================================
Files 5922 5922
Lines 533165 533184 +19
Branches 65208 65211 +3
=========================================
+ Hits 94242 94273 +31
+ Misses 428276 428264 -12
Partials 10647 10647
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Contributor
Author
|
@blueorangutan package KVM |
|
@andrijapanicsb a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM SystemVM template(s). I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 18194 |
Contributor
Author
|
@blueorangutan test ol9 kvm-ol9 |
|
@andrijapanicsb a [SL] Trillian-Jenkins test job (ol9 mgmt + kvm-ol9) has been kicked to run smoke tests |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
When a KVM host with host-HA + out-of-band management (OOBM) enabled is hard powered off (forced chassis-off from the BMC, or a real power/cable failure), CloudStack never transitions the host to
Downand therefore never restarts its VMs on other hosts — the host stays inAlert/Disconnectedindefinitely.Root cause: the host-HA state machine declares a host dead (
HAState.Fenced→ investigatorStatus.Down) only after a successful OOBM power-off. Against an already-off chassis the BMC rejects the power-off (the Redfish driver mapsOFFtoGracefulShutdown, which returns HTTP 409 when the system is already off), soKVMHAProvider.fence()reports failure and the host stays stuck in theFencingstate — whichHAManagerImpl.getHostStatusFromHAConfig()maps toStatus.Disconnected, notStatus.Down. VM-HA is therefore never invoked, and the VMs are only recovered once the original (dead) host is powered back on, at which point the pending power-off finally succeeds.Observed in production with Redfish/iDRAC. Full root-cause analysis and management-server log evidence are in #13376.
Fix
Fencing now succeeds based on the actual chassis power state, not the power-off command's return code:
OOBM STATUS == Off) → treat it as fenced (no power-off issued);Offstate counts as a successful fence; if the state cannot be confirmed (e.g. an unreachable BMC) the fence fails and is retried, to avoid split-brain.This is OOBM-driver-agnostic (works for ipmitool, Redfish and nested-cloudstack drivers).
Additionally, the Redfish driver now maps
PowerOperation.OFFtoForceOff(a hard power-off) instead ofGracefulShutdown— consistent with the ipmitool driver and appropriate for fencing an unresponsive host;SOFTremains the graceful ACPI shutdown. Also fixes a latentString.formatargument-count bug on the RedfishSTATUSbranch.Fixes: #13376
Types of changes
Bug Severity
How Has This Been Tested?
Unit tests added to
KVMHostHATest(all green) covering the fence behaviour:Off→ fenced;Off→ still fenced (the regression for this issue);Note on reproduction: the original symptom reproduces on real Redfish hardware (power-off-when-off → HTTP 409). Software/nested OOBM drivers whose power-off is idempotent (e.g. the nested-cloudstack driver's
stopVirtualMachine, which is a no-op on an already-stopped VM) do not exhibit the bug, so the deterministic coverage is provided by the unit tests above.