Skip to content

Host-HA never marks a powered-off KVM host Down because the OOBM fence (power-off) fails against an already-off chassis — VMs only recover after the dead host is powered back on #13376

@andrijapanicsb

Description

@andrijapanicsb

Host-HA never marks a powered-off KVM host Down because the fence (OOBM power-off) can't succeed against an already-off chassis — VM-HA only triggers once the dead host is powered back on

ISSUE TYPE

  • Bug Report

COMPONENT NAME

HA (host-HA framework), Out-of-band Management (Redfish/IPMI), KVM

CLOUDSTACK VERSION

Confirmed present with identical (or functionally identical) logic on:
  - tag    4.22.1.0  (analyzed in detail)
  - branch 4.22      (origin/4.22 @ 21b2025c) — all key files byte-identical to 4.22.1.0
  - branch main      (origin/main @ 6bc83a3c) — all key files byte-identical to 4.22.1.0
  - branch 4.20      (origin/4.20 @ a3970bb1) — same logic; differences are cosmetic only
                       (method rename getHostStatus() -> getHostStatusFromHAConfig(); logger formatting)

The host-HA + OOBM fence design predates 4.20, so earlier 4.x releases are very likely affected too.

Per-branch verification of the relevant elements:
  - KVMHAProvider.fence() = OOBM PowerOperation.OFF, returns resp.getSuccess(): same on 4.20 / 4.22 / main
  - FenceTask: only transitions to Fenced on success; retries Fencing otherwise: byte-identical on 4.20 / 4.22 / main
  - HAManagerImpl host-status mapping (Fenced->Down, Fencing->Disconnected): same on 4.20 (getHostStatus) and 4.22/main (getHostStatusFromHAConfig)
  - RedfishWrapper: PowerOperation.OFF -> RedfishResetCmd.GracefulShutdown: byte-identical on 4.20 / 4.22 / main
  - RedfishClient: throws unless HTTP status in 2XX (SC_OK..SC_MULTIPLE_CHOICES): byte-identical on 4.20 / 4.22 / main

CONFIGURATION

  • KVM cluster with host-HA enabled on the hosts.
  • Out-of-band Management enabled per host (reproduced with the Redfish driver against Dell iDRAC; the same logic applies to the ipmitool driver).
  • VM-HA enabled (VmHaEnabled).
  • Primary storage: Linstor (not material — isStorageSupportHA() == true, so the legacy investigator is not the bottleneck here).

OS / ENVIRONMENT

  • Management servers: Ubuntu 24.04, OpenJDK 21.
  • Hypervisors: KVM.
  • BMC: Dell iDRAC via Redfish (/redfish/v1/Systems/System.Embedded.1).

SUMMARY

When a KVM host that has host-HA + OOBM enabled is hard powered off (e.g. forced chassis-off from the BMC console, or a real power/cable failure), CloudStack never transitions the host to Down and therefore never restarts its VMs on other hosts. The host stays in Alert/Disconnected indefinitely.

Root cause: the host-HA state machine only declares a host dead (HAState.Fenced → investigator Status.Down) after a successful fence, and the fence is implemented as an active OOBM power-off. Against an already-off chassis that power-off cannot succeed (the BMC rejects it), so the host is pinned in the Fencing state and retried forever. The investigator maps Fencing to Status.Disconnected, not Status.Down, so VM-HA is never invoked.

The perverse result: the VMs are only recovered once the original (dead) host is powered back on - even during BIOS booting stage — at which point the pending power-off finally succeeds, the host transitions to Fenced/Down, and HA restarts the VMs elsewhere. This defeats the purpose of HA.

All three current branches are affected by the identical issue: the relevant code is byte-identical on 4.22 and main, and functionally identical on 4.20 (only a method rename and logger formatting differ). There is no 4.21 branch upstream. Per-element diff verification is in the CLOUDSTACK VERSION section below.

STEPS TO REPRODUCE

  1. KVM cluster, host-HA enabled, OOBM (Redfish or ipmitool) configured and enabled on the hosts, VM-HA enabled. Place some HA-enabled VMs (incl. system VMs) on hostA.
  2. Forcefully power off hostA at the BMC (chassis power off / simulate power loss). The BMC itself stays reachable.
  3. Observe hostA in CloudStack over the next 20+ minutes.

EXPECTED RESULTS

  • Health check fails → activity check fails → host is fenced → host marked Down → VM-HA restarts hostA's VMs on other hosts within a few minutes.

ACTUAL RESULTS

  • hostA remains in Alert (host status) with the host-HA state stuck in Fencing.
  • The OOBM STATUS poll correctly reports the chassis as Off the entire time, but that knowledge is never used to declare the host down.
  • The agent investigator repeatedly reports the host as Up (while HA state is Suspect) and then Disconnected (while HA state is Fencing) — never Down.
  • VMs are not restarted; the scheduler keeps preferring the VM's last host (the dead hostA).
  • The instant hostA is powered back on, the fence power-off finally succeeds → host goes Down → VM-HA restarts the VMs on other hosts.

ROOT CAUSE ANALYSIS

Decision chain (only Fenced yields Down)

  1. For an HA-eligible KVM host, the legacy investigator delegates to the host-HA framework:
    • KVMInvestigator.getHostAgentStatus()haManager.getHostStatusFromHAConfig(host)
      (plugins/hypervisors/kvm/src/main/java/com/cloud/ha/KVMInvestigator.java:81)
  2. HAManagerImpl.getHostStatusFromHAConfig() maps HA state → host status
    (server/src/main/java/org/apache/cloudstack/ha/HAManagerImpl.java:315):
    • FencedStatus.Down
    • Degraded / Recovering / FencingStatus.Disconnected
    • everything else (Available/Suspect/Checking/Recovered) → Status.Up
  3. AgentManagerImpl only fires the HostDown event and scheduleRestartForVmsOnHost(...) when the investigator returns Status.Down
    (engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java:1147, :1200).

So VM-HA for an HA-eligible KVM host requires the host-HA state machine to reach Fenced.

Reaching Fenced requires a successful power-off

  • The state machine only goes Fencing → Fenced on Event.Fenced
    (api/src/main/java/org/apache/cloudstack/ha/HAConfig.java:139).
  • FenceTask.processResult() only fires Event.Fenced when the fence returned true; otherwise it does nothing and the poll loop retries Fencing forever via RetryFencing
    (server/src/main/java/org/apache/cloudstack/ha/task/FenceTask.java:45; retry at server/src/main/java/org/apache/cloudstack/ha/HAManagerImpl.java:724).
  • The fence is an active OOBM power-off:
    KVMHAProvider.fence()outOfBandManagementService.executePowerOperation(host, PowerOperation.OFF, null) and returns resp.getSuccess()
    (plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KVMHAProvider.java:87).
  • executePowerOperation() throws CloudRuntimeException whenever the driver response is not successful — it never returns success=false
    (server/src/main/java/org/apache/cloudstack/outofbandmanagement/OutOfBandManagementServiceImpl.java:432).

Why the power-off fails against an already-off host (Redfish)

  • The Redfish driver maps PowerOperation.OFFRedfishResetCmd.GracefulShutdown
    (plugins/outofbandmanagement-drivers/redfish/src/main/java/org/apache/cloudstack/outofbandmanagement/driver/redfish/RedfishWrapper.java:34).
  • RedfishClient.executeComputerSystemReset() POSTs to .../Actions/ComputerSystem.Reset and throws RedfishException if the HTTP status is not 2XX
    (utils/src/main/java/org/apache/cloudstack/utils/redfish/RedfishClient.java:300-312).
  • An already-off system returns HTTP 409 (Conflict) — a GracefulShutdown is invalid because there is no running OS to shut down. 409 ∉ 2XX → RedfishExceptionCloudRuntimeExceptionHAFenceExceptionFenceTask sees result=falseno Fenced transition → stuck in Fencing.
  • (The ipmitool driver has the analogous failure mode: chassis power off against an already-off / unreachable BMC returns a non-zero exit code, judged purely by process exit status with no "already in target state" handling — IpmitoolWrapper.executeCommands()result.isSuccess().)

Net effect

The fence requires confirming an active power-off transition, but a host that is already off (precisely the case where restarting its VMs is safe) cannot be "powered off successfully." The safety mechanism deadlocks in exactly the scenario it exists to handle. VMs recover only when the dead host returns.

LOG EVIDENCE (two-MS cluster; host kvm-host01, id:1, uuid aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee; hostnames/IPs/VM names below are anonymised examples)

OOBM STATUS poll knew the chassis was off the whole time (MS #1 log):

15:10:47  OutOfBandManagementServiceImpl  Transitioned out-of-band management power state from On to Off
          due to event: Off(Chassis Power is Off) for Host {id:1, kvm-host01}

Investigator never returns Down — Up while Suspect, then Disconnected while Fencing (MS #1 log):

15:07:51  KVMInvestigator was able to determine host {id:1} is in Up
          ... is considered Up (...). State: Suspect, Most recent health check failed.
15:14:51  HAManagerImpl  HA: Agent [{id:1}] is disconnected. State: Fencing, The resource is undergoing fence operation.

The fence itself, on the MS node that owns the HA config (MS #2 log) — repeated every ~4s for ~20 min:

15:14:20  (first) ... it got '409'
15:14:28  KVMHAProvider  OOBM service is not configured or enabled for this host {id:1} error is
          Failed to execute System power command ... 'POST' ...
          '.../Actions/ComputerSystem.Reset' ... The expected HTTP status code is '2XX' but it got '409'.
15:14:28  FenceTask  Exception occurred while running FenceTask ...
          org.apache.cloudstack.ha.provider.HAFenceException ... at KVMHAProvider.fence(KVMHAProvider.java:99)

Counts over the outage: ~618 × 409, ~308 × HAFenceException, 930 × Fencing state lines, Starting HA on ... = 1 (only at the very end).

VM-HA only fires after the host is powered back on (MS #2 log):

15:35:03  HighAvailabilityManagerExtImpl  Scheduling restart for VMs on host {id:1, kvm-host01}
15:35:03  Host [kvm-host01 (id:1) ...] is down.  Starting HA on the following VMs: vm-app01 vm-app02

(chassis Off→On detected ~15:35:05 in MS #1 log.)

SECONDARY BUGS surfaced by this incident

  1. Misleading error message. Every fence failure logs OOBM service is not configured or enabled for this host ..., but OOBM is configured and working. The catch-all in KVMHAProvider.fence() (plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KVMHAProvider.java:97-100) assumes any exception means "OOBM not configured," hiding the real cause (HTTP 409 / already off). This actively misdirects troubleshooting.

  2. Misleading "fencing performed" alerts. Each failed fence attempt emits alertType=30 — "HA Fencing of host id=1 ... performed" because FenceTask.processResult() calls sendAlert(resource, HAState.Fencing) unconditionally regardless of result (server/src/main/java/org/apache/cloudstack/ha/task/FenceTask.java:54). Admins receive a flood of "fencing performed" alerts while fencing is in fact failing continuously.

SUGGESTED FIX (direction)

Make fencing treat "host is already off" as a successful fence, and stop hiding the real error:

  1. In KVMHAProvider.fence(), query OOBM power STATUS first; if the chassis is already Off, return true (host is effectively fenced) instead of issuing a power-off that 409s. (A confirmed-off host is safe to declare fenced.)
  2. Redfish driver: treat an idempotent power-off (target state already reached, HTTP 409 on GracefulShutdown/ForceOff when already off) as success; and/or prefer ForceOff over GracefulShutdown for the HA fence path.
  3. Fix the fence() catch block to surface the actual driver error rather than "OOBM not configured."
  4. Make FenceTask alerts reflect actual success/failure of the fence.

NOTES

  • Analysed against git tag 4.22.1.0.
  • Storage (Linstor) is not the bottleneck: LinstorPrimaryDataStoreDriverImpl.isStorageSupportHA() returns true, so the legacy KVM investigator does not short-circuit; the host-HA framework path (above) is in effect.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions