Host-HA never marks a powered-off KVM host Down because the OOBM fence (power-off) fails against an already-off chassis — VMs only recover after the dead host is powered back on

# Host-HA never marks a powered-off KVM host `Down` because the fence (OOBM power-off) can't succeed against an already-off chassis — VM-HA only triggers once the dead host is powered back on

## ISSUE TYPE
- Bug Report

## COMPONENT NAME
~~~
HA (host-HA framework), Out-of-band Management (Redfish/IPMI), KVM
~~~

## CLOUDSTACK VERSION
~~~
Confirmed present with identical (or functionally identical) logic on:
  - tag    4.22.1.0  (analyzed in detail)
  - branch 4.22      (origin/4.22 @ 21b2025c) — all key files byte-identical to 4.22.1.0
  - branch main      (origin/main @ 6bc83a3c) — all key files byte-identical to 4.22.1.0
  - branch 4.20      (origin/4.20 @ a3970bb1) — same logic; differences are cosmetic only
                       (method rename getHostStatus() -> getHostStatusFromHAConfig(); logger formatting)

The host-HA + OOBM fence design predates 4.20, so earlier 4.x releases are very likely affected too.

Per-branch verification of the relevant elements:
  - KVMHAProvider.fence() = OOBM PowerOperation.OFF, returns resp.getSuccess(): same on 4.20 / 4.22 / main
  - FenceTask: only transitions to Fenced on success; retries Fencing otherwise: byte-identical on 4.20 / 4.22 / main
  - HAManagerImpl host-status mapping (Fenced->Down, Fencing->Disconnected): same on 4.20 (getHostStatus) and 4.22/main (getHostStatusFromHAConfig)
  - RedfishWrapper: PowerOperation.OFF -> RedfishResetCmd.GracefulShutdown: byte-identical on 4.20 / 4.22 / main
  - RedfishClient: throws unless HTTP status in 2XX (SC_OK..SC_MULTIPLE_CHOICES): byte-identical on 4.20 / 4.22 / main
~~~

## CONFIGURATION
- KVM cluster with **host-HA enabled** on the hosts.
- **Out-of-band Management enabled** per host (reproduced with the **Redfish** driver against Dell iDRAC; the same logic applies to the **ipmitool** driver).
- VM-HA enabled (`VmHaEnabled`).
- Primary storage: Linstor (not material — `isStorageSupportHA() == true`, so the legacy investigator is not the bottleneck here).

## OS / ENVIRONMENT
- Management servers: Ubuntu 24.04, OpenJDK 21.
- Hypervisors: KVM.
- BMC: Dell iDRAC via Redfish (`/redfish/v1/Systems/System.Embedded.1`).

## SUMMARY

When a KVM host that has host-HA + OOBM enabled is **hard powered off** (e.g. forced chassis-off from the BMC console, or a real power/cable failure), CloudStack **never transitions the host to `Down`** and therefore **never restarts its VMs on other hosts**. The host stays in `Alert`/`Disconnected` indefinitely.

Root cause: the host-HA state machine only declares a host dead (`HAState.Fenced` → investigator `Status.Down`) **after a successful fence**, and the fence is implemented as an **active OOBM power-off**. Against an already-off chassis that power-off cannot succeed (the BMC rejects it), so the host is pinned in the `Fencing` state and retried forever. The investigator maps `Fencing` to `Status.Disconnected`, not `Status.Down`, so VM-HA is never invoked.

The perverse result: **the VMs are only recovered once the original (dead) host is powered back on - even during BIOS booting stage** — at which point the pending power-off finally succeeds, the host transitions to `Fenced`/`Down`, and HA restarts the VMs elsewhere. This defeats the purpose of HA.

**All three current branches are affected by the identical issue:** the relevant code is byte-identical on `4.22` and `main`, and functionally identical on `4.20` (only a method rename and logger formatting differ). There is no `4.21` branch upstream. Per-element diff verification is in the CLOUDSTACK VERSION section below.

## STEPS TO REPRODUCE

1. KVM cluster, host-HA enabled, OOBM (Redfish or ipmitool) configured and enabled on the hosts, VM-HA enabled. Place some HA-enabled VMs (incl. system VMs) on `hostA`.
2. Forcefully power off `hostA` at the BMC (chassis power off / simulate power loss). The BMC itself stays reachable.
3. Observe `hostA` in CloudStack over the next 20+ minutes.

### EXPECTED RESULTS
- Health check fails → activity check fails → host is fenced → host marked `Down` → VM-HA restarts `hostA`'s VMs on other hosts within a few minutes.

### ACTUAL RESULTS
- `hostA` remains in `Alert` (host status) with the host-HA state stuck in `Fencing`.
- The OOBM **STATUS** poll correctly reports the chassis as `Off` the entire time, but that knowledge is never used to declare the host down.
- The agent investigator repeatedly reports the host as `Up` (while HA state is `Suspect`) and then `Disconnected` (while HA state is `Fencing`) — **never `Down`**.
- VMs are **not** restarted; the scheduler keeps preferring the VM's last host (the dead `hostA`).
- The instant `hostA` is powered back **on**, the fence power-off finally succeeds → host goes `Down` → VM-HA restarts the VMs on other hosts.

## ROOT CAUSE ANALYSIS

### Decision chain (only `Fenced` yields `Down`)

1. For an HA-eligible KVM host, the legacy investigator delegates to the host-HA framework:
   - `KVMInvestigator.getHostAgentStatus()` → `haManager.getHostStatusFromHAConfig(host)`
     (`plugins/hypervisors/kvm/src/main/java/com/cloud/ha/KVMInvestigator.java:81`)
2. `HAManagerImpl.getHostStatusFromHAConfig()` maps HA state → host status
   (`server/src/main/java/org/apache/cloudstack/ha/HAManagerImpl.java:315`):
   - `Fenced` → `Status.Down`
   - `Degraded` / `Recovering` / `Fencing` → `Status.Disconnected`
   - everything else (`Available`/`Suspect`/`Checking`/`Recovered`) → `Status.Up`
3. `AgentManagerImpl` only fires the `HostDown` event and `scheduleRestartForVmsOnHost(...)` when the investigator returns `Status.Down`
   (`engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java:1147`, `:1200`).

So VM-HA for an HA-eligible KVM host requires the host-HA state machine to reach **`Fenced`**.

### Reaching `Fenced` requires a *successful* power-off

- The state machine only goes `Fencing → Fenced` on `Event.Fenced`
  (`api/src/main/java/org/apache/cloudstack/ha/HAConfig.java:139`).
- `FenceTask.processResult()` only fires `Event.Fenced` when the fence returned `true`; otherwise it does nothing and the poll loop retries `Fencing` forever via `RetryFencing`
  (`server/src/main/java/org/apache/cloudstack/ha/task/FenceTask.java:45`; retry at `server/src/main/java/org/apache/cloudstack/ha/HAManagerImpl.java:724`).
- The fence is an active OOBM power-off:
  `KVMHAProvider.fence()` → `outOfBandManagementService.executePowerOperation(host, PowerOperation.OFF, null)` and returns `resp.getSuccess()`
  (`plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KVMHAProvider.java:87`).
- `executePowerOperation()` **throws** `CloudRuntimeException` whenever the driver response is not successful — it never returns `success=false`
  (`server/src/main/java/org/apache/cloudstack/outofbandmanagement/OutOfBandManagementServiceImpl.java:432`).

### Why the power-off fails against an already-off host (Redfish)

- The Redfish driver maps `PowerOperation.OFF` → `RedfishResetCmd.GracefulShutdown`
  (`plugins/outofbandmanagement-drivers/redfish/src/main/java/org/apache/cloudstack/outofbandmanagement/driver/redfish/RedfishWrapper.java:34`).
- `RedfishClient.executeComputerSystemReset()` POSTs to `.../Actions/ComputerSystem.Reset` and throws `RedfishException` if the HTTP status is not 2XX
  (`utils/src/main/java/org/apache/cloudstack/utils/redfish/RedfishClient.java:300-312`).
- An already-off system returns **HTTP 409 (Conflict)** — a `GracefulShutdown` is invalid because there is no running OS to shut down. 409 ∉ 2XX → `RedfishException` → `CloudRuntimeException` → `HAFenceException` → `FenceTask` sees `result=false` → **no `Fenced` transition** → stuck in `Fencing`.
- (The ipmitool driver has the analogous failure mode: `chassis power off` against an already-off / unreachable BMC returns a non-zero exit code, judged purely by process exit status with no "already in target state" handling — `IpmitoolWrapper.executeCommands()` → `result.isSuccess()`.)

### Net effect

The fence requires confirming an **active power-off transition**, but a host that is already off (precisely the case where restarting its VMs is safe) cannot be "powered off successfully." The safety mechanism deadlocks in exactly the scenario it exists to handle. VMs recover only when the dead host returns.

## LOG EVIDENCE (two-MS cluster; host `kvm-host01`, id:1, uuid `aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee`; hostnames/IPs/VM names below are anonymised examples)

OOBM STATUS poll knew the chassis was off the whole time (MS #1 log):
~~~
15:10:47  OutOfBandManagementServiceImpl  Transitioned out-of-band management power state from On to Off
          due to event: Off(Chassis Power is Off) for Host {id:1, kvm-host01}
~~~

Investigator never returns Down — `Up` while `Suspect`, then `Disconnected` while `Fencing` (MS #1 log):
~~~
15:07:51  KVMInvestigator was able to determine host {id:1} is in Up
          ... is considered Up (...). State: Suspect, Most recent health check failed.
15:14:51  HAManagerImpl  HA: Agent [{id:1}] is disconnected. State: Fencing, The resource is undergoing fence operation.
~~~

The fence itself, on the MS node that owns the HA config (MS #2 log) — repeated every ~4s for ~20 min:
~~~
15:14:20  (first) ... it got '409'
15:14:28  KVMHAProvider  OOBM service is not configured or enabled for this host {id:1} error is
          Failed to execute System power command ... 'POST' ...
          '.../Actions/ComputerSystem.Reset' ... The expected HTTP status code is '2XX' but it got '409'.
15:14:28  FenceTask  Exception occurred while running FenceTask ...
          org.apache.cloudstack.ha.provider.HAFenceException ... at KVMHAProvider.fence(KVMHAProvider.java:99)
~~~
Counts over the outage: ~618 × `409`, ~308 × `HAFenceException`, 930 × `Fencing` state lines, `Starting HA on ... = 1` (only at the very end).

VM-HA only fires after the host is powered back on (MS #2 log):
~~~
15:35:03  HighAvailabilityManagerExtImpl  Scheduling restart for VMs on host {id:1, kvm-host01}
15:35:03  Host [kvm-host01 (id:1) ...] is down.  Starting HA on the following VMs: vm-app01 vm-app02
~~~
(chassis Off→On detected ~15:35:05 in MS #1 log.)

## SECONDARY BUGS surfaced by this incident

1. **Misleading error message.** Every fence failure logs `OOBM service is not configured or enabled for this host ...`, but OOBM *is* configured and working. The catch-all in `KVMHAProvider.fence()` (`plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KVMHAProvider.java:97-100`) assumes any exception means "OOBM not configured," hiding the real cause (HTTP 409 / already off). This actively misdirects troubleshooting.

2. **Misleading "fencing performed" alerts.** Each *failed* fence attempt emits `alertType=30 — "HA Fencing of host id=1 ... performed"` because `FenceTask.processResult()` calls `sendAlert(resource, HAState.Fencing)` unconditionally regardless of `result` (`server/src/main/java/org/apache/cloudstack/ha/task/FenceTask.java:54`). Admins receive a flood of "fencing performed" alerts while fencing is in fact failing continuously.

## SUGGESTED FIX (direction)

Make fencing treat "host is already off" as a successful fence, and stop hiding the real error:

1. In `KVMHAProvider.fence()`, query OOBM power **STATUS** first; if the chassis is already `Off`, return `true` (host is effectively fenced) instead of issuing a power-off that 409s. (A confirmed-off host is safe to declare fenced.)
2. Redfish driver: treat an idempotent power-off (target state already reached, HTTP 409 on `GracefulShutdown`/`ForceOff` when already off) as success; and/or prefer `ForceOff` over `GracefulShutdown` for the HA fence path.
3. Fix the `fence()` catch block to surface the actual driver error rather than "OOBM not configured."
4. Make `FenceTask` alerts reflect actual success/failure of the fence.

## NOTES
- Analysed against git tag `4.22.1.0`.
- Storage (Linstor) is not the bottleneck: `LinstorPrimaryDataStoreDriverImpl.isStorageSupportHA()` returns `true`, so the legacy KVM investigator does not short-circuit; the host-HA framework path (above) is in effect.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Host-HA never marks a powered-off KVM host Down because the OOBM fence (power-off) fails against an already-off chassis — VMs only recover after the dead host is powered back on #13376

Host-HA never marks a powered-off KVM host `Down` because the fence (OOBM power-off) can't succeed against an already-off chassis — VM-HA only triggers once the dead host is powered back on

ISSUE TYPE

COMPONENT NAME

CLOUDSTACK VERSION

CONFIGURATION

OS / ENVIRONMENT

SUMMARY

STEPS TO REPRODUCE

EXPECTED RESULTS

ACTUAL RESULTS

ROOT CAUSE ANALYSIS

Decision chain (only `Fenced` yields `Down`)

Reaching `Fenced` requires a successful power-off

Why the power-off fails against an already-off host (Redfish)

Net effect

LOG EVIDENCE (two-MS cluster; host `kvm-host01`, id:1, uuid `aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee`; hostnames/IPs/VM names below are anonymised examples)

SECONDARY BUGS surfaced by this incident

SUGGESTED FIX (direction)

NOTES

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Host-HA never marks a powered-off KVM host Down because the OOBM fence (power-off) fails against an already-off chassis — VMs only recover after the dead host is powered back on #13376

Description

Host-HA never marks a powered-off KVM host Down because the fence (OOBM power-off) can't succeed against an already-off chassis — VM-HA only triggers once the dead host is powered back on

ISSUE TYPE

COMPONENT NAME

CLOUDSTACK VERSION

CONFIGURATION

OS / ENVIRONMENT

SUMMARY

STEPS TO REPRODUCE

EXPECTED RESULTS

ACTUAL RESULTS

ROOT CAUSE ANALYSIS

Decision chain (only Fenced yields Down)

Reaching Fenced requires a successful power-off

Why the power-off fails against an already-off host (Redfish)

Net effect

LOG EVIDENCE (two-MS cluster; host kvm-host01, id:1, uuid aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee; hostnames/IPs/VM names below are anonymised examples)

SECONDARY BUGS surfaced by this incident

SUGGESTED FIX (direction)

NOTES

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Host-HA never marks a powered-off KVM host `Down` because the fence (OOBM power-off) can't succeed against an already-off chassis — VM-HA only triggers once the dead host is powered back on

Decision chain (only `Fenced` yields `Down`)

Reaching `Fenced` requires a successful power-off

LOG EVIDENCE (two-MS cluster; host `kvm-host01`, id:1, uuid `aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee`; hostnames/IPs/VM names below are anonymised examples)