Drain per-host reservation when a VM starts on a different host#13363
Open
Kukunin wants to merge 1 commit into
Open
Drain per-host reservation when a VM starts on a different host#13363Kukunin wants to merge 1 commit into
Kukunin wants to merge 1 commit into
Conversation
When a VM is stopped via the API, postStateTransitionEvent moves its used capacity into reserved_capacity on its last host (so a quick restart can reclaim it cheaply). Until now, this reservation was only drained when the VM started again on the same lastHostId. Starting on any other host left an orphan reservation that lingered for up to one hour (capacity.skipcounting.hours) and was summed into the cluster used+reserved aggregate by FirstFitPlanner.removeClustersCrossingThreshold, spuriously tripping the disable-threshold and blocking later starts. Always drain the VM's reservation on its previous host before allocating on the target host — same host or not. This removes the fromLastHost branch (now dead, the only other caller already passes false), the matching boolean parameter on allocateVmCapacity, and the misspelt moveToReservered parameter everywhere it appears. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What's happening
When a VM is stopped via the API, CloudStack moves its CPU/RAM from
usedtoreservedon itslast_host_id(see theStopping → Stopped + OperationSucceededbranch inCapacityManagerImpl.postStateTransitionEvent). The idea is that a quick restart on the same host can reclaim its earmarked slot cheaply.The asymmetry: when the VM later starts on the same host, the
fromLastHost=truebranch inallocateVmCapacitydrains that reservation. When it starts on a different host, the old reservation just sits there. It only clears when:capacity.skipcounting.hours(default 1h) elapses andupdateCapacityForHostrecycles it, orUntil then, the orphan reservation gets summed into the cluster's
used + reservedaggregate byFirstFitPlanner.removeClustersCrossingThreshold. On a cluster that's already nearcluster.memory.allocated.capacity.disablethreshold(default 0.85), the phantom can trip the threshold and block subsequent VM starts in the whole cluster — even though the VM in question isn't actually consuming anything on its old host.We hit this on a fairly full cluster where stop-then-start cycles started failing intermittently with
InsufficientServerCapacityException: No destination found. The "ghost" capacity was the released-but-not-drained reservation from the last stop.The fix
postStateTransitionEventnow drains the VM's reservation on its previous host before allocating on the target host — regardless of whether the target is the same host or a different one. Treating both cases identically removes thefromLastHostasymmetry.Side effects of unifying the path:
fromLastHost=truebranch inallocateVmCapacityis now unreachable frompostStateTransitionEvent. The only other caller (VirtualMachineManagerImpl#reconfiguringOnExistingHost) already passesfalse, so the parameter is removed entirely.moveToReserveredtypo (3 e's) — renamed tomoveToReservedthroughout the interface, the impl, and the debug logs.logger.debug(String.format(...))inpostStateTransitionEventswitched to SLF4J{}placeholders to match the rest of the file.Tests
Two new tests in
CapacityManagerImplTestcover both transitions:testPostStateTransitionReleasesStaleReservationWhenStartingOnDifferentHost— fails onmain, passes with the fix. VerifiesreleaseVmCapacity(vm, true, false, oldHostId)is invoked.testPostStateTransitionReleasesReservationWhenStartingOnSameHost— guards the unified contract so both paths stay in lockstep.Manual validation
Reproduced on a small test cluster:
reserved=128Mappears on A inop_host_capacity).hostid.release mem from host: A, old reserved: 128MB → new reserved: 0.Runningon B,last_host_idupdates to B andupdateCapacityForHostagrees the reservation is gone.Same-host stop/start round-trip continues to behave the same way as before.