fix(tests): increase memory limits for race-detector builds#19951
fix(tests): increase memory limits for race-detector builds#19951
Conversation
|
Skipping CI for Draft Pull Request. |
🚀 Build Images ReadyImages are ready for commit 498c169. To use with deploy scripts: export MAIN_IMAGE_TAG=4.11.x-652-g498c169ccb |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #19951 +/- ##
==========================================
- Coverage 49.56% 49.56% -0.01%
==========================================
Files 2764 2764
Lines 208442 208442
==========================================
- Hits 103319 103317 -2
- Misses 97467 97469 +2
Partials 7656 7656
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
/test gke-qa-e2e-tests |
|
/retest-required |
|
/retest |
3 similar comments
|
/retest |
|
/retest |
|
/retest |
9328fc6 to
1ee3f01
Compare
Testing whether the [Operator] Upgrade multi-namespace installation test fails on current master. This test has failed 7+ consecutive times on PR #19951 (which only changes race-build gated code in tests/e2e/lib.sh). If it fails here too, the root cause is in master, not in #19951. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
e509be6 to
7f22af2
Compare
|
/retest |
The busybox-style binary consolidation (ROX-33958) runs init() for all components at startup. Under the race detector's ~5-10x memory multiplier this causes OOMKills for admission-control (500Mi limit) and config-controller (128Mi limit). Override memory limits post-deploy when IS_RACE_BUILD is set, following the same pattern used for OpenShift CPU overrides (ROX-5334). - config-controller: 128Mi → 512Mi (race builds only) - admission-control: 500Mi → 2Gi (race builds only) Default Helm chart values are unchanged. Generated with assistance from AI Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…yment The config-controller deployment names its container 'manager', not 'config-controller'. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… resources The operator reconciles and reverts kubectl resource overrides. For operator-managed deployments (OCP jobs), patch the Central/SecuredCluster CRs directly. For helm/kubectl deployments (GKE jobs), keep using kubectl set resources. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The race detector's 5-10x memory multiplier causes scanner-v4 pods to thrash GC or OOM under their default limits. This starves the operator reconciler (config-controller), preventing it from processing CR changes like disabling Scanner V4 — causing the "Upgrade multi-namespace installation" test to consistently time out waiting for deployment deletion. Increase memory limits for scanner-v4-indexer (3Gi->6Gi), scanner-v4-matcher (3Gi->6Gi), and scanner-v4-db (8Gi->16Gi) when IS_RACE_BUILD is set. Values are intentionally generous; can be tuned down once we have passing runs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ost-deploy patches The operator reconciles deployments back to CR spec values, so post-deploy kubectl patches and CR patches were being overwritten. Move memory limit overrides into the CR YAML templates as envsubst variables, set from deploy_central_via_operator/deploy_sensor_via_operator based on IS_RACE_BUILD. Non-operator (Helm/roxctl) path keeps using kubectl set resources as a post-deploy fallback since Helm manages its own values. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The "[Operator] Upgrade multi-namespace installation" test deploys the old operator (4.10.1) first, which doesn't support all resource fields in the CR. After upgrading to the new operator, the CR still has the old default memory limits. Under race-detector builds, this causes OOMKills that prevent the operator from reconciling scanner-v4 disable requests. Patch the Central and SecuredCluster CRs with increased memory limits immediately after the operator upgrade, so the new operator reconciles the deployments with sufficient memory. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…grade The configAsCode CR field doesn't control the config-controller deployment's resource limits. Patch the deployment directly as a fallback to prevent OOMKills under race-detector builds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ilds The operator's config-controller runs at 128Mi and OOMKills under the race detector. The operator continuously reconciles it back to 128Mi, so we can't increase the limit via CR or deployment patches. Instead, increase the deployment deletion timeout from 4m to 10m for race builds, giving the crashlooping config-controller enough cycles to eventually process the scanner-v4 disable reconciliation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cd8c64f to
498c169
Compare
|
@janisz I'll hold on this. If/when we need to increase the memory for these components in -race testing, we can revisit this? And possibly it will not be needed 🤞 |
|
@davdhacs: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Description
The busybox-style consolidated binary (ROX-33958) runs
init()for allcomponents at startup. Under the race detector's ~5-10x memory multiplier
(triggered by the
ci-race-testslabel settingIS_RACE_BUILD=true),this causes OOMKills for components with tight memory limits.
The most critical victim is
config-controller(the operator reconciler,128Mi default) — when it OOMKills, the operator cannot reconcile CR changes
such as disabling Scanner V4, causing the
[Operator] Upgrade multi-namespace installationtest to time out waiting for deployment deletion.Changes
CR template resource overrides (operator fresh installs):
central-cr.envsubst.yamlandsecured-cluster-cr-with-scanner-v4.envsubst.yamlas envsubst variablesdeploy_central_via_operator/deploy_sensor_via_operatorset highervalues when
IS_RACE_BUILDis setscanner-v4-matcher 2Gi→6Gi, scanner-v4-db 1Gi→4Gi, admission-control
500Mi→2Gi
Post-upgrade CR + deployment patches (operator upgrade test):
CRs with its own defaults — our CR template overrides don't apply
AND directly patch the config-controller deployment
Extended deletion timeout (operator upgrade test):
overridden via CR), so it continues to OOMKill under the race detector
verify_deployment_deletion_with_timeoutfrom 4m to 10m forrace builds, giving the crashlooping config-controller enough cycles to
process the scanner-v4 disable reconciliation
Non-operator fallback (Helm/roxctl tests):
kubectl set resourcespost-deploy patches for the same componentshardcoded values)
User-facing documentation
Testing and quality
Automated testing
How I validated my change
IS_RACE_BUILDis set by theci-race-testslabel via CI build logsscanner-v4-matcher (3Gi) in failing runs
(tests 6, 7 pass)
test (test 8 — previously failed 8+ consecutive times, now passes)
previously hardcoded values)