Skip to content

fix(tests): increase memory limits for race-detector builds#19951

Closed
davdhacs wants to merge 8 commits intomasterfrom
davdhacs/fix-rcd-oom-race-overrides
Closed

fix(tests): increase memory limits for race-detector builds#19951
davdhacs wants to merge 8 commits intomasterfrom
davdhacs/fix-rcd-oom-race-overrides

Conversation

@davdhacs
Copy link
Copy Markdown
Contributor

@davdhacs davdhacs commented Apr 10, 2026

Description

The busybox-style consolidated binary (ROX-33958) runs init() for all
components at startup. Under the race detector's ~5-10x memory multiplier
(triggered by the ci-race-tests label setting IS_RACE_BUILD=true),
this causes OOMKills for components with tight memory limits.

The most critical victim is config-controller (the operator reconciler,
128Mi default) — when it OOMKills, the operator cannot reconcile CR changes
such as disabling Scanner V4, causing the [Operator] Upgrade multi-namespace installation test to time out waiting for deployment deletion.

Changes

CR template resource overrides (operator fresh installs):

  • Parameterized memory limits in central-cr.envsubst.yaml and
    secured-cluster-cr-with-scanner-v4.envsubst.yaml as envsubst variables
  • deploy_central_via_operator / deploy_sensor_via_operator set higher
    values when IS_RACE_BUILD is set
  • Bumped: config-controller 128Mi→512Mi, scanner-v4-indexer 2Gi→6Gi,
    scanner-v4-matcher 2Gi→6Gi, scanner-v4-db 1Gi→4Gi, admission-control
    500Mi→2Gi

Post-upgrade CR + deployment patches (operator upgrade test):

  • The upgrade test deploys the old operator (4.10.1) first, which creates
    CRs with its own defaults — our CR template overrides don't apply
  • After upgrading to the new operator, patch both CRs with increased limits
    AND directly patch the config-controller deployment

Extended deletion timeout (operator upgrade test):

  • The operator reconciles config-controller back to 128Mi (can't be
    overridden via CR), so it continues to OOMKill under the race detector
  • Increased verify_deployment_deletion_with_timeout from 4m to 10m for
    race builds, giving the crashlooping config-controller enough cycles to
    process the scanner-v4 disable reconciliation

Non-operator fallback (Helm/roxctl tests):

  • kubectl set resources post-deploy patches for the same components
  • Non-CI deployments are unaffected (defaults match the previously
    hardcoded values)

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • modified existing tests

How I validated my change

  • Confirmed IS_RACE_BUILD is set by the ci-race-tests label via CI build logs
  • Verified OOMKills on config-controller (128Mi), scanner-v4-indexer (3Gi),
    scanner-v4-matcher (3Gi) in failing runs
  • Confirmed CR template overrides take effect for operator fresh install tests
    (tests 6, 7 pass)
  • Confirmed post-upgrade patches + extended timeout fix the operator upgrade
    test (test 8 — previously failed 8+ consecutive times, now passes)
  • Verified non-CI deployments are unaffected (envsubst defaults match
    previously hardcoded values)
  • All 10 scanner-v4-install tests pass with this change

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 10, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@davdhacs davdhacs marked this pull request as ready for review April 10, 2026 22:09
@davdhacs davdhacs requested review from dashrews78 and janisz April 10, 2026 22:09
@davdhacs davdhacs added the ci-race-tests Uses a `-race` build for all e2e tests label Apr 10, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 10, 2026

🚀 Build Images Ready

Images are ready for commit 498c169. To use with deploy scripts:

export MAIN_IMAGE_TAG=4.11.x-652-g498c169ccb

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 49.56%. Comparing base (91eb730) to head (498c169).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #19951      +/-   ##
==========================================
- Coverage   49.56%   49.56%   -0.01%     
==========================================
  Files        2764     2764              
  Lines      208442   208442              
==========================================
- Hits       103319   103317       -2     
- Misses      97467    97469       +2     
  Partials     7656     7656              
Flag Coverage Δ
go-unit-tests 49.56% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@davdhacs
Copy link
Copy Markdown
Contributor Author

/test gke-qa-e2e-tests

@davdhacs davdhacs added the auto-retest PRs with this label will be automatically retested if prow checks fails label Apr 12, 2026
@davdhacs
Copy link
Copy Markdown
Contributor Author

davdhacs commented Apr 12, 2026

/retest-required

@rhacs-bot
Copy link
Copy Markdown
Contributor

/retest

3 similar comments
@rhacs-bot
Copy link
Copy Markdown
Contributor

/retest

@rhacs-bot
Copy link
Copy Markdown
Contributor

/retest

@rhacs-bot
Copy link
Copy Markdown
Contributor

/retest

@davdhacs davdhacs removed the auto-retest PRs with this label will be automatically retested if prow checks fails label Apr 12, 2026
@davdhacs davdhacs force-pushed the davdhacs/fix-rcd-oom-race-overrides branch from 9328fc6 to 1ee3f01 Compare April 12, 2026 14:46
davdhacs added a commit that referenced this pull request Apr 12, 2026
Testing whether the [Operator] Upgrade multi-namespace installation test
fails on current master. This test has failed 7+ consecutive times on
PR #19951 (which only changes race-build gated code in tests/e2e/lib.sh).

If it fails here too, the root cause is in master, not in #19951.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@davdhacs davdhacs force-pushed the davdhacs/fix-rcd-oom-race-overrides branch from e509be6 to 7f22af2 Compare April 13, 2026 03:57
@davdhacs
Copy link
Copy Markdown
Contributor Author

/retest

davdhacs and others added 6 commits April 14, 2026 07:10
The busybox-style binary consolidation (ROX-33958) runs init() for all
components at startup. Under the race detector's ~5-10x memory multiplier
this causes OOMKills for admission-control (500Mi limit) and
config-controller (128Mi limit).

Override memory limits post-deploy when IS_RACE_BUILD is set, following
the same pattern used for OpenShift CPU overrides (ROX-5334).

- config-controller: 128Mi → 512Mi (race builds only)
- admission-control: 500Mi → 2Gi (race builds only)

Default Helm chart values are unchanged.

Generated with assistance from AI

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…yment

The config-controller deployment names its container 'manager', not
'config-controller'.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… resources

The operator reconciles and reverts kubectl resource overrides. For
operator-managed deployments (OCP jobs), patch the Central/SecuredCluster
CRs directly. For helm/kubectl deployments (GKE jobs), keep using
kubectl set resources.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The race detector's 5-10x memory multiplier causes scanner-v4 pods to
thrash GC or OOM under their default limits. This starves the operator
reconciler (config-controller), preventing it from processing CR changes
like disabling Scanner V4 — causing the "Upgrade multi-namespace
installation" test to consistently time out waiting for deployment deletion.

Increase memory limits for scanner-v4-indexer (3Gi->6Gi),
scanner-v4-matcher (3Gi->6Gi), and scanner-v4-db (8Gi->16Gi) when
IS_RACE_BUILD is set. Values are intentionally generous; can be tuned
down once we have passing runs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ost-deploy patches

The operator reconciles deployments back to CR spec values, so
post-deploy kubectl patches and CR patches were being overwritten.
Move memory limit overrides into the CR YAML templates as envsubst
variables, set from deploy_central_via_operator/deploy_sensor_via_operator
based on IS_RACE_BUILD. Non-operator (Helm/roxctl) path keeps using
kubectl set resources as a post-deploy fallback since Helm manages its
own values.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The "[Operator] Upgrade multi-namespace installation" test deploys the
old operator (4.10.1) first, which doesn't support all resource fields
in the CR. After upgrading to the new operator, the CR still has the old
default memory limits. Under race-detector builds, this causes OOMKills
that prevent the operator from reconciling scanner-v4 disable requests.

Patch the Central and SecuredCluster CRs with increased memory limits
immediately after the operator upgrade, so the new operator reconciles
the deployments with sufficient memory.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
davdhacs and others added 2 commits April 14, 2026 07:11
…grade

The configAsCode CR field doesn't control the config-controller
deployment's resource limits. Patch the deployment directly as a
fallback to prevent OOMKills under race-detector builds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ilds

The operator's config-controller runs at 128Mi and OOMKills under the
race detector. The operator continuously reconciles it back to 128Mi,
so we can't increase the limit via CR or deployment patches. Instead,
increase the deployment deletion timeout from 4m to 10m for race builds,
giving the crashlooping config-controller enough cycles to eventually
process the scanner-v4 disable reconciliation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@davdhacs davdhacs force-pushed the davdhacs/fix-rcd-oom-race-overrides branch from cd8c64f to 498c169 Compare April 14, 2026 13:11
@davdhacs
Copy link
Copy Markdown
Contributor Author

@janisz I'll hold on this. If/when we need to increase the memory for these components in -race testing, we can revisit this? And possibly it will not be needed 🤞

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 14, 2026

@davdhacs: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/gke-ui-e2e-tests 498c169 link true /test gke-ui-e2e-tests

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@davdhacs davdhacs closed this Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-assisted ci-race-tests Uses a `-race` build for all e2e tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants