fix: Harden informer cache with label selectors and memory optimizations by jyejare · Pull Request #6242 · feast-dev/feast

jyejare · 2026-04-08T17:15:21Z

Summary

The feast-operator's Owns() calls create cluster-wide informers for ConfigMaps, Deployments, Services, and other resource types. On clusters with a large number of these objects, the informer cache can grow beyond the operator's 256Mi memory limit, causing OOMKill and restarts.

Changes

`ByObject` label selectors for all owned resource types

Restrict informer caches to only objects with app.kubernetes.io/managed-by: feast-operator. Covers all 10 owned types: ConfigMap, Deployment, Service, ServiceAccount, PVC, RoleBinding, Role, CronJob, HPA, PDB. Extracted into newCacheOptions() for clarity.

`DefaultTransform: cache.TransformStripManagedFields()`

Strip managedFields from all cached objects, reducing per-object memory footprint by ~30-50%.

`GOMEMLIMIT=230MiB`

Set Go runtime soft memory limit (90% of 256Mi container limit). Triggers GC pressure before hard OOMKill as defense-in-depth.

Additional changes

Add app.kubernetes.io/managed-by: feast-operator label to getLabels() so all FeatureStore-managed resources carry it
Introduce getSelectorLabels() for immutable selectors (Deployment spec.selector, Service spec.selector, TopologySpreadConstraints, PodAffinity) to avoid breaking existing resources on upgrade
Standardize notebook controller's managed-by label to app.kubernetes.io/managed-by
Use shared constants (services.ManagedByLabelKey/Value) throughout

Test Results

Verified on cluster with a large number of ConfigMaps pre-loaded:

Metric	Before	After
Memory usage	254Mi (at limit)	25Mi
Stability	OOMKilled, CrashLoopBackOff	Stable, no restarts

Test plan

Deploy fixed operator on RHOAI 3.3.0 cluster
Verify memory usage stays well below 256Mi limit under load
Verify no OOMKill or CrashLoopBackOff
Run existing unit tests (make test) — all pass
Verify getSelectorLabels() prevents immutable selector breakage on upgrade

Summary by CodeRabbit

Chores
- Operator cache now only watches/caches resources marked with the managed-by label, reducing resource usage.
- Added GOMEMLIMIT=230MiB to the controller-manager container.
- Standardized managed-by label constants and separated immutable selector labels from mutable metadata to improve stability.
Tests
- Updated tests to use the shared label constants.

devin-ai-integration

Devin Review found 1 new potential issue.

View 15 additional findings in Devin Review.

devin-ai-integration · 2026-04-10T11:22:35Z

 	return map[string]string{
 		services.NameLabelKey:        authz.Handler.FeatureStore.Name,
 		services.ServiceTypeLabelKey: string(services.AuthzFeastType),
+		services.ManagedByLabelKey:   services.ManagedByLabelValue,


🟡 removeOrphanedRoles silently skips pre-upgrade custom auth Roles due to stricter label selector

The authz.getLabels() function now includes ManagedByLabelKey (authz.go:334), and removeOrphanedRoles uses this label set as a list selector (authz.go:85). Pre-upgrade custom auth Roles only have {NameLabelKey, ServiceTypeLabelKey} without ManagedByLabelKey, so the API server's label selector will never match them. These orphaned Roles will never be cleaned up by removeOrphanedRoles.

The main feast Role and RoleBinding are still cleaned up correctly via DeleteOwnedFeastObj (which looks up by name, not labels). Only custom auth roles from KubernetesAuthz.Roles are affected. The practical impact is limited: orphaned Roles have empty rules (no security impact) and have owner references for eventual GC on FeatureStore CR deletion. The window is narrow — it requires changing the Roles list concurrently with or very shortly after the operator upgrade, before the first reconciliation adds the label to existing Roles.

Prompt for agents

In authz.go, the removeOrphanedRoles function at line 81-101 lists Roles using authz.getLabels() as the label selector. Since getLabels() now includes ManagedByLabelKey, pre-upgrade Roles without this label are invisible to this cleanup function. To fix: either (a) use a separate label set for removeOrphanedRoles that omits ManagedByLabelKey (matching by NameLabelKey and ServiceTypeLabelKey only), or (b) run a one-time migration during reconciliation that adds ManagedByLabelKey to all existing authz Roles before removeOrphanedRoles is called.

Was this helpful? React with 👍 or 👎 to provide feedback.

Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>

ntkathole

lgtm

…ons (feast-dev#6242) * fix: Harden informer cache with label selectors and memory optimizations Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com> * Additional Fixes on caching with PVC and HPA Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com> --------- Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com> Signed-off-by: Alex Korbonits <alex@korbonits.com>

# [0.63.0](v0.62.0...v0.63.0) (2026-05-04) ### Bug Fixes * Add project filter to apply_data_source and delete_data_source (closes [#6206](#6206)) ([#6322](#6322)) ([96562c4](96562c4)) * Add project_id filter to SnowflakeRegistry UPDATE path ([#6243](#6243)) ([6658b71](6658b71)), closes [#6208](#6208) [#6208](#6208) * Add subprocess timeouts to prevent test_e2e_local hanging on Dask atexit handler ([3de6556](3de6556)) * Ambiguous truth value of array during materialization ([#6259](#6259)) ([d0c8984](d0c8984)) * Auto-detect GCS/S3 registry store when registry is passed as string ([#6260](#6260)) ([7ebcf03](7ebcf03)) * **bigquery:** Prefer query over table in get_table_query_string ([#6360](#6360)) ([77ed779](77ed779)), closes [#6200](#6200) * correct project_id scoping in get_user_metadata and delete_project ([0c469a7](0c469a7)) * disable Redis RDB persistence in test deployments ([44cd682](44cd682)) * Disable snowflake tests temporarily in CI ([#6356](#6356)) ([31d5a98](31d5a98)) * Filter empty SQL commands at execute_snowflake_statement call sites ([#6249](#6249)) ([92ffbb9](92ffbb9)) * Fix five bugs in milvus online store ([#6275](#6275)) ([212504b](212504b)) * Fix issue with apply feature view ([835cda8](835cda8)) * Fix streaming materialization for exotic sources with lazy UDF pipelines ([c07972d](c07972d)) * Handle missing features gracefully instead of panicking ([7d00b3a](7d00b3a)) * Harden informer cache with label selectors and memory optimizations ([#6242](#6242)) ([3f11356](3f11356)) * **helm:** Avoid nil pointer for metrics.enabled inside podAnnotations ([#6251](#6251)) ([c833f1a](c833f1a)) * Include git in feast server image ([fb03c46](fb03c46)) * Include StreamFeatureView in freshness metric ([#6269](#6269)) ([463f16c](463f16c)) * Pre-create S3A event log dir before SparkContext init ([#6317](#6317)) ([9feca77](9feca77)) * Remote Online Store Type Inference Error with All-NULL Columns ([#6063](#6063)) ([de67bdd](de67bdd)) * Remove selector with kustomize overlay using a JSON 6902 patch ([9107a43](9107a43)) * Resolve multiple bugs in SnowflakeRegistry and Snowflake connection handling ([#6315](#6315)) ([7e66a2e](7e66a2e)) * **spark:** BatchFeatureView with TransformationMode.PYTHON now reads all source columns ([a310eaf](a310eaf)) * **spark:** Use SELECT * when feature_name_columns is empty in pull_all_from_table_or_query ([e1b1d2d](e1b1d2d)) * Support pandas mode in feature builder and fix dask column extraction ([863315e](863315e)) * support SQL string as entity_df in RemoteOfflineStore.get_historical_features ([c559889](c559889)) * Wrap LocalOutputNode return value in ArrowTableValue for consist… ([#6286](#6286)) ([a16cd55](a16cd55)) ### Features * Add agent skills and Cursor/Claude rules for Feast development ([312eea3](312eea3)) * Add feature view versioning support to FAISS online store ([b36acb7](b36acb7)) * Add feature view versioning support to Redis and DynamoDB online stores ([#6257](#6257)) ([edf25af](edf25af)), closes [#6164](#6164) [#6163](#6163) * Add optional 'org' in feature view ([#6288](#6288)) ([#6301](#6301)) ([608b105](608b105)) * Add RaySource, to_ray_dataset first-class method, docs, and tests ([1c98157](1c98157)) * Add TLS support for Go Feature Server ([#6229](#6229)) ([28a58d0](28a58d0)) * Add Vector Search support to MongoDBOnlineStore ([#6344](#6344)) ([c102738](c102738)) * Add versioning support to Milvus online store ([#6330](#6330)) ([3268ced](3268ced)) * Addresses performance issues in the Redis online store ([2e50da0](2e50da0)) * Allow to set gpu for ray ([5580ab4](5580ab4)) * Bump redis-py version cap from <5 to <8 ([#6339](#6339)) ([9538180](9538180)) * Expose feature_server, materialization, and openlineage configuration via FeatureStore CRD ([ec6ecfd](ec6ecfd)) * Make online_write_batch_size configurable in MaterializationConfig ([#6268](#6268)) ([d41becf](d41becf)) * Make udf optional if agg defined ([#5689](#5689)) ([#6328](#6328)) ([f630056](f630056)) * MongoDB offline store ([#6138](#6138)) ([8eebad7](8eebad7)) * Optional input_schema for ODFV ([#6308](#6308)) ([#6312](#6312)) ([f08b4e8](f08b4e8)) * Provision minimal TokenReview RBAC for OIDC auth and add SSL error logging in token parser ([#6240](#6240)) ([dca57e8](dca57e8)) * **spark:** Add compute-on-read support for BatchFeatureView in get_… ([#6357](#6357)) ([630d9f8](630d9f8))

jyejare requested a review from a team as a code owner April 8, 2026 17:15

This comment was marked as resolved.

Sign in to view

jyejare force-pushed the ugo-harden-cache branch 3 times, most recently from eab7bf4 to aa69c5b Compare April 8, 2026 18:34

This comment was marked as resolved.

Sign in to view

jyejare force-pushed the ugo-harden-cache branch from aa69c5b to 6a2995e Compare April 9, 2026 08:57

This comment was marked as resolved.

Sign in to view

jyejare force-pushed the ugo-harden-cache branch 5 times, most recently from b3237d2 to e1b57ef Compare April 10, 2026 11:10

devin-ai-integration Bot reviewed Apr 10, 2026

View reviewed changes

ntkathole added the ok-to-test label Apr 13, 2026

jyejare added 2 commits April 13, 2026 17:06

fix: Harden informer cache with label selectors and memory optimizations

7676dd7

Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>

Additional Fixes on caching with PVC and HPA

91a3f79

Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>

ntkathole force-pushed the ugo-harden-cache branch from e1b57ef to 91a3f79 Compare April 13, 2026 11:36

ntkathole approved these changes Apr 13, 2026

View reviewed changes

ntkathole merged commit 3f11356 into feast-dev:master Apr 13, 2026
32 of 35 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Harden informer cache with label selectors and memory optimizations#6242

fix: Harden informer cache with label selectors and memory optimizations#6242
ntkathole merged 2 commits into
feast-dev:masterfrom
jyejare:ugo-harden-cache

jyejare commented Apr 8, 2026 •

edited by ntkathole

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot Apr 10, 2026

Uh oh!

ntkathole left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jyejare commented Apr 8, 2026 • edited by ntkathole Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

ByObject label selectors for all owned resource types

DefaultTransform: cache.TransformStripManagedFields()

GOMEMLIMIT=230MiB

Additional changes

Test Results

Test plan

Summary by CodeRabbit

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

ntkathole left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jyejare commented Apr 8, 2026 •

edited by ntkathole

Loading

`ByObject` label selectors for all owned resource types

`DefaultTransform: cache.TransformStripManagedFields()`

`GOMEMLIMIT=230MiB`