fix: Harden informer cache with label selectors and memory optimizations#6242
fix: Harden informer cache with label selectors and memory optimizations#6242jyejare wants to merge 2 commits intofeast-dev:masterfrom
Conversation
eab7bf4 to
aa69c5b
Compare
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
aa69c5b to
6a2995e
Compare
There was a problem hiding this comment.
Devin Review found 2 new potential issues.
⚠️ 1 issue in files not directly in the diff
⚠️ Status.Selector reports getLabels() instead of getSelectorLabels(), mismatches actual Deployment selector (infra/feast-operator/internal/controller/services/scaling.go:252-253)
At scaling.go:252-253, cr.Status.Selector is computed from feast.getLabels() which now includes the ManagedByLabelKey. However, the Deployment's actual spec.selector uses getSelectorLabels() (only the name label). The Status.Selector field should reflect the actual selector used to identify the pods belonging to this FeatureStore's deployment. Any tooling or user inspecting .status.selector will see feast.dev/name=X,app.kubernetes.io/managed-by=feast-operator while the real Deployment selector is just feast.dev/name=X.
View 9 additional findings in Devin Review.
| ByObject: map[client.Object]cache.ByObject{ | ||
| &corev1.ConfigMap{}: managedByFilter, | ||
| &appsv1.Deployment{}: managedByFilter, | ||
| &corev1.Service{}: managedByFilter, | ||
| &corev1.ServiceAccount{}: managedByFilter, | ||
| &corev1.PersistentVolumeClaim{}: managedByFilter, | ||
| &rbacv1.RoleBinding{}: managedByFilter, | ||
| &rbacv1.Role{}: managedByFilter, | ||
| &batchv1.CronJob{}: managedByFilter, | ||
| &autoscalingv2.HorizontalPodAutoscaler{}: managedByFilter, | ||
| &policyv1.PodDisruptionBudget{}: managedByFilter, | ||
| }, | ||
| } |
There was a problem hiding this comment.
🔴 Cache label filter breaks operator for pre-existing resources on upgrade
The new cache ByObject configuration at infra/feast-operator/cmd/main.go:77-88 restricts informer caches to only watch resources with the app.kubernetes.io/managed-by: feast-operator label. However, only ConfigMap and Secret are in the DisableFor list (bypassing cache). All other types (Deployment, Service, ServiceAccount, PVC, RoleBinding, Role, CronJob, HPA, PDB) are read through the label-filtered cache.
On upgrade from a prior version, existing operator-managed resources won't have the managed-by label. When the operator reconciles an existing FeatureStore:
controllerutil.CreateOrUpdate()callsGet()→ goes through the cache → cache returns NotFound (object exists in API server but lacks the label, so the informer never cached it)CreateOrUpdatethen callsCreate()→ fails with AlreadyExists- Error propagates up, reconciliation fails and retries forever
This is a permanent deadlock: the operator can never update the resource to add the label because it can't find it through the cache. This affects all existing FeatureStore instances after an operator upgrade.
Prompt for agents
The cache ByObject label-selector filtering causes all cached Get/List calls to miss pre-existing resources that lack the managed-by label. On upgrade, this creates a permanent deadlock where the operator cannot reconcile existing FeatureStore instances.
Possible approaches to fix:
1. Add the resource types that might pre-exist (Deployment, Service, ServiceAccount, PVC, RoleBinding, Role, CronJob, HPA, PDB) to the Client.CacheOptions.DisableFor list, similar to ConfigMap and Secret. This bypasses the cache for reads but still benefits from the label-filtered watch for triggering reconciliation events.
2. Implement a startup migration that labels all existing operator-owned resources (identified by owner references to FeatureStore CRs) with the managed-by label before starting the informers.
3. Use a cache.TransformFunc that adds the label to objects as they are fetched, though this doesn't help with the initial list.
Option 1 is simplest but loses cache benefits for reads. Option 2 is more correct but requires additional startup logic. The key files involved are cmd/main.go (cache config), and all resource creation paths in internal/controller/services/services.go, scaling.go, client.go, cronjob.go, namespace_registry.go, and internal/controller/authz/authz.go.
Was this helpful? React with 👍 or 👎 to provide feedback.
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
| } | ||
| } |
There was a problem hiding this comment.
🔴 Namespace registry Role Get from filtered cache fails after tolerated AlreadyExists on Create
In setNamespaceRegistryRoleBinding, the code creates a Role via Client.Create and tolerates AlreadyExists (line 187-189). It then immediately calls Client.Get on the same Role (line 193-198). Client.Get goes through the label-filtered cache. If the Role already existed without the managed-by label (pre-upgrade), or even if it was just created (cache hasn't synced the watch event yet), Get returns NotFound, causing the function to return an error. This breaks namespace registry reconciliation on upgrade and introduces a race condition even on fresh installs.
(Refers to lines 182-207)
Prompt for agents
In setNamespaceRegistryRoleBinding (namespace_registry.go), the Role is created with Client.Create (AlreadyExists tolerated), then immediately re-fetched with Client.Get through the filtered cache. This fails if (a) the Role existed before the managed-by label was introduced, or (b) the cache informer hasn't processed the watch event yet for a newly created Role.
The fix should switch from the manual Create-then-Get pattern to using controllerutil.CreateOrUpdate for the Role (similar to how RoleBindings are handled), or use a direct API server read (bypassing cache) for the re-fetch. If using CreateOrUpdate, note that this is also affected by BUG-0001 for the upgrade case. A robust approach would be to use Server-Side Apply (Patch with ApplyPatchType) for the Role, similar to how HPA and PDB are handled in scaling.go, which avoids both the cache lookup and AlreadyExists issues entirely.
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
The feast-operator's
Owns()calls create cluster-wide informers for ConfigMaps, Deployments, Services, and other resource types. On clusters with a large number of these objects, the informer cache can grow beyond the operator's 256Mi memory limit, causing OOMKill and restarts.Changes
ByObjectlabel selectors for all owned resource typesRestrict informer caches to only objects with
app.kubernetes.io/managed-by: feast-operator. Covers all 10 owned types: ConfigMap, Deployment, Service, ServiceAccount, PVC, RoleBinding, Role, CronJob, HPA, PDB. Extracted intonewCacheOptions()for clarity.DefaultTransform: cache.TransformStripManagedFields()Strip
managedFieldsfrom all cached objects, reducing per-object memory footprint by ~30-50%.GOMEMLIMIT=230MiBSet Go runtime soft memory limit (90% of 256Mi container limit). Triggers GC pressure before hard OOMKill as defense-in-depth.
Additional changes
app.kubernetes.io/managed-by: feast-operatorlabel togetLabels()so all FeatureStore-managed resources carry itgetSelectorLabels()for immutable selectors (Deploymentspec.selector, Servicespec.selector, TopologySpreadConstraints, PodAffinity) to avoid breaking existing resources on upgradeapp.kubernetes.io/managed-byservices.ManagedByLabelKey/Value) throughoutTest Results
Verified on cluster with a large number of ConfigMaps pre-loaded:
Test plan
make test) — all passgetSelectorLabels()prevents immutable selector breakage on upgradeSummary by CodeRabbit