Skip to content

cri: skip stats collection for non-running containers and sandboxes#13150

Open
aattilio wants to merge 1 commit intocontainerd:mainfrom
aattilio:fix-containerd-memory-stopped-containers
Open

cri: skip stats collection for non-running containers and sandboxes#13150
aattilio wants to merge 1 commit intocontainerd:mainfrom
aattilio:fix-containerd-memory-stopped-containers

Conversation

@aattilio
Copy link
Copy Markdown

@aattilio aattilio commented Apr 2, 2026

This change addresses performance and memory issues when a node has a large number of stopped containers (e.g. Succeeded/Completed pods in Kubernetes).

Fixes #13037

Key changes:

  1. StatsCollector: Skip non-running containers and non-ready sandboxes in background collection loops. Also proactively remove their entries from the collector to free memory (samples).
  2. ListPodSandboxMetrics: Skip non-running containers to avoid unnecessary GRPC calls and error logging.
  3. ListContainerStats / ListPodSandboxStats: When no filter is provided, default to only returning stats for running/ready objects, matching the documented intent and reducing response size.

This significantly reduces CPU usage and memory churn on nodes with high pod churn.

Verification:

  • Added unit test in internal/cri/server/container_stats_list_test.go to verify that buildTaskMetricsRequest correctly filters for running containers when no filter is provided.
  • Added internal/cri/server/stats_collector_linux_test.go with a unit test to verify that the background StatsCollector correctly filters containers/sandboxes based on state and removes stale data.
  • Added a Benchmark in internal/cri/server/stats_collector_linux_test.go (BenchmarkCollectContainerStats) to demonstrate the efficiency of the filtering logic in high-churn scenarios (e.g., 1000 containers).

Maintainer Feedback Addressed:

  • Rebased on top of upstream/main to remove merge commits.
  • Added Assisted-by trailer to the commit.
  • Verified the reduction of computational work (avoiding gRPC overhead for exited containers) via targeted tests and benchmark.

@github-project-automation github-project-automation Bot moved this to Needs Triage in Pull Request Review Apr 2, 2026
@dosubot dosubot Bot added area/cri Container Runtime Interface (CRI) kind/performance labels Apr 2, 2026
@aattilio aattilio force-pushed the fix-containerd-memory-stopped-containers branch from 10d2216 to ef9f878 Compare April 2, 2026 17:41
@samuelkarp
Copy link
Copy Markdown
Member

Verification:
Added unit test

A few comments before looking at the code:

  1. Since this is intended to reduce CPU utilization under specific scenarios, can you actually test that? That can be a one-off, but if you're feeling like driving an LLM tool, maybe that could be a test we run based on containerd-stress?
  2. I assume this is LLM-assisted, at least by the structure of the PR summary. If it is, would you mind adding an Assisted-by trailer to the git commit message?
  3. CI is failing, that needs to be addressed.

@aattilio aattilio force-pushed the fix-containerd-memory-stopped-containers branch from ef9f878 to 41083ac Compare April 2, 2026 19:04
@aattilio aattilio force-pushed the fix-containerd-memory-stopped-containers branch from 0bddf04 to 2ce0f19 Compare April 2, 2026 19:38
@aattilio aattilio force-pushed the fix-containerd-memory-stopped-containers branch from 2ce0f19 to d4a84f1 Compare April 2, 2026 19:45
@aattilio
Copy link
Copy Markdown
Author

aattilio commented Apr 2, 2026

I introduced a new unit test in internal/cri/server/stats_collector_linux_test.go to validate the reduction in CPU load, confirming the StatsCollector now correctly filters containers based on their status and manages the removal of obsolete data from the TimedStore, moreover resolving the linter issues by restoring the normalizePodSandboxStatsFilter function.

@samuelkarp
Copy link
Copy Markdown
Member

I introduced a new unit test in internal/cri/server/stats_collector_linux_test.go to validate the reduction in CPU load

I don't think it does that? It validates that exited containers are removed from the stores.

@aattilio aattilio force-pushed the fix-containerd-memory-stopped-containers branch from d4a84f1 to 9b56a1d Compare April 2, 2026 20:08
Signed-off-by: Alessio Attilio <attilio.alessio@protonmail.com>
Assisted-by: Gemini CLI <gemini-cli@google.com>
@aattilio aattilio force-pushed the fix-containerd-memory-stopped-containers branch from 9b56a1d to f3d3f10 Compare April 2, 2026 20:11
@aattilio
Copy link
Copy Markdown
Author

aattilio commented Apr 2, 2026

I added BenchmarkCollectContainerStats function in test.

Result

go test -v ./internal/cri/server -bench BenchmarkCollectContainerStats -run ^$

goos: linux
goarch: amd64
pkg: github.com/containerd/containerd/v2/internal/cri/server
cpu: Intel(R) Core(TM) 7 150U
BenchmarkCollectContainerStats
BenchmarkCollectContainerStats/AllRunning
BenchmarkCollectContainerStats/AllRunning-12                2475            573045 ns/op
BenchmarkCollectContainerStats/MostlyStopped
BenchmarkCollectContainerStats/MostlyStopped-12             2367            527336 ns/op
PASS
ok      github.com/containerd/containerd/v2/internal/cri/server 4.808s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cri Container Runtime Interface (CRI) kind/performance size/L

Projects

Status: Needs Triage

Development

Successfully merging this pull request may close these issues.

containerd memory does not decrease after pods complete — retained per-container overhead for stopped containers

4 participants