Skip to content

Fix calltrace_storage counters accumulating unboundedly across rotations#427

Merged
jbachorik merged 8 commits intomainfrom
jb/fix_calltrace_counters2
Apr 20, 2026
Merged

Fix calltrace_storage counters accumulating unboundedly across rotations#427
jbachorik merged 8 commits intomainfrom
jb/fix_calltrace_counters2

Conversation

@jbachorik
Copy link
Copy Markdown
Collaborator

What does this PR do?:
Fixes calltrace_storage_bytes and calltrace_storage_traces counters that grew monotonically (cumulative total ever allocated) instead of reflecting the current live memory footprint.

Motivation:
clearTableOnly() frees the allocator chunks but never decremented the global counters. During each processTraces() rotation:

  • Preserved traces are re-inserted via putWithExistingId() → counters incremented again
  • Old standby and active tables are cleared → counters NOT decremented

Over time (e.g. ~2 rotations/min × 20 hours) the counters accumulated the cumulative total of all trace bytes ever allocated, which was misreported as 3.35 GB of live memory when the actual footprint was tens of MB.

Additional Notes:
clearTableOnly() is called from both the direct clear() path and the deferred-free path in processTraces(), so a single fix covers both cases.

The decrement pass uses the same slot iteration as collect() and runs after waitForAllRefCountsToClear(), which guarantees no concurrent writers — so the byte/trace counts are stable and exact.

How to test the change?:
Run the profiler for several upload cycles and verify calltrace_storage_bytes and calltrace_storage_traces stay bounded rather than growing indefinitely. The values should reflect the actual live traces in the three hash tables at any point.

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.
  • JIRA: [JIRA-XXXX]

Unsure? Have a question? Request a review!

clearTableOnly() was freeing memory without decrementing the global
CALLTRACE_STORAGE_BYTES/TRACES counters, causing them to grow
monotonically (cumulative total ever allocated, not current live size).

Add a decrement pass in clearTableOnly() using the same table
iteration as collect(), computed after waitForAllRefCountsToClear()
ensures no concurrent writers. Both the direct clear() path and
the deferred-free processTraces() path go through clearTableOnly(),
so a single fix covers both.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@jbachorik jbachorik added the AI label Mar 20, 2026
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts bot commented Mar 20, 2026

CI Test Results

Run: #24658333136 | Commit: 72a70ab | Duration: 24m 28s (longest job)

All 32 test jobs passed

Status Overview

JDK glibc-aarch64/debug glibc-amd64/debug musl-aarch64/debug musl-amd64/debug
8 - - -
8-ibm - - -
8-j9 - -
8-librca - -
8-orcl - - -
11 - - -
11-j9 - -
11-librca - -
17 - -
17-graal - -
17-j9 - -
17-librca - -
21 - -
21-graal - -
21-librca - -
25 - -
25-graal - -
25-librca - -

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Summary: Total: 32 | Passed: 32 | Failed: 0


Updated: 2026-04-20 09:42:13 UTC

@jbachorik jbachorik changed the title Fix calltrace_storage counters accumulating unboundedly across rotations [WIP] Fix calltrace_storage counters accumulating unboundedly across rotations Mar 22, 2026
jbachorik and others added 6 commits March 22, 2026 20:37
- Deduplicate trace pointers in clearTableOnly() to avoid over-decrement
  when put() reuses the same CallTrace* across chained tables
- Update test assertions: counters are 0 after stop (correct for leak
  detection), not in range [1,100]

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Live counters are 0 after stop because processTraces() decrements them
during finishChunk(). The recording already captures pre-cleanup values
(writeCounters runs before writeCpool/processTraces). Read those instead.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes calltrace_storage_bytes / calltrace_storage_traces so they reflect current live calltrace storage instead of monotonically accumulating across processTraces() rotations, and updates tests to validate counter values from the JFR recording (pre-final-cleanup).

Changes:

  • Decrement calltrace_storage_* counters when clearing calltrace hash tables to prevent unbounded accumulation across rotations.
  • Gate unwind timing measurements in profiler.cpp behind #ifdef COUNTERS to avoid overhead when counters are disabled.
  • Update CPU/wallclock context tests to assert recorded (JFR) counter values via a new AbstractProfilerTest#getRecordedCounterValue() helper.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
ddprof-test/src/test/java/com/datadoghq/profiler/wallclock/BaseContextWallClockTest.java Switches counter assertions to JFR-recorded values (pre-cleanup).
ddprof-test/src/test/java/com/datadoghq/profiler/cpu/ContextCpuTest.java Switches counter assertions to JFR-recorded values (pre-cleanup).
ddprof-test/src/test/java/com/datadoghq/profiler/AbstractProfilerTest.java Adds helper to read datadog.ProfilerCounter values from the recording.
ddprof-lib/src/main/cpp/profiler.cpp Wraps unwind timing measurements with #ifdef COUNTERS.
ddprof-lib/src/main/cpp/callTraceHashTable.h / .cpp Adds counter-decrement pass during table clears to prevent counter accumulation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +115 to +116
long long freed_traces = 0;
std::unordered_set<CallTrace*> seen;
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decrementCounters() builds an unordered_set of seen CallTrace* but never reserves capacity. On large tables this will cause repeated rehashing while iterating every slot. Since LongHashTable::size() is available, consider summing sizes across the chain and calling seen.reserve(estimatedDistinct) before insertion to keep the clear/rotation overhead bounded.

Suggested change
long long freed_traces = 0;
std::unordered_set<CallTrace*> seen;
long long freed_traces = 0;
size_t estimated_entries = 0;
for (LongHashTable *t = _table; t != nullptr; t = t->prev()) {
estimated_entries += t->size();
}
std::unordered_set<CallTrace*> seen;
seen.reserve(estimated_entries);

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — pre-computed the sum of sizes across the chain and called seen.reserve(estimated_entries) to avoid repeated rehashing on large tables.

assertInRange(test.getRecordedCounterValue("calltrace_storage_traces"), 1, 100);
assertInRange(test.getRecordedCounterValue("calltrace_storage_bytes"), 1024, 8 * 1024 * 1024);
// live counters are 0 after stop (all traces freed - correct, non-leaking behaviour)
Map<String, Long> debugCounters = profiler.getDebugCounters();
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says live counters are 0 after stopProfiler(), but the test no longer asserts anything about calltrace_storage_traces/bytes in debugCounters. Since this PR changes counter semantics, it would be good to assert they are actually 0 here (and thus verify the post-cleanup state matches the intended non-leaking behavior).

Suggested change
Map<String, Long> debugCounters = profiler.getDebugCounters();
Map<String, Long> debugCounters = profiler.getDebugCounters();
assertEquals(0, debugCounters.get("calltrace_storage_traces"));
assertEquals(0, debugCounters.get("calltrace_storage_bytes"));

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — added assertions for calltrace_storage_traces and calltrace_storage_bytes being 0 after getDebugCounters() to verify the non-leaking post-cleanup state.

assertInRange(getRecordedCounterValue("calltrace_storage_traces"), 1, 100);
assertInRange(getRecordedCounterValue("calltrace_storage_bytes"), 1024, 8 * 1024 * 1024);
// live counters are 0 after stop (all traces freed - correct, non-leaking behaviour)
Map<String, Long> debugCounters = profiler.getDebugCounters();
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test now validates the recorded (pre-cleanup) counter values, but it no longer asserts the live calltrace_storage_* counters after stopProfiler(). Given the fix is about counters not accumulating/leaking, consider asserting debugCounters.get("calltrace_storage_traces") and debugCounters.get("calltrace_storage_bytes") are 0 here (consistent with the nearby comment and linear_allocator_* == 0).

Suggested change
Map<String, Long> debugCounters = profiler.getDebugCounters();
Map<String, Long> debugCounters = profiler.getDebugCounters();
assertEquals(0, debugCounters.get("calltrace_storage_traces"));
assertEquals(0, debugCounters.get("calltrace_storage_bytes"));

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — same fix: added assertions for calltrace_storage_traces and calltrace_storage_bytes being 0 after stopProfiler(), consistent with the nearby comment and linear_allocator checks.

@jbachorik jbachorik marked this pull request as ready for review April 20, 2026 07:50
@jbachorik jbachorik requested a review from a team as a code owner April 20, 2026 07:50
- callTraceHashTable.cpp:116 — pre-reserve unordered_set to avoid rehashing
- BaseContextWallClockTest.java:204 — assert calltrace_storage_* counters are 0 after stop
- ContextCpuTest.java:114 — assert calltrace_storage_* counters are 0 after stop

Co-Authored-By: muse <muse@noreply>
@jbachorik jbachorik changed the title [WIP] Fix calltrace_storage counters accumulating unboundedly across rotations Fix calltrace_storage counters accumulating unboundedly across rotations Apr 20, 2026
@jbachorik jbachorik requested a review from Copilot April 20, 2026 08:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

ddprof-lib/src/main/cpp/callTraceHashTable.cpp:154

  • In clearTableOnly(), the loop that “clears previous chain pointers” doesn’t actually traverse the whole chain: once table->setPrev(nullptr) runs, the for loop’s increment (table = table->prev()) becomes nullptr and exits after the first iteration. prev_table is computed but never used. Consider iterating via the saved prev_table (or explicitly documenting/simplifying to only break the head link) so the code matches the intent and avoids confusion.
  // Clear previous chain pointers to prevent traversal during deallocation
  for (LongHashTable *table = _table; table != nullptr; table = table->prev()) {
    LongHashTable *prev_table = table->prev();
    if (prev_table != nullptr) {
      table->setPrev(nullptr);  // Clear link before deallocation
    }
  }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jbachorik jbachorik merged commit f9ea8f2 into main Apr 20, 2026
146 checks passed
@jbachorik jbachorik deleted the jb/fix_calltrace_counters2 branch April 20, 2026 09:15
@github-actions github-actions bot added this to the 1.41.0 milestone Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants