Fix calltrace_storage counters accumulating unboundedly across rotations#427
Fix calltrace_storage counters accumulating unboundedly across rotations#427
Conversation
clearTableOnly() was freeing memory without decrementing the global CALLTRACE_STORAGE_BYTES/TRACES counters, causing them to grow monotonically (cumulative total ever allocated, not current live size). Add a decrement pass in clearTableOnly() using the same table iteration as collect(), computed after waitForAllRefCountsToClear() ensures no concurrent writers. Both the direct clear() path and the deferred-free processTraces() path go through clearTableOnly(), so a single fix covers both. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
CI Test ResultsRun: #24658333136 | Commit:
Status Overview
Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled Summary: Total: 32 | Passed: 32 | Failed: 0 Updated: 2026-04-20 09:42:13 UTC |
- Deduplicate trace pointers in clearTableOnly() to avoid over-decrement when put() reuses the same CallTrace* across chained tables - Update test assertions: counters are 0 after stop (correct for leak detection), not in range [1,100] Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Live counters are 0 after stop because processTraces() decrements them during finishChunk(). The recording already captures pre-cleanup values (writeCounters runs before writeCpool/processTraces). Read those instead. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Fixes calltrace_storage_bytes / calltrace_storage_traces so they reflect current live calltrace storage instead of monotonically accumulating across processTraces() rotations, and updates tests to validate counter values from the JFR recording (pre-final-cleanup).
Changes:
- Decrement
calltrace_storage_*counters when clearing calltrace hash tables to prevent unbounded accumulation across rotations. - Gate unwind timing measurements in
profiler.cppbehind#ifdef COUNTERSto avoid overhead when counters are disabled. - Update CPU/wallclock context tests to assert recorded (JFR) counter values via a new
AbstractProfilerTest#getRecordedCounterValue()helper.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| ddprof-test/src/test/java/com/datadoghq/profiler/wallclock/BaseContextWallClockTest.java | Switches counter assertions to JFR-recorded values (pre-cleanup). |
| ddprof-test/src/test/java/com/datadoghq/profiler/cpu/ContextCpuTest.java | Switches counter assertions to JFR-recorded values (pre-cleanup). |
| ddprof-test/src/test/java/com/datadoghq/profiler/AbstractProfilerTest.java | Adds helper to read datadog.ProfilerCounter values from the recording. |
| ddprof-lib/src/main/cpp/profiler.cpp | Wraps unwind timing measurements with #ifdef COUNTERS. |
| ddprof-lib/src/main/cpp/callTraceHashTable.h / .cpp | Adds counter-decrement pass during table clears to prevent counter accumulation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| long long freed_traces = 0; | ||
| std::unordered_set<CallTrace*> seen; |
There was a problem hiding this comment.
decrementCounters() builds an unordered_set of seen CallTrace* but never reserves capacity. On large tables this will cause repeated rehashing while iterating every slot. Since LongHashTable::size() is available, consider summing sizes across the chain and calling seen.reserve(estimatedDistinct) before insertion to keep the clear/rotation overhead bounded.
| long long freed_traces = 0; | |
| std::unordered_set<CallTrace*> seen; | |
| long long freed_traces = 0; | |
| size_t estimated_entries = 0; | |
| for (LongHashTable *t = _table; t != nullptr; t = t->prev()) { | |
| estimated_entries += t->size(); | |
| } | |
| std::unordered_set<CallTrace*> seen; | |
| seen.reserve(estimated_entries); |
There was a problem hiding this comment.
Addressed — pre-computed the sum of sizes across the chain and called seen.reserve(estimated_entries) to avoid repeated rehashing on large tables.
| assertInRange(test.getRecordedCounterValue("calltrace_storage_traces"), 1, 100); | ||
| assertInRange(test.getRecordedCounterValue("calltrace_storage_bytes"), 1024, 8 * 1024 * 1024); | ||
| // live counters are 0 after stop (all traces freed - correct, non-leaking behaviour) | ||
| Map<String, Long> debugCounters = profiler.getDebugCounters(); |
There was a problem hiding this comment.
The comment says live counters are 0 after stopProfiler(), but the test no longer asserts anything about calltrace_storage_traces/bytes in debugCounters. Since this PR changes counter semantics, it would be good to assert they are actually 0 here (and thus verify the post-cleanup state matches the intended non-leaking behavior).
| Map<String, Long> debugCounters = profiler.getDebugCounters(); | |
| Map<String, Long> debugCounters = profiler.getDebugCounters(); | |
| assertEquals(0, debugCounters.get("calltrace_storage_traces")); | |
| assertEquals(0, debugCounters.get("calltrace_storage_bytes")); |
There was a problem hiding this comment.
Addressed — added assertions for calltrace_storage_traces and calltrace_storage_bytes being 0 after getDebugCounters() to verify the non-leaking post-cleanup state.
| assertInRange(getRecordedCounterValue("calltrace_storage_traces"), 1, 100); | ||
| assertInRange(getRecordedCounterValue("calltrace_storage_bytes"), 1024, 8 * 1024 * 1024); | ||
| // live counters are 0 after stop (all traces freed - correct, non-leaking behaviour) | ||
| Map<String, Long> debugCounters = profiler.getDebugCounters(); |
There was a problem hiding this comment.
The test now validates the recorded (pre-cleanup) counter values, but it no longer asserts the live calltrace_storage_* counters after stopProfiler(). Given the fix is about counters not accumulating/leaking, consider asserting debugCounters.get("calltrace_storage_traces") and debugCounters.get("calltrace_storage_bytes") are 0 here (consistent with the nearby comment and linear_allocator_* == 0).
| Map<String, Long> debugCounters = profiler.getDebugCounters(); | |
| Map<String, Long> debugCounters = profiler.getDebugCounters(); | |
| assertEquals(0, debugCounters.get("calltrace_storage_traces")); | |
| assertEquals(0, debugCounters.get("calltrace_storage_bytes")); |
There was a problem hiding this comment.
Addressed — same fix: added assertions for calltrace_storage_traces and calltrace_storage_bytes being 0 after stopProfiler(), consistent with the nearby comment and linear_allocator checks.
- callTraceHashTable.cpp:116 — pre-reserve unordered_set to avoid rehashing - BaseContextWallClockTest.java:204 — assert calltrace_storage_* counters are 0 after stop - ContextCpuTest.java:114 — assert calltrace_storage_* counters are 0 after stop Co-Authored-By: muse <muse@noreply>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (1)
ddprof-lib/src/main/cpp/callTraceHashTable.cpp:154
- In
clearTableOnly(), the loop that “clears previous chain pointers” doesn’t actually traverse the whole chain: oncetable->setPrev(nullptr)runs, theforloop’s increment (table = table->prev()) becomesnullptrand exits after the first iteration.prev_tableis computed but never used. Consider iterating via the savedprev_table(or explicitly documenting/simplifying to only break the head link) so the code matches the intent and avoids confusion.
// Clear previous chain pointers to prevent traversal during deallocation
for (LongHashTable *table = _table; table != nullptr; table = table->prev()) {
LongHashTable *prev_table = table->prev();
if (prev_table != nullptr) {
table->setPrev(nullptr); // Clear link before deallocation
}
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
What does this PR do?:
Fixes
calltrace_storage_bytesandcalltrace_storage_tracescounters that grew monotonically (cumulative total ever allocated) instead of reflecting the current live memory footprint.Motivation:
clearTableOnly()frees the allocator chunks but never decremented the global counters. During eachprocessTraces()rotation:putWithExistingId()→ counters incremented againOver time (e.g. ~2 rotations/min × 20 hours) the counters accumulated the cumulative total of all trace bytes ever allocated, which was misreported as 3.35 GB of live memory when the actual footprint was tens of MB.
Additional Notes:
clearTableOnly()is called from both the directclear()path and the deferred-free path inprocessTraces(), so a single fix covers both cases.The decrement pass uses the same slot iteration as
collect()and runs afterwaitForAllRefCountsToClear(), which guarantees no concurrent writers — so the byte/trace counts are stable and exact.How to test the change?:
Run the profiler for several upload cycles and verify
calltrace_storage_bytesandcalltrace_storage_tracesstay bounded rather than growing indefinitely. The values should reflect the actual live traces in the three hash tables at any point.For Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance.Unsure? Have a question? Request a review!