Defer MetricKey construction to the aggregator thread by dougqh · Pull Request #11381 · DataDog/dd-trace-java

dougqh · 2026-05-15T17:58:54Z

Summary

Stacked on top of #11380 -- review that first; the merge base of this PR is dougqh/conflating-metrics-producer-wins, not master. The diff shown here is only the work that's new beyond that PR.

Moves the per-span MetricKey construction, cache lookups, and aggregation off the producer thread into the existing aggregator thread, replacing the Batch-based conflation pipeline with a thin per-span SpanSnapshot posted to the inbox.

What the producer does now (per span)

filter (shouldComputeMetric, resource-ignored, longRunning)
collect tag values into a SpanSnapshot (one allocation per span)
inbox.offer(snapshot) + return error flag for forceKeep

What moved off the producer

MetricKey construction and its hash computation
SERVICE_NAMES.computeIfAbsent (UTF8 encoding of service name)
SPAN_KINDS.computeIfAbsent (UTF8 encoding of span.kind)
PEER_TAGS_CACHE lookups (peer-tag name+value UTF8 encoding)
pending / keys ConcurrentHashMap operations
Batch pooling, atomic ops, contributeTo

Removed entirely

Batch.java -- the aggregator's existing LRUCache<MetricKey, AggregateMetric> IS the conflation point now
pending ConcurrentHashMap<MetricKey, Batch>
keys ConcurrentHashMap<MetricKey, MetricKey> (canonical dedup)
batchPool MessagePassingQueue<Batch>
CommonKeyCleaner's keys.keySet() tracking; AggregateExpiry now just reports LRU drops to health metrics

Added

SpanSnapshot: immutable value carrying the raw MetricKey inputs + a tagAndDuration long (duration OR-ed with ERROR_TAG / TOP_LEVEL_TAG).
AggregateMetric.recordOneDuration(long) -- single-hit equivalent of the existing recordDurations(int, AtomicLongArray).
Peer-tag values flow through the snapshot as a flattened String[] of [name0, value0, name1, value1, ...]; the aggregator encodes them through PEER_TAGS_CACHE on its own thread.
HealthMetrics.onStatsInboxFull() + a TracerHealthMetrics counter reported as stats.dropped_aggregates{reason:inbox_full} -- parallel to the existing reason:lru_eviction. Without conflation the producer can lose snapshots when the bounded MPSC queue is full; this makes that visible without silencing it.

Benchmark results (2 forks × 5 iter × 15s)

ConflatingMetricsAggregatorDDSpanBenchmark:

	avgt (µs/op)	CI (99.9%)
prior commit (stacked base)	6.343 ± 0.115	[6.228, 6.458]
this PR	2.506 ± 0.044	[2.462, 2.550]

~60% faster on the production DDSpan path. The SimpleSpan bench shows ~53% faster as well.

Caveat on the bench numbers

Without conflation, the producer pushes 1 inbox item per span instead of ~1 per 64. At the JMH bench's synthetic rate (effectively ~20M snapshots/sec from the producer) the consumer can't keep up and inbox.offer silently drops -- the new onStatsInboxFull counter would fire constantly. The headline numbers measure producer publish() latency only; consumer throughput at realistic span rates is a follow-up to validate. Tuning maxPending matters more in this design.

Real fixes for capacity (out of scope for this PR):

Bump maxPending default; the conflating design used 2048 slots × ~64 conflation = ~131K effective capacity, the new design has 2048 slots flat.
Producer-side mini-batching (TLAB-style accumulator per thread) to recover some compression.

Test plan

./gradlew :dd-trace-core:test --tests 'datadog.trace.common.metrics.*' passes
./gradlew :dd-trace-core:test --tests 'datadog.trace.core.monitor.*' passes
./gradlew :dd-trace-core:compileJava :dd-trace-core:compileTestGroovy :dd-trace-core:compileJmhJava :dd-trace-core:compileTraceAgentTestGroovy all green
./gradlew spotlessCheck clean
CI muzzle / integration suites
Validate stats.dropped_aggregates{reason:inbox_full} reports as expected under a synthetic high-load run (not in the JMH bench)

🤖 Generated with Claude Code

Replace the producer-side conflation pipeline with a thin per-span SpanSnapshot posted to the existing aggregator thread. The aggregator now builds the MetricKey, does the SERVICE_NAMES / SPAN_KINDS / PEER_TAGS_CACHE lookups, and updates the AggregateMetric directly -- all off the producer's hot path. What the producer does now, per span: - filter (shouldComputeMetric, resource-ignored, longRunning) - collect tag values into a SpanSnapshot (1 allocation per span) - inbox.offer(snapshot) + return error flag for forceKeep What moved off the producer: - MetricKey construction and its hash computation - SERVICE_NAMES.computeIfAbsent (UTF8 encoding of service name) - SPAN_KINDS.computeIfAbsent (UTF8 encoding of span.kind) - PEER_TAGS_CACHE lookups (peer-tag name+value UTF8 encoding) - pending/keys ConcurrentHashMap operations - Batch pooling, batch atomic ops, batch contributeTo Removed entirely: - Batch.java -- the conflation primitive is no longer needed; the aggregator's existing LRUCache<MetricKey, AggregateMetric> IS the conflation point now. - pending ConcurrentHashMap<MetricKey, Batch> - keys ConcurrentHashMap<MetricKey, MetricKey> (canonical dedup) - batchPool MessagePassingQueue<Batch> - The CommonKeyCleaner role of tracking keys.keySet() on LRU eviction -- AggregateExpiry now just reports drops to healthMetrics. Added: - SpanSnapshot: immutable value carrying the raw MetricKey inputs + a tagAndDuration long (duration | ERROR_TAG | TOP_LEVEL_TAG). - AggregateMetric.recordOneDuration(long tagAndDuration) -- the single-hit equivalent of the existing recordDurations(int, AtomicLongArray). - Peer-tag values flow through the snapshot as a flattened String[] of [name0, value0, name1, value1, ...]; the aggregator encodes them through PEER_TAGS_CACHE on its own thread. Benchmark results (2 forks x 5 iter x 15s): ConflatingMetricsAggregatorDDSpanBenchmark prior commit 6.343 +- 0.115 us/op this commit 2.506 +- 0.044 us/op (~60% faster) ConflatingMetricsAggregatorBenchmark (SimpleSpan) prior commit 6.585 +- 0.049 us/op this commit 3.116 +- 0.032 us/op (~53% faster) Caveat on the benchmark: without conflation, the producer pushes 1 inbox item per span instead of ~1 per 64. At the benchmark's synthetic rate the consumer can't keep up and inbox.offer silently drops. The numbers measure producer publish() latency only; consumer throughput at realistic span rates is a follow-up to validate. Tuning maxPending matters more in this design. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

With the per-span SpanSnapshot inbox path, the producer can lose snapshots when the bounded MPSC queue is full -- silently, since inbox.offer() returns a boolean we previously ignored. The conflating-Batch design used to absorb ~64x more producer pressure per inbox slot, so this is a new failure mode worth surfacing. Wire it through the existing HealthMetrics path: - HealthMetrics.onStatsInboxFull() (no-op default). - TracerHealthMetrics gets a statsInboxFull LongAdder and a new reason tag reason:inbox_full reported under the same stats.dropped_aggregates metric used for LRU evictions. Two LongAdders, two tagged time series. - ConflatingMetricsAggregator.publish increments the counter when inbox.offer(snapshot) returns false. This doesn't fix the drop -- tuning maxPending and/or building producer-side batching are the actual fixes. But it makes the failure visible in the same place ops already watches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dougqh and others added 2 commits May 15, 2026 13:50

dougqh added type: enhancement Enhancements and improvements comp: core Tracer core tag: performance Performance related changes tag: no release notes Changes to exclude from release notes comp: metrics Metrics tag: ai generated Largely based on code generated by an AI or LLM labels May 15, 2026

dougqh mentioned this pull request May 15, 2026

Replace LRUCache with Hashtable-backed AggregateTable; eliminate MetricKey #11382

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defer MetricKey construction to the aggregator thread#11381

Defer MetricKey construction to the aggregator thread#11381
dougqh wants to merge 2 commits into
dougqh/conflating-metrics-producer-winsfrom
dougqh/conflating-metrics-background-work

dougqh commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dougqh commented May 15, 2026

Summary

What the producer does now (per span)

What moved off the producer

Removed entirely

Added

Benchmark results (2 forks × 5 iter × 15s)

Caveat on the bench numbers

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant