Avoid leaking continuations on google pubsub instrumentation by amarziali · Pull Request #11605 · DataDog/dd-trace-java

amarziali · 2026-06-09T09:57:20Z

What Does This Do

Fix Pub/Sub strict trace writes by avoiding async continuation capture for internal GAX/PubSub callbacks that do not represent application work. The Pub/Sub tests no longer need useStrictTraceWrites() == false: strict mode was failing because internal retry/listener callbacks could capture continuations from the publish span and resolve too late, or never resolve, leaving the writer waiting on orphaned work.

This has been detected with an experimental (unreleased) diagnostic tool to track continuation leakage. The diagnostic message was

- span: `google-pubsub.produce`
  - state: `captured-never-activated-or-canceled`
  - capture thread: `Gax-*`
  - capture stack included:
    - `State.captureAndSetContinuation`
    - `ExecutorInstrumentationUtils.setupState`
    - `AbstractFuture.addListener`
    - `CallbackChainRetryingFuture.setAttemptFuture`
    - `AttemptCallable.call`
    - `Publisher.publishOutstandingBatch`

That showed the continuation was captured by GAX retry/listener machinery, not by user-visible async work.

Motivation

Additional Notes

Contributor Checklist

Format the title according to the contribution guidelines
Assign the type: and (comp: or inst:) labels in addition to any other useful labels
Avoid using close, fix, or any linking keywords when referencing an issue
Use solves instead, and assign the PR milestone to the issue
Update the CODEOWNERS file on source file addition, migration, or deletion
Update public documentation with any new configuration flags or behaviors
Add your completed PR to the merge queue by commenting /merge. You can also:
- Customize the commit message associated with the merge with /merge --commit-message "..."
- Remove your PR from the merge queue with /merge -c
- Skip all merge queue checks with /merge -f --reason "reason"; please use this judiciously, as some checks do not run at the PR-level (note: the PR still needs to be mergeable, this will only skip the pre-merge build)
- Get more information in this doc

Jira ticket: [PROJ-IDENT]

dd-octo-sts · 2026-06-09T10:15:16Z

🟢 Java Benchmark SLOs — All performance SLOs passed

Suite	Status
Startup	🟢 pass

SLO thresholds are defined here based on automatically generated metrics. A warning is raised when results are within 5% of the threshold.

PR vs. master results

Scenario	Candidate	master	Δ (95% CI of mean)
startup:insecure-bank:iast:Agent	13.95 s	13.95 s	[-0.8%; +0.8%] (no difference)
startup:insecure-bank:tracing:Agent	12.96 s	13.01 s	[-1.4%; +0.7%] (no difference)
startup:petclinic:appsec:Agent	16.92 s	16.66 s	[-0.1%; +3.2%] (no difference)
startup:petclinic:iast:Agent	16.84 s	16.96 s	[-1.6%; +0.2%] (no difference)
startup:petclinic:profiling:Agent	16.74 s	16.89 s	[-2.2%; +0.4%] (no difference)
startup:petclinic:sca:Agent	16.83 s	16.86 s	[-1.6%; +1.2%] (no difference)
startup:petclinic:tracing:Agent	16.12 s	16.02 s	[-0.4%; +1.6%] (no difference)

Commit: b8d6556b · CI Pipeline · Benchmarking Platform UI

Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion.

dougqh · 2026-06-09T16:40:31Z

[Claude] The EXECUTOR exclusion scope is worth double-checking. Excluding a class from RUNNABLE prevents it from being wrapped with context when submitted; excluding it from EXECUTOR prevents context propagation when the class acts as an executor scheduling other work. The diagnostic stack in the PR description only shows the continuation being captured (the RUNNABLE path), not these classes acting as executors. If BasicRetryingFuture$CompletionListener and CallbackChainRetryingFuture$AttemptCompletionListener are not used as executors elsewhere in the GAX stack, the EXECUTOR entry is harmless but unnecessary. If they are, dropping it from EXECUTOR while keeping RUNNABLE would give a narrower fix with less risk of suppressing legitimate context propagation.

dougqh · 2026-06-09T16:41:04Z

[Claude] PubSubTest.groovy is modified in this PR (deletion of useStrictTraceWrites()). Per project conventions, Groovy/Spock tests should be migrated to JUnit 5 Java rather than modified in place. The change here is a small deletion so it's low risk as-is, but this file is a migration candidate — worth doing either as part of this PR or as a follow-up.

dougqh

Claude raised one minor concern. I'll leave that to your discretion.

amarziali · 2026-06-09T22:08:01Z

[Claude] PubSubTest.groovy is modified in this PR (deletion of useStrictTraceWrites()). Per project conventions, Groovy/Spock tests should be migrated to JUnit 5 Java rather than modified in place. The change here is a small deletion so it's low risk as-is, but this file is a migration candidate — worth doing either as part of this PR or as a follow-up.

it's a project convention to do new tests perhaps in java but I won't engage a migration just for the fact that I removed a method override

amarziali · 2026-06-09T22:15:50Z

[Claude] The EXECUTOR exclusion scope is worth double-checking. Excluding a class from RUNNABLE prevents it from being wrapped with context when submitted; excluding it from EXECUTOR prevents context propagation when the class acts as an executor scheduling other work. The diagnostic stack in the PR description only shows the continuation being captured (the RUNNABLE path), not these classes acting as executors. If BasicRetryingFuture$CompletionListener and CallbackChainRetryingFuture$AttemptCompletionListener are not used as executors elsewhere in the GAX stack, the EXECUTOR entry is harmless but unnecessary. If they are, dropping it from EXECUTOR while keeping RUNNABLE would give a narrower fix with less risk of suppressing legitimate context propagation.

Well I did an exercise before opening this PR. I removed the exclusions of type EXECUTOR and the tests were failing. This is because they apply at the executor-submission instrumentation boundary, where the agent decides whether to attach continuation state for work submitted through an executor. So it's not a matter because those classes are or not a instanceof Executor but because they are submitted through executors.
For the GAX scope, it's true that those classes are general purpose and not only used for pubsub (i.e. they can also be used for grpc). I think that overall it's a good choice to limit the boundaries like suggested in this PR. We also test with grpc instrumentation here (because behind the scene pubsub uses grpc) and there is a mode where we also test grpc spans. so all in all this change should be harmless for grpc

amarziali requested review from a team as code owners June 9, 2026 09:57

amarziali added type: bug Bug report and fix inst: others All other instrumentations labels Jun 9, 2026

amarziali requested review from ValentinZakharov and mcculls and removed request for a team June 9, 2026 09:57

This comment has been minimized.

Sign in to view

Avoid leaking continuations on google pubsub instrumentation

b8d6556

amarziali force-pushed the andrea.marziali/pubsub branch from 2d25102 to b8d6556 Compare June 9, 2026 13:01

dougqh approved these changes Jun 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid leaking continuations on google pubsub instrumentation#11605

Avoid leaking continuations on google pubsub instrumentation#11605
amarziali wants to merge 1 commit into
masterfrom
andrea.marziali/pubsub

amarziali commented Jun 9, 2026

Uh oh!

This comment has been minimized.

dd-octo-sts Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

dougqh commented Jun 9, 2026

Uh oh!

dougqh commented Jun 9, 2026

Uh oh!

dougqh left a comment

Uh oh!

amarziali commented Jun 9, 2026

Uh oh!

amarziali commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

amarziali commented Jun 9, 2026

What Does This Do

Motivation

Additional Notes

Contributor Checklist

Uh oh!

This comment has been minimized.

dd-octo-sts Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🟢 Java Benchmark SLOs — All performance SLOs passed

Uh oh!

dougqh commented Jun 9, 2026

Uh oh!

dougqh commented Jun 9, 2026

Uh oh!

dougqh left a comment

Choose a reason for hiding this comment

Uh oh!

amarziali commented Jun 9, 2026

Uh oh!

amarziali commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dd-octo-sts Bot commented Jun 9, 2026 •

edited

Loading