Skip to content

Avoid leaking continuations on google pubsub instrumentation#11605

Open
amarziali wants to merge 1 commit into
masterfrom
andrea.marziali/pubsub
Open

Avoid leaking continuations on google pubsub instrumentation#11605
amarziali wants to merge 1 commit into
masterfrom
andrea.marziali/pubsub

Conversation

@amarziali

Copy link
Copy Markdown
Contributor

What Does This Do

Fix Pub/Sub strict trace writes by avoiding async continuation capture for internal GAX/PubSub callbacks that do not represent application work. The Pub/Sub tests no longer need useStrictTraceWrites() == false: strict mode was failing because internal retry/listener callbacks could capture continuations from the publish span and resolve too late, or never resolve, leaving the writer waiting on orphaned work.

This has been detected with an experimental (unreleased) diagnostic tool to track continuation leakage. The diagnostic message was

- span: `google-pubsub.produce`
  - state: `captured-never-activated-or-canceled`
  - capture thread: `Gax-*`
  - capture stack included:
    - `State.captureAndSetContinuation`
    - `ExecutorInstrumentationUtils.setupState`
    - `AbstractFuture.addListener`
    - `CallbackChainRetryingFuture.setAttemptFuture`
    - `AttemptCallable.call`
    - `Publisher.publishOutstandingBatch`

That showed the continuation was captured by GAX retry/listener machinery, not by user-visible async work.

Motivation

Additional Notes

Contributor Checklist

  • Format the title according to the contribution guidelines
  • Assign the type: and (comp: or inst:) labels in addition to any other useful labels
  • Avoid using close, fix, or any linking keywords when referencing an issue
    Use solves instead, and assign the PR milestone to the issue
  • Update the CODEOWNERS file on source file addition, migration, or deletion
  • Update public documentation with any new configuration flags or behaviors
  • Add your completed PR to the merge queue by commenting /merge. You can also:
    • Customize the commit message associated with the merge with /merge --commit-message "..."
    • Remove your PR from the merge queue with /merge -c
    • Skip all merge queue checks with /merge -f --reason "reason"; please use this judiciously, as some checks do not run at the PR-level (note: the PR still needs to be mergeable, this will only skip the pre-merge build)
    • Get more information in this doc

Jira ticket: [PROJ-IDENT]

@amarziali amarziali requested review from a team as code owners June 9, 2026 09:57
@amarziali amarziali added type: bug Bug report and fix inst: others All other instrumentations labels Jun 9, 2026
@amarziali amarziali requested review from ValentinZakharov and mcculls and removed request for a team June 9, 2026 09:57
@datadog-official

This comment has been minimized.

@dd-octo-sts

dd-octo-sts Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

🟢 Java Benchmark SLOs — All performance SLOs passed

Suite Status
Startup 🟢 pass

SLO thresholds are defined here based on automatically generated metrics. A warning is raised when results are within 5% of the threshold.

PR vs. master results
Scenario Candidate master Δ (95% CI of mean)
startup:insecure-bank:iast:Agent 13.95 s 13.95 s [-0.8%; +0.8%] (no difference)
startup:insecure-bank:tracing:Agent 12.96 s 13.01 s [-1.4%; +0.7%] (no difference)
startup:petclinic:appsec:Agent 16.92 s 16.66 s [-0.1%; +3.2%] (no difference)
startup:petclinic:iast:Agent 16.84 s 16.96 s [-1.6%; +0.2%] (no difference)
startup:petclinic:profiling:Agent 16.74 s 16.89 s [-2.2%; +0.4%] (no difference)
startup:petclinic:sca:Agent 16.83 s 16.86 s [-1.6%; +1.2%] (no difference)
startup:petclinic:tracing:Agent 16.12 s 16.02 s [-0.4%; +1.6%] (no difference)

Commit: b8d6556b · CI Pipeline · Benchmarking Platform UI


Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion.

@amarziali amarziali force-pushed the andrea.marziali/pubsub branch from 2d25102 to b8d6556 Compare June 9, 2026 13:01
@dougqh

dougqh commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

[Claude] The EXECUTOR exclusion scope is worth double-checking. Excluding a class from RUNNABLE prevents it from being wrapped with context when submitted; excluding it from EXECUTOR prevents context propagation when the class acts as an executor scheduling other work. The diagnostic stack in the PR description only shows the continuation being captured (the RUNNABLE path), not these classes acting as executors. If BasicRetryingFuture$CompletionListener and CallbackChainRetryingFuture$AttemptCompletionListener are not used as executors elsewhere in the GAX stack, the EXECUTOR entry is harmless but unnecessary. If they are, dropping it from EXECUTOR while keeping RUNNABLE would give a narrower fix with less risk of suppressing legitimate context propagation.

@dougqh

dougqh commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

[Claude] PubSubTest.groovy is modified in this PR (deletion of useStrictTraceWrites()). Per project conventions, Groovy/Spock tests should be migrated to JUnit 5 Java rather than modified in place. The change here is a small deletion so it's low risk as-is, but this file is a migration candidate — worth doing either as part of this PR or as a follow-up.

@dougqh dougqh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude raised one minor concern. I'll leave that to your discretion.

@amarziali

Copy link
Copy Markdown
Contributor Author

[Claude] PubSubTest.groovy is modified in this PR (deletion of useStrictTraceWrites()). Per project conventions, Groovy/Spock tests should be migrated to JUnit 5 Java rather than modified in place. The change here is a small deletion so it's low risk as-is, but this file is a migration candidate — worth doing either as part of this PR or as a follow-up.

it's a project convention to do new tests perhaps in java but I won't engage a migration just for the fact that I removed a method override

@amarziali

Copy link
Copy Markdown
Contributor Author

[Claude] The EXECUTOR exclusion scope is worth double-checking. Excluding a class from RUNNABLE prevents it from being wrapped with context when submitted; excluding it from EXECUTOR prevents context propagation when the class acts as an executor scheduling other work. The diagnostic stack in the PR description only shows the continuation being captured (the RUNNABLE path), not these classes acting as executors. If BasicRetryingFuture$CompletionListener and CallbackChainRetryingFuture$AttemptCompletionListener are not used as executors elsewhere in the GAX stack, the EXECUTOR entry is harmless but unnecessary. If they are, dropping it from EXECUTOR while keeping RUNNABLE would give a narrower fix with less risk of suppressing legitimate context propagation.

Well I did an exercise before opening this PR. I removed the exclusions of type EXECUTOR and the tests were failing. This is because they apply at the executor-submission instrumentation boundary, where the agent decides whether to attach continuation state for work submitted through an executor. So it's not a matter because those classes are or not a instanceof Executor but because they are submitted through executors.
For the GAX scope, it's true that those classes are general purpose and not only used for pubsub (i.e. they can also be used for grpc). I think that overall it's a good choice to limit the boundaries like suggested in this PR. We also test with grpc instrumentation here (because behind the scene pubsub uses grpc) and there is a mode where we also test grpc spans. so all in all this change should be harmless for grpc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

inst: others All other instrumentations type: bug Bug report and fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants