Skip to content

feat(supervisor): add opt-in dequeue backpressure#3836

Open
nicktrn wants to merge 9 commits into
mainfrom
feat/supervisor-dequeue-backpressure-tri-5354
Open

feat(supervisor): add opt-in dequeue backpressure#3836
nicktrn wants to merge 9 commits into
mainfrom
feat/supervisor-dequeue-backpressure-tri-5354

Conversation

@nicktrn
Copy link
Copy Markdown
Collaborator

@nicktrn nicktrn commented Jun 4, 2026

The supervisor can now pause dequeuing - and freeze consumer-pool scale-up - when a backpressure signal says the cluster can't place more work, then ramp dequeuing back up gradually once it clears. The signal is a verdict published to a Redis key by a cluster-side component; the supervisor reads it on a short refresh and gates preDequeue on it.

Off by default (TRIGGER_DEQUEUE_BACKPRESSURE_ENABLED). Everything fails open: a missing, stale, or unreadable verdict never pins the brake, and the hot-path read is a synchronous cached lookup with no I/O. The scale-up freeze leaves scale-down untouched, and on release the resume is ramped so a deep queue isn't hammered all at once.

Dry-run is on by default (TRIGGER_DEQUEUE_BACKPRESSURE_DRY_RUN): even once enabled it only logs what it would have done, and surfaces the computed state through metrics, until explicitly set to act. Prometheus: supervisor_backpressure_engaged, _dry_run, _skipped_dequeues_total.

Refs TRI-5354

nicktrn added 7 commits June 4, 2026 17:35
Cached, fail-open monitor that decides whether to skip dequeues based on a
pluggable signal source. Disabled is a total no-op (no refresh, no reads).
The hot-path read is synchronous and never performs I/O; every failure mode
(source throws, returns null, or verdict goes stale) fails open.
Reads the backpressure verdict from a Redis key (written by the cluster-side
aggregator). Malformed or wrong-shaped values are treated as unknown so the
monitor fails open. Adds @internal/redis + @internal/testcontainers deps.
Gate dequeues on the backpressure verdict via the existing preDequeue hook,
on all paths including k8s (where the resource monitor is a no-op). Construct
the Redis-backed monitor only when TRIGGER_DEQUEUE_BACKPRESSURE_ENABLED is set;
require a Redis host when enabled. Off by default - no Redis client, no effect.
isEngaged() exposes the hard backpressure state (drives scale-up freeze),
while shouldSkipDequeue() additionally ramps after release - skipping a
linearly-decaying fraction of attempts over rampMs so the aggregate dequeue
rate climbs back to full instead of snapping and re-flooding the cluster.
Add optional shouldPauseScaling to ScalingOptions; when it returns true the
pool stops scaling up (scale-down still allowed), so it won't add consumers to
drain a queue backpressure is deliberately holding.
Pass shouldPauseScaling (monitor.isEngaged) into the consumer pool so scale-up
freezes while hard-engaged, and feed TRIGGER_DEQUEUE_BACKPRESSURE_RAMP_MS into
the monitor's post-release ramp. Off by default.
Dry-run (default on via env) keeps the gates inert while computeEngaged still
reflects the real signal and verdict transitions are logged. Adds
BackpressureMetrics (engaged/dry_run gauges, skipped-dequeues counter).
@nicktrn nicktrn self-assigned this Jun 4, 2026
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Jun 4, 2026

🦋 Changeset detected

Latest commit: 44842f9

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 25 packages
Name Type
@trigger.dev/core Patch
@trigger.dev/build Patch
trigger.dev Patch
@trigger.dev/plugins Patch
@trigger.dev/python Patch
@trigger.dev/redis-worker Patch
@trigger.dev/schema-to-json Patch
@trigger.dev/sdk Patch
@internal/cache Patch
@internal/clickhouse Patch
@internal/llm-model-catalog Patch
@trigger.dev/rbac Patch
@internal/redis Patch
@internal/replication Patch
@internal/run-engine Patch
@internal/schedule-engine Patch
@internal/testcontainers Patch
@internal/tracing Patch
@internal/tsql Patch
@internal/zod-worker Patch
@internal/sdk-compat-tests Patch
@trigger.dev/react-hooks Patch
@trigger.dev/rsc Patch
@trigger.dev/database Patch
@trigger.dev/otlp-importer Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: e543f166-0716-4fa8-ab25-1eeacd0d75b4

📥 Commits

Reviewing files that changed from the base of the PR and between e6b5a67 and 44842f9.

📒 Files selected for processing (1)
  • apps/supervisor/src/index.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • apps/supervisor/src/index.ts
📜 Recent review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (49)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: units / e2e-webapp / 🧪 E2E Tests: Webapp
  • GitHub Check: typecheck / typecheck
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: sdk-compat / Deno Runtime
  • GitHub Check: internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: sdk-compat / Cloudflare Workers
  • GitHub Check: sdk-compat / Bun Runtime
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: e2e-webapp / 🧪 E2E Tests: Webapp
  • GitHub Check: packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: audit
  • GitHub Check: Analyze (javascript-typescript)

Walkthrough

This pull request implements a Redis-backed dequeue backpressure system for the supervisor. When enabled, the supervisor periodically reads a backpressure verdict from Redis and uses it to pause consumer scale-up and probabilistically skip dequeue attempts. The system includes verdicts with optional timestamps to handle staleness, a post-release ramp that gradually resumes dequeueing after backpressure clears, and a dry-run mode for testing. Configuration is managed through environment variables, metrics are exported via Prometheus, and the consumer pool receives a new shouldPauseScaling callback to freeze scale-up while allowing scale-down. The feature is opt-in and disabled by default.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is substantially complete, explaining the feature, configuration, defaults, and failure modes. However, it lacks required template sections including Testing and Changelog sections, and does not include the issue reference in the Closes format. Add 'Closes #' at the top, include Testing section describing test steps, and add a Changelog section summarizing the change.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding an opt-in dequeue backpressure feature to the supervisor. It is concise and clearly conveys the primary purpose of the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/supervisor-dequeue-backpressure-tri-5354

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as resolved.

Guard against overlapping refresh ticks when a read hangs; use the
staleness-aware computeEngaged() for transition/ramp/gauge bookkeeping; close
the backpressure Redis client on supervisor shutdown.
@nicktrn nicktrn marked this pull request as ready for review June 4, 2026 18:20
devin-ai-integration[bot]

This comment was marked as resolved.

Add TRIGGER_DEQUEUE_BACKPRESSURE_REDIS_PASSWORD to the secret strip-list so it
never lands in the DEBUG startup log, with a comment to keep new secrets out.
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 8 additional findings in Devin Review.

Open in Devin Review

Comment on lines +99 to +111
computeEngaged(): boolean {
const verdict = this.verdict;
if (verdict?.engaged !== true) {
return false;
}

const maxAge = this.opts.maxVerdictAgeMs;
if (maxAge !== undefined && verdict.ts !== undefined && Date.now() - verdict.ts > maxAge) {
return false;
}

return true;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Staleness check bypassed when verdict lacks a ts field

The staleness fail-open in computeEngaged() at apps/supervisor/src/backpressure/backpressureMonitor.ts:106 requires both maxVerdictAgeMs and verdict.ts to be defined. If the cluster-side aggregator writes a verdict like {engaged: true} without a ts field and then dies, the supervisor will never age out that verdict — it stays engaged until the next successful refresh (which might read the same stale key forever if it has no Redis TTL). This is by design (ts is optional in the schema), but operators should be aware: if maxVerdictAgeMs is relied upon for safety, the aggregator must include ts in every verdict it writes. Worth documenting this operational requirement.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant