Skip to content

fix(webapp,core): retry run resume through transient database outages#4161

Open
matt-aitken wants to merge 5 commits into
mainfrom
fix/resume-retriable-on-transient-db-errors
Open

fix(webapp,core): retry run resume through transient database outages#4161
matt-aitken wants to merge 5 commits into
mainfrom
fix/resume-retriable-on-transient-db-errors

Conversation

@matt-aitken

@matt-aitken matt-aitken commented Jul 5, 2026

Copy link
Copy Markdown
Member

Summary

When the platform database is briefly unreachable while a run is resuming from a wait, the run no longer fails with TASK_EXECUTION_ABORTED. The worker now retries the resume through the outage instead of aborting on the first blip.

Root cause

Resuming a run calls the engine's continue worker-action endpoint. That route caught every error and returned a 422, which the worker's HTTP client treats as non-retryable. So a transient Prisma infrastructure error (for example P1001 "Can't reach database server") was flattened into a permanent failure: the worker gave up, force-killed the run process, and completed it with TASK_EXECUTION_ABORTED.

Fix

  • The continue route now lets infrastructure errors propagate to the generic 500 handler (message scrubbed, and retryable by the worker's HTTP client), the same treatment the trigger path already gives them via isInfrastructureError. Genuine validation errors (snapshot mismatch, invalid state) still return 422, so a stale retry stays non-retryable. Resuming is idempotent server-side (guarded by the snapshot id), so retrying is safe.
  • The worker's continueRunExecution calls (both the runner-to-supervisor and supervisor-to-engine hops) retry with a longer, jittered backoff so they can ride out an outage lasting tens of seconds, and the jitter keeps a fleet of resuming runs from stampeding the database the moment it recovers.

Builds on #3960, which scrubbed the leaked message on these routes but left the status non-retryable.

No changeset: this is a server-side behaviour fix recorded via .server-changes. The @trigger.dev/core edits are internal run-engine worker plumbing, not a public API change.

Resuming a run after a wait calls the engine's continue endpoint. When the
database was briefly unreachable, that route caught the Prisma infrastructure
error and returned a non-retryable 422, so the worker aborted the run with
TASK_EXECUTION_ABORTED over a transient blip.

The continue route now lets infrastructure errors propagate to the generic 500
handler (scrubbed and retryable), matching how the trigger path already treats
them. The worker's continue call also retries with a longer, jittered backoff
so it can ride out an outage lasting tens of seconds without stampeding the
database on recovery. Genuine validation errors still return 422.
@changeset-bot

changeset-bot Bot commented Jul 5, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: 988f53d

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai

coderabbitai Bot commented Jul 5, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

The continue-run route now re-throws Prisma infrastructure errors instead of mapping them to 422 responses, so generic 500 handling can take over. The worker HTTP clients also add retry policies to the continue-run request path with jittered exponential backoff and capped attempts. A changeset entry records the resume retry fix for transient database unavailability.

Changes

Area Changes
Continue route error handling Detects Prisma infrastructure errors via isInfrastructureError; re-throws them for 500 handling instead of returning 422; updates warning log message
Worker HTTP retry policy Adds retry settings to continueRunExecution in supervisor and workload HTTP clients
Documentation Adds changeset entry describing the resume retry fix

Sequence Diagram(s)

sequenceDiagram
  participant WorkerClient
  participant ContinueRoute
  participant PrismaDB

  WorkerClient->>ContinueRoute: POST continue run execution
  ContinueRoute->>PrismaDB: perform continuation logic
  PrismaDB-->>ContinueRoute: infrastructure error
  ContinueRoute->>ContinueRoute: isInfrastructureError check
  alt infrastructure error
    ContinueRoute-->>WorkerClient: rethrow for generic 500 handling
  else other error
    ContinueRoute-->>WorkerClient: 422 response
  end
Loading

Related issues: None specified

Related PRs: None specified

Suggested labels: bug, webapp

Suggested reviewers: None specified

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description explains the fix well, but it does not follow the required template and is missing Closes #, checklist items, Testing, Changelog, and Screenshots. Add the required template sections: Closes #issue, checklist, testing steps, changelog, and screenshots, or mark non-applicable items clearly.
✅ Passed checks (4 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly describes the main change: retrying run resume through transient database outages.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/resume-retriable-on-transient-db-errors

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

devin-ai-integration[bot]

This comment was marked as resolved.

The supervisor-to-engine hop is the one that reaches the continue endpoint,
so it is where a transient database outage surfaces as a retryable 5xx. Give
its continueRunExecution the same longer, jittered retry budget as the
workload client so it can ride out the outage.
@matt-aitken matt-aitken force-pushed the fix/resume-retriable-on-transient-db-errors branch from 7b8d0a0 to 8334820 Compare July 5, 2026 14:33
@pkg-pr-new

pkg-pr-new Bot commented Jul 5, 2026

Copy link
Copy Markdown

Open in StackBlitz

@trigger.dev/build

npm i https://pkg.pr.new/@trigger.dev/build@988f53d

trigger.dev

npm i https://pkg.pr.new/trigger.dev@988f53d

@trigger.dev/core

npm i https://pkg.pr.new/@trigger.dev/core@988f53d

@trigger.dev/python

npm i https://pkg.pr.new/@trigger.dev/python@988f53d

@trigger.dev/react-hooks

npm i https://pkg.pr.new/@trigger.dev/react-hooks@988f53d

@trigger.dev/redis-worker

npm i https://pkg.pr.new/@trigger.dev/redis-worker@988f53d

@trigger.dev/rsc

npm i https://pkg.pr.new/@trigger.dev/rsc@988f53d

@trigger.dev/schema-to-json

npm i https://pkg.pr.new/@trigger.dev/schema-to-json@988f53d

@trigger.dev/sdk

npm i https://pkg.pr.new/@trigger.dev/sdk@988f53d

commit: 988f53d

devin-ai-integration[bot]

This comment was marked as resolved.

The database-outage retry lives on the supervisor-to-engine hop; the workload
client only reaches the supervisor's workload server, so its retry rides out
supervisor blips (e.g. a restart), not DB outages. Fix the comment to say so.
Drop the worker HTTP-client retry tuning and keep only the continue route
change, so the fix is server-only. Swap the package changeset for a
.server-changes entry.
@matt-aitken matt-aitken force-pushed the fix/resume-retriable-on-transient-db-errors branch from f73646b to 6c615be Compare July 5, 2026 17:03
@matt-aitken matt-aitken changed the title fix(webapp,core): retry run resume through transient database outages fix(webapp): retry run resume through transient database outages Jul 5, 2026
Restore the extended, jittered retry on the workload and supervisor
continueRunExecution calls so the resume can ride out a transient database
outage. Recorded via .server-changes; no package changeset.
@matt-aitken matt-aitken changed the title fix(webapp): retry run resume through transient database outages fix(webapp,core): retry run resume through transient database outages Jul 5, 2026

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

Open in Devin Review

Comment on lines +1 to +6
---
area: webapp
type: fix
---

Runs resuming after a wait no longer fail with TASK_EXECUTION_ABORTED when the database is briefly unreachable; the resume endpoint returns a retryable response for transient infrastructure errors instead of a permanent one.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Missing changeset means the retry-on-resume fix won't ship to users in the next package release

The retry configuration is added to a published package but no changeset is included (.server-changes/resume-retry-transient-db.md at .server-changes/resume-retry-transient-db.md:1-6), so the version of @trigger.dev/core won't be bumped and the behavioral change won't be released.

Impact: Users running the published SDK won't get the retry-on-resume fix until a changeset is added and the package is versioned.

Repository rules require a changeset for any change under packages/

CONTRIBUTING.md states: "If you are contributing a change to any packages in this monorepo (anything in either the /packages or /integrations directories), then you will need to add a changeset to your Pull Requests before they can be merged."

The table in CONTRIBUTING.md also says: "Both packages and server → Just the changeset" (no .server-changes/ file needed).

CLAUDE.md echoes: "When modifying any public package (packages/* or integrations/*), add a changeset."

This PR modifies packages/core/src/v3/runEngineWorker/supervisor/http.ts and packages/core/src/v3/runEngineWorker/workload/http.ts, both under packages/core which is the published @trigger.dev/core package. A changeset (via pnpm run changeset:add) selecting @trigger.dev/core as a patch is required.

Prompt for agents
This PR modifies packages/core (a published npm package) in addition to apps/webapp. Per CONTRIBUTING.md and CLAUDE.md, changes to packages/* require a changeset, not a .server-changes/ file. Run `pnpm run changeset:add` from the repo root, select `@trigger.dev/core`, choose `patch`, and describe the retry behavior change for the continue-run-execution endpoint. The .server-changes/resume-retry-transient-db.md file should be removed since mixed PRs (both packages and server) only need the changeset.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
packages/core/src/v3/runEngineWorker/supervisor/http.ts (1)

249-262: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Extract the retry policy into a shared constant.

The exact same retry object (minTimeoutInMs: 500, maxTimeoutInMs: 10_000, maxAttempts: 8, factor: 2, randomize: true) is duplicated verbatim in workload/http.ts's continueRunExecution. Extracting a shared constant (e.g. RESUME_HOP_RETRY_POLICY) would keep the two hops' policies in sync if they're ever re-tuned.

♻️ Proposed shared constant
+// e.g. in a shared file such as packages/core/src/v3/runEngineWorker/retry.ts
+export const RESUME_HOP_RETRY_POLICY = {
+  minTimeoutInMs: 500,
+  maxTimeoutInMs: 10_000,
+  maxAttempts: 8,
+  factor: 2,
+  randomize: true,
+} as const;
-        retry: {
-          minTimeoutInMs: 500,
-          maxTimeoutInMs: 10_000,
-          maxAttempts: 8,
-          factor: 2,
-          randomize: true,
-        },
+        retry: RESUME_HOP_RETRY_POLICY,
packages/core/src/v3/runEngineWorker/workload/http.ts (1)

125-152: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Same DRY concern as the supervisor client.

Identical retry object to SupervisorHttpClient.continueRunExecution; see the sibling comment in packages/core/src/v3/runEngineWorker/supervisor/http.ts for the extraction suggestion.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 0752937d-02ab-4e38-a4fa-13fe9e6414e4

📥 Commits

Reviewing files that changed from the base of the PR and between 6c615be and 988f53d.

📒 Files selected for processing (2)
  • packages/core/src/v3/runEngineWorker/supervisor/http.ts
  • packages/core/src/v3/runEngineWorker/workload/http.ts
📜 Review details
⏰ Context from checks skipped due to timeout. (32)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (12, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (6, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (11, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (9, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (5, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (4, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (7, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (3, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (1, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (8, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (10, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (2, 12)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (4, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (8, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (6, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (9, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (10, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (3, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (2, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (5, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (7, 10)
  • GitHub Check: sdk-compat / Node.js 24.18 (blacksmith-4vcpu-ubuntu-2404)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (1, 10)
  • GitHub Check: sdk-compat / Node.js 26.4 (blacksmith-4vcpu-ubuntu-2404)
  • GitHub Check: e2e / 🧪 CLI v3 tests (blacksmith-4vcpu-ubuntu-2404 - pnpm)
  • GitHub Check: sdk-compat / Cloudflare Workers
  • GitHub Check: packages / 🧪 Unit Tests: Packages (2, 3)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: e2e / 🧪 CLI v3 tests (blacksmith-4vcpu-ubuntu-2404 - npm)
  • GitHub Check: packages / 🧪 Unit Tests: Packages (1, 3)
  • GitHub Check: packages / 🧪 Unit Tests: Packages (3, 3)
  • GitHub Check: e2e-webapp / 🧪 E2E Tests: Webapp
⚠️ CI failures not shown inline (4)

GitHub Actions: 🔎 REVIEW.md Drift Audit / audit: fix(webapp,core): retry run resume through transient database outages

Conclusion: failure

View job details

##[group]Run anthropics/claude-code-action@428971d2ecd6e3a7cb0ee0da2a3a8b33fdb3678d
 with:
   anthropic_***REDACTED***
   use_sticky_comment: true
   allowed_bots: devin-ai-integration[bot]
   claude_args: --max-turns 30
--allowedTools "Read,Glob,Grep,Bash(git diff:*)"
   prompt: You are auditing this PR for drift against `.claude/REVIEW.md`.
## Context
`.claude/REVIEW.md` is the repo's source of truth for what AI / agent code reviewers should treat as critical findings (rolling-deploy safety, hot-table indexes, recovery-path queries, testcontainers usage, Lua versioning, etc.). It is consumed by review agents to calibrate severity. If REVIEW.md goes stale, every future agent review degrades.
## Strategy — read this first
You have a hard turn budget. Spend it on signal, not coverage. The audit is allowed to miss things; it is NOT allowed to time out.
1. Read `.claude/REVIEW.md` once, in full.
2. Run `git diff origin/main...HEAD --name-only` to get the list of changed files. Do NOT read the diff content yet.
3. Scan the file-list for relevance to REVIEW.md scope. Relevance signals: changes to Prisma schema, Redis / queue / Lua code, hot tables, recovery / restart loops, new packages, deletions of paths REVIEW.md cites. Skim everything else.
4. Open at most **5 files** total — only the ones most likely to surface a real signal. If nothing in the file-list looks relevant to any REVIEW.md rule, do NOT read any files; go straight to the verdict.
5. Form a verdict and stop. Do not exhaust the turn budget exploring.
Large PRs (>50 files changed) are a strong signal to be MORE selective, not more thorough. Pick 3-5 files at most.
## What to look for
- **Stale references** — does any REVIEW.md rule cite a file, directory, function, table, Prisma model, or package name that has been removed or renamed in this PR (or is already gone from `main`)?
- **Contradictions** — does code in this PR clearly violate a current REVIEW.md rule? (Don't re-review the PR. Only flag if REVIE...

GitHub Actions: 🔎 REVIEW.md Drift Audit / 0_audit.txt: fix(webapp,core): retry run resume through transient database outages

Conclusion: failure

View job details

ild-legacy-run-engine.fix3
  * [new tag]             build-manual-checkpoints.rc1 -> build-manual-checkpoints.rc1
  * [new tag]             build-metadata-upgrade-logging.rc1 -> build-metadata-upgrade-logging.rc1
  * [new tag]             build-metadata-upgrade-logging.rc2 -> build-metadata-upgrade-logging.rc2
  * [new tag]             build-metadata-upgrade-logging.rc3 -> build-metadata-upgrade-logging.rc3
  * [new tag]             build-new-build-system.rc.1 -> build-new-build-system.rc.1
  * [new tag]             build-otel-upgrade-rc.0     -> build-otel-upgrade-rc.0
  * [new tag]             build-otel-upgrade-rc.1     -> build-otel-upgrade-rc.1
  * [new tag]             build-pre-pull-deployments-rc.1 -> build-pre-pull-deployments-rc.1
  * [new tag]             build-prod-rescue-rc.1      -> build-prod-rescue-rc.1
  * [new tag]             build-rate-limiter-fix-rc.1 -> build-rate-limiter-fix-rc.1
  * [new tag]             build-re2.rc0               -> build-re2.rc0
  * [new tag]             build-realtime-v2-stream-fix -> build-realtime-v2-stream-fix
  * [new tag]             build-realtime-v2-stream-fix-2 -> build-realtime-v2-stream-fix-2
  * [new tag]             build-realtime-v2-stream-fix-3 -> build-realtime-v2-stream-fix-3
  * [new tag]             build-realtime-v2-stream-fix-4 -> build-realtime-v2-stream-fix-4
  * [new tag]             build-realtime-v2-stream-fix-5 -> build-realtime-v2-stream-fix-5
  * [new tag]             build-realtimestreams-dedupe -> build-realtimestreams-dedupe
  * [new tag]             build-registry-maintenance-rc.1 -> build-registry-maintenance-rc.1
  * [new tag]             build-registry-maintenance-rc.2 -> build-registry-maintenance-rc.2
  * [new tag]             build-remote-ecr-rc.0       -> build-remote-ecr-rc.0
  * [new tag]             build-reschedule-hotfix.rc1 -> build-reschedule-hotfix.rc1
  * [new tag]             build-resume-fixes.rc1      -> build-resume-fixes.rc1
  * [new tag]             build-resume-fix...

GitHub Actions: 📝 Agent Instructions Audit / 0_audit.txt: fix(webapp,core): retry run resume through transient database outages

Conclusion: failure

View job details

-rc.2         -> build-batching-rc.2
  * [new tag]             build-billing-0.0.1         -> build-billing-0.0.1
  * [new tag]             build-billing-0.0.2         -> build-billing-0.0.2
  * [new tag]             build-billing-0.0.3         -> build-billing-0.0.3
  * [new tag]             build-buildinfo-rc.0        -> build-buildinfo-rc.0
  * [new tag]             build-buildinfo-rc.1        -> build-buildinfo-rc.1
  * [new tag]             build-checkpoint-failover-rc.1 -> build-checkpoint-failover-rc.1
  * [new tag]             build-checkpoint-race-condition-1 -> build-checkpoint-race-condition-1
  * [new tag]             build-checkpoint-race-condition-2 -> build-checkpoint-race-condition-2
  * [new tag]             build-checkpoint-race-condition-3 -> build-checkpoint-race-condition-3
  * [new tag]             build-chris-test-blacksmith -> build-chris-test-blacksmith
  * [new tag]             build-chris-test-blacksmith-2 -> build-chris-test-blacksmith-2
  * [new tag]             build-cli-build-upgrade-rc.1 -> build-cli-build-upgrade-rc.1
  * [new tag]             build-clickhouse-reads-rc0  -> build-clickhouse-reads-rc0
  * [new tag]             build-clickhouse-reads-rc1  -> build-clickhouse-reads-rc1
  * [new tag]             build-compute.rc0           -> build-compute.rc0
  * [new tag]             build-compute.rc1           -> build-compute.rc1
  * [new tag]             build-compute.rc2           -> build-compute.rc2
  * [new tag]             build-compute.rc3           -> build-compute.rc3
  * [new tag]             build-compute.rc4           -> build-compute.rc4
  * [new tag]             build-compute.rc5           -> build-compute.rc5
  * [new tag]             build-compute.rc6           -> build-compute.rc6
  * [new tag]             build-corepack-offline-rc.0 -> build-corepack-offline-rc.0
  * [new tag]             build-current-deployment-rc.0 -> build-current-deployment-rc.0
  * [new tag]             build-dependabot-q2.rc0     -> build-...

GitHub Actions: 📝 Agent Instructions Audit / audit: fix(webapp,core): retry run resume through transient database outages

Conclusion: failure

View job details

##[group]Run anthropics/claude-code-action@428971d2ecd6e3a7cb0ee0da2a3a8b33fdb3678d
 with:
   anthropic_***REDACTED***
   use_sticky_comment: true
   allowed_bots: devin-ai-integration[bot]
   claude_args: --max-turns 25
--model claude-opus-4-8
--allowedTools "Read,Glob,Grep,Bash(git diff:*)"
   prompt: You are reviewing a PR to check whether any agent instruction files need updating.
In this repo:
- Root shared agent guidance lives in `AGENTS.md`.
- Root `CLAUDE.md` is only a Claude Code adapter that imports `AGENTS.md`.
- Subdirectories may still have scoped `CLAUDE.md` files.
- `.claude/rules/` contains additional Claude Code guidance.
## Your task
1. Run `git diff origin/main...HEAD --name-only` to see which files changed in this PR.
2. For each changed directory, check the applicable instruction files: root `AGENTS.md`, any `CLAUDE.md` in that directory or a parent directory, and relevant `.claude/rules/` files.
3. Determine if any instruction file should be updated based on the changes. Consider:
   - New files/directories that aren't covered by existing documentation
   - Changed architecture or patterns that contradict current agent guidance
   - New dependencies, services, or infrastructure that agents should know about
   - Renamed or moved files that are referenced in an instruction file
   - Changes to build commands, test patterns, or development workflows
## Response format
If NO updates are needed, respond with exactly:
✅ Agent instruction files look current for this PR.
If updates ARE needed, respond with a short list:
📝 **Agent instruction updates suggested:**
- `AGENTS.md`: [what should be added/changed]
- `path/to/CLAUDE.md`: [what should be added/changed]
- `.claude/rules/file.md`: [what should be added/changed]
Keep suggestions specific and brief. Only flag things that would actually mislead agents in future sessions.
Do NOT suggest updates for trivial changes (bug fixes, small refactors within existing patterns).
Do NOT suggest creating new...
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

Files:

  • packages/core/src/v3/runEngineWorker/workload/http.ts
  • packages/core/src/v3/runEngineWorker/supervisor/http.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use zod for validation in packages/core and apps/webapp

Files:

  • packages/core/src/v3/runEngineWorker/workload/http.ts
  • packages/core/src/v3/runEngineWorker/supervisor/http.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

**/*.{ts,tsx,js,jsx}: Prefer static imports over dynamic import(); only use dynamic imports when resolving circular dependencies, enabling real code splitting, or conditionally loading a module at runtime.
Always import from @trigger.dev/sdk; never import from @trigger.dev/sdk/v3 or use deprecated client.defineJob.
In code that imports @trigger.dev/core, use subpath imports only and never import from the package root.

Files:

  • packages/core/src/v3/runEngineWorker/workload/http.ts
  • packages/core/src/v3/runEngineWorker/supervisor/http.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • packages/core/src/v3/runEngineWorker/workload/http.ts
  • packages/core/src/v3/runEngineWorker/supervisor/http.ts
packages/core/**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (packages/core/CLAUDE.md)

Never import the root package (@trigger.dev/core). Always use subpath imports such as @trigger.dev/core/v3, @trigger.dev/core/v3/utils, @trigger.dev/core/logger, or @trigger.dev/core/schemas

Files:

  • packages/core/src/v3/runEngineWorker/workload/http.ts
  • packages/core/src/v3/runEngineWorker/supervisor/http.ts
🧠 Learnings (9)
📚 Learning: 2026-03-22T13:26:12.060Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).

Applied to files:

  • packages/core/src/v3/runEngineWorker/workload/http.ts
  • packages/core/src/v3/runEngineWorker/supervisor/http.ts
📚 Learning: 2026-03-22T19:24:14.403Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.

Applied to files:

  • packages/core/src/v3/runEngineWorker/workload/http.ts
  • packages/core/src/v3/runEngineWorker/supervisor/http.ts
📚 Learning: 2026-05-18T08:21:27.694Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma error P1001 ("Can't reach database server") in TypeScript, don’t assume a single error shape. Prisma can surface P1001 via two different error classes/fields: `PrismaClientKnownRequestError` exposes it as `err.code === "P1001"` (common during mid-query connection drops), while `PrismaClientInitializationError` exposes it as `err.errorCode === "P1001"` (common on client startup failure). Therefore, predicates should use `err.code === "P1001" || err.errorCode === "P1001"`. Do not flag `err.code === "P1001"` as “unreachable/never matches,” as it is expected in production.

Applied to files:

  • packages/core/src/v3/runEngineWorker/workload/http.ts
  • packages/core/src/v3/runEngineWorker/supervisor/http.ts
📚 Learning: 2026-05-18T08:21:27.694Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma errors for P1001 ("Can't reach database server"), do not assume it only appears under a single property name. Prisma may surface P1001 via either `PrismaClientKnownRequestError` (`err.code === "P1001"`, e.g., mid-query connection drops) or `PrismaClientInitializationError` (`err.errorCode === "P1001"`, e.g., client startup connection failure). To reliably detect the condition, check `err.code === "P1001" || err.errorCode === "P1001"`, and avoid review rules that would incorrectly flag `err.code === "P1001"` as unreachable/never-matching.

Applied to files:

  • packages/core/src/v3/runEngineWorker/workload/http.ts
  • packages/core/src/v3/runEngineWorker/supervisor/http.ts
📚 Learning: 2026-06-13T19:53:13.759Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3937
File: packages/trigger-sdk/skills/realtime-and-frontend/SKILL.md:258-260
Timestamp: 2026-06-13T19:53:13.759Z
Learning: When reviewing code that uses `trigger.dev/react-hooks`’s `useRealtimeRun`, preserve the call signature where the first argument is the full realtime handle object (not `handle.id`). This is intentional to maintain type-safety and is consistent with the official docs; do not suggest changing the first argument from the handle object to `handle.id`.

Applied to files:

  • packages/core/src/v3/runEngineWorker/workload/http.ts
  • packages/core/src/v3/runEngineWorker/supervisor/http.ts
📚 Learning: 2026-06-17T17:13:49.929Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3948
File: apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.bulk-actions.$bulkActionParam/route.tsx:48-62
Timestamp: 2026-06-17T17:13:49.929Z
Learning: In triggerdotdev/trigger.dev, within `dashboardLoader`/`dashboardAction` (or similar context resolver code) whenever you resolve an organization ID from an organization slug for RBAC/enterprise authorization scope, always read from the primary Prisma client (`prisma`), not `$replica`. Using `$replica` can hit replica-lag and cause the RBAC lookup/authorization to run without the correct org scope (bypassing intended role enforcement). Implement the slug→org lookup with `prisma.organization.findFirst(...)` (or equivalent primary-client query) and add an inline comment documenting why the primary client is required (replica lag could lead to unscoped RBAC checks).

Applied to files:

  • packages/core/src/v3/runEngineWorker/workload/http.ts
  • packages/core/src/v3/runEngineWorker/supervisor/http.ts
📚 Learning: 2026-06-23T13:04:21.413Z
Learnt from: carderne
Repo: triggerdotdev/trigger.dev PR: 4023
File: apps/webapp/app/services/upsertBranch.server.ts:14-18
Timestamp: 2026-06-23T13:04:21.413Z
Learning: In TypeScript, it’s valid to `import { type X }` and then use `typeof X` in a type-only position, e.g. `type Alias = z.infer<typeof X>`. The `type` modifier suppresses the runtime import, but the type checker still has the full exported type so `z.infer<typeof X>` can resolve correctly. In code reviews, don’t flag this as a TypeScript compile error as long as `typeof X` is used in a type context (e.g., with `z.infer`, `type` aliases, generics), not as a runtime value.

Applied to files:

  • packages/core/src/v3/runEngineWorker/workload/http.ts
  • packages/core/src/v3/runEngineWorker/supervisor/http.ts
📚 Learning: 2026-06-04T18:16:35.386Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3836
File: apps/supervisor/src/backpressure/backpressureMonitor.ts:3-5
Timestamp: 2026-06-04T18:16:35.386Z
Learning: When reviewing TypeScript in this repo, apply the rule “prefer type aliases over interfaces” only to data/object shapes and union/intersection type modeling. If an interface is being used as a behavioral contract for collaborators to implement (e.g., method-shape interfaces that define required behavior, such as `BackpressureLogger` / `BackpressureSignalSource` in `apps/supervisor/src/backpressure/backpressureMonitor.ts`), keep it as an `interface` and do not flag it as a type-alias-vs-interface violation.

Applied to files:

  • packages/core/src/v3/runEngineWorker/workload/http.ts
  • packages/core/src/v3/runEngineWorker/supervisor/http.ts
📚 Learning: 2026-06-09T17:58:04.699Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 3879
File: apps/webapp/app/models/vercelIntegration.server.ts:619-630
Timestamp: 2026-06-09T17:58:04.699Z
Learning: In this codebase, outbound raw `fetch` calls should typically rely on Node/undici’s default request timeout (about ~300s) rather than adding a per-call `AbortController` + `setTimeout` wrapper inside individual functions (e.g. in files like `apps/webapp/app/models/vercelIntegration.server.ts`). During code review, do not flag the absence of a per-call timeout on a single `fetch` as an issue; if per-call timeouts are needed, they should be implemented via a codebase-wide convention (e.g., a shared fetch wrapper or documented pattern) rather than ad-hoc per-function changes.

Applied to files:

  • packages/core/src/v3/runEngineWorker/workload/http.ts
  • packages/core/src/v3/runEngineWorker/supervisor/http.ts
🔇 Additional comments (2)
packages/core/src/v3/runEngineWorker/workload/http.ts (1)

137-141: 🩺 Stability & Availability

Compounding retries across hops could stretch worst-case resume latency significantly.

Per the comment, this hop (worker→supervisor) now retries up to 8 times (worst-case ~45s+ of backoff alone) on top of the supervisor's own retry against the engine (also up to 8 attempts / ~45s+, per supervisor/http.ts). If the worker→supervisor leg itself fails or times out while the supervisor is mid-retry internally, the two policies stack rather than share a budget, so a single resume attempt could take several minutes in a bad-case outage. Unlike heartbeatRun/getSnapshotsSince in this same file, continueRunExecution also has no AbortSignal.timeout(...), so there's no explicit upper bound on a single attempt either. Worth confirming this compounded worst-case latency is acceptable for the resume flow (e.g., does anything downstream have its own timeout that could fire mid-retry and force-kill the run anyway, defeating the purpose of this fix?).

packages/core/src/v3/runEngineWorker/supervisor/http.ts (1)

238-265: 🎯 Functional Correctness

wrapZodFetch already accepts the 4th options argument and forwards retry; 5xx responses are retried by shouldRetry.

			> Likely an incorrect or invalid review comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants