fix(webapp,core): retry run resume through transient database outages by matt-aitken · Pull Request #4161 · triggerdotdev/trigger.dev

matt-aitken · 2026-07-05T14:26:53Z

Summary

When the platform database is briefly unreachable while a run is resuming from a wait, the run no longer fails with TASK_EXECUTION_ABORTED. The worker now retries the resume through the outage instead of aborting on the first blip.

Root cause

Resuming a run calls the engine's continue worker-action endpoint. That route caught every error and returned a 422, which the worker's HTTP client treats as non-retryable. So a transient Prisma infrastructure error (for example P1001 "Can't reach database server") was flattened into a permanent failure: the worker gave up, force-killed the run process, and completed it with TASK_EXECUTION_ABORTED.

Fix

The continue route now lets infrastructure errors propagate to the generic 500 handler (message scrubbed, and retryable by the worker's HTTP client), the same treatment the trigger path already gives them via isInfrastructureError. Genuine validation errors (snapshot mismatch, invalid state) still return 422, so a stale retry stays non-retryable. Resuming is idempotent server-side (guarded by the snapshot id), so retrying is safe.
The worker's continueRunExecution calls (both the runner-to-supervisor and supervisor-to-engine hops) retry with a longer, jittered backoff so they can ride out an outage lasting tens of seconds, and the jitter keeps a fleet of resuming runs from stampeding the database the moment it recovers.

Builds on #3960, which scrubbed the leaked message on these routes but left the status non-retryable.

No changeset: this is a server-side behaviour fix recorded via .server-changes. The @trigger.dev/core edits are internal run-engine worker plumbing, not a public API change.

Resuming a run after a wait calls the engine's continue endpoint. When the database was briefly unreachable, that route caught the Prisma infrastructure error and returned a non-retryable 422, so the worker aborted the run with TASK_EXECUTION_ABORTED over a transient blip. The continue route now lets infrastructure errors propagate to the generic 500 handler (scrubbed and retryable), matching how the trigger path already treats them. The worker's continue call also retries with a longer, jittered backoff so it can ride out an outage lasting tens of seconds without stampeding the database on recovery. Genuine validation errors still return 422.

changeset-bot · 2026-07-05T14:26:58Z

⚠️ No Changeset found

Latest commit: 988f53d

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai · 2026-07-05T14:29:38Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

The continue-run route now re-throws Prisma infrastructure errors instead of mapping them to 422 responses, so generic 500 handling can take over. The worker HTTP clients also add retry policies to the continue-run request path with jittered exponential backoff and capped attempts. A changeset entry records the resume retry fix for transient database unavailability.

Changes

Area	Changes
Continue route error handling	Detects Prisma infrastructure errors via `isInfrastructureError`; re-throws them for 500 handling instead of returning 422; updates warning log message
Worker HTTP retry policy	Adds retry settings to `continueRunExecution` in supervisor and workload HTTP clients
Documentation	Adds changeset entry describing the resume retry fix

Sequence Diagram(s)

sequenceDiagram
  participant WorkerClient
  participant ContinueRoute
  participant PrismaDB

  WorkerClient->>ContinueRoute: POST continue run execution
  ContinueRoute->>PrismaDB: perform continuation logic
  PrismaDB-->>ContinueRoute: infrastructure error
  ContinueRoute->>ContinueRoute: isInfrastructureError check
  alt infrastructure error
    ContinueRoute-->>WorkerClient: rethrow for generic 500 handling
  else other error
    ContinueRoute-->>WorkerClient: 422 response
  end

Related issues: None specified

Related PRs: None specified

Suggested labels: bug, webapp

Suggested reviewers: None specified

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description explains the fix well, but it does not follow the required template and is missing Closes #, checklist items, Testing, Changelog, and Screenshots.	Add the required template sections: Closes `#issue`, checklist, testing steps, changelog, and screenshots, or mark non-applicable items clearly.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title clearly describes the main change: retrying run resume through transient database outages.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/resume-retriable-on-transient-db-errors

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

The supervisor-to-engine hop is the one that reaches the continue endpoint, so it is where a transient database outage surfaces as a retryable 5xx. Give its continueRunExecution the same longer, jittered retry budget as the workload client so it can ride out the outage.

pkg-pr-new · 2026-07-05T14:35:27Z

Open in StackBlitz

@trigger.dev/build

npm i https://pkg.pr.new/@trigger.dev/build@988f53d

trigger.dev

npm i https://pkg.pr.new/trigger.dev@988f53d

@trigger.dev/core

npm i https://pkg.pr.new/@trigger.dev/core@988f53d

@trigger.dev/python

npm i https://pkg.pr.new/@trigger.dev/python@988f53d

@trigger.dev/react-hooks

npm i https://pkg.pr.new/@trigger.dev/react-hooks@988f53d

@trigger.dev/redis-worker

npm i https://pkg.pr.new/@trigger.dev/redis-worker@988f53d

@trigger.dev/rsc

npm i https://pkg.pr.new/@trigger.dev/rsc@988f53d

@trigger.dev/schema-to-json

npm i https://pkg.pr.new/@trigger.dev/schema-to-json@988f53d

@trigger.dev/sdk

npm i https://pkg.pr.new/@trigger.dev/sdk@988f53d

commit: 988f53d

The database-outage retry lives on the supervisor-to-engine hop; the workload client only reaches the supervisor's workload server, so its retry rides out supervisor blips (e.g. a restart), not DB outages. Fix the comment to say so.

Drop the worker HTTP-client retry tuning and keep only the continue route change, so the fix is server-only. Swap the package changeset for a .server-changes entry.

Restore the extended, jittered retry on the workload and supervisor continueRunExecution calls so the resume can ride out a transient database outage. Recorded via .server-changes; no package changeset.

devin-ai-integration

Devin Review found 1 new potential issue.

devin-ai-integration · 2026-07-05T18:50:05Z

+---
+area: webapp
+type: fix
+---
+
+Runs resuming after a wait no longer fail with TASK_EXECUTION_ABORTED when the database is briefly unreachable; the resume endpoint returns a retryable response for transient infrastructure errors instead of a permanent one.


🟡 Missing changeset means the retry-on-resume fix won't ship to users in the next package release

The retry configuration is added to a published package but no changeset is included (.server-changes/resume-retry-transient-db.md at .server-changes/resume-retry-transient-db.md:1-6), so the version of @trigger.dev/core won't be bumped and the behavioral change won't be released.

Impact: Users running the published SDK won't get the retry-on-resume fix until a changeset is added and the package is versioned.

Repository rules require a changeset for any change under packages/

CONTRIBUTING.md states: "If you are contributing a change to any packages in this monorepo (anything in either the /packages or /integrations directories), then you will need to add a changeset to your Pull Requests before they can be merged."

The table in CONTRIBUTING.md also says: "Both packages and server → Just the changeset" (no .server-changes/ file needed).

CLAUDE.md echoes: "When modifying any public package (packages/* or integrations/*), add a changeset."

This PR modifies packages/core/src/v3/runEngineWorker/supervisor/http.ts and packages/core/src/v3/runEngineWorker/workload/http.ts, both under packages/core which is the published @trigger.dev/core package. A changeset (via pnpm run changeset:add) selecting @trigger.dev/core as a patch is required.

Prompt for agents

This PR modifies packages/core (a published npm package) in addition to apps/webapp. Per CONTRIBUTING.md and CLAUDE.md, changes to packages/* require a changeset, not a .server-changes/ file. Run `pnpm run changeset:add` from the repo root, select `@trigger.dev/core`, choose `patch`, and describe the retry behavior change for the continue-run-execution endpoint. The .server-changes/resume-retry-transient-db.md file should be removed since mixed PRs (both packages and server) only need the changeset.

Was this helpful? React with 👍 or 👎 to provide feedback.

coderabbitai

🧹 Nitpick comments (2)

packages/core/src/v3/runEngineWorker/supervisor/http.ts (1)
249-262: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Extract the retry policy into a shared constant.

The exact same retry object (minTimeoutInMs: 500, maxTimeoutInMs: 10_000, maxAttempts: 8, factor: 2, randomize: true) is duplicated verbatim in workload/http.ts's continueRunExecution. Extracting a shared constant (e.g. RESUME_HOP_RETRY_POLICY) would keep the two hops' policies in sync if they're ever re-tuned.
♻️ Proposed shared constant
+// e.g. in a shared file such as packages/core/src/v3/runEngineWorker/retry.ts
+export const RESUME_HOP_RETRY_POLICY = {
+  minTimeoutInMs: 500,
+  maxTimeoutInMs: 10_000,
+  maxAttempts: 8,
+  factor: 2,
+  randomize: true,
+} as const;
-        retry: {
-          minTimeoutInMs: 500,
-          maxTimeoutInMs: 10_000,
-          maxAttempts: 8,
-          factor: 2,
-          randomize: true,
-        },
+        retry: RESUME_HOP_RETRY_POLICY,
packages/core/src/v3/runEngineWorker/workload/http.ts (1)

125-152: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Same DRY concern as the supervisor client.

Identical retry object to SupervisorHttpClient.continueRunExecution; see the sibling comment in packages/core/src/v3/runEngineWorker/supervisor/http.ts for the extraction suggestion.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 0752937d-02ab-4e38-a4fa-13fe9e6414e4

📥 Commits

Reviewing files that changed from the base of the PR and between 6c615be and 988f53d.

📒 Files selected for processing (2)

packages/core/src/v3/runEngineWorker/supervisor/http.ts
packages/core/src/v3/runEngineWorker/workload/http.ts

📜 Review details

⏰ Context from checks skipped due to timeout. (32)

GitHub Check: internal / 🧪 Unit Tests: Internal (12, 12)
GitHub Check: internal / 🧪 Unit Tests: Internal (6, 12)
GitHub Check: internal / 🧪 Unit Tests: Internal (11, 12)
GitHub Check: internal / 🧪 Unit Tests: Internal (9, 12)
GitHub Check: internal / 🧪 Unit Tests: Internal (5, 12)
GitHub Check: internal / 🧪 Unit Tests: Internal (4, 12)
GitHub Check: internal / 🧪 Unit Tests: Internal (7, 12)
GitHub Check: internal / 🧪 Unit Tests: Internal (3, 12)
GitHub Check: internal / 🧪 Unit Tests: Internal (1, 12)
GitHub Check: internal / 🧪 Unit Tests: Internal (8, 12)
GitHub Check: internal / 🧪 Unit Tests: Internal (10, 12)
GitHub Check: internal / 🧪 Unit Tests: Internal (2, 12)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (4, 10)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (8, 10)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (6, 10)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (9, 10)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (10, 10)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (3, 10)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (2, 10)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (5, 10)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (7, 10)
GitHub Check: sdk-compat / Node.js 24.18 (blacksmith-4vcpu-ubuntu-2404)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (1, 10)
GitHub Check: sdk-compat / Node.js 26.4 (blacksmith-4vcpu-ubuntu-2404)
GitHub Check: e2e / 🧪 CLI v3 tests (blacksmith-4vcpu-ubuntu-2404 - pnpm)
GitHub Check: sdk-compat / Cloudflare Workers
GitHub Check: packages / 🧪 Unit Tests: Packages (2, 3)
GitHub Check: typecheck / typecheck
GitHub Check: e2e / 🧪 CLI v3 tests (blacksmith-4vcpu-ubuntu-2404 - npm)
GitHub Check: packages / 🧪 Unit Tests: Packages (1, 3)
GitHub Check: packages / 🧪 Unit Tests: Packages (3, 3)
GitHub Check: e2e-webapp / 🧪 E2E Tests: Webapp

⚠️ CI failures not shown inline (4)

GitHub Actions: 🔎 REVIEW.md Drift Audit / audit: fix(webapp,core): retry run resume through transient database outages

Conclusion: failure

View job details

##[group]Run anthropics/claude-code-action@428971d2ecd6e3a7cb0ee0da2a3a8b33fdb3678d
 with:
   anthropic_***REDACTED***
   use_sticky_comment: true
   allowed_bots: devin-ai-integration[bot]
   claude_args: --max-turns 30
--allowedTools "Read,Glob,Grep,Bash(git diff:*)"
   prompt: You are auditing this PR for drift against `.claude/REVIEW.md`.
## Context
`.claude/REVIEW.md` is the repo's source of truth for what AI / agent code reviewers should treat as critical findings (rolling-deploy safety, hot-table indexes, recovery-path queries, testcontainers usage, Lua versioning, etc.). It is consumed by review agents to calibrate severity. If REVIEW.md goes stale, every future agent review degrades.
## Strategy — read this first
You have a hard turn budget. Spend it on signal, not coverage. The audit is allowed to miss things; it is NOT allowed to time out.
1. Read `.claude/REVIEW.md` once, in full.
2. Run `git diff origin/main...HEAD --name-only` to get the list of changed files. Do NOT read the diff content yet.
3. Scan the file-list for relevance to REVIEW.md scope. Relevance signals: changes to Prisma schema, Redis / queue / Lua code, hot tables, recovery / restart loops, new packages, deletions of paths REVIEW.md cites. Skim everything else.
4. Open at most **5 files** total — only the ones most likely to surface a real signal. If nothing in the file-list looks relevant to any REVIEW.md rule, do NOT read any files; go straight to the verdict.
5. Form a verdict and stop. Do not exhaust the turn budget exploring.
Large PRs (>50 files changed) are a strong signal to be MORE selective, not more thorough. Pick 3-5 files at most.
## What to look for
- **Stale references** — does any REVIEW.md rule cite a file, directory, function, table, Prisma model, or package name that has been removed or renamed in this PR (or is already gone from `main`)?
- **Contradictions** — does code in this PR clearly violate a current REVIEW.md rule? (Don't re-review the PR. Only flag if REVIE...

GitHub Actions: 🔎 REVIEW.md Drift Audit / 0_audit.txt: fix(webapp,core): retry run resume through transient database outages

Conclusion: failure

View job details

ild-legacy-run-engine.fix3
  * [new tag]             build-manual-checkpoints.rc1 -> build-manual-checkpoints.rc1
  * [new tag]             build-metadata-upgrade-logging.rc1 -> build-metadata-upgrade-logging.rc1
  * [new tag]             build-metadata-upgrade-logging.rc2 -> build-metadata-upgrade-logging.rc2
  * [new tag]             build-metadata-upgrade-logging.rc3 -> build-metadata-upgrade-logging.rc3
  * [new tag]             build-new-build-system.rc.1 -> build-new-build-system.rc.1
  * [new tag]             build-otel-upgrade-rc.0     -> build-otel-upgrade-rc.0
  * [new tag]             build-otel-upgrade-rc.1     -> build-otel-upgrade-rc.1
  * [new tag]             build-pre-pull-deployments-rc.1 -> build-pre-pull-deployments-rc.1
  * [new tag]             build-prod-rescue-rc.1      -> build-prod-rescue-rc.1
  * [new tag]             build-rate-limiter-fix-rc.1 -> build-rate-limiter-fix-rc.1
  * [new tag]             build-re2.rc0               -> build-re2.rc0
  * [new tag]             build-realtime-v2-stream-fix -> build-realtime-v2-stream-fix
  * [new tag]             build-realtime-v2-stream-fix-2 -> build-realtime-v2-stream-fix-2
  * [new tag]             build-realtime-v2-stream-fix-3 -> build-realtime-v2-stream-fix-3
  * [new tag]             build-realtime-v2-stream-fix-4 -> build-realtime-v2-stream-fix-4
  * [new tag]             build-realtime-v2-stream-fix-5 -> build-realtime-v2-stream-fix-5
  * [new tag]             build-realtimestreams-dedupe -> build-realtimestreams-dedupe
  * [new tag]             build-registry-maintenance-rc.1 -> build-registry-maintenance-rc.1
  * [new tag]             build-registry-maintenance-rc.2 -> build-registry-maintenance-rc.2
  * [new tag]             build-remote-ecr-rc.0       -> build-remote-ecr-rc.0
  * [new tag]             build-reschedule-hotfix.rc1 -> build-reschedule-hotfix.rc1
  * [new tag]             build-resume-fixes.rc1      -> build-resume-fixes.rc1
  * [new tag]             build-resume-fix...

GitHub Actions: 📝 Agent Instructions Audit / 0_audit.txt: fix(webapp,core): retry run resume through transient database outages

Conclusion: failure

View job details

-rc.2         -> build-batching-rc.2
  * [new tag]             build-billing-0.0.1         -> build-billing-0.0.1
  * [new tag]             build-billing-0.0.2         -> build-billing-0.0.2
  * [new tag]             build-billing-0.0.3         -> build-billing-0.0.3
  * [new tag]             build-buildinfo-rc.0        -> build-buildinfo-rc.0
  * [new tag]             build-buildinfo-rc.1        -> build-buildinfo-rc.1
  * [new tag]             build-checkpoint-failover-rc.1 -> build-checkpoint-failover-rc.1
  * [new tag]             build-checkpoint-race-condition-1 -> build-checkpoint-race-condition-1
  * [new tag]             build-checkpoint-race-condition-2 -> build-checkpoint-race-condition-2
  * [new tag]             build-checkpoint-race-condition-3 -> build-checkpoint-race-condition-3
  * [new tag]             build-chris-test-blacksmith -> build-chris-test-blacksmith
  * [new tag]             build-chris-test-blacksmith-2 -> build-chris-test-blacksmith-2
  * [new tag]             build-cli-build-upgrade-rc.1 -> build-cli-build-upgrade-rc.1
  * [new tag]             build-clickhouse-reads-rc0  -> build-clickhouse-reads-rc0
  * [new tag]             build-clickhouse-reads-rc1  -> build-clickhouse-reads-rc1
  * [new tag]             build-compute.rc0           -> build-compute.rc0
  * [new tag]             build-compute.rc1           -> build-compute.rc1
  * [new tag]             build-compute.rc2           -> build-compute.rc2
  * [new tag]             build-compute.rc3           -> build-compute.rc3
  * [new tag]             build-compute.rc4           -> build-compute.rc4
  * [new tag]             build-compute.rc5           -> build-compute.rc5
  * [new tag]             build-compute.rc6           -> build-compute.rc6
  * [new tag]             build-corepack-offline-rc.0 -> build-corepack-offline-rc.0
  * [new tag]             build-current-deployment-rc.0 -> build-current-deployment-rc.0
  * [new tag]             build-dependabot-q2.rc0     -> build-...

GitHub Actions: 📝 Agent Instructions Audit / audit: fix(webapp,core): retry run resume through transient database outages

Conclusion: failure

View job details

##[group]Run anthropics/claude-code-action@428971d2ecd6e3a7cb0ee0da2a3a8b33fdb3678d
 with:
   anthropic_***REDACTED***
   use_sticky_comment: true
   allowed_bots: devin-ai-integration[bot]
   claude_args: --max-turns 25
--model claude-opus-4-8
--allowedTools "Read,Glob,Grep,Bash(git diff:*)"
   prompt: You are reviewing a PR to check whether any agent instruction files need updating.
In this repo:
- Root shared agent guidance lives in `AGENTS.md`.
- Root `CLAUDE.md` is only a Claude Code adapter that imports `AGENTS.md`.
- Subdirectories may still have scoped `CLAUDE.md` files.
- `.claude/rules/` contains additional Claude Code guidance.
## Your task
1. Run `git diff origin/main...HEAD --name-only` to see which files changed in this PR.
2. For each changed directory, check the applicable instruction files: root `AGENTS.md`, any `CLAUDE.md` in that directory or a parent directory, and relevant `.claude/rules/` files.
3. Determine if any instruction file should be updated based on the changes. Consider:
   - New files/directories that aren't covered by existing documentation
   - Changed architecture or patterns that contradict current agent guidance
   - New dependencies, services, or infrastructure that agents should know about
   - Renamed or moved files that are referenced in an instruction file
   - Changes to build commands, test patterns, or development workflows
## Response format
If NO updates are needed, respond with exactly:
✅ Agent instruction files look current for this PR.
If updates ARE needed, respond with a short list:
📝 **Agent instruction updates suggested:**
- `AGENTS.md`: [what should be added/changed]
- `path/to/CLAUDE.md`: [what should be added/changed]
- `.claude/rules/file.md`: [what should be added/changed]
Keep suggestions specific and brief. Only flag things that would actually mislead agents in future sessions.
Do NOT suggest updates for trivial changes (bug fixes, small refactors within existing patterns).
Do NOT suggest creating new...

🧰 Additional context used

📓 Path-based instructions (5)

**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

Files:

packages/core/src/v3/runEngineWorker/workload/http.ts
packages/core/src/v3/runEngineWorker/supervisor/http.ts

{packages/core,apps/webapp}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use zod for validation in packages/core and apps/webapp

Files:

packages/core/src/v3/runEngineWorker/workload/http.ts
packages/core/src/v3/runEngineWorker/supervisor/http.ts

**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

**/*.{ts,tsx,js,jsx}: Prefer static imports over dynamic import(); only use dynamic imports when resolving circular dependencies, enabling real code splitting, or conditionally loading a module at runtime.
Always import from @trigger.dev/sdk; never import from @trigger.dev/sdk/v3 or use deprecated client.defineJob.
In code that imports @trigger.dev/core, use subpath imports only and never import from the package root.

Files:

packages/core/src/v3/runEngineWorker/workload/http.ts
packages/core/src/v3/runEngineWorker/supervisor/http.ts

**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

packages/core/src/v3/runEngineWorker/workload/http.ts
packages/core/src/v3/runEngineWorker/supervisor/http.ts

packages/core/**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (packages/core/CLAUDE.md)

Never import the root package (@trigger.dev/core). Always use subpath imports such as @trigger.dev/core/v3, @trigger.dev/core/v3/utils, @trigger.dev/core/logger, or @trigger.dev/core/schemas

Files:

packages/core/src/v3/runEngineWorker/workload/http.ts
packages/core/src/v3/runEngineWorker/supervisor/http.ts

🧠 Learnings (9)

📚 Learning: 2026-03-22T13:26:12.060Z

Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).

Applied to files:

packages/core/src/v3/runEngineWorker/workload/http.ts
packages/core/src/v3/runEngineWorker/supervisor/http.ts

📚 Learning: 2026-03-22T19:24:14.403Z

Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.

Applied to files:

packages/core/src/v3/runEngineWorker/workload/http.ts
packages/core/src/v3/runEngineWorker/supervisor/http.ts

📚 Learning: 2026-05-18T08:21:27.694Z

Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma error P1001 ("Can't reach database server") in TypeScript, don’t assume a single error shape. Prisma can surface P1001 via two different error classes/fields: `PrismaClientKnownRequestError` exposes it as `err.code === "P1001"` (common during mid-query connection drops), while `PrismaClientInitializationError` exposes it as `err.errorCode === "P1001"` (common on client startup failure). Therefore, predicates should use `err.code === "P1001" || err.errorCode === "P1001"`. Do not flag `err.code === "P1001"` as “unreachable/never matches,” as it is expected in production.

Applied to files:

packages/core/src/v3/runEngineWorker/workload/http.ts
packages/core/src/v3/runEngineWorker/supervisor/http.ts

📚 Learning: 2026-05-18T08:21:27.694Z

Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma errors for P1001 ("Can't reach database server"), do not assume it only appears under a single property name. Prisma may surface P1001 via either `PrismaClientKnownRequestError` (`err.code === "P1001"`, e.g., mid-query connection drops) or `PrismaClientInitializationError` (`err.errorCode === "P1001"`, e.g., client startup connection failure). To reliably detect the condition, check `err.code === "P1001" || err.errorCode === "P1001"`, and avoid review rules that would incorrectly flag `err.code === "P1001"` as unreachable/never-matching.

Applied to files:

packages/core/src/v3/runEngineWorker/workload/http.ts
packages/core/src/v3/runEngineWorker/supervisor/http.ts

📚 Learning: 2026-06-13T19:53:13.759Z

Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3937
File: packages/trigger-sdk/skills/realtime-and-frontend/SKILL.md:258-260
Timestamp: 2026-06-13T19:53:13.759Z
Learning: When reviewing code that uses `trigger.dev/react-hooks`’s `useRealtimeRun`, preserve the call signature where the first argument is the full realtime handle object (not `handle.id`). This is intentional to maintain type-safety and is consistent with the official docs; do not suggest changing the first argument from the handle object to `handle.id`.

Applied to files:

packages/core/src/v3/runEngineWorker/workload/http.ts
packages/core/src/v3/runEngineWorker/supervisor/http.ts

📚 Learning: 2026-06-17T17:13:49.929Z

Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3948
File: apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.bulk-actions.$bulkActionParam/route.tsx:48-62
Timestamp: 2026-06-17T17:13:49.929Z
Learning: In triggerdotdev/trigger.dev, within `dashboardLoader`/`dashboardAction` (or similar context resolver code) whenever you resolve an organization ID from an organization slug for RBAC/enterprise authorization scope, always read from the primary Prisma client (`prisma`), not `$replica`. Using `$replica` can hit replica-lag and cause the RBAC lookup/authorization to run without the correct org scope (bypassing intended role enforcement). Implement the slug→org lookup with `prisma.organization.findFirst(...)` (or equivalent primary-client query) and add an inline comment documenting why the primary client is required (replica lag could lead to unscoped RBAC checks).

Applied to files:

packages/core/src/v3/runEngineWorker/workload/http.ts
packages/core/src/v3/runEngineWorker/supervisor/http.ts

📚 Learning: 2026-06-23T13:04:21.413Z

Learnt from: carderne
Repo: triggerdotdev/trigger.dev PR: 4023
File: apps/webapp/app/services/upsertBranch.server.ts:14-18
Timestamp: 2026-06-23T13:04:21.413Z
Learning: In TypeScript, it’s valid to `import { type X }` and then use `typeof X` in a type-only position, e.g. `type Alias = z.infer<typeof X>`. The `type` modifier suppresses the runtime import, but the type checker still has the full exported type so `z.infer<typeof X>` can resolve correctly. In code reviews, don’t flag this as a TypeScript compile error as long as `typeof X` is used in a type context (e.g., with `z.infer`, `type` aliases, generics), not as a runtime value.

Applied to files:

packages/core/src/v3/runEngineWorker/workload/http.ts
packages/core/src/v3/runEngineWorker/supervisor/http.ts

📚 Learning: 2026-06-04T18:16:35.386Z

Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3836
File: apps/supervisor/src/backpressure/backpressureMonitor.ts:3-5
Timestamp: 2026-06-04T18:16:35.386Z
Learning: When reviewing TypeScript in this repo, apply the rule “prefer type aliases over interfaces” only to data/object shapes and union/intersection type modeling. If an interface is being used as a behavioral contract for collaborators to implement (e.g., method-shape interfaces that define required behavior, such as `BackpressureLogger` / `BackpressureSignalSource` in `apps/supervisor/src/backpressure/backpressureMonitor.ts`), keep it as an `interface` and do not flag it as a type-alias-vs-interface violation.

Applied to files:

packages/core/src/v3/runEngineWorker/workload/http.ts
packages/core/src/v3/runEngineWorker/supervisor/http.ts

📚 Learning: 2026-06-09T17:58:04.699Z

Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 3879
File: apps/webapp/app/models/vercelIntegration.server.ts:619-630
Timestamp: 2026-06-09T17:58:04.699Z
Learning: In this codebase, outbound raw `fetch` calls should typically rely on Node/undici’s default request timeout (about ~300s) rather than adding a per-call `AbortController` + `setTimeout` wrapper inside individual functions (e.g. in files like `apps/webapp/app/models/vercelIntegration.server.ts`). During code review, do not flag the absence of a per-call timeout on a single `fetch` as an issue; if per-call timeouts are needed, they should be implemented via a codebase-wide convention (e.g., a shared fetch wrapper or documented pattern) rather than ad-hoc per-function changes.

Applied to files:

packages/core/src/v3/runEngineWorker/workload/http.ts
packages/core/src/v3/runEngineWorker/supervisor/http.ts

🔇 Additional comments (2)

packages/core/src/v3/runEngineWorker/workload/http.ts (1)

137-141: 🩺 Stability & Availability

Compounding retries across hops could stretch worst-case resume latency significantly.

Per the comment, this hop (worker→supervisor) now retries up to 8 times (worst-case ~45s+ of backoff alone) on top of the supervisor's own retry against the engine (also up to 8 attempts / ~45s+, per supervisor/http.ts). If the worker→supervisor leg itself fails or times out while the supervisor is mid-retry internally, the two policies stack rather than share a budget, so a single resume attempt could take several minutes in a bad-case outage. Unlike heartbeatRun/getSnapshotsSince in this same file, continueRunExecution also has no AbortSignal.timeout(...), so there's no explicit upper bound on a single attempt either. Worth confirming this compounded worst-case latency is acceptable for the resume flow (e.g., does anything downstream have its own timeout that could fire mid-retry and force-kill the run anyway, defeating the purpose of this fix?).
packages/core/src/v3/runEngineWorker/supervisor/http.ts (1)
238-265: 🎯 Functional Correctness

wrapZodFetch already accepts the 4th options argument and forwards retry; 5xx responses are retried by shouldRetry.
			> Likely an incorrect or invalid review comment.

This comment was marked as resolved.

Sign in to view

matt-aitken force-pushed the fix/resume-retriable-on-transient-db-errors branch from 7b8d0a0 to 8334820 Compare July 5, 2026 14:33

This comment was marked as resolved.

Sign in to view

ericallam approved these changes Jul 5, 2026

View reviewed changes

refactor(webapp,core): scope resume-retry fix to the server route

6c615be

Drop the worker HTTP-client retry tuning and keep only the continue route change, so the fix is server-only. Swap the package changeset for a .server-changes entry.

matt-aitken force-pushed the fix/resume-retriable-on-transient-db-errors branch from f73646b to 6c615be Compare July 5, 2026 17:03

matt-aitken changed the title ~~fix(webapp,core): retry run resume through transient database outages~~ fix(webapp): retry run resume through transient database outages Jul 5, 2026

fix(core): jittered retry on the resume hops for transient DB outages

988f53d

Restore the extended, jittered retry on the workload and supervisor continueRunExecution calls so the resume can ride out a transient database outage. Recorded via .server-changes; no package changeset.

matt-aitken changed the title ~~fix(webapp): retry run resume through transient database outages~~ fix(webapp,core): retry run resume through transient database outages Jul 5, 2026

devin-ai-integration Bot reviewed Jul 5, 2026

View reviewed changes

coderabbitai Bot reviewed Jul 5, 2026

View reviewed changes

d-cs approved these changes Jul 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(webapp,core): retry run resume through transient database outages#4161

fix(webapp,core): retry run resume through transient database outages#4161
matt-aitken wants to merge 5 commits into
mainfrom
fix/resume-retriable-on-transient-db-errors

matt-aitken commented Jul 5, 2026 •

edited

Loading

Uh oh!

changeset-bot Bot commented Jul 5, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jul 5, 2026 •

edited

Loading

Reviews paused

❌ Failed checks (1 warning)

Uh oh!

This comment was marked as resolved.

Uh oh!

pkg-pr-new Bot commented Jul 5, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot Jul 5, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Uh oh!

Conversation

matt-aitken commented Jul 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Uh oh!

changeset-bot Bot commented Jul 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

coderabbitai Bot commented Jul 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

❌ Failed checks (1 warning)

Uh oh!

This comment was marked as resolved.

Uh oh!

pkg-pr-new Bot commented Jul 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Jul 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

matt-aitken commented Jul 5, 2026 •

edited

Loading

changeset-bot Bot commented Jul 5, 2026 •

edited

Loading

coderabbitai Bot commented Jul 5, 2026 •

edited

Loading

pkg-pr-new Bot commented Jul 5, 2026 •

edited

Loading