fix(webapp,core): retry run resume through transient database outages#4161
fix(webapp,core): retry run resume through transient database outages#4161matt-aitken wants to merge 5 commits into
Conversation
Resuming a run after a wait calls the engine's continue endpoint. When the database was briefly unreachable, that route caught the Prisma infrastructure error and returned a non-retryable 422, so the worker aborted the run with TASK_EXECUTION_ABORTED over a transient blip. The continue route now lets infrastructure errors propagate to the generic 500 handler (scrubbed and retryable), matching how the trigger path already treats them. The worker's continue call also retries with a longer, jittered backoff so it can ride out an outage lasting tens of seconds without stampeding the database on recovery. Genuine validation errors still return 422.
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThe continue-run route now re-throws Prisma infrastructure errors instead of mapping them to 422 responses, so generic 500 handling can take over. The worker HTTP clients also add retry policies to the continue-run request path with jittered exponential backoff and capped attempts. A changeset entry records the resume retry fix for transient database unavailability. Changes
Sequence Diagram(s)sequenceDiagram
participant WorkerClient
participant ContinueRoute
participant PrismaDB
WorkerClient->>ContinueRoute: POST continue run execution
ContinueRoute->>PrismaDB: perform continuation logic
PrismaDB-->>ContinueRoute: infrastructure error
ContinueRoute->>ContinueRoute: isInfrastructureError check
alt infrastructure error
ContinueRoute-->>WorkerClient: rethrow for generic 500 handling
else other error
ContinueRoute-->>WorkerClient: 422 response
end
Related issues: None specified Related PRs: None specified Suggested labels: bug, webapp Suggested reviewers: None specified 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
The supervisor-to-engine hop is the one that reaches the continue endpoint, so it is where a transient database outage surfaces as a retryable 5xx. Give its continueRunExecution the same longer, jittered retry budget as the workload client so it can ride out the outage.
7b8d0a0 to
8334820
Compare
@trigger.dev/build
trigger.dev
@trigger.dev/core
@trigger.dev/python
@trigger.dev/react-hooks
@trigger.dev/redis-worker
@trigger.dev/rsc
@trigger.dev/schema-to-json
@trigger.dev/sdk
commit: |
The database-outage retry lives on the supervisor-to-engine hop; the workload client only reaches the supervisor's workload server, so its retry rides out supervisor blips (e.g. a restart), not DB outages. Fix the comment to say so.
Drop the worker HTTP-client retry tuning and keep only the continue route change, so the fix is server-only. Swap the package changeset for a .server-changes entry.
f73646b to
6c615be
Compare
Restore the extended, jittered retry on the workload and supervisor continueRunExecution calls so the resume can ride out a transient database outage. Recorded via .server-changes; no package changeset.
| --- | ||
| area: webapp | ||
| type: fix | ||
| --- | ||
|
|
||
| Runs resuming after a wait no longer fail with TASK_EXECUTION_ABORTED when the database is briefly unreachable; the resume endpoint returns a retryable response for transient infrastructure errors instead of a permanent one. |
There was a problem hiding this comment.
🟡 Missing changeset means the retry-on-resume fix won't ship to users in the next package release
The retry configuration is added to a published package but no changeset is included (.server-changes/resume-retry-transient-db.md at .server-changes/resume-retry-transient-db.md:1-6), so the version of @trigger.dev/core won't be bumped and the behavioral change won't be released.
Impact: Users running the published SDK won't get the retry-on-resume fix until a changeset is added and the package is versioned.
Repository rules require a changeset for any change under packages/
CONTRIBUTING.md states: "If you are contributing a change to any packages in this monorepo (anything in either the /packages or /integrations directories), then you will need to add a changeset to your Pull Requests before they can be merged."
The table in CONTRIBUTING.md also says: "Both packages and server → Just the changeset" (no .server-changes/ file needed).
CLAUDE.md echoes: "When modifying any public package (packages/* or integrations/*), add a changeset."
This PR modifies packages/core/src/v3/runEngineWorker/supervisor/http.ts and packages/core/src/v3/runEngineWorker/workload/http.ts, both under packages/core which is the published @trigger.dev/core package. A changeset (via pnpm run changeset:add) selecting @trigger.dev/core as a patch is required.
Prompt for agents
This PR modifies packages/core (a published npm package) in addition to apps/webapp. Per CONTRIBUTING.md and CLAUDE.md, changes to packages/* require a changeset, not a .server-changes/ file. Run `pnpm run changeset:add` from the repo root, select `@trigger.dev/core`, choose `patch`, and describe the retry behavior change for the continue-run-execution endpoint. The .server-changes/resume-retry-transient-db.md file should be removed since mixed PRs (both packages and server) only need the changeset.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
packages/core/src/v3/runEngineWorker/supervisor/http.ts (1)
249-262: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winExtract the retry policy into a shared constant.
The exact same retry object (
minTimeoutInMs: 500, maxTimeoutInMs: 10_000, maxAttempts: 8, factor: 2, randomize: true) is duplicated verbatim inworkload/http.ts'scontinueRunExecution. Extracting a shared constant (e.g.RESUME_HOP_RETRY_POLICY) would keep the two hops' policies in sync if they're ever re-tuned.♻️ Proposed shared constant
+// e.g. in a shared file such as packages/core/src/v3/runEngineWorker/retry.ts +export const RESUME_HOP_RETRY_POLICY = { + minTimeoutInMs: 500, + maxTimeoutInMs: 10_000, + maxAttempts: 8, + factor: 2, + randomize: true, +} as const;- retry: { - minTimeoutInMs: 500, - maxTimeoutInMs: 10_000, - maxAttempts: 8, - factor: 2, - randomize: true, - }, + retry: RESUME_HOP_RETRY_POLICY,packages/core/src/v3/runEngineWorker/workload/http.ts (1)
125-152: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winSame DRY concern as the supervisor client.
Identical retry object to
SupervisorHttpClient.continueRunExecution; see the sibling comment inpackages/core/src/v3/runEngineWorker/supervisor/http.tsfor the extraction suggestion.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 0752937d-02ab-4e38-a4fa-13fe9e6414e4
📒 Files selected for processing (2)
packages/core/src/v3/runEngineWorker/supervisor/http.tspackages/core/src/v3/runEngineWorker/workload/http.ts
📜 Review details
⏰ Context from checks skipped due to timeout. (32)
- GitHub Check: internal / 🧪 Unit Tests: Internal (12, 12)
- GitHub Check: internal / 🧪 Unit Tests: Internal (6, 12)
- GitHub Check: internal / 🧪 Unit Tests: Internal (11, 12)
- GitHub Check: internal / 🧪 Unit Tests: Internal (9, 12)
- GitHub Check: internal / 🧪 Unit Tests: Internal (5, 12)
- GitHub Check: internal / 🧪 Unit Tests: Internal (4, 12)
- GitHub Check: internal / 🧪 Unit Tests: Internal (7, 12)
- GitHub Check: internal / 🧪 Unit Tests: Internal (3, 12)
- GitHub Check: internal / 🧪 Unit Tests: Internal (1, 12)
- GitHub Check: internal / 🧪 Unit Tests: Internal (8, 12)
- GitHub Check: internal / 🧪 Unit Tests: Internal (10, 12)
- GitHub Check: internal / 🧪 Unit Tests: Internal (2, 12)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (4, 10)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (8, 10)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (6, 10)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (9, 10)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (10, 10)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (3, 10)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (2, 10)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (5, 10)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (7, 10)
- GitHub Check: sdk-compat / Node.js 24.18 (blacksmith-4vcpu-ubuntu-2404)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (1, 10)
- GitHub Check: sdk-compat / Node.js 26.4 (blacksmith-4vcpu-ubuntu-2404)
- GitHub Check: e2e / 🧪 CLI v3 tests (blacksmith-4vcpu-ubuntu-2404 - pnpm)
- GitHub Check: sdk-compat / Cloudflare Workers
- GitHub Check: packages / 🧪 Unit Tests: Packages (2, 3)
- GitHub Check: typecheck / typecheck
- GitHub Check: e2e / 🧪 CLI v3 tests (blacksmith-4vcpu-ubuntu-2404 - npm)
- GitHub Check: packages / 🧪 Unit Tests: Packages (1, 3)
- GitHub Check: packages / 🧪 Unit Tests: Packages (3, 3)
- GitHub Check: e2e-webapp / 🧪 E2E Tests: Webapp
⚠️ CI failures not shown inline (4)
GitHub Actions: 🔎 REVIEW.md Drift Audit / audit: fix(webapp,core): retry run resume through transient database outages
Conclusion: failure
##[group]Run anthropics/claude-code-action@428971d2ecd6e3a7cb0ee0da2a3a8b33fdb3678d
with:
anthropic_***REDACTED***
use_sticky_comment: true
allowed_bots: devin-ai-integration[bot]
claude_args: --max-turns 30
--allowedTools "Read,Glob,Grep,Bash(git diff:*)"
prompt: You are auditing this PR for drift against `.claude/REVIEW.md`.
## Context
`.claude/REVIEW.md` is the repo's source of truth for what AI / agent code reviewers should treat as critical findings (rolling-deploy safety, hot-table indexes, recovery-path queries, testcontainers usage, Lua versioning, etc.). It is consumed by review agents to calibrate severity. If REVIEW.md goes stale, every future agent review degrades.
## Strategy — read this first
You have a hard turn budget. Spend it on signal, not coverage. The audit is allowed to miss things; it is NOT allowed to time out.
1. Read `.claude/REVIEW.md` once, in full.
2. Run `git diff origin/main...HEAD --name-only` to get the list of changed files. Do NOT read the diff content yet.
3. Scan the file-list for relevance to REVIEW.md scope. Relevance signals: changes to Prisma schema, Redis / queue / Lua code, hot tables, recovery / restart loops, new packages, deletions of paths REVIEW.md cites. Skim everything else.
4. Open at most **5 files** total — only the ones most likely to surface a real signal. If nothing in the file-list looks relevant to any REVIEW.md rule, do NOT read any files; go straight to the verdict.
5. Form a verdict and stop. Do not exhaust the turn budget exploring.
Large PRs (>50 files changed) are a strong signal to be MORE selective, not more thorough. Pick 3-5 files at most.
## What to look for
- **Stale references** — does any REVIEW.md rule cite a file, directory, function, table, Prisma model, or package name that has been removed or renamed in this PR (or is already gone from `main`)?
- **Contradictions** — does code in this PR clearly violate a current REVIEW.md rule? (Don't re-review the PR. Only flag if REVIE...
GitHub Actions: 🔎 REVIEW.md Drift Audit / 0_audit.txt: fix(webapp,core): retry run resume through transient database outages
Conclusion: failure
ild-legacy-run-engine.fix3
* [new tag] build-manual-checkpoints.rc1 -> build-manual-checkpoints.rc1
* [new tag] build-metadata-upgrade-logging.rc1 -> build-metadata-upgrade-logging.rc1
* [new tag] build-metadata-upgrade-logging.rc2 -> build-metadata-upgrade-logging.rc2
* [new tag] build-metadata-upgrade-logging.rc3 -> build-metadata-upgrade-logging.rc3
* [new tag] build-new-build-system.rc.1 -> build-new-build-system.rc.1
* [new tag] build-otel-upgrade-rc.0 -> build-otel-upgrade-rc.0
* [new tag] build-otel-upgrade-rc.1 -> build-otel-upgrade-rc.1
* [new tag] build-pre-pull-deployments-rc.1 -> build-pre-pull-deployments-rc.1
* [new tag] build-prod-rescue-rc.1 -> build-prod-rescue-rc.1
* [new tag] build-rate-limiter-fix-rc.1 -> build-rate-limiter-fix-rc.1
* [new tag] build-re2.rc0 -> build-re2.rc0
* [new tag] build-realtime-v2-stream-fix -> build-realtime-v2-stream-fix
* [new tag] build-realtime-v2-stream-fix-2 -> build-realtime-v2-stream-fix-2
* [new tag] build-realtime-v2-stream-fix-3 -> build-realtime-v2-stream-fix-3
* [new tag] build-realtime-v2-stream-fix-4 -> build-realtime-v2-stream-fix-4
* [new tag] build-realtime-v2-stream-fix-5 -> build-realtime-v2-stream-fix-5
* [new tag] build-realtimestreams-dedupe -> build-realtimestreams-dedupe
* [new tag] build-registry-maintenance-rc.1 -> build-registry-maintenance-rc.1
* [new tag] build-registry-maintenance-rc.2 -> build-registry-maintenance-rc.2
* [new tag] build-remote-ecr-rc.0 -> build-remote-ecr-rc.0
* [new tag] build-reschedule-hotfix.rc1 -> build-reschedule-hotfix.rc1
* [new tag] build-resume-fixes.rc1 -> build-resume-fixes.rc1
* [new tag] build-resume-fix...
GitHub Actions: 📝 Agent Instructions Audit / 0_audit.txt: fix(webapp,core): retry run resume through transient database outages
Conclusion: failure
-rc.2 -> build-batching-rc.2
* [new tag] build-billing-0.0.1 -> build-billing-0.0.1
* [new tag] build-billing-0.0.2 -> build-billing-0.0.2
* [new tag] build-billing-0.0.3 -> build-billing-0.0.3
* [new tag] build-buildinfo-rc.0 -> build-buildinfo-rc.0
* [new tag] build-buildinfo-rc.1 -> build-buildinfo-rc.1
* [new tag] build-checkpoint-failover-rc.1 -> build-checkpoint-failover-rc.1
* [new tag] build-checkpoint-race-condition-1 -> build-checkpoint-race-condition-1
* [new tag] build-checkpoint-race-condition-2 -> build-checkpoint-race-condition-2
* [new tag] build-checkpoint-race-condition-3 -> build-checkpoint-race-condition-3
* [new tag] build-chris-test-blacksmith -> build-chris-test-blacksmith
* [new tag] build-chris-test-blacksmith-2 -> build-chris-test-blacksmith-2
* [new tag] build-cli-build-upgrade-rc.1 -> build-cli-build-upgrade-rc.1
* [new tag] build-clickhouse-reads-rc0 -> build-clickhouse-reads-rc0
* [new tag] build-clickhouse-reads-rc1 -> build-clickhouse-reads-rc1
* [new tag] build-compute.rc0 -> build-compute.rc0
* [new tag] build-compute.rc1 -> build-compute.rc1
* [new tag] build-compute.rc2 -> build-compute.rc2
* [new tag] build-compute.rc3 -> build-compute.rc3
* [new tag] build-compute.rc4 -> build-compute.rc4
* [new tag] build-compute.rc5 -> build-compute.rc5
* [new tag] build-compute.rc6 -> build-compute.rc6
* [new tag] build-corepack-offline-rc.0 -> build-corepack-offline-rc.0
* [new tag] build-current-deployment-rc.0 -> build-current-deployment-rc.0
* [new tag] build-dependabot-q2.rc0 -> build-...
GitHub Actions: 📝 Agent Instructions Audit / audit: fix(webapp,core): retry run resume through transient database outages
Conclusion: failure
##[group]Run anthropics/claude-code-action@428971d2ecd6e3a7cb0ee0da2a3a8b33fdb3678d
with:
anthropic_***REDACTED***
use_sticky_comment: true
allowed_bots: devin-ai-integration[bot]
claude_args: --max-turns 25
--model claude-opus-4-8
--allowedTools "Read,Glob,Grep,Bash(git diff:*)"
prompt: You are reviewing a PR to check whether any agent instruction files need updating.
In this repo:
- Root shared agent guidance lives in `AGENTS.md`.
- Root `CLAUDE.md` is only a Claude Code adapter that imports `AGENTS.md`.
- Subdirectories may still have scoped `CLAUDE.md` files.
- `.claude/rules/` contains additional Claude Code guidance.
## Your task
1. Run `git diff origin/main...HEAD --name-only` to see which files changed in this PR.
2. For each changed directory, check the applicable instruction files: root `AGENTS.md`, any `CLAUDE.md` in that directory or a parent directory, and relevant `.claude/rules/` files.
3. Determine if any instruction file should be updated based on the changes. Consider:
- New files/directories that aren't covered by existing documentation
- Changed architecture or patterns that contradict current agent guidance
- New dependencies, services, or infrastructure that agents should know about
- Renamed or moved files that are referenced in an instruction file
- Changes to build commands, test patterns, or development workflows
## Response format
If NO updates are needed, respond with exactly:
✅ Agent instruction files look current for this PR.
If updates ARE needed, respond with a short list:
📝 **Agent instruction updates suggested:**
- `AGENTS.md`: [what should be added/changed]
- `path/to/CLAUDE.md`: [what should be added/changed]
- `.claude/rules/file.md`: [what should be added/changed]
Keep suggestions specific and brief. Only flag things that would actually mislead agents in future sessions.
Do NOT suggest updates for trivial changes (bug fixes, small refactors within existing patterns).
Do NOT suggest creating new...
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead
Files:
packages/core/src/v3/runEngineWorker/workload/http.tspackages/core/src/v3/runEngineWorker/supervisor/http.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use zod for validation in packages/core and apps/webapp
Files:
packages/core/src/v3/runEngineWorker/workload/http.tspackages/core/src/v3/runEngineWorker/supervisor/http.ts
**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use function declarations instead of default exports
**/*.{ts,tsx,js,jsx}: Prefer static imports over dynamicimport(); only use dynamic imports when resolving circular dependencies, enabling real code splitting, or conditionally loading a module at runtime.
Always import from@trigger.dev/sdk; never import from@trigger.dev/sdk/v3or use deprecatedclient.defineJob.
In code that imports@trigger.dev/core, use subpath imports only and never import from the package root.
Files:
packages/core/src/v3/runEngineWorker/workload/http.tspackages/core/src/v3/runEngineWorker/supervisor/http.ts
**/*.ts
📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)
**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries
Files:
packages/core/src/v3/runEngineWorker/workload/http.tspackages/core/src/v3/runEngineWorker/supervisor/http.ts
packages/core/**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (packages/core/CLAUDE.md)
Never import the root package (
@trigger.dev/core). Always use subpath imports such as@trigger.dev/core/v3,@trigger.dev/core/v3/utils,@trigger.dev/core/logger, or@trigger.dev/core/schemas
Files:
packages/core/src/v3/runEngineWorker/workload/http.tspackages/core/src/v3/runEngineWorker/supervisor/http.ts
🧠 Learnings (9)
📚 Learning: 2026-03-22T13:26:12.060Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).
Applied to files:
packages/core/src/v3/runEngineWorker/workload/http.tspackages/core/src/v3/runEngineWorker/supervisor/http.ts
📚 Learning: 2026-03-22T19:24:14.403Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.
Applied to files:
packages/core/src/v3/runEngineWorker/workload/http.tspackages/core/src/v3/runEngineWorker/supervisor/http.ts
📚 Learning: 2026-05-18T08:21:27.694Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma error P1001 ("Can't reach database server") in TypeScript, don’t assume a single error shape. Prisma can surface P1001 via two different error classes/fields: `PrismaClientKnownRequestError` exposes it as `err.code === "P1001"` (common during mid-query connection drops), while `PrismaClientInitializationError` exposes it as `err.errorCode === "P1001"` (common on client startup failure). Therefore, predicates should use `err.code === "P1001" || err.errorCode === "P1001"`. Do not flag `err.code === "P1001"` as “unreachable/never matches,” as it is expected in production.
Applied to files:
packages/core/src/v3/runEngineWorker/workload/http.tspackages/core/src/v3/runEngineWorker/supervisor/http.ts
📚 Learning: 2026-05-18T08:21:27.694Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma errors for P1001 ("Can't reach database server"), do not assume it only appears under a single property name. Prisma may surface P1001 via either `PrismaClientKnownRequestError` (`err.code === "P1001"`, e.g., mid-query connection drops) or `PrismaClientInitializationError` (`err.errorCode === "P1001"`, e.g., client startup connection failure). To reliably detect the condition, check `err.code === "P1001" || err.errorCode === "P1001"`, and avoid review rules that would incorrectly flag `err.code === "P1001"` as unreachable/never-matching.
Applied to files:
packages/core/src/v3/runEngineWorker/workload/http.tspackages/core/src/v3/runEngineWorker/supervisor/http.ts
📚 Learning: 2026-06-13T19:53:13.759Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3937
File: packages/trigger-sdk/skills/realtime-and-frontend/SKILL.md:258-260
Timestamp: 2026-06-13T19:53:13.759Z
Learning: When reviewing code that uses `trigger.dev/react-hooks`’s `useRealtimeRun`, preserve the call signature where the first argument is the full realtime handle object (not `handle.id`). This is intentional to maintain type-safety and is consistent with the official docs; do not suggest changing the first argument from the handle object to `handle.id`.
Applied to files:
packages/core/src/v3/runEngineWorker/workload/http.tspackages/core/src/v3/runEngineWorker/supervisor/http.ts
📚 Learning: 2026-06-17T17:13:49.929Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3948
File: apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.bulk-actions.$bulkActionParam/route.tsx:48-62
Timestamp: 2026-06-17T17:13:49.929Z
Learning: In triggerdotdev/trigger.dev, within `dashboardLoader`/`dashboardAction` (or similar context resolver code) whenever you resolve an organization ID from an organization slug for RBAC/enterprise authorization scope, always read from the primary Prisma client (`prisma`), not `$replica`. Using `$replica` can hit replica-lag and cause the RBAC lookup/authorization to run without the correct org scope (bypassing intended role enforcement). Implement the slug→org lookup with `prisma.organization.findFirst(...)` (or equivalent primary-client query) and add an inline comment documenting why the primary client is required (replica lag could lead to unscoped RBAC checks).
Applied to files:
packages/core/src/v3/runEngineWorker/workload/http.tspackages/core/src/v3/runEngineWorker/supervisor/http.ts
📚 Learning: 2026-06-23T13:04:21.413Z
Learnt from: carderne
Repo: triggerdotdev/trigger.dev PR: 4023
File: apps/webapp/app/services/upsertBranch.server.ts:14-18
Timestamp: 2026-06-23T13:04:21.413Z
Learning: In TypeScript, it’s valid to `import { type X }` and then use `typeof X` in a type-only position, e.g. `type Alias = z.infer<typeof X>`. The `type` modifier suppresses the runtime import, but the type checker still has the full exported type so `z.infer<typeof X>` can resolve correctly. In code reviews, don’t flag this as a TypeScript compile error as long as `typeof X` is used in a type context (e.g., with `z.infer`, `type` aliases, generics), not as a runtime value.
Applied to files:
packages/core/src/v3/runEngineWorker/workload/http.tspackages/core/src/v3/runEngineWorker/supervisor/http.ts
📚 Learning: 2026-06-04T18:16:35.386Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3836
File: apps/supervisor/src/backpressure/backpressureMonitor.ts:3-5
Timestamp: 2026-06-04T18:16:35.386Z
Learning: When reviewing TypeScript in this repo, apply the rule “prefer type aliases over interfaces” only to data/object shapes and union/intersection type modeling. If an interface is being used as a behavioral contract for collaborators to implement (e.g., method-shape interfaces that define required behavior, such as `BackpressureLogger` / `BackpressureSignalSource` in `apps/supervisor/src/backpressure/backpressureMonitor.ts`), keep it as an `interface` and do not flag it as a type-alias-vs-interface violation.
Applied to files:
packages/core/src/v3/runEngineWorker/workload/http.tspackages/core/src/v3/runEngineWorker/supervisor/http.ts
📚 Learning: 2026-06-09T17:58:04.699Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 3879
File: apps/webapp/app/models/vercelIntegration.server.ts:619-630
Timestamp: 2026-06-09T17:58:04.699Z
Learning: In this codebase, outbound raw `fetch` calls should typically rely on Node/undici’s default request timeout (about ~300s) rather than adding a per-call `AbortController` + `setTimeout` wrapper inside individual functions (e.g. in files like `apps/webapp/app/models/vercelIntegration.server.ts`). During code review, do not flag the absence of a per-call timeout on a single `fetch` as an issue; if per-call timeouts are needed, they should be implemented via a codebase-wide convention (e.g., a shared fetch wrapper or documented pattern) rather than ad-hoc per-function changes.
Applied to files:
packages/core/src/v3/runEngineWorker/workload/http.tspackages/core/src/v3/runEngineWorker/supervisor/http.ts
🔇 Additional comments (2)
packages/core/src/v3/runEngineWorker/workload/http.ts (1)
137-141: 🩺 Stability & AvailabilityCompounding retries across hops could stretch worst-case resume latency significantly.
Per the comment, this hop (worker→supervisor) now retries up to 8 times (worst-case ~45s+ of backoff alone) on top of the supervisor's own retry against the engine (also up to 8 attempts / ~45s+, per
supervisor/http.ts). If the worker→supervisor leg itself fails or times out while the supervisor is mid-retry internally, the two policies stack rather than share a budget, so a single resume attempt could take several minutes in a bad-case outage. UnlikeheartbeatRun/getSnapshotsSincein this same file,continueRunExecutionalso has noAbortSignal.timeout(...), so there's no explicit upper bound on a single attempt either. Worth confirming this compounded worst-case latency is acceptable for the resume flow (e.g., does anything downstream have its own timeout that could fire mid-retry and force-kill the run anyway, defeating the purpose of this fix?).packages/core/src/v3/runEngineWorker/supervisor/http.ts (1)
238-265: 🎯 Functional Correctness
wrapZodFetchalready accepts the 4th options argument and forwardsretry; 5xx responses are retried byshouldRetry.> Likely an incorrect or invalid review comment.
Summary
When the platform database is briefly unreachable while a run is resuming from a wait, the run no longer fails with
TASK_EXECUTION_ABORTED. The worker now retries the resume through the outage instead of aborting on the first blip.Root cause
Resuming a run calls the engine's
continueworker-action endpoint. That route caught every error and returned a422, which the worker's HTTP client treats as non-retryable. So a transient Prisma infrastructure error (for exampleP1001"Can't reach database server") was flattened into a permanent failure: the worker gave up, force-killed the run process, and completed it withTASK_EXECUTION_ABORTED.Fix
continueroute now lets infrastructure errors propagate to the generic 500 handler (message scrubbed, and retryable by the worker's HTTP client), the same treatment the trigger path already gives them viaisInfrastructureError. Genuine validation errors (snapshot mismatch, invalid state) still return422, so a stale retry stays non-retryable. Resuming is idempotent server-side (guarded by the snapshot id), so retrying is safe.continueRunExecutioncalls (both the runner-to-supervisor and supervisor-to-engine hops) retry with a longer, jittered backoff so they can ride out an outage lasting tens of seconds, and the jitter keeps a fleet of resuming runs from stampeding the database the moment it recovers.Builds on #3960, which scrubbed the leaked message on these routes but left the status non-retryable.
No changeset: this is a server-side behaviour fix recorded via
.server-changes. The@trigger.dev/coreedits are internal run-engine worker plumbing, not a public API change.