Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .server-changes/resume-retry-transient-db.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
area: webapp
type: fix
---

Runs resuming after a wait no longer fail with TASK_EXECUTION_ABORTED when the database is briefly unreachable; the resume endpoint returns a retryable response for transient infrastructure errors instead of a permanent one.
Comment on lines +1 to +6

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Missing changeset means the retry-on-resume fix won't ship to users in the next package release

The retry configuration is added to a published package but no changeset is included (.server-changes/resume-retry-transient-db.md at .server-changes/resume-retry-transient-db.md:1-6), so the version of @trigger.dev/core won't be bumped and the behavioral change won't be released.

Impact: Users running the published SDK won't get the retry-on-resume fix until a changeset is added and the package is versioned.

Repository rules require a changeset for any change under packages/

CONTRIBUTING.md states: "If you are contributing a change to any packages in this monorepo (anything in either the /packages or /integrations directories), then you will need to add a changeset to your Pull Requests before they can be merged."

The table in CONTRIBUTING.md also says: "Both packages and server → Just the changeset" (no .server-changes/ file needed).

CLAUDE.md echoes: "When modifying any public package (packages/* or integrations/*), add a changeset."

This PR modifies packages/core/src/v3/runEngineWorker/supervisor/http.ts and packages/core/src/v3/runEngineWorker/workload/http.ts, both under packages/core which is the published @trigger.dev/core package. A changeset (via pnpm run changeset:add) selecting @trigger.dev/core as a patch is required.

Prompt for agents
This PR modifies packages/core (a published npm package) in addition to apps/webapp. Per CONTRIBUTING.md and CLAUDE.md, changes to packages/* require a changeset, not a .server-changes/ file. Run `pnpm run changeset:add` from the repo root, select `@trigger.dev/core`, choose `patch`, and describe the retry behavior change for the continue-run-execution endpoint. The .server-changes/resume-retry-transient-db.md file should be removed since mixed PRs (both packages and server) only need the changeset.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ import type { WorkerApiContinueRunExecutionRequestBody } from "@trigger.dev/core
import { z } from "zod";
import { logger } from "~/services/logger.server";
import { createLoaderWorkerApiRoute } from "~/services/routeBuilders/apiBuilder.server";
import { clientSafeErrorMessage } from "~/utils/prismaErrors";
import { clientSafeErrorMessage, isInfrastructureError } from "~/utils/prismaErrors";

export const loader = createLoaderWorkerApiRoute(
{
Expand All @@ -31,7 +31,21 @@ export const loader = createLoaderWorkerApiRoute(

return json(continuationResult);
} catch (error) {
logger.warn("Failed to suspend run", { runFriendlyId, snapshotFriendlyId, error });
logger.warn("Failed to continue run execution", {
runFriendlyId,
snapshotFriendlyId,
error,
});

// A Prisma infrastructure error (e.g. P1001 "Can't reach database
// server") means the DB was transiently unreachable while resuming. A 422
// is non-retryable, so the worker would permanently abort a run over a
// blip. Let it propagate to the generic 500 handler, which scrubs the
// message and is retried by the worker's HTTP client.
if (isInfrastructureError(error)) {
throw error;
}

if (error instanceof Error) {
throw json({ error: clientSafeErrorMessage(error) }, { status: 422 });
}
Expand Down
15 changes: 15 additions & 0 deletions packages/core/src/v3/runEngineWorker/supervisor/http.ts
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,21 @@ export class SupervisorHttpClient {
...this.defaultHeaders,
...this.runnerIdHeader(runnerId),
},
},
{
// This is the hop that reaches the engine, so it's where a transient
// database outage during resume surfaces (as a retryable 5xx). Resuming
// is idempotent server-side (guarded by the snapshot id), so retry
// generously to ride out the outage rather than aborting the run.
// `randomize` jitters the delay so a fleet of runs resuming at once
// doesn't stampede the DB the moment it recovers.
retry: {
minTimeoutInMs: 500,
maxTimeoutInMs: 10_000,
maxAttempts: 8,
factor: 2,
randomize: true,
},
}
);
}
Expand Down
14 changes: 14 additions & 0 deletions packages/core/src/v3/runEngineWorker/workload/http.ts
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,20 @@ export class WorkloadHttpClient {
headers: {
...this.defaultHeaders(),
},
},
{
// This hop only reaches the supervisor's workload server, so retry
// generously with jittered backoff to ride out a transient blip
// talking to the supervisor (e.g. a restart) rather than aborting the
// run. Database outages surface one hop further in, on the
// supervisor-to-engine call, which carries its own retry for them.
retry: {
minTimeoutInMs: 500,
maxTimeoutInMs: 10_000,
maxAttempts: 8,
factor: 2,
randomize: true,
},
}
)
);
Expand Down
Loading