fix(core): retry TASK_PROCESS_SIGSEGV under the user's retry policy by matt-aitken · Pull Request #3552 · triggerdotdev/trigger.dev

matt-aitken · 2026-05-11T16:11:03Z

Summary

TASK_PROCESS_SIGSEGV was hard-classified as non-retriable in shouldRetryError (packages/core/src/v3/errors.ts), failing the run on the first segfault regardless of the user's retry policy.
That assumed SIGSEGV is always a deterministic native crash. For Node tasks that's not reliably true — many production SIGSEGVs are flaky.
This flips SIGSEGV into the return true branch of shouldRetryError. The existing shouldLookupRetrySettings + lockedRetryConfig + maxAttempts chain in internal-packages/run-engine/src/engine/retrying.ts then gates the retry — same path SIGTERM and uncaught-exception already use. Tasks without a retry policy still fail fast.

Why retry

Common Node SIGSEGV causes that are non-deterministic across processes:

Native addon races — sharp, canvas, better-sqlite3, node-rdkafka, bcrypt, etc. libuv thread-pool work stepping on V8 handles. Different heap layout / thread schedule on a fresh process → retry often succeeds.
JIT / GC interaction — V8 turbofan deopt or GC during a native callback. Timing-dependent.
Near-OOM in native code — when RSS approaches the cgroup limit, native allocations fail and poorly-written addons dereference NULL → SIGSEGV instead of clean OOM-kill.
Host / hardware issues — bit flips, kernel quirks. Retry lands on a different host.

The genuinely deterministic case (a bad pointer in user code that always trips the same addon) is real, but it's a subset — and maxAttempts already bounds the damage.

Pre-existing inconsistency this resolves

shouldRetryError returned false for TASK_PROCESS_SIGSEGV → fail_run.
shouldLookupRetrySettings already lists TASK_PROCESS_SIGSEGV as retry-config-aware — but that branch was unreachable because shouldRetryError short-circuited first in retrying.ts:86-90.
We already retry TASK_RUN_UNCAUGHT_EXCEPTION (clearly a user-code bug) under the user's retry policy. Refusing to retry SIGSEGV was the odd one out.

shouldLookupRetrySettings reads like the intended behaviour; shouldRetryError looks like a stale gate.

What this doesn't touch

OOM-killed (TASK_PROCESS_OOM_KILLED, TASK_PROCESS_MAYBE_OOM_KILLED) — still false here because OOM has its own retry path in retrying.ts that bumps the machine size before reaching shouldRetryError. Tests assert this stays the case.
SIGKILL_TIMEOUT — still false. Process didn't respond to SIGTERM, so retrying without diagnosing why is more likely to mask a problem.
Routing SIGSEGV through the OOM machine-bump path is a plausible follow-up — would want data on SIGSEGV-near-OOM frequency before shipping.

Test plan

pnpm exec vitest run test/errors.test.ts in packages/core — 26/26 pass (4 new)
pnpm run build --filter @trigger.dev/core
CI green on PR

🤖 Generated with Claude Code

SIGSEGV was hard-classified as non-retriable in shouldRetryError on the assumption that it's always a deterministic native crash. For Node tasks that's not reliably true — many production SIGSEGVs are flaky: - Native addon races (sharp, canvas, better-sqlite3, node-rdkafka, bcrypt, etc.) — libuv thread-pool work stepping on V8 handles. Different heap layout / thread schedule on a fresh process, retry often succeeds. - JIT / GC interaction — V8 turbofan deopt or GC during a native callback. Timing-dependent. - Near-OOM in native code — when RSS approaches the cgroup limit, native allocations fail and poorly-written addons dereference NULL → SIGSEGV instead of a clean OOM-kill. A fresh process with cleaner memory often succeeds. - Host / hardware issues — bit flips, kernel quirks. Retry lands on a different host. The codebase was already inconsistent here: shouldLookupRetrySettings listed SIGSEGV as retry-config-aware, but the shouldRetryError gate short-circuited fail_run before that branch could be reached. And we already retry TASK_RUN_UNCAUGHT_EXCEPTION — clearly a user-code bug — under the user's retry policy, so refusing to retry SIGSEGV was the odd one out. Flip TASK_PROCESS_SIGSEGV from the false branch to the true branch in shouldRetryError. The existing retrying.ts pipeline then gates the retry on lockedRetryConfig + maxAttempts — same path SIGTERM and uncaught-exception already use. No new code paths; tasks without a retry policy still fail fast. Tests added in packages/core/test/errors.test.ts lock down the new classification alongside SIGTERM, SIGKILL_TIMEOUT, and the OOM codes (still non-retriable here because OOM has its own machine-bump retry path in retrying.ts that runs before shouldRetryError). Closes TRI-9234. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

changeset-bot · 2026-05-11T16:11:11Z

🦋 Changeset detected

Latest commit: cdfe334

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 29 packages

Name	Type
@trigger.dev/core	Patch
@trigger.dev/build	Patch
trigger.dev	Patch
@trigger.dev/python	Patch
@trigger.dev/redis-worker	Patch
@trigger.dev/schema-to-json	Patch
@trigger.dev/sdk	Patch
@internal/cache	Patch
@internal/clickhouse	Patch
@internal/llm-model-catalog	Patch
@internal/redis	Patch
@internal/replication	Patch
@internal/run-engine	Patch
@internal/schedule-engine	Patch
@internal/testcontainers	Patch
@internal/tracing	Patch
@internal/tsql	Patch
@internal/zod-worker	Patch
d3-chat	Patch
references-d3-openai-agents	Patch
references-nextjs-realtime	Patch
references-realtime-hooks-test	Patch
references-realtime-streams	Patch
references-telemetry	Patch
@internal/sdk-compat-tests	Patch
@trigger.dev/react-hooks	Patch
@trigger.dev/rsc	Patch
@trigger.dev/database	Patch
@trigger.dev/otlp-importer	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

coderabbitai · 2026-05-11T16:11:21Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 9401a995-a47a-449d-9563-b21838084501

📥 Commits

Reviewing files that changed from the base of the PR and between 2b84545 and cdfe334.

📒 Files selected for processing (3)

.changeset/retry-sigsegv.md
packages/core/src/v3/errors.ts
packages/core/test/errors.test.ts

📜 Recent review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)

GitHub Check: units / e2e-webapp / 🧪 E2E Tests: Webapp
GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
GitHub Check: typecheck / typecheck
GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
GitHub Check: sdk-compat / Cloudflare Workers
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
GitHub Check: sdk-compat / Deno Runtime
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
GitHub Check: sdk-compat / Bun Runtime
GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
GitHub Check: Analyze (javascript-typescript)

🧰 Additional context used

📓 Path-based instructions (13)

**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

Files:

packages/core/test/errors.test.ts
packages/core/src/v3/errors.ts

{packages/core,apps/webapp}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use zod for validation in packages/core and apps/webapp

Files:

packages/core/test/errors.test.ts
packages/core/src/v3/errors.ts

**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Files:

packages/core/test/errors.test.ts
packages/core/src/v3/errors.ts

**/*.{test,spec}.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use vitest for all tests in the Trigger.dev repository

Files:

packages/core/test/errors.test.ts

**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

packages/core/test/errors.test.ts
packages/core/src/v3/errors.ts

packages/core/**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (packages/core/CLAUDE.md)

Never import the root package (@trigger.dev/core). Always use subpath imports such as @trigger.dev/core/v3, @trigger.dev/core/v3/utils, @trigger.dev/core/logger, or @trigger.dev/core/schemas

Files:

packages/core/test/errors.test.ts
packages/core/src/v3/errors.ts

packages/**/*.{ts,tsx,js}