fix(core): retry TASK_PROCESS_SIGSEGV under the user's retry policy#3552
fix(core): retry TASK_PROCESS_SIGSEGV under the user's retry policy#3552matt-aitken wants to merge 1 commit into
Conversation
SIGSEGV was hard-classified as non-retriable in shouldRetryError on the assumption that it's always a deterministic native crash. For Node tasks that's not reliably true — many production SIGSEGVs are flaky: - Native addon races (sharp, canvas, better-sqlite3, node-rdkafka, bcrypt, etc.) — libuv thread-pool work stepping on V8 handles. Different heap layout / thread schedule on a fresh process, retry often succeeds. - JIT / GC interaction — V8 turbofan deopt or GC during a native callback. Timing-dependent. - Near-OOM in native code — when RSS approaches the cgroup limit, native allocations fail and poorly-written addons dereference NULL → SIGSEGV instead of a clean OOM-kill. A fresh process with cleaner memory often succeeds. - Host / hardware issues — bit flips, kernel quirks. Retry lands on a different host. The codebase was already inconsistent here: shouldLookupRetrySettings listed SIGSEGV as retry-config-aware, but the shouldRetryError gate short-circuited fail_run before that branch could be reached. And we already retry TASK_RUN_UNCAUGHT_EXCEPTION — clearly a user-code bug — under the user's retry policy, so refusing to retry SIGSEGV was the odd one out. Flip TASK_PROCESS_SIGSEGV from the false branch to the true branch in shouldRetryError. The existing retrying.ts pipeline then gates the retry on lockedRetryConfig + maxAttempts — same path SIGTERM and uncaught-exception already use. No new code paths; tasks without a retry policy still fail fast. Tests added in packages/core/test/errors.test.ts lock down the new classification alongside SIGTERM, SIGKILL_TIMEOUT, and the OOM codes (still non-retriable here because OOM has its own machine-bump retry path in retrying.ts that runs before shouldRetryError). Closes TRI-9234. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
🦋 Changeset detectedLatest commit: cdfe334 The changes in this PR will be included in the next version bump. This PR includes changesets to release 29 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📜 Recent review details⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
🧰 Additional context used📓 Path-based instructions (13)**/*.{ts,tsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
{packages/core,apps/webapp}/**/*.{ts,tsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
**/*.{ts,tsx,js,jsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
**/*.{test,spec}.{ts,tsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
**/*.ts📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)
Files:
packages/core/**/*.{ts,tsx,js,jsx}📄 CodeRabbit inference engine (packages/core/CLAUDE.md)
Files:
packages/**/*.{ts,tsx,js}📄 CodeRabbit inference engine (CLAUDE.md)
Files:
**/*.test.{ts,tsx,js}📄 CodeRabbit inference engine (CLAUDE.md)
Files:
**/*.test.{ts,tsx}📄 CodeRabbit inference engine (CLAUDE.md)
Files:
{packages,integrations}/**/*.{ts,tsx,js}📄 CodeRabbit inference engine (CLAUDE.md)
Files:
{package.json,**/*.{ts,tsx,js}}📄 CodeRabbit inference engine (CLAUDE.md)
Files:
**/*.{ts,tsx,js}📄 CodeRabbit inference engine (CLAUDE.md)
Files:
**/*.{ts,tsx,js,jsx,json,md,css,scss}📄 CodeRabbit inference engine (AGENTS.md)
Files:
🧠 Learnings (2)📚 Learning: 2026-03-22T13:26:12.060ZApplied to files:
📚 Learning: 2026-03-22T19:24:14.403ZApplied to files:
🔇 Additional comments (3)
WalkthroughThis PR makes segmentation fault (SIGSEGV) errors retryable when a task has a retry policy configured. The Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Closes TRI-9234
Summary
TASK_PROCESS_SIGSEGVwas hard-classified as non-retriable inshouldRetryError(packages/core/src/v3/errors.ts), failing the run on the first segfault regardless of the user's retry policy.return truebranch ofshouldRetryError. The existingshouldLookupRetrySettings+lockedRetryConfig+maxAttemptschain ininternal-packages/run-engine/src/engine/retrying.tsthen gates the retry — same path SIGTERM and uncaught-exception already use. Tasks without a retry policy still fail fast.Why retry
Common Node SIGSEGV causes that are non-deterministic across processes:
sharp,canvas,better-sqlite3,node-rdkafka,bcrypt, etc. libuv thread-pool work stepping on V8 handles. Different heap layout / thread schedule on a fresh process → retry often succeeds.The genuinely deterministic case (a bad pointer in user code that always trips the same addon) is real, but it's a subset — and
maxAttemptsalready bounds the damage.Pre-existing inconsistency this resolves
shouldRetryErrorreturnedfalseforTASK_PROCESS_SIGSEGV→fail_run.shouldLookupRetrySettingsalready listsTASK_PROCESS_SIGSEGVas retry-config-aware — but that branch was unreachable becauseshouldRetryErrorshort-circuited first inretrying.ts:86-90.TASK_RUN_UNCAUGHT_EXCEPTION(clearly a user-code bug) under the user's retry policy. Refusing to retry SIGSEGV was the odd one out.shouldLookupRetrySettingsreads like the intended behaviour;shouldRetryErrorlooks like a stale gate.What this doesn't touch
TASK_PROCESS_OOM_KILLED,TASK_PROCESS_MAYBE_OOM_KILLED) — stillfalsehere because OOM has its own retry path inretrying.tsthat bumps the machine size before reachingshouldRetryError. Tests assert this stays the case.false. Process didn't respond to SIGTERM, so retrying without diagnosing why is more likely to mask a problem.Test plan
pnpm exec vitest run test/errors.test.tsinpackages/core— 26/26 pass (4 new)pnpm run build --filter @trigger.dev/core🤖 Generated with Claude Code