perf(run-engine): merge dequeue snapshot creation into taskRun.update transaction [TRI-8450] by devin-ai-integration[bot] · Pull Request #3395 · triggerdotdev/trigger.dev

devin-ai-integration · 2026-04-16T10:30:58Z

Summary

Nests the TaskRunExecutionSnapshot creation inside the taskRun.update() Prisma call in the dequeue flow, reducing 2 DB commits → 1 per dequeue operation. This is the highest-volume of the five unmerged flows identified in TRI-8450 (~9,200 commits/sec on the engine service).

Pattern: Follows the same nested-write approach already used in the completion path (runAttemptSystem.ts:735) and trigger path (engine/index.ts:674).

Changes:

dequeueSystem.ts: Moved snapshot creation into executionSnapshots: { create: {...} } within the existing taskRun.update(). Pre-generates the snapshot ID via generateInternalId() (plain cuid, matching what Prisma's @default(cuid()) produces) so the event emission, heartbeat enqueue, and return value can all be constructed from data already in scope — no extra DB read needed after the merged write. SnapshotId.toFriendlyId() is used only for the return value's friendlyId field, matching the original createExecutionSnapshot behavior.
executionSnapshotSystem.ts: Added public enqueueHeartbeatIfNeeded() method that exposes the heartbeat scheduling logic (previously only available internally via createExecutionSnapshot). This is needed because PENDING_EXECUTING requires a heartbeat, unlike the FINISHED status in the completion reference pattern. This method is reusable by future merge targets (retry-immediate, checkpoint, cancel, requeue).

Net DB change per dequeue: eliminates 1 write transaction (the separate TaskRunExecutionSnapshot.create). No extra reads added — the snapshot ID is pre-generated and the executionSnapshotCreated event payload is constructed inline from values already available in the closure.

Review & Testing Checklist for Human

Verify manually-constructed event payload matches DB state: The executionSnapshotCreated event is now built inline (not read back from DB). Confirm the field values (runStatus: "PENDING", attemptNumber, checkpointId, workerId, runnerId, completedWaitpointIds) match what Prisma actually writes. A mismatch here would be silent — event consumers would get stale/wrong data.
Verify attemptNumber source is equivalent: Old code used lockedTaskRun.attemptNumber (post-update result). New code uses result.run.attemptNumber (pre-update). The taskRun.update() data payload does NOT include attemptNumber, so they should be identical — but confirm this assumption holds for all dequeue scenarios (e.g. retried runs).
Verify isValid defaults to true in schema: The old createExecutionSnapshot explicitly set isValid: error ? false : true. The nested create omits isValid (no error in the dequeue happy path). Confirm the Prisma schema default for TaskRunExecutionSnapshot.isValid is true.
Verify runStatus: "PENDING" hardcoding matches the mapping: The old code passed lockedTaskRun.status ("DEQUEUED") to createExecutionSnapshot, which mapped it to "PENDING" via run.status === "DEQUEUED" ? "PENDING" : run.status. The new code hardcodes "PENDING" directly. This is correct but brittle if status ever changes from "DEQUEUED" to something else upstream.
Spot-check completedWaitpoints connect + order logic: The nested create replicates the connect/order logic from createExecutionSnapshot (lines 387-393). Verify the snapshot.completedWaitpoints type provides id and index fields compatible with this usage.
Verify checkpoint in return value: The return now uses snapshot.checkpoint (from the previous snapshot) instead of reading the newly-created snapshot's checkpoint relation. Since checkpointId is passed through unchanged, they should be identical — but worth a sanity check.

Recommended test plan: deploy to staging, run the sample_pg_activity.py sampler for a 5-minute window, and verify the COMMIT count drop on the engine service + proportional IO:XactSync reduction.

Notes

This only covers the dequeue flow (flow Testing creating a new issue #1 from TRI-8450). The remaining four flows (retry-immediate, checkpoint, requeue, cancel) are separate follow-ups.
The new enqueueHeartbeatIfNeeded method is deliberately designed for reuse by those follow-up PRs.
CI note: the priority.test.ts failure in shard 7 is a flaky ordering assertion unrelated to this change (it compares friendlyId values in dequeue order). The audit check is also pre-existing/unrelated.

Link to Devin session: https://app.devin.ai/sessions/034fe0e7224f49278a2de260203e1377
Requested by: @ericallam

devin-ai-integration · 2026-04-16T10:31:01Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

changeset-bot · 2026-04-16T10:31:04Z

⚠️ No Changeset found

Latest commit: 2a87406

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

github-actions · 2026-04-16T10:31:22Z

Thanks for your contribution! We require all external PRs to be opened in draft status first so you can address CodeRabbit review comments and ensure CI passes before requesting a review. Please re-open this PR as a draft. See CONTRIBUTING.md for details.

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

… transaction Nest the TaskRunExecutionSnapshot create inside the preceding taskRun.update() call in the dequeue flow, reducing 2 explicit BEGIN/COMMIT transactions to 1 per dequeue operation. This follows the same pattern already used in the completion path (runAttemptSystem.ts:735) and trigger path (engine/index.ts:674-686). Side effects (heartbeat enqueue, executionSnapshotCreated event) are kept outside the transaction and fed the result of the merged write. Also adds a public enqueueHeartbeatIfNeeded() method to ExecutionSnapshotSystem for reuse by other flows that will adopt the same merged pattern. Refs: TRI-8450 Co-Authored-By: Eric Allam <eallam@icloud.com>

Pre-generate the snapshot ID with SnapshotId.generate() and construct the event/return data from values already available in scope. This removes the extra DB read that was added in the initial merge commit. Co-Authored-By: Eric Allam <eallam@icloud.com>

….generate() Co-Authored-By: Eric Allam <eallam@icloud.com>

## Summary 8 new features, 18 improvements, 11 bug fixes. ## Breaking changes - Add server-side deprecation gate for deploys from v3 CLI versions (gated by `DEPRECATE_V3_CLI_DEPLOYS_ENABLED`). v4 CLI deploys are unaffected. ([#3415](#3415)) ## Improvements - Add `--no-browser` flag to `init` and `login` to skip auto-opening the browser during authentication. Also error loudly when `init` is run without `--yes` under non-TTY stdin (previously default-and-exited silently, leaving the project half-initialized). Both commands now show an `Examples` section in `--help`. ([#3483](#3483)) - Add `isReplay` boolean to the run context (`ctx.run.isReplay`), derived from the existing `replayedFromTaskRunFriendlyId` database field. Defaults to `false` for backwards compatibility. ([#3454](#3454)) - Redact the `resolveWaitpoint` runtime log so it only emits `id` and `type` instead of the full completed waitpoint. Previously the log printed the entire waitpoint (including `output`) to stdout in production runs, which could leak sensitive payloads. The value returned by `wait.forToken()` is unchanged. ([#3490](#3490)) - Add `SessionId` friendly ID generator and schemas for the new durable Session primitive. Exported from `@trigger.dev/core/v3/isomorphic` alongside `RunId`, `BatchId`, etc. Ships the `CreateSessionStreamWaitpoint` request/response schemas alongside the main Session CRUD. ([#3417](#3417)) - Truncate large error stacks and messages to prevent OOM crashes. Stack traces are capped at 50 frames (keeping top 5 + bottom 45 with an omission notice), individual stack lines at 1024 chars, and error messages at 1000 chars. Applied in parseError, sanitizeError, and OTel span recording. ([#3405](#3405)) ## Server changes These changes affect the self-hosted Docker image and Trigger.dev Cloud: - Add a "Back office" tab to `/admin` and a per-organization detail page at `/admin/back-office/orgs/:orgId`. The first action available on that page is editing the org's API rate limit: admins can save a `tokenBucket` override (refill rate, interval, max tokens) and see a plain-English preview of the resulting sustained rate and burst allowance. Writes are audit-logged via the server logger. ([#3434](#3434)) - Optional `DEPLOY_REGISTRY_ECR_DEFAULT_REPOSITORY_POLICY` env var to apply a default repository policy when the webapp creates new ECR repos ([#3467](#3467)) - Ship the Errors page to all users, with a polish + bug-fix pass: pinned "No channel" item in the Slack alert channel picker, viewer-timezone alert timestamps via Slack's `<!date^>` token, Activity sparkline peak tooltip, centered loading spinner and bug-icon empty state on the error detail page, ellipsis on the Configure alerts trigger. ([#3477](#3477)) - Configure the set of machine presets to build boot snapshots for at deploy time via `COMPUTE_TEMPLATE_MACHINE_PRESETS` (CSV of preset names, default `small-1x`). Use `COMPUTE_TEMPLATE_MACHINE_PRESETS_REQUIRED` (CSV, default = full PRESETS list) to scope which preset failures fail a required-mode deploy. Optional preset failures are logged and don't block the deploy. ([#3492](#3492)) - Regenerating a RuntimeEnvironment API key no longer invalidates the previous key immediately. The old key is recorded in a new `RevokedApiKey` table with a 24 hour grace window, and `findEnvironmentByApiKey` falls back to it when the submitted key doesn't match any live environment. The grace window can be ended early (or extended) by updating `expiresAt` on the row. ([#3420](#3420)) - Add the `Session` primitive — a durable, task-bound, bidirectional I/O channel that outlives a single run and acts as the run manager for `chat.agent`. Ships the Postgres `Session` + `SessionRun` tables, ClickHouse `sessions_v1` + replication service, the `sessions` JWT scope, and the public CRUD + realtime routes (`/api/v1/sessions`, `/realtime/v1/sessions/:session/:io`) including `end-and-continue` for server-orchestrated run handoffs and session-stream waitpoints. ([#3417](#3417)) - Add `KUBERNETES_POD_DNS_NDOTS_OVERRIDE_ENABLED` flag (off by default) that overrides the cluster default and sets `dnsConfig.options.ndots` on runner pods (defaulting to 2, configurable via `KUBERNETES_POD_DNS_NDOTS`). Kubernetes defaults pods to `ndots: 5`, so any name with fewer than 5 dots — including typical external domains like `api.example.com` — is first walked through every entry in the cluster search list (`<ns>.svc.cluster.local`, `svc.cluster.local`, `cluster.local`) before being tried as-is, turning one resolution into 4+ CoreDNS queries (×2 with A+AAAA). Using a lower `ndots` value reduces DNS query amplification in the `cluster.local` zone. Note: before enabling, make sure no code path relies on search-list expansion for names with dots ≥ the configured value — those names will hit their as-is form first and could resolve externally before falling back to the cluster search path. ([#3441](#3441)) - Vercel integration option to disable auto promotions ([#3376](#3376)) - Make it clear in the admin that feature flags are global and should rarely be changed. ([#3408](#3408)) - Admin worker groups API: add GET loader and expose more fields on POST. ([#3390](#3390)) - Add 60s fresh / 60s stale SWR cache to `getEntitlement` in `platform.v3.server.ts`. Eliminates a synchronous billing-service HTTP round trip on every trigger. Reuses the existing `platformCache` (LRU memory + Redis) pattern already used for `limits` and `usage`. Cache key is `${orgId}`. Errors return a permissive `{ hasAccess: true }` fallback (existing behavior) and are also cached to prevent thundering-herd on billing outages. ([#3388](#3388)) - Show a `MicroVM` badge next to the region name on the regions page. ([#3407](#3407)) - Increase default maximum project count per organization from 10 to 25 ([#3409](#3409)) - Merge execution snapshot creation into the dequeue taskRun.update transaction, reducing 2 DB commits to 1 per dequeue operation ([#3395](#3395)) - Add per-worker Node.js heap metrics to the OTel meter — `nodejs.memory.heap.used`, `nodejs.memory.heap.total`, `nodejs.memory.heap.limit`, `nodejs.memory.external`, `nodejs.memory.array_buffers`, `nodejs.memory.rss`. Host-metrics only publishes RSS, which overstates V8 heap by the external + native footprint; these give direct heap visibility per cluster worker so `NODE_MAX_OLD_SPACE_SIZE` can be sized against observed heap peaks rather than RSS. ([#3437](#3437)) - Tag Prisma spans with `db.datasource: "writer" | "replica"` so monitors and trace queries can distinguish the writer pool from the replica pool. Applies to all `prisma:engine:*` spans (including `prisma:engine:connection` used by the connection-pool monitors) and the outer `prisma:client:operation` span. ([#3422](#3422)) - Clarify the cross-region intent in the Terraform and AI-prompt helpers on the Add Private Connection page. Both already default `supported_regions` to `["us-east-1", "eu-central-1"]`; added an inline comment / parenthetical so the user understands why both regions are listed (Trigger.dev runs in both, so the service must be consumable from either). ([#3465](#3465)) - Add `RUN_ENGINE_READ_REPLICA_SNAPSHOTS_SINCE_ENABLED` flag (default off) to route the Prisma reads inside `RunEngine.getSnapshotsSince` through the read-only replica client. Offloads the snapshot polling queries (fired by every running task runner) from the primary. When disabled, behavior is unchanged. ([#3423](#3423)) - Stop creating TaskRunTag records and _TaskRunToTaskRunTag join table entries during task triggering. The denormalized runTags string array on TaskRun already stores tag names, making the M2M relation redundant write overhead. ([#3369](#3369)) - Stop writing per-tick state (`lastScheduledTimestamp`, `nextScheduledTimestamp`, `lastRunTriggeredAt`) on `TaskSchedule` and `TaskScheduleInstance`. The schedule engine now carries the previous fire time forward via the worker queue payload, eliminating ~270K dead-tuple-driven autovacuums per year on these hot tables and the associated `IO:XactSync` mini-spikes on the writer. Customer-facing `payload.lastTimestamp` semantics are unchanged. ([#3476](#3476)) - Replace the expensive DISTINCT query for task filter dropdowns with a dedicated TaskIdentifier registry table backed by Redis. Environments migrate automatically on their next deploy, with a transparent fallback to the legacy query for unmigrated environments. Also fixes duplicate dropdown entries when a task changes trigger source, and adds active/archived grouping for removed tasks. Moves BackgroundWorkerTask reads in the trigger hot path to the read replica. ([#3368](#3368)) - Public Access Tokens (PATs) minted before an API key rotation now keep working during the 24h grace window. `validatePublicJwtKey` falls back to any non-expired `RevokedApiKey` rows for the signing environment when the primary signature check against the env's current `apiKey` fails. The fallback query only runs on the failure path, so the hot success path is unchanged. ([#3464](#3464)) - Batch items that hit the environment queue size limit now fast-fail without retries and without creating pre-failed TaskRuns. ([#3352](#3352)) - Show the cancel button in the runs list for runs in `DEQUEUED` status. `DEQUEUED` was missing from `NON_FINAL_RUN_STATUSES` so the list hid the button even though the single run page allowed it. ([#3421](#3421)) - Reduce 5xx feedback loops on hot debounce keys by quantizing `delayUntil`, adding an unlocked fast-path skip, and gracefully handling redlock contention in `handleDebounce` so the SDK no longer retries into a herd. ([#3453](#3453)) - Fix RSS memory leak in the realtime proxy routes. `/realtime/v1/runs`, `/realtime/v1/runs/:id`, and `/realtime/v1/batches/:id` called `fetch()` into Electric with no abort signal, so when a client disconnected mid long-poll, undici kept the upstream socket open and buffered response chunks that would never be consumed — retained only in RSS, invisible to V8 heap tooling. Thread `getRequestAbortSignal()` through `RealtimeClient.streamRun/streamRuns/streamBatch` to `longPollingFetch` and cancel the upstream body in the error path. Isolated reproducer showed ~44 KB retained per leaked request; signal propagation releases it cleanly. ([#3442](#3442)) - Fix memory leak where every aborted SSE connection pinned the full request/response graph on Node 20, caused by `AbortSignal.any()` in `sse.ts` retaining its source signals indefinitely (see nodejs/node#54614, nodejs/node#55351). Also clear the `setTimeout(abort)` timer in `entry.server.tsx` so successful HTML renders don't pin the React tree for 30s per request. ([#3430](#3430)) - Preserve filters on the queues page when submitting modal actions. ([#3471](#3471)) - Fix Redis connection leak in realtime streams and broken abort signal propagation. **Redis connections**: Non-blocking methods (ingestData, appendPart, getLastChunkIndex) now share a single Redis connection instead of creating one per request. streamResponse still uses dedicated connections (required for XREAD BLOCK) but now tears them down immediately via disconnect() instead of graceful quit(), with a 15s inactivity fallback. **Abort signal**: request.signal is broken in Remix/Express due to a Node.js undici GC bug (nodejs/node#55428) that severs the signal chain when Remix clones the Request internally. Added getRequestAbortSignal() wired to Express res.on("close") via httpAsyncStorage, which fires reliably on client disconnect. All SSE/streaming routes updated to use it. ([#3399](#3399)) - Prevent dashboard crash (React error #31) when span accessory item text is not a string. Filters out malformed accessory items in SpanCodePathAccessory instead of passing objects to React as children. ([#3400](#3400)) - Upgrade Remix packages from 2.1.0 to 2.17.4 to address security vulnerabilities in React Router ([#3372](#3372)) - Fix Vercel integration settings page (remove redundant section toggles) and improve the Vercel onboarding flow so the modal closes after connecting a GitHub repo and the marketplace `next` URL is preserved across the GitHub app install redirect. ([#3424](#3424)) <details> <summary>Raw changeset output</summary> # Releases ## @trigger.dev/build@4.4.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.5` ## trigger.dev@4.4.5 ### Patch Changes - Add `--no-browser` flag to `init` and `login` to skip auto-opening the browser during authentication. Also error loudly when `init` is run without `--yes` under non-TTY stdin (previously default-and-exited silently, leaving the project half-initialized). Both commands now show an `Examples` section in `--help`. ([#3483](#3483)) - Updated dependencies: - `@trigger.dev/core@4.4.5` - `@trigger.dev/build@4.4.5` - `@trigger.dev/schema-to-json@4.4.5` ## @trigger.dev/core@4.4.5 ### Patch Changes - Add `isReplay` boolean to the run context (`ctx.run.isReplay`), derived from the existing `replayedFromTaskRunFriendlyId` database field. Defaults to `false` for backwards compatibility. ([#3454](#3454)) - Redact the `resolveWaitpoint` runtime log so it only emits `id` and `type` instead of the full completed waitpoint. Previously the log printed the entire waitpoint (including `output`) to stdout in production runs, which could leak sensitive payloads. The value returned by `wait.forToken()` is unchanged. ([#3490](#3490)) - Add `SessionId` friendly ID generator and schemas for the new durable Session primitive. Exported from `@trigger.dev/core/v3/isomorphic` alongside `RunId`, `BatchId`, etc. Ships the `CreateSessionStreamWaitpoint` request/response schemas alongside the main Session CRUD. ([#3417](#3417)) - Truncate large error stacks and messages to prevent OOM crashes. Stack traces are capped at 50 frames (keeping top 5 + bottom 45 with an omission notice), individual stack lines at 1024 chars, and error messages at 1000 chars. Applied in parseError, sanitizeError, and OTel span recording. ([#3405](#3405)) ## @trigger.dev/python@4.4.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.5` - `@trigger.dev/build@4.4.5` - `@trigger.dev/sdk@4.4.5` ## @trigger.dev/react-hooks@4.4.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.5` ## @trigger.dev/redis-worker@4.4.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.5` ## @trigger.dev/rsc@4.4.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.5` ## @trigger.dev/schema-to-json@4.4.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.5` ## @trigger.dev/sdk@4.4.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.5` </details> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

devin-ai-integration Bot assigned ericallam Apr 16, 2026

github-actions Bot closed this Apr 16, 2026

ericallam reopened this Apr 16, 2026

devin-ai-integration Bot commented Apr 16, 2026

View reviewed changes

devin-ai-integration Bot and others added 3 commits April 16, 2026 15:03

refactor: use plain cuid via generateInternalId instead of SnapshotId…

2a87406

….generate() Co-Authored-By: Eric Allam <eallam@icloud.com>

ericallam force-pushed the devin/TRI-8450-1776335318 branch from 813a973 to 2a87406 Compare April 16, 2026 14:12

myftija approved these changes Apr 16, 2026

View reviewed changes

matt-aitken approved these changes Apr 16, 2026

View reviewed changes

matt-aitken merged commit ff290df into main Apr 16, 2026
39 checks passed

matt-aitken deleted the devin/TRI-8450-1776335318 branch April 16, 2026 14:43

github-actions Bot mentioned this pull request Apr 17, 2026

chore: release v4.4.5 #3406

Merged

github-actions Bot mentioned this pull request May 1, 2026

chore: release v4.4.6 #3501

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(run-engine): merge dequeue snapshot creation into taskRun.update transaction [TRI-8450]#3395

perf(run-engine): merge dequeue snapshot creation into taskRun.update transaction [TRI-8450]#3395
matt-aitken merged 3 commits into
mainfrom
devin/TRI-8450-1776335318

devin-ai-integration Bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot commented Apr 16, 2026

Uh oh!

changeset-bot Bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 16, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

devin-ai-integration Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration Bot commented Apr 16, 2026

🤖 Devin AI Engineer

Uh oh!

changeset-bot Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

github-actions Bot commented Apr 16, 2026

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

devin-ai-integration Bot commented Apr 16, 2026 •

edited

Loading

changeset-bot Bot commented Apr 16, 2026 •

edited

Loading