Skip to content

perf(run-engine): merge dequeue snapshot creation into taskRun.update transaction [TRI-8450]#3395

Open
devin-ai-integration[bot] wants to merge 3 commits intomainfrom
devin/TRI-8450-1776335318
Open

perf(run-engine): merge dequeue snapshot creation into taskRun.update transaction [TRI-8450]#3395
devin-ai-integration[bot] wants to merge 3 commits intomainfrom
devin/TRI-8450-1776335318

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot commented Apr 16, 2026

Summary

Nests the TaskRunExecutionSnapshot creation inside the taskRun.update() Prisma call in the dequeue flow, reducing 2 DB commits → 1 per dequeue operation. This is the highest-volume of the five unmerged flows identified in TRI-8450 (~9,200 commits/sec on the engine service).

Pattern: Follows the same nested-write approach already used in the completion path (runAttemptSystem.ts:735) and trigger path (engine/index.ts:674).

Changes:

  • dequeueSystem.ts: Moved snapshot creation into executionSnapshots: { create: {...} } within the existing taskRun.update(). Pre-generates the snapshot ID via generateInternalId() (plain cuid, matching what Prisma's @default(cuid()) produces) so the event emission, heartbeat enqueue, and return value can all be constructed from data already in scope — no extra DB read needed after the merged write. SnapshotId.toFriendlyId() is used only for the return value's friendlyId field, matching the original createExecutionSnapshot behavior.
  • executionSnapshotSystem.ts: Added public enqueueHeartbeatIfNeeded() method that exposes the heartbeat scheduling logic (previously only available internally via createExecutionSnapshot). This is needed because PENDING_EXECUTING requires a heartbeat, unlike the FINISHED status in the completion reference pattern. This method is reusable by future merge targets (retry-immediate, checkpoint, cancel, requeue).

Net DB change per dequeue: eliminates 1 write transaction (the separate TaskRunExecutionSnapshot.create). No extra reads added — the snapshot ID is pre-generated and the executionSnapshotCreated event payload is constructed inline from values already available in the closure.

Review & Testing Checklist for Human

  • Verify manually-constructed event payload matches DB state: The executionSnapshotCreated event is now built inline (not read back from DB). Confirm the field values (runStatus: "PENDING", attemptNumber, checkpointId, workerId, runnerId, completedWaitpointIds) match what Prisma actually writes. A mismatch here would be silent — event consumers would get stale/wrong data.
  • Verify attemptNumber source is equivalent: Old code used lockedTaskRun.attemptNumber (post-update result). New code uses result.run.attemptNumber (pre-update). The taskRun.update() data payload does NOT include attemptNumber, so they should be identical — but confirm this assumption holds for all dequeue scenarios (e.g. retried runs).
  • Verify isValid defaults to true in schema: The old createExecutionSnapshot explicitly set isValid: error ? false : true. The nested create omits isValid (no error in the dequeue happy path). Confirm the Prisma schema default for TaskRunExecutionSnapshot.isValid is true.
  • Verify runStatus: "PENDING" hardcoding matches the mapping: The old code passed lockedTaskRun.status ("DEQUEUED") to createExecutionSnapshot, which mapped it to "PENDING" via run.status === "DEQUEUED" ? "PENDING" : run.status. The new code hardcodes "PENDING" directly. This is correct but brittle if status ever changes from "DEQUEUED" to something else upstream.
  • Spot-check completedWaitpoints connect + order logic: The nested create replicates the connect/order logic from createExecutionSnapshot (lines 387-393). Verify the snapshot.completedWaitpoints type provides id and index fields compatible with this usage.
  • Verify checkpoint in return value: The return now uses snapshot.checkpoint (from the previous snapshot) instead of reading the newly-created snapshot's checkpoint relation. Since checkpointId is passed through unchanged, they should be identical — but worth a sanity check.

Recommended test plan: deploy to staging, run the sample_pg_activity.py sampler for a 5-minute window, and verify the COMMIT count drop on the engine service + proportional IO:XactSync reduction.

Notes

  • This only covers the dequeue flow (flow Testing creating a new issue #1 from TRI-8450). The remaining four flows (retry-immediate, checkpoint, requeue, cancel) are separate follow-ups.
  • The new enqueueHeartbeatIfNeeded method is deliberately designed for reuse by those follow-up PRs.
  • CI note: the priority.test.ts failure in shard 7 is a flaky ordering assertion unrelated to this change (it compares friendlyId values in dequeue order). The audit check is also pre-existing/unrelated.

Link to Devin session: https://app.devin.ai/sessions/034fe0e7224f49278a2de260203e1377
Requested by: @ericallam

… transaction

Nest the TaskRunExecutionSnapshot create inside the preceding
taskRun.update() call in the dequeue flow, reducing 2 explicit
BEGIN/COMMIT transactions to 1 per dequeue operation.

This follows the same pattern already used in the completion path
(runAttemptSystem.ts:735) and trigger path (engine/index.ts:674-686).

Side effects (heartbeat enqueue, executionSnapshotCreated event) are
kept outside the transaction and fed the result of the merged write.

Also adds a public enqueueHeartbeatIfNeeded() method to
ExecutionSnapshotSystem for reuse by other flows that will adopt
the same merged pattern.

Refs: TRI-8450
Co-Authored-By: Eric Allam <eallam@icloud.com>
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Apr 16, 2026

⚠️ No Changeset found

Latest commit: 813a973

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for your contribution! We require all external PRs to be opened in draft status first so you can address CodeRabbit review comments and ensure CI passes before requesting a review. Please re-open this PR as a draft. See CONTRIBUTING.md for details.

@github-actions github-actions bot closed this Apr 16, 2026
@ericallam ericallam reopened this Apr 16, 2026
Copy link
Copy Markdown
Contributor Author

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Open in Devin Review

devin-ai-integration bot and others added 2 commits April 16, 2026 10:45
Pre-generate the snapshot ID with SnapshotId.generate() and construct
the event/return data from values already available in scope. This
removes the extra DB read that was added in the initial merge commit.

Co-Authored-By: Eric Allam <eallam@icloud.com>
….generate()

Co-Authored-By: Eric Allam <eallam@icloud.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant