Skip to content

.NET: Fix intermittent checkpoint-restore race in in-process workflow runs#5134

Merged
peibekwe merged 7 commits intomainfrom
peibekwe/workflow-unit-tests
Apr 16, 2026
Merged

.NET: Fix intermittent checkpoint-restore race in in-process workflow runs#5134
peibekwe merged 7 commits intomainfrom
peibekwe/workflow-unit-tests

Conversation

@peibekwe
Copy link
Copy Markdown
Contributor

@peibekwe peibekwe commented Apr 7, 2026

Description

During live RestoreCheckpointAsync, queued external deliveries from the superseded timeline could survive restore and be applied after checkpoint state was imported. This caused flaky replay behavior in unit test sample execution, including incorrect prompt/order after restore.
The change clears queued external deliveries during checkpoint import and adds a regression test to verify restored runs remain pending until a fresh post-restore response is sent.

Contribution Checklist

  • The code builds clean without any errors or warnings
  • The PR follows the Contribution Guidelines
  • All unit tests pass, and I have added new tests where possible
  • Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

@moonbox3 moonbox3 added .NET workflows Related to Workflows in agent-framework labels Apr 7, 2026
@github-actions github-actions bot changed the title Fix intermittent checkpoint-restore race in in-process workflow runs .NET: Fix intermittent checkpoint-restore race in in-process workflow runs Apr 7, 2026
@peibekwe peibekwe marked this pull request as ready for review April 7, 2026 15:00
Copilot AI review requested due to automatic review settings April 7, 2026 15:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an intermittent in-process checkpoint restore race where pre-restore queued external deliveries could be applied after checkpoint state import, leading to flaky replay behavior.

Changes:

  • Clear queued external deliveries during ImportStateAsync so stale responses/messages from a superseded timeline can’t be applied post-restore.
  • Add a regression unit test to ensure a restored run remains PendingRequests until a fresh post-restore response is provided.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
dotnet/src/Microsoft.Agents.AI.Workflows/InProc/InProcessRunnerContext.cs Clears queued external deliveries during checkpoint state import to prevent stale delivery application after restore.
dotnet/tests/Microsoft.Agents.AI.Workflows.UnitTests/CheckpointResumeTests.cs Adds a regression test validating queued responses from the superseded timeline don’t complete the restored run.

@peibekwe peibekwe added this pull request to the merge queue Apr 15, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Apr 15, 2026
@peibekwe peibekwe added this pull request to the merge queue Apr 15, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 15, 2026
@peibekwe peibekwe added this pull request to the merge queue Apr 16, 2026
Merged via the queue into main with commit 87a8fa2 Apr 16, 2026
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

.NET workflows Related to Workflows in agent-framework

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants