Skip to content

fix(mship): add tool watchdog#4991

Merged
Sg312 merged 5 commits into
stagingfrom
improvement/mship-tool-watchdog
Jun 12, 2026
Merged

fix(mship): add tool watchdog#4991
Sg312 merged 5 commits into
stagingfrom
improvement/mship-tool-watchdog

Conversation

@Sg312

@Sg312 Sg312 commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Summary

Add tool timeout for vfs reads

Type of Change

  • Bug fix

Testing

Manual

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel

vercel Bot commented Jun 12, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Jun 12, 2026 3:48am

Request Review

@greptile-apps

greptile-apps Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a two-layer tool watchdog to prevent hung async tool executions from permanently blocking the copilot checkpoint loop. Previously, a VFS read (or any tool) that neither resolved nor rejected would park the loop's Promise.allSettled forever, wedging the chat's pending-stream lock.

  • Per-tool watchdog (executeToolWithWatchdog): wraps every tool execution in a Promise.race against a setTimeout. Short-lived tools (reads, metadata ops) get 60 s; legitimately long-running tools (workflow runs, media generation, research, E2B file ops) get the full orchestration budget so their inner timeouts can fire first.
  • Structural backstop (run.ts checkpoint loop): replaces the bare Promise.allSettled with a raced wait-budget (max watchdog across pending tools + 30 s grace). If the budget expires, forceFailHungToolCall settles each hung tool into a terminal error state — persisting a failed async row and publishing a terminal confirmation — then the loop resumes Go with an error result instead of hanging.

Confidence Score: 4/5

Safe to merge — the watchdog correctly unblocks hung tool calls without interfering with tools that complete normally, and state transitions are verified to satisfy the downstream result builder.

The two-layer design is sound: per-tool watchdog in executeToolWithWatchdog and a structural backstop in the checkpoint loop. The forceFailHungToolCall path correctly sets tool.result before the resume body builder checks it, preventing the missing result exception. The unawaited publishTerminalToolConfirmation in forceFailHungToolCall means a client could be left with a spinning tool indicator if that call fails silently, but this is a minor edge case in an already-error recovery path.

executor.ts — the forceFailHungToolCall function's unawaited terminal confirmation publish is the only call site that lacks error handling; all other code paths in the file handle errors consistently.

Important Files Changed

Filename Overview
apps/sim/lib/copilot/constants.ts Adds three new timeout constants: TOOL_WATCHDOG_DEFAULT_MS (60s), TOOL_WATCHDOG_LONG_RUNNING_MS (alias of ORCHESTRATION_TIMEOUT_MS), and TOOL_WATCHDOG_RESUME_GRACE_MS (30s). Values are well-reasoned and well-commented.
apps/sim/lib/copilot/request/tools/executor.ts Adds LONG_RUNNING_TOOL_IDS set, toolWatchdogTimeoutMs classifier, executeToolWithWatchdog (per-tool race against a hard timer), and forceFailHungToolCall (last-resort state settlement). Core logic is correct; publishTerminalToolConfirmation is fire-and-forget without error handling, consistent with existing usage patterns.
apps/sim/lib/copilot/request/lifecycle/run.ts Replaces bare Promise.allSettled with a raced wait-budget (max tool watchdog + grace), then iterates hung tools sequentially calling forceFailHungToolCall. State transitions verified to satisfy the downstream result builder's !tool.result guard.
apps/sim/lib/copilot/request/lifecycle/run.test.ts New test uses vi.useFakeTimers, stubs forceFailHungToolCall with the correct state shape (matching setTerminalToolCallState output), and validates that forceFailHungToolCall is called, the resume URL is used, and the error result is correctly serialized.

Sequence Diagram

sequenceDiagram
    participant CL as Checkpoint Loop
    participant WD as executeToolWithWatchdog
    participant TH as Tool Handler
    participant FF as forceFailHungToolCall
    participant Go as Go Resume Endpoint

    CL->>WD: executeToolAndReport(toolCallId)
    WD->>TH: executeTool(name, params)

    alt Tool completes within watchdog cap
        TH-->>WD: result
        WD-->>CL: pendingPromise resolves
        CL->>Go: POST /api/tools/resume with results
    else Per-tool watchdog fires first
        WD-->>CL: rejects with ToolExecutionTimeoutError
        CL->>Go: POST /api/tools/resume with error result
    else Tool hangs past waitBudget (watchdog + grace)
        Note over CL: sleep(maxWatchdog + 30s) wins race
        CL->>FF: forceFailHungToolCall(toolCallId)
        FF->>FF: setTerminalToolCallState error
        FF->>FF: completeAsyncToolCall failed
        FF->>FF: publishTerminalToolConfirmation
        FF-->>CL: done
        CL->>Go: POST /api/tools/resume with error result
    end
Loading

Reviews (1): Last reviewed commit: "Add tool watchdog" | Re-trigger Greptile

Comment thread apps/sim/lib/copilot/request/lifecycle/run.ts
Comment thread apps/sim/lib/copilot/request/tools/executor.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant