feat: improve sub-agent orchestration tools#26673
Draft
mafredri wants to merge 1 commit into
Draft
Conversation
The orchestration tools work, but they tell the orchestrator the wrong story, so it acts on the misframing. wait_agent says "final response" and returns a bare error on timeout, which reads as failure rather than "still working." spawn_agent frames a spawn-wait-done lifecycle, so the orchestrator never learns agents persist and can be reused. close_agent sounds like it destroys when it only interrupts, message_agent hides its queue and interrupt behavior, and there is no way to list spawned agents to recover the fleet after a compaction. This is measured, not hypothesized: in the personal-agent chat snapshots ~23% of wait_agent calls time out and ~10% of delegated agents are never cleanly collected. A retry-disciplined harness orphans ~0% with the same descriptions, so the give-up is decided against the tool response and the missing hygiene guidance, not against the description. The fixes land where the decision is made. wait_agent returns an informational timeout payload and a recoverable-aware error payload instead of a bare error. A new list_agents tool recovers the fleet. close_agent becomes interrupt_agent, with a hidden alias so old histories still dispatch. The descriptions and a <subagent-orchestration> section in the system prompt teach persistence, queuing, and not abandoning a working agent. Implements CODAGT-512.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The orchestration tools (
spawn_agent,wait_agent,message_agent,close_agent) work, but they communicate badly, so the orchestrator acts on the wrong story and abandons work that is still running. This is measured, not hypothesized: in the personal-agent chat snapshots ~23% ofwait_agentcalls time out and ~10% of delegated agents are never cleanly collected. A retry-disciplined harness orphans ~0% with the same descriptions, which puts the fix at the tool response and the missing guidance, not the description alone.What this does:
list_agents(new, root-only): paginated (limit/offset, default 10,total/has_more), most-recently-active first, archived excluded. Lets the orchestrator recover its fleet after a compaction drops the spawnedchat_ids. Available in plan mode since it is read-only.wait_agentpayloads: on timeout it returns an informational (non-error) payload withstatus,timed_out, and retry guidance instead of a bare error; on error status it returns a structured, recoverable-aware payload (last_error,report, guidance) so transient failures get resumed viamessage_agentrather than read as terminal. The recording-on-timeout behavior is unchanged.close_agenttointerrupt_agent: matches the codebase vocabulary (InterruptChat,ErrInterrupted,StatusInterrupted) and stops implying destruction; the response returns"interrupted"instead of"terminated". A hiddenclose_agentalias (ToolNameAliasesonExecuteLocalToolsOptions, resolved once inexecuteSingleTool) keeps old histories dispatching without advertising the old name.message_agentnow explains queue-by-default andinterrupt: true;wait_agentandspawn_agentexplain that agents persist and can be reused.<subagent-orchestration>section in the system prompt and a sentence onspawn_agentso spawned agents are not abandoned in a working state.interrupt_agentandlist_agentstool calls;close_agentis kept for rendering existing history.The backend
chatdpackage and the touched frontend tests pass;gofmt,go vet, and the full pre-commit (gen/fmt/lint/build) are green. One pre-existing Storybook story (MCP Tool Completed) fails independently of this change and touches no files here.Implementation plan and key decisions
Five slices: (1) rename + hidden alias + response field, (2) description rewrites and the
wait_agenttimeout/error payloads, (3)list_agentsbackend, (4) frontend descriptor, (5) hygiene guidance.Notable decisions:
last_errorand let the orchestrator judge recoverability rather than auto-classifying.list_agentsshape): cap withlimit/offsetlikeread_file, fixedupdated_at DESCsort with anidtiebreak, noorder_by(no built-in tool exposes a sort). Sorting and paging happen in the handler so the sharedGetChildChatsByParentIDsquery (used by the chats sidebar) is untouched.close_agentwould only cost one self-correcting step, but the mechanism is a single localized field, so old histories dispatch cleanly.Anchors in the source plan drifted against
coder/coderHEAD; line numbers were re-confirmed before editing. Two siblings the plan did not name were also updated:chatprompt.isSubagentLifecycleToolNameand the plan-mode help text insubagent_catalog.go.TestWaitAgentTimeoutLeavesRecordingRunningencoded the old timeout-as-error behavior and was updated to the new contract.The full deep-plan, decision log, and product analysis live in a personal, gitignored repo and are not linked here.
Implements CODAGT-512.