Skip to content

feat: improve sub-agent orchestration tools#26673

Draft
mafredri wants to merge 1 commit into
mainfrom
mathias/codagt-512-improve-sub-agent-orchestration-tools-list_agents-rename
Draft

feat: improve sub-agent orchestration tools#26673
mafredri wants to merge 1 commit into
mainfrom
mathias/codagt-512-improve-sub-agent-orchestration-tools-list_agents-rename

Conversation

@mafredri

Copy link
Copy Markdown
Member

The orchestration tools (spawn_agent, wait_agent, message_agent, close_agent) work, but they communicate badly, so the orchestrator acts on the wrong story and abandons work that is still running. This is measured, not hypothesized: in the personal-agent chat snapshots ~23% of wait_agent calls time out and ~10% of delegated agents are never cleanly collected. A retry-disciplined harness orphans ~0% with the same descriptions, which puts the fix at the tool response and the missing guidance, not the description alone.

What this does:

  • list_agents (new, root-only): paginated (limit/offset, default 10, total/has_more), most-recently-active first, archived excluded. Lets the orchestrator recover its fleet after a compaction drops the spawned chat_ids. Available in plan mode since it is read-only.
  • wait_agent payloads: on timeout it returns an informational (non-error) payload with status, timed_out, and retry guidance instead of a bare error; on error status it returns a structured, recoverable-aware payload (last_error, report, guidance) so transient failures get resumed via message_agent rather than read as terminal. The recording-on-timeout behavior is unchanged.
  • close_agent to interrupt_agent: matches the codebase vocabulary (InterruptChat, ErrInterrupted, StatusInterrupted) and stops implying destruction; the response returns "interrupted" instead of "terminated". A hidden close_agent alias (ToolNameAliases on ExecuteLocalToolsOptions, resolved once in executeSingleTool) keeps old histories dispatching without advertising the old name.
  • Descriptions: message_agent now explains queue-by-default and interrupt: true; wait_agent and spawn_agent explain that agents persist and can be reused.
  • Hygiene: a <subagent-orchestration> section in the system prompt and a sentence on spawn_agent so spawned agents are not abandoned in a working state.
  • Frontend: renders interrupt_agent and list_agents tool calls; close_agent is kept for rendering existing history.

The backend chatd package and the touched frontend tests pass; gofmt, go vet, and the full pre-commit (gen/fmt/lint/build) are green. One pre-existing Storybook story (MCP Tool Completed) fails independently of this change and touches no files here.

Implementation plan and key decisions

Five slices: (1) rename + hidden alias + response field, (2) description rewrites and the wait_agent timeout/error payloads, (3) list_agents backend, (4) frontend descriptor, (5) hygiene guidance.

Notable decisions:

  • D9 (timeout payload): the give-up is decided against the response, not the description, so the timeout returns a status-carrying informational payload, not an error.
  • D12 (error payload): an errored, non-archived agent resumes when messaged, so surface last_error and let the orchestrator judge recoverability rather than auto-classifying.
  • D11 (list_agents shape): cap with limit/offset like read_file, fixed updated_at DESC sort with an id tiebreak, no order_by (no built-in tool exposes a sort). Sorting and paging happen in the handler so the shared GetChildChatsByParentIDs query (used by the chats sidebar) is untouched.
  • D7 (alias): included. Without it a stray close_agent would only cost one self-correcting step, but the mechanism is a single localized field, so old histories dispatch cleanly.

Anchors in the source plan drifted against coder/coder HEAD; line numbers were re-confirmed before editing. Two siblings the plan did not name were also updated: chatprompt.isSubagentLifecycleToolName and the plan-mode help text in subagent_catalog.go. TestWaitAgentTimeoutLeavesRecordingRunning encoded the old timeout-as-error behavior and was updated to the new contract.

The full deep-plan, decision log, and product analysis live in a personal, gitignored repo and are not linked here.

Implements CODAGT-512.

🤖 This PR was created with the help of Coder Agents, and will be reviewed by a human. 🏂🏻

The orchestration tools work, but they tell the orchestrator the wrong
story, so it acts on the misframing. wait_agent says "final response"
and returns a bare error on timeout, which reads as failure rather than
"still working." spawn_agent frames a spawn-wait-done lifecycle, so the
orchestrator never learns agents persist and can be reused. close_agent
sounds like it destroys when it only interrupts, message_agent hides its
queue and interrupt behavior, and there is no way to list spawned agents
to recover the fleet after a compaction.

This is measured, not hypothesized: in the personal-agent chat snapshots
~23% of wait_agent calls time out and ~10% of delegated agents are never
cleanly collected. A retry-disciplined harness orphans ~0% with the same
descriptions, so the give-up is decided against the tool response and the
missing hygiene guidance, not against the description.

The fixes land where the decision is made. wait_agent returns an
informational timeout payload and a recoverable-aware error payload
instead of a bare error. A new list_agents tool recovers the fleet.
close_agent becomes interrupt_agent, with a hidden alias so old histories
still dispatch. The descriptions and a <subagent-orchestration> section
in the system prompt teach persistence, queuing, and not abandoning a
working agent.

Implements CODAGT-512.
@linear-code

linear-code Bot commented Jun 24, 2026

Copy link
Copy Markdown

CODAGT-512

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant