feat: add agent-side workspace context sources and Agent API v2.10 PushContextState#25983
Conversation
|
/coder-agents-review |
|
Chat: Spend limit reached | View chat Review history
deep-review v0.6.1 | Round 5 | Last posted: Round 5, 42 findings (7 P2, 20 P3, 1 P4, 11 Nit, 3 Note), COMMENT. Review Finding inventoryFindings
Contested and acknowledgedCRF-6 (Note, agent.go:2556) - startAgentAPI210 is a full copy of startAgentAPI
Law analysisEffective LOC: 4194. Head SHA: 4b3eda5. Verdict: Don't split. Enforcement: N/A. Round logRound 1Netero-only. 1 P2, 2 P3, 2 Nit, 1 Note. Law: Don't split. Reviewed against eea427f..4b3eda5. Round 2Churn guard: PROCEED. 5 addressed, 1 acknowledged. Panel: 19 reviewers. Netero R2: no new findings. 5 P2, 14 P3, 1 P4, 4 Nit, 2 Note new. 3 dropped, 1 dropped. Reviewed against eea427f..8389a1e. Round 3Churn guard: PROCEED. 29 addressed. Re-review panel (8): Netero, Bisky, Mafuuu, Meruem, Hisoka, Pariston, Mafu-san, Knov. Netero: no findings. Panel: 4 P3 new. 0 P2. Reviewed against eea427f..9da0cfe. Round 4Churn guard: PROCEED. 4 addressed (ef1e235). Re-review panel (5): Netero, Mafuuu, Meruem, Hisoka, Razor. Netero: no findings. Panel: 3 P3 new. 0 P2. Reviewed against eea427f..ef1e235. Round 5Churn guard: PROCEED. 3 addressed (eb613a4). Re-review panel (4): Netero, Mafuuu, Hisoka, Pariston. 1 P2, 1 P3, 2 Nit new. Reviewed against eea427f..eb613a4. About deep-reviewCRF = Coder Review Finding (P0-P4, Nit, Note)
|
There was a problem hiding this comment.
First-pass mechanical review (Netero). These are automated findings from the mechanical first-pass reviewer. The full review panel has not yet reviewed this PR. The panel will review after these findings are addressed.
The feature design is clean: a linear dependency chain from proto through types/resolver/watcher/manager/push to agent wiring, with a well-contained coderd stub. Law evaluated the PR for splitting and concluded it should stay unified (single risk domain, no independent vertical slices, generated protobuf accounts for 762 of 4956 raw lines).
Findings: 1 P2, 2 P3, 2 Nit, 1 Note.
As Netero once said: "The resolve loop is a tidy chain. The quartz clock, however, has a gap in its coverage."
🤖 This review was automatically generated with Coder Agents.
|
/coder-agents-review |
There was a problem hiding this comment.
Full panel review (19 reviewers). R1 findings all addressed; the quartz clock fix, dead-data cleanup, and time.Sleep elimination are well done.
The overall design is sound: a clean linear dependency chain, well-scoped boundaries, and good test density (34.7%). Takumi found no concurrency issues after tracing every lock, channel, and shutdown path. Pariston evaluated four alternative framings and concluded the author's premise holds. Kite confirmed the risk is uniform and low. Melody walked every serialization pairing and found them correct.
Four areas need attention before merge:
- AllowedRoots is frozen at init before the manifest loads, locking devcontainer workspaces out of runtime AddSource. Five reviewers converged on this independently.
- The walkDir error handler hardcodes KindInstructionFile for all recognized files, including .mcp.json and SKILL.md. This creates phantom resource ID changes when permissions are fixed.
- Several doc comments make false claims about the encoding format, reference nonexistent fields, or describe functionality that doesn't match the implementation.
- resolveLocked holds the mutex across filesystem I/O. Three reviewers flagged this; the impact scales with filesystem latency.
Findings: 5 P2, 14 P3, 1 P4, 4 Nit, 2 Note.
Ryosuke: "Both P2s stem from the same design choice: the Manager computes everything eagerly under the mutex and at construction time."
🤖 This review was automatically generated with Coder Agents.
|
/coder-agents-review |
There was a problem hiding this comment.
Re-review (8 reviewers). All 29 R2 findings addressed in 9da0cfe. The CRF-10 mutex refactor, CRF-7 AllowedRoots fix, CRF-8 kindFromFilename extraction, and CRF-15 ctx plumbing are all verified correct. The fix commit is clean: 396 insertions, 122 deletions across 13 files, no scope drift, every CRF number traceable to a specific code change.
Netero found no mechanical issues. Takumi-equivalent concurrency traces by Meruem and Hisoka found the same structural pattern (I/O under mutex) in two sibling call sites that the CRF-10 fix did not address. Hisoka also found that watcher degraded state is sticky after ENOSPC recovery. These are bounded-impact issues on infrequent code paths.
New findings: 4 P3, 0 P2. No ship-blockers. The prior REQUEST_CHANGES is dismissed.
Mafu-san: "The 9da0cfe commit addresses all 29 R2 findings. Every finding maps to a specific code change, and the commit message itemizes each one by CRF number. No findings were silently dropped or substituted."
🤖 This review was automatically generated with Coder Agents.
|
/coder-agents-review |
There was a problem hiding this comment.
Verification round (5 reviewers). All 4 R3 findings verified fixed. The Resync mutex refactor (CRF-36), Watcher.Sync lock split (CRF-37), degraded-state clearing (CRF-38), and kebab-case test (CRF-39) are all confirmed correct by every reviewer.
Three new P3 findings surfaced, all edge cases with bounded impact. The mutex refactors introduced a theoretical version-ordering race (self-healing in one resolve cycle) and exposed an unreachable deadlock path in Run/Close sequencing. A behavioral inconsistency in the oversize skill hash path was also identified.
No P2 or above. The PR is in good shape.
New findings: 3 P3.
Hisoka, on the version race: "Self-healing exists: the watcher fires another trigger, the next resolve produces version N+2 with F2. Window is one resolve cycle (sub-second). But nothing in the protocol tells coderd that version N+1 was a regression."
🤖 This review was automatically generated with Coder Agents.
|
/coder-agents-review |
There was a problem hiding this comment.
Final verification round (4 reviewers). All 3 R4 fixes verified correct. The epoch counter (CRF-41), defer placement (CRF-40), and skill hash (CRF-42) are all confirmed by every reviewer. Hisoka probed seven additional edge cases and found nothing. Pariston re-evaluated the premise from four framings and confirmed the solution is proportional.
One new P2: Resync and resolveAndBroadcast can publish an empty snapshot when the context is canceled mid-walk. The consequence is deferred (coderd returns Unimplemented today), but the pattern is a data-loss path once the chatd follow-up accepts pushes. One P3 (exported test-only method) and two Nits (comment inaccuracies).
New findings: 1 P2, 1 P3, 2 Nit.
Cumulative: 46 findings across 5 rounds. 42 fixed, 1 accepted, 3 dropped, 4 open.
Hisoka: "Boring code gets silence. This earned it."
🤖 This review was automatically generated with Coder Agents.
|
/coder-agents-review |
ibetitsmike
left a comment
There was a problem hiding this comment.
Solid decomposition of the resolver, watcher, manager, and DRPC adapter, and the cancellation and epoch regression coverage is useful. This review found 2 P1 and 5 P2 findings across 7 inline comments. This review contains findings that may need attention before merge.
Mux working on behalf of Mike.
| return res | ||
| } | ||
| data, err := os.ReadFile(path) | ||
| if err != nil { |
There was a problem hiding this comment.
P1 Reject or re-validate symlinked context files before reading payloads (Security Reviewer P1, Edge Case Analyst P1)
every workspace gets the manifest working directory as a scan root in
agent/agentcontext/manager.go:497-500; during recursive walks, any non-directory entry whose basename matchesAGENTS.md,.mcp.json, orSKILL.mdis classified atagent/agentcontext/resolve.go:315-320andagent/agentcontext/resolve.go:331-355; the reader then callsos.ReadFile(path)atagent/agentcontext/resolve.go:419-425without rejectingfs.ModeSymlink, resolving the symlink target, or checking that the opened target remains under an allowed root. A malicious repo can include a symlink such asnested/AGENTS.md -> ~/.ssh/id_rsa; the agent will treat it as an instruction file and send the target bytes in the context snapshot.
Edge Case Analyst adds: So a workspace containing
AGENTS.mdas a symlink to~/.ssh/id_rsa,/etc/passwd, or a very large file is classified as an in-root instruction file, then the target bytes are loaded and pushed even though the target is outside the scan root and may be larger thanDefaultMaxResourceBytes.
This is worth fixing before merge because default working-directory scans are automatic. Skip symlink resources, or resolve each file target and re-validate it against the scan root before opening it with a capped reader.
There was a problem hiding this comment.
Fixed in 496f157. The resolver now follows symlinks (codex follows them unconditionally and developers legitimately symlink AGENTS.md into shared monorepo docs), but resolveReadTarget validates the resolved target is inside the contributing scan root before opening with the capped reader.
- Inside the scan root → follow normally.
- Escapes the scan root →
StatusInvalidwith"symlink target %q escapes scan root", no payload. - Broken or non-regular target →
StatusUnreadable.
emitSkillsFromContainer switched from os.Stat to os.Lstat so symlinked SKILL.md files take the same path. Tests: TestResolver_SymlinkInsideScanRootAllowed, TestResolver_SymlinkOutsideScanRootRejected, TestResolver_BrokenSymlink.
🤖 Reply by Coder Agents.
| // resolver only enforces the per-resource size cap. Future | ||
| // work: detect malformed JSON and surface StatusInvalid. | ||
| func (r *Resolver) readMCPConfig(path string, info fs.FileInfo, userSource string) Resource { | ||
| return r.readFileResource(KindMCPConfig, path, info, userSource) |
There was a problem hiding this comment.
P1 Do not push raw .mcp.json payloads without redacting secret-bearing fields (Security Reviewer P1)
readMCPConfigdelegates directly to the generic file reader atagent/agentcontext/resolve.go:387-392,readFileResourcereads the full file bytes and stores them inres.Payloadatagent/agentcontext/resolve.go:419-425, and the DRPC adapter copies that payload onto the wire atagent/agentcontext/drpc.go:63-69. The on-disk MCP schema includes secret-bearing fields, withEnv map[string]stringandHeaders map[string]stringinagent/x/agentmcp/config.go:12-20, and the user docs show token and Authorization header examples in.mcp.jsonatdocs/ai-coder/agents/extending-agents.md:124-135.
Even with the current coderd stub, the request payload is still transmitted before coderd returns Unimplemented. Unless a downstream consumer has a demonstrated need for raw config bytes, send path/hash metadata only, or redact known secret-bearing fields before building the snapshot.
There was a problem hiding this comment.
Fixed in 496f157. readMCPConfig is no longer a thin wrapper around readFileResource: it reads the bytes only to compute SizeBytes and ContentHash, then drops the payload. Resources of kind KindMCPConfig carry path + hash + size on the wire (change detection still works) but never the bytes, so embedded Env and Headers values stay on disk. Live tool data continues to flow through the MCPProvider as KindMCPServer resources, which is what consumers actually want.
Test TestResolver_MCPConfigEmitted was extended to embed a GITHUB_TOKEN secret and assert Payload is empty while ContentHash is populated.
🤖 Reply by Coder Agents.
| // loop translates this into a clean shutdown rather than a | ||
| // retry storm. | ||
| func (*API) PushContextState(_ context.Context, _ *agentproto.PushContextStateRequest) (*agentproto.PushContextStateResponse, error) { | ||
| return nil, drpcerr.WithCode(errPushContextStateUnimplemented, drpcerr.Unimplemented) |
There was a problem hiding this comment.
P2 The published v2.10 contract is ahead of the implemented server and agent behavior (Go Architect P1, Product Reviewer P2)
agent/proto/version.go:73-81declaresAPI v2.10as addingPushContextStateand setsCurrentMinor = 10.
coderd/workspaceagentsrpc.go:46-61accepts RPC connections whoseversionvalidates againstproto.CurrentVersion, so a caller requesting2.10now passes version negotiation.
coderd/agentapi/context.go:11-21saysCoderd does not yet persist context snapshots; the chatd integration that consumes pushes lives in a follow-up changeand returnsdrpcerr.UnimplementedfromPushContextState.
Go Architect adds:
agent/proto/agent.proto:541-556definesContextResourceas coveringlive MCP server tool listresources and includes theMCP_SERVERkind.
This is a P2 rather than P1 because the current agent handles Unimplemented without retrying forever, but shipping CurrentMinor = 10 makes a nominally supported API whose only new method always fails and whose promised MCP server resources are not wired in production. Could we either keep this behind an internal rollout flag and leave the public version at v2.9, or include the coderd persistence plus MCP provider adapter in the same slice that advertises v2.10?
There was a problem hiding this comment.
Done in 496f157. Reverted tailnet/proto.CurrentMinor to 9 with a comment explaining the bump is deferred to the change that wires coderd persistence + MCP provider. The agent now connects via ConnectRPC29WithRole; apiConnRoutineManager.aAPI is DRPCAgentClient29 again; the startAgentAPI210("push context state", ...) block in run() is removed (with a TODO at the same location for the follow-up). DRPCAgentClient210, agentcontext.NewDRPCPusher, and the rest of the push plumbing remain compiled and tested. TestAgent_ContextStatePushed is skipped with a pointer back to the version bump.
🤖 Reply by Coder Agents.
| Clock: a.clock, | ||
| WorkingDir: workingDirFn, | ||
| InitialSources: initialContextSources(a.contextConfig, workingDirFn), | ||
| }) |
There was a problem hiding this comment.
P2 Resolve boot-time context env sources after the manifest working directory is known (Contract Auditor P2, Structural Analyst P2)
However
initialContextSourcesresolves the config once duringinit, whenworkingDirFncan still return an empty string because the manifest has not been stored yet:
The resolver used here explicitly drops relative paths when the base directory is empty, while the existing context-config contract resolves these same fields against the workspace directory lazily:
The fallback working-directory scan does not fully preserve that contract because the new resolver only recognizes its hard-coded filenames, for example
.mcp.jsonandSKILL.md, rather than the configured file names:
A template that uses relative CODER_AGENT_EXP_* paths or custom CODER_AGENT_EXP_INSTRUCTIONS_FILE or CODER_AGENT_EXP_SKILL_META_FILE values can lose that configuration for the lifetime of the agent. Re-seed after manifest load, or store the raw config in the manager and resolve it lazily with the working directory and configured file names.
There was a problem hiding this comment.
Fixed in 496f157. Added Manager.SeedSources([]Source), a trusted late-binding equivalent of InitialSources that canonicalizes + deduplicates + bypasses AllowedRoots (the paths come from template env vars, not HTTP). The agent's handleManifest calls it right before Trigger() with initialContextSources(a.contextConfig, func() string { return manifest.Directory }), so any relative CODER_AGENT_EXP_*_DIRS that was dropped at boot (when workingDirFn() returned "") now resolves against the actual workspace directory and gets added before the snapshot is rebroadcast.
The narrower sub-finding about custom CODER_AGENT_EXP_INSTRUCTIONS_FILE / SKILL_META_FILE names not being honored is intentionally deferred to a follow-up: it requires plumbing the agentcontextconfig.Config into the resolver's filename classification, which touches more surface than this PR should grow. Filed as TODO in the handleManifest comment block.
Test TestManager_SeedSourcesLateBindsAfterManifest covers the late-binding path with AllowedRoots deliberately excluding the seeded path.
🤖 Reply by Coder Agents.
| return | ||
| } | ||
| if w.watcher == nil { | ||
| // Degraded mode: nothing to wire up; fire the callback |
There was a problem hiding this comment.
P2 Degraded watcher mode re-arms itself into a full-resolve loop (Edge Case Analyst P2, Concurrency Reviewer P2)
NewWatcherreturns a degraded watcher with no fsnotify instance when initialization fails:
Syncturns every degraded sync into a debounced callback:
That callback is
m.signal, and the Run loop handles the signal by callingwatcher.Syncagain before resolving, which schedules the next signal and creates an endless 250ms scan loop on hosts where inotify cannot initialize:
When fsnotify startup fails, the fallback should surface degraded state without continuously walking and pushing the workspace while idle. Make degraded Sync a no-op for manager-triggered syncs, or switch to an explicit low-frequency poller that cannot re-arm itself through OnChange.
There was a problem hiding this comment.
Fixed in 496f157. Degraded Sync is now a true no-op: it returns immediately without calling schedule(), breaking the Sync → m.signal → Run loop → Sync cycle. Manual Resync, AddSource, and RemoveSource still drive re-resolves; the only thing lost in degraded mode is auto-update on file edits, which is the expected degradation when fsnotify cannot initialize.
🤖 Reply by Coder Agents.
| published := m.snapshot | ||
| m.mu.Unlock() | ||
| if watcher != nil { | ||
| watcher.Sync(ctx, roots) |
There was a problem hiding this comment.
P2 Do not call Watcher.Sync with stale roots after the epoch guard rejects a Resync (Concurrency Reviewer P2)
Resynccapturesroots := m.scanRootsLocked()before dropping the mutex and starting its filesystem walk:agent/agentcontext/manager.go:395-406. If another resolve starts while that walk is running, the stale pass detectsm.resolveEpoch != myEpoch, skips publishing, but still callswatcher.Sync(ctx, roots)with its older roots atagent/agentcontext/manager.go:432-443.
The concrete interleaving is: Resync A captures roots without a newly added source, AddSource signals Run B, B captures the new roots and publishes them, then A reaches the stale branch and Syncs the old roots, removing the new source watches even though the snapshot and source list are current. Later edits under that source are not observed until another explicit trigger re-arms the watcher.
The epoch guard already knows this pass is stale, so it should not perform watcher side effects with stale inputs. Return the published snapshot without syncing, or re-read the current roots before touching the watcher.
There was a problem hiding this comment.
Fixed in 496f157. The stale-epoch branch in Resync no longer calls watcher.Sync(ctx, roots) before returning. The winning pass already armed the watcher with the current root set; replaying the stale roots from a losing pass would drop watches on sources that only the newer pass knows about. The branch returns the currently published snapshot and no longer mutates the watcher.
🤖 Reply by Coder Agents.
| mcp := r.MCP.MCPResources() | ||
| resources = append(resources, mcp...) | ||
| // MCP resources may push the aggregate over the cap. | ||
| // Re-apply count and size limits to MCP entries only. |
There was a problem hiding this comment.
P2 Apply byte caps to MCP resources before adding them to the snapshot (Contract Auditor P2, Structural Analyst P2)
DefaultMaxSnapshotBytesis documented as the aggregate payload cap and says resources past it are emitted asStatusExcluded:
ResolveContextthen acknowledges that MCP resources are appended after filesystem caps and says it will re-apply size limits:
But
applyMCPCapsonly checks the resource count and returns immediately when the count is withinMaxResources, so a provider can return one largeKindMCPServerpayload and the snapshot will exceedMaxSnapshotByteswithStatusOK:
This is latent until the production MCP provider is wired, but the contract and comments introduced here already say the cap applies. Route MCP resources through the same per-resource and remaining aggregate budget checks before they can affect PayloadBytes or the wire payload.
There was a problem hiding this comment.
Fixed in 496f157. applyCaps now returns the running aggregate byte total so applyMCPCaps can pick up where it left off. applyMCPCaps walks the appended MCP tail and stamps StatusExcluded (with a payload-clear and a SnapshotError) on any resource that would push the running sum past MaxSnapshotBytes, in addition to the existing count cap. A provider returning a single oversized KindMCPServer payload now ends up excluded instead of breaking the aggregate cap contract.
Test TestResolver_MCPProviderRespectsAggregateByteCap uses a 1 KiB MCP payload with MaxSnapshotBytes: 512 and asserts the resource becomes StatusExcluded with empty Payload and a populated SnapshotError.
🤖 Reply by Coder Agents.
| const ( | ||
| CurrentMajor = 2 | ||
| CurrentMinor = 9 | ||
| CurrentMinor = 10 |
There was a problem hiding this comment.
I am not sure we can bump this in the context of just this PR
There was a problem hiding this comment.
Agreed and reverted. CurrentMinor is back to 9 in 496f157 with a docstring explaining the v2.10 bump is deferred to the change that adds coderd persistence and the MCP provider adapter. See the CRF-52 reply on the parallel thread for the full agent-side wiring change.
🤖 Reply by Coder Agents.
There was a problem hiding this comment.
After looking at the historical precedent (notably #21293 which bumped to v2.7 for ReportBoundaryLogs with an explicit stub server, on the rationale of landing the feature across multiple PRs), I think the right call is actually to ship the bump with this PR after all and keep coderd's handler as a stub returning Unimplemented.
Reasons:
- Every prior minor bump on this API (v2.5 through v2.9) has shipped the version constant in the same PR that adds the proto change. Splitting them is a new pattern.
- The previous state (proto wire scaffolded,
CurrentMinorheld back) shipped ~500 lines of dead code into production binaries that no caller could reach because version negotiation hid them. DRPCPushertranslatesdrpcerr.UnimplementedintoErrPushUnimplemented, so the agent push loop exits cleanly against older coderd deployments. Both directions remain interoperable.
a39871fae8 re-applies the bump:
tailnet/proto.CurrentMinor = 10run()dialsConnectRPC210WithRoleapiConnRoutineManager.aAPIisDRPCAgentClient210;startAgentAPI210helper restored- The
push context stategoroutine is live TestAgent_ContextStatePushedis un-skipped and passescoderd/workspaceagents_test.gopostStartupusesConnectRPC210again
CODAGT-569 is now scoped to just the real coderd handler (persist to workspace_agent_context, fan dirty events to chatd, add the KindMCPServer provider) and the coder exp chat context CLI.
Happy to revert again if you'd rather follow a different convention.
🤖 Reply by Coder Agents.
There was a problem hiding this comment.
I'm not against this approach 👍
…cher, caps) Six concrete fixes from @ibetitsmike's review on PR #25983: P1 symlink arbitrary read: the resolver previously called os.ReadFile on any AGENTS.md/.mcp.json/SKILL.md it walked into, including symlinks, so a workspace with `nested/AGENTS.md -> ~/.ssh/id_rsa` would have shipped the target. Symlinks are still followed (codex follows them too, and they are the standard way to compose monorepo agent docs), but the resolved target must now stay inside the contributing scan root or the resource is emitted as StatusInvalid. Broken symlinks and non-regular targets are emitted as StatusUnreadable. P1 .mcp.json secret exfiltration: readMCPConfig used to ship the raw file bytes, which embed Env tokens and Authorization headers. The resolver now produces metadata only (path, SizeBytes, ContentHash). Change detection still works; the live tool list flows through MCPProvider as a KindMCPServer resource, which is what consumers actually need. P2 v2.10 contract ahead of behavior: revert tailnet/proto.CurrentMinor to 9, switch the agent's connect site back to ConnectRPC29WithRole, and disable the PushContextState push goroutine. The DRPCAgentClient210 interface, the agentcontext.Manager, and the push code paths remain compiled and tested so the v2.10 bump in the follow-up that adds coderd persistence is a small, focused change. A TestAgent_ContextStatePushed skip and a comment at the connect site call out the re-enable steps. P2 boot-time env sources lost: relative CODER_AGENT_EXP_*_DIRS paths used to be dropped during agent.init when the manifest's working directory was still unknown. Add Manager.SeedSources, a trusted late-binding equivalent of InitialSources that bypasses AllowedRoots, and call it from handleManifest with the now-resolved working directory before Trigger. P2 degraded watcher re-arm loop: when fsnotify init fails, Sync used to schedule its OnChange callback, the Manager's signal handler turned it into another Sync, and the cycle repeated every 250ms forever. Degraded Sync is now a true no-op; manual Resync, AddSource, and RemoveSource still drive re-resolves. P2 stale-epoch Resync mutating the watcher: a Resync pass that loses the epoch race already skipped publishing its stale snapshot, but still called watcher.Sync with stale roots. That removed watches on sources only the winning pass knew about. The stale branch now returns the published snapshot without touching the watcher. P2 MCP byte cap not enforced: applyMCPCaps previously only checked the resource-count cap, so a provider returning one big KindMCPServer payload could exceed MaxSnapshotBytes with StatusOK. The cap helper now picks up where applyCaps left off and stamps StatusExcluded plus a snapshot error on MCP entries that breach the count or byte budget. Adds regression tests for symlink-inside, symlink-outside, broken symlinks, MCP secret omission, MCP byte cap, and the manifest-late SeedSources path.
Add the ContextResource, PushContextStateRequest, and PushContextStateResponse messages plus a new PushContextState RPC on the Agent API. Bumps the tailnet+agent shared CurrentVersion to 2.10 and adds the DRPCAgentClient210 interface, plus ConnectRPC210/ConnectRPC210WithRole on codersdk/agentsdk's Client. Reserved kinds (PLUGIN, HOOK, SUBAGENT, COMMAND) ship in the enum but are not emitted by v1 agents. The wire shape matches the workspace context sources RFC. Coderd's server-side handler arrives separately.
Add an Unimplemented stub for PushContextState so the binary satisfies the v2.10 DRPCAgentServer interface. The real handler that persists snapshots and fans dirty events out to chats lands with the chatd integration. Until then, agents receive Unimplemented and stop pushing for the connection (agentcontext.RunPush handles this gracefully).
DRPCPusher adapts the generated DRPCAgentClient210 to the agentcontext.Pusher interface so RunPush can ship snapshots without coupling the package to a specific proto version. drpcerr.Unimplemented is translated to ErrPushUnimplemented so the agent stops cleanly when talking to an older coderd. Resync now re-arms the recursive watcher in addition to re-resolving. This is required so a Resync triggered by a late-arriving working directory (manifest fetch) starts watching the workspace tree.
Boot the agentcontext.Manager during agent init, seed its sources from the existing CODER_AGENT_EXP_*_DIRS env vars, run its watcher/resolver loop in the graceful context, and tear it down in agent.Close. Mount the source CRUD and resync HTTP API at /api/v0/context. Connect to coderd via the new ConnectRPC210WithRole and push each snapshot via DRPCPusher. The push goroutine is registered through a new startAgentAPI210 helper that exposes the v2.10 client. When the manifest lands, trigger a Resync so the snapshot reflects the workspace directory immediately instead of waiting for the next filesystem event. Add agenttest.Client.ContextStatePushes and a FakeAgentAPI PushContextState implementation so tests can assert on the push traffic, plus an end-to-end test that boots a real agent against the fake and verifies AGENTS.md appears in a snapshot.
- CRF-1 (P2): thread quartz.Clock through PushOptions so pushWithRetry uses clock.NewTimer, making the retry test deterministic via a quartz trap instead of real sleeps. - CRF-2 (P3): remove dead skillsParentNames map; simplify isSkillsContainer to a base-name check that already covered every reachable case. - CRF-3 (P3): remove unused Snapshot.AggregateHashHex (api.go inlines hex encoding for HTTP responses). - CRF-4 (Nit): replace time.Sleep timing waits in watch_test.go and manager_test.go with Eventually-driven writes and a new Manager.Started signal channel. - CRF-5 (Nit): drop unused parameters from defaultContextRoots.
P2: - CRF-7: effectiveAllowedRoots now appends the current working directory lazily so the manager picks up the resolved path after the manifest loads. - CRF-8: walkDir error path computes Kind from the basename (kindFromFilename) so .mcp.json and SKILL.md keep a stable resource ID across permission flips. - CRF-9: ComputeAggregateHash / writeLengthPrefixed docs now describe the Netstring-style encoding accurately. - CRF-10: resolveAndBroadcast snapshots inputs under the lock, releases it for the filesystem walk, then re-acquires to swap. - CRF-11: api.go SnapshotResource comment no longer references a nonexistent PayloadBase64 field. P3: - CRF-12: WatcherOptions.MaxDepth; the watcher now mirrors the resolver's depth instead of hardcoding DefaultMaxScanDepth. - CRF-13: drop obsolete "follow-up" sentence in doc.go. - CRF-14: Pusher doc says Agent API v2.10, not proto v30. - CRF-15: Resolver.ResolveContext threads ctx through walk so resolveAndBroadcast can cancel between roots. - CRF-16: new tests cover StatusUnreadable for instruction and MCP-config files (skipped on Windows and root). - CRF-17: TestRunPush_ClosesOnManagerClose covers closedCh path. - CRF-18: drop dead !ok branch on the subscriber channel. - CRF-19: extract readFileResource; readInstructionFile and readMCPConfig share the plumbing. - CRF-24: DefaultPushInitialBackoff / DefaultPushMaxBackoff. - CRF-25: subsequent-push test asserts AggregateHash and Version advance, not just the count. P4: - CRF-26: TestRunPush_RejectedResponseProceeds covers the Accepted=false fast-path. Nit: - CRF-27: slices.SortFunc/SortStableFunc replace sort.Slice. - CRF-28: firstLine uses strings.SplitSeq. - CRF-29: sync.OnceFunc replaces the sync.Once+closure wrapper. - CRF-30: atomic.Int32 replaces bare int32 in watch_test. - CRF-31: redundant sort in resolveLocked already gone with CRF-10. - CRF-32: drop redundant "agentcontext:" prefix from log messages. - CRF-33: resolveAndBroadcast doc no longer claims conditional. Note: - CRF-34: ResourceStatus doc explains the proto-enum offset. - CRF-35: readInstructionFile doc flags the sanitization gap delegated to the chatd follow-up.
CRF-36: Resync now mirrors the resolveAndBroadcast pattern. Filesystem walks happen outside m.mu, so concurrent Sources, AddSource, RemoveSource, and Snapshot calls no longer block on resolver I/O. resolveLocked is only invoked from NewManager's initial resolve, matching the comment. CRF-37: Watcher.Sync drops w.mu around collectDirs (the recursive directory walk) and re-acquires it only for the diff-and-apply phase. Close, schedule, and the run goroutine no longer block on a slow filesystem. CRF-38: Watcher.degraded is cleared at the end of a Sync pass whose Adds all succeeded. A user who bumps the inotify watch limit and re-syncs now sees a clean snapshot instead of a permanent SnapshotError. CRF-39: new TestResolver_SkillNameNonKebabInvalid covers the kebab-case validation branch in readSkillMeta (name matches parent directory but fails SkillNamePattern).
The handleManifest Resync call was synchronous; on slow filesystems (CI macOS and Windows runners walking a real $HOME) the resolver walk could outlive the agent's graceful shutdown window, leaking a goroutine inside filepath.WalkDir and tripping goleak in the agent test suite. Two changes break the leak: - New Manager.Trigger() queues an asynchronous re-resolve on the trigger channel. handleManifest now calls Trigger instead of Resync, so manifest handling returns immediately and the background walk is owned by the Manager.Run goroutine that already respects gracefulCtx cancellation. - Resolver.walkDir threads ctx into the filepath.WalkDir callback, so when ctx cancels the walk aborts after the current directory read instead of running to completion.
CRF-40: ensure runDoneCh is closed even if NewWatcher fails so Close does not deadlock waiting on a goroutine that already exited. CRF-41: gate snapshot publishes with a monotonic resolveEpoch counter. Resync and resolveAndBroadcast drop m.mu around the filesystem walk; under concurrency a stale walk could overwrite a fresher one at a higher version number. Each pass now captures its epoch under the lock and skips the publish if a newer pass has started. Resync returns the currently published Snapshot in the stale case, which is guaranteed to be at least as fresh as the discarded result. CRF-42: hash the (capped) prefix of oversize SKILL.md files so an edit that keeps the file oversize still shifts the aggregate hash, matching readFileResource and preserving the change-detection contract.
CRF-44: bail without publishing when ctx is canceled mid-walk in both Resync and resolveAndBroadcast. ResolveContext returns an empty Snapshot when its ctx is canceled; publishing it would replace the live Snapshot with empty resources until the next trigger. Resync now returns the existing Snapshot and ctx.Err() instead. Added regression test TestManager_ResyncCanceledKeepsLiveSnapshot. CRF-43: unexport Manager.Started to started; the method has no production callers. Tests reach it through ManagerStarted in export_test.go. CRF-45: drop the inaccurate Unix newline conversion clause from readMCPConfig's doc comment. CRF-46: AllowedRoots doc clarified that an empty slice falls back to the home directory rather than disabling validation.
agent.NewManager never returns an error; it had a single return m, nil path. Drop the error from the signature so the agent.go call site does not pretend to handle a failure that cannot happen. Remove the five dead 'if a.contextManager != nil' guards that followed. Move the hardcoded default scan-root and allow-root lists out of agent/agentcontext_seed.go and into the agentcontext package itself as DefaultBuiltinRoots() and DefaultAllowedRoots(). The package owns the policy now; the agent passes them in explicitly, matching the pattern other defaults (Clock, Debounce, SchemaVersion) follow. agentcontext_seed.go is gone. The one helper that legitimately lives in package agent, initialContextSources (which bridges the legacy CODER_AGENT_EXP_*_DIRS env vars), moves into agent.go next to its sole caller.
…watch to watcher Move types.go contents next to their producers and consumers: - Source moves into manager.go (Manager input). - Resource, ResourceKind, ResourceStatus, their String() methods, Snapshot, ComputeAggregateHash, and writeLengthPrefixed move into resolve.go (the resolver emits and hashes these). - types_test.go is folded into resolve_test.go. Rename watch.go and watch_test.go to watcher.go and watcher_test.go; the file describes the Watcher type. Inline switchHomeEnv into paths_test.go (its only caller) and delete helpers_test.go. export_test.go stays. It is the Go-conventional bridge for exposing the unexported started() channel to external test packages and is recognized by the toolchain for this purpose.
DefaultBuiltinRoots and DefaultAllowedRoots were one-shot exports whose only caller was the agent passing them straight back to the package as ManagerOptions. The round-trip is dead API surface; the package owns its scan and allow-list policy and should not require callers to ship it on every construction. - Drop the BuiltinRoots field from ManagerOptions. The package always layers ~/.coder, ~/.coder/skills, and ~/.claude/plugins/cache via the unexported defaultBuiltinRoots, consulted on every scanRootsLocked call. - Keep AllowedRoots as an option because the package's tests use it as a control surface for the validation logic. An empty AllowedRoots now falls back to the unexported defaultAllowedRoots ([~, ~/.coder, ~/.claude]) instead of just [~]. - Agent no longer passes either field; it constructs the Manager with just Logger/Clock/WorkingDir/InitialSources. - Add a package TestMain that sets HOME and USERPROFILE to a binary-wide temp dir so the now-mandatory built-in walks resolve to nothing on developer hosts, keeping Len(Resources, N) assertions deterministic.
…cher, caps) Six concrete fixes from @ibetitsmike's review on PR #25983: P1 symlink arbitrary read: the resolver previously called os.ReadFile on any AGENTS.md/.mcp.json/SKILL.md it walked into, including symlinks, so a workspace with `nested/AGENTS.md -> ~/.ssh/id_rsa` would have shipped the target. Symlinks are still followed (codex follows them too, and they are the standard way to compose monorepo agent docs), but the resolved target must now stay inside the contributing scan root or the resource is emitted as StatusInvalid. Broken symlinks and non-regular targets are emitted as StatusUnreadable. P1 .mcp.json secret exfiltration: readMCPConfig used to ship the raw file bytes, which embed Env tokens and Authorization headers. The resolver now produces metadata only (path, SizeBytes, ContentHash). Change detection still works; the live tool list flows through MCPProvider as a KindMCPServer resource, which is what consumers actually need. P2 v2.10 contract ahead of behavior: revert tailnet/proto.CurrentMinor to 9, switch the agent's connect site back to ConnectRPC29WithRole, and disable the PushContextState push goroutine. The DRPCAgentClient210 interface, the agentcontext.Manager, and the push code paths remain compiled and tested so the v2.10 bump in the follow-up that adds coderd persistence is a small, focused change. A TestAgent_ContextStatePushed skip and a comment at the connect site call out the re-enable steps. P2 boot-time env sources lost: relative CODER_AGENT_EXP_*_DIRS paths used to be dropped during agent.init when the manifest's working directory was still unknown. Add Manager.SeedSources, a trusted late-binding equivalent of InitialSources that bypasses AllowedRoots, and call it from handleManifest with the now-resolved working directory before Trigger. P2 degraded watcher re-arm loop: when fsnotify init fails, Sync used to schedule its OnChange callback, the Manager's signal handler turned it into another Sync, and the cycle repeated every 250ms forever. Degraded Sync is now a true no-op; manual Resync, AddSource, and RemoveSource still drive re-resolves. P2 stale-epoch Resync mutating the watcher: a Resync pass that loses the epoch race already skipped publishing its stale snapshot, but still called watcher.Sync with stale roots. That removed watches on sources only the winning pass knew about. The stale branch now returns the published snapshot without touching the watcher. P2 MCP byte cap not enforced: applyMCPCaps previously only checked the resource-count cap, so a provider returning one big KindMCPServer payload could exceed MaxSnapshotBytes with StatusOK. The cap helper now picks up where applyCaps left off and stamps StatusExcluded plus a snapshot error on MCP entries that breach the count or byte budget. Adds regression tests for symlink-inside, symlink-outside, broken symlinks, MCP secret omission, MCP byte cap, and the manifest-late SeedSources path.
496f157 to
5576dc8
Compare
| // | ||
| // API v2.10 (proposed, NOT advertised): | ||
| // - PushContextState RPC on the Agent API for pushing | ||
| // resolved workspace context snapshots (instruction files, | ||
| // skills, MCP configs, MCP server tool lists) from the | ||
| // agent to coderd. Adds ContextResource, PushContextState | ||
| // Request, and PushContextStateResponse messages. | ||
| // - CurrentMinor stays at 9 until coderd persistence and the | ||
| // KindMCPServer provider land. The RPC is defined and the | ||
| // agent-side plumbing exists so we can iterate behind a | ||
| // flag, but advertising v2.10 today would expose an API | ||
| // whose only new method returns Unimplemented from every | ||
| // deployed server. Bump CurrentMinor in the same change | ||
| // that wires coderd PushContextState persistence. |
There was a problem hiding this comment.
Please add and link a follow-up issue to track this work.
There was a problem hiding this comment.
Tracked in CODAGT-569 under the Coder Agents project, assigned to @kylecarbs. The issue captures the agent-side bump (CurrentMinor to 10, switch run() back to ConnectRPC210WithRole, re-enable startAgentAPI210 for push, un-skip TestAgent_ContextStatePushed), the coderd PushContextState handler that persists to workspace_agent_context and emits dirty events, the chatd consumer of those snapshots, and the coder exp chat context CLI.
🤖 Reply by Coder Agents.
Cian asked on PR review (#25983) for a follow-up issue tracking the deferred v2.10 PushContextState enablement. CODAGT-569 covers the CurrentMinor bump, the agent push goroutine, the coderd handler, and the chatd/CLI consumers. Reference it from the comments where the deferral is documented so the issue is discoverable from the source. 🤖 Filed by Coder Agents.
Match the historical pattern for adding new RPCs (see #21293 for v2.7 ReportBoundaryLogs): ship the proto change with the version bump and a stub handler in the same PR, then land real persistence in a follow-up. Changes: - tailnet/proto.CurrentMinor goes from 9 to 10. - agent.run() dials with ConnectRPC210WithRole. - apiConnRoutineManager holds DRPCAgentClient210; existing v2.8 routines still bind to DRPCAgentClient28 via interface embedding. - New startAgentAPI210 helper hands v2.10 routines the upgraded client. - The push context state routine is live. DRPCPusher translates the coderd Unimplemented stub to ErrPushUnimplemented so RunPush exits cleanly against older coderd deployments. - TestAgent_ContextStatePushed is un-skipped and passes against agenttest.FakeAgentAPI. - coderd/workspaceagents_test.go postStartup uses ConnectRPC210 again. Real persistence (workspace_agent_context rows, chatd hydration, KindMCPServer provider) tracked by CODAGT-569. 🤖 Reply by Coder Agents.
Replace the flat `bytes payload` / `Kind enum` / `id` / `description`
fields on ContextResource with a proto3 oneof whose variant tells coderd
the resource kind and carries kind-specific structured fields. Driven by
review feedback that the prior shape:
- shipped arbitrary JSON for mcp_server tools with no schema, so two
agent versions could silently disagree on the inner JSON shape and
produce conflicting aggregate_hash values.
- duplicated information already implied by the variant (kind, id).
- forced coderd to switch on Kind and then json.Unmarshal the payload to
reach the structure.
New proto shape on `ContextResource`:
- source, optional source_path, content_hash, size_bytes, status, error.
- oneof body { InstructionFileBody, SkillMetaBody, MCPConfigBody,
MCPServerBody }.
- Reserved tags 7-9 and 14-16 to prevent reintroducing the legacy
id/kind/payload/description slots or accidentally renumbering the
reserved plugin/hook/subagent/command kinds.
Body messages:
- InstructionFileBody { content }.
- SkillMetaBody { meta, name, description }.
- MCPConfigBody {} (intentionally empty so env-block secrets cannot
leave the agent; content_hash + size_bytes on the outer resource
still detect changes).
- MCPServerBody { server_name, description, tools[] } where MCPTool is
{ name, description, input_schema (google.protobuf.Struct) }.
Go side:
- Resource gains Name (skill front-matter, MCP server name) and
Tools []MCPTool; the flat struct stays the internal representation
and drpc.go is the only place that maps to the wire.
- MCPProvider interface doc updated to require Tools and Name.
- readSkillMeta populates Resource.Name from front-matter.
- Wire mapping always sets a body variant, even for resources with
Status != OK, so coderd can attribute failures to a known kind.
The body variant is always set; an unset body indicates an unrecognized
kind. ID is no longer transmitted; coderd derives the per-resource key
from (oneof variant, source). description is dropped from the
ContextResource flat field and lives only on the body types that
naturally carry it (skill, mcp_server).
🤖 Reply by Coder Agents.
|
Following up on Kyle's call to do this right rather than ship
Wire-mapping in Could you take another look at the proto change in df47cf39aa when you have a moment? The diff is intentionally a single commit so the proto change is reviewable on its own. 🤖 Reply by Coder Agents. |
johnstcn
left a comment
There was a problem hiding this comment.
I like the oneof way better!
Adds the agent half of the workspace context sources RFC. The agent now resolves instruction files, skills, and MCP configs into a typed
Snapshot, watches the relevant paths recursively, exposes the source list over a workspace-agent HTTP API, and pushes eachSnapshotto coderd over a newPushContextStateRPC on Agent API v2.10.The coderd-side handler is a stub returning
Unimplementedfor now. Real persistence toworkspace_agent_context, chatd hydration on dirty events, and theKindMCPServerMCP provider are tracked by CODAGT-569. This matches the pattern used for v2.7ReportBoundaryLogsin #21293, which bumped the version and shipped a stub server so the wire and client could iterate before the persistence layer landed.What ships
agent/agentcontext (new package)
Source,Resource(kindsinstruction_file,skill,mcp_config,mcp_serverplus reservedplugin/hook/subagent/command),ResourceStatus,Snapshot,ComputeAggregateHash.Managerowns the in-memory source list, performs the initial resolve synchronously inNewManager, runs a re-resolve/watcher loop inRun, exposesAddSource/RemoveSource/Sources/HasSource/Snapshot/SubscribeChanges/Resync/SeedSources/Close.Resolverwalks scan roots, classifies recognized files, enforces 64 KiB per-resource, 2 MiB aggregate, and 500-resource caps withStatusOversize/StatusExcluded/StatusUnreadable/StatusInvalidoutcomes, skipsnode_modules/vendor/etc., validates symlink targets stay inside the scan root, stampsSourcePathon user-derived resources, and optionally pulls MCP server tool lists via anMCPProviderinterface. MCP config resources ship metadata only (size, hash) so secrets in env blocks never leave the agent.Watcheris a recursivefsnotifywrapper with a 250 ms debounce, dynamic arming of newly created directories, and an ENOSPC-tolerant degraded mode that no-ops further syncs until the manager resyncs explicitly.GET/POST /sources,GET/DELETE /sources/{path},POST /resyncmounted at/api/v0/context.Pusherinterface plusRunPushgoroutine with exponential backoff capped at 30 s.DRPCPusheradapts the generatedDRPCAgentClient210toPusherand translatesdrpcerr.UnimplementedtoErrPushUnimplementedso the push loop exits cleanly when talking to coderd deployments that have not enabled the real handler.agent/proto (v2.10)
ContextResource,PushContextStateRequest,PushContextStateResponseand thePushContextStateRPC onservice Agent.DRPCAgentClient210interface andcodersdk/agentsdk.Client.ConnectRPC210/ConnectRPC210WithRole.tailnet/proto.CurrentMinorbumped from9to10.Agent wiring
agent.Options.Clientdeclares both v2.9 and v2.10 connectors;run()dials withConnectRPC210WithRole.apiConnRoutineManagerholds aDRPCAgentClient210. Existing v2.8 routines keep their narrowerDRPCAgentClient28signature thanks to interface embedding.startAgentAPI210is the v2.10 counterpart tostartAgentAPIfor routines that need the new client. The push context state routine uses it.contextManageris constructed inagent.init(), seeded from the existingCODER_AGENT_EXP_*_DIRSenv vars, started in its own goroutine undergracefulCtx, and closed inagent.Close.handleManifestcallsManager.SeedSourcesfor sources rooted at the manifest directory, thenResyncaftermanifest.Swap, so the snapshot reflects the workspace working directory immediately instead of waiting for the next filesystem event./api/v0/contextwhen the manager is up.Coderd stub
coderd/agentapi/context.goreturnsdrpcerr.UnimplementedforPushContextState. The real handler that persistsworkspace_agent_contextrows, hydrates chats, and emits dirty events lives in CODAGT-569.Tests
24 tests across
agent/agentcontextcover types, paths, resolver behavior with file caps, skill containers, MCP secret omission, symlink target validation, the recursive watcher firing on real fsnotify events, manager source CRUD /Resync/SeedSources/Runlifetime, the HTTP API, the DRPC adapter, and the push retry / initial-flag / unimplemented paths. Passesgo test -race -count=2.TestAgent_ContextStatePushedboots a full agent againstagenttest.FakeAgentAPI(which now recordsPushContextStatetraffic) and asserts the seededAGENTS.mdappears in a snapshot push withschema_version = 1.Notes for reviewers
StatusExcludedand an empty payload so the aggregate hash still detects content edits. MCP-emitted resources enforce both a per-provider count cap and the aggregate byte cap.StatusExcludedso credentials reachable via a stray symlink stay off the wire.lifecycle = readyin the eventual full design. For this PR theSeedSourcesplushandleManifest-drivenResynckeeps the snapshot fresh; the live push loop ships now and DRPCPusher translates the coderdUnimplementedstub into a clean exit.PLUGIN/HOOK/SUBAGENT/COMMANDkinds are reserved in proto and Go enums but unused; the Claude Code plugin resolver ships in a follow-up that does not need a schema migration.coder exp chat contextCLI command set forlist/show/add/remove/refresh.This PR was authored by Coder Agents on Kyle Carberry's behalf.