Skip to content

feat: add agent-side workspace context sources and Agent API v2.10 PushContextState#25983

Merged
kylecarbs merged 21 commits into
mainfrom
feat/agent-context-package
Jun 8, 2026
Merged

feat: add agent-side workspace context sources and Agent API v2.10 PushContextState#25983
kylecarbs merged 21 commits into
mainfrom
feat/agent-context-package

Conversation

@kylecarbs

@kylecarbs kylecarbs commented Jun 2, 2026

Copy link
Copy Markdown
Member

Adds the agent half of the workspace context sources RFC. The agent now resolves instruction files, skills, and MCP configs into a typed Snapshot, watches the relevant paths recursively, exposes the source list over a workspace-agent HTTP API, and pushes each Snapshot to coderd over a new PushContextState RPC on Agent API v2.10.

The coderd-side handler is a stub returning Unimplemented for now. Real persistence to workspace_agent_context, chatd hydration on dirty events, and the KindMCPServer MCP provider are tracked by CODAGT-569. This matches the pattern used for v2.7 ReportBoundaryLogs in #21293, which bumped the version and shipped a stub server so the wire and client could iterate before the persistence layer landed.

What ships

agent/agentcontext (new package)

  • Source, Resource (kinds instruction_file, skill, mcp_config, mcp_server plus reserved plugin/hook/subagent/command), ResourceStatus, Snapshot, ComputeAggregateHash.
  • Manager owns the in-memory source list, performs the initial resolve synchronously in NewManager, runs a re-resolve/watcher loop in Run, exposes AddSource/RemoveSource/Sources/HasSource/Snapshot/SubscribeChanges/Resync/SeedSources/Close.
  • Resolver walks scan roots, classifies recognized files, enforces 64 KiB per-resource, 2 MiB aggregate, and 500-resource caps with StatusOversize/StatusExcluded/StatusUnreadable/StatusInvalid outcomes, skips node_modules/vendor/etc., validates symlink targets stay inside the scan root, stamps SourcePath on user-derived resources, and optionally pulls MCP server tool lists via an MCPProvider interface. MCP config resources ship metadata only (size, hash) so secrets in env blocks never leave the agent.
  • Watcher is a recursive fsnotify wrapper with a 250 ms debounce, dynamic arming of newly created directories, and an ENOSPC-tolerant degraded mode that no-ops further syncs until the manager resyncs explicitly.
  • HTTP API for GET/POST /sources, GET/DELETE /sources/{path}, POST /resync mounted at /api/v0/context.
  • Pusher interface plus RunPush goroutine with exponential backoff capped at 30 s. DRPCPusher adapts the generated DRPCAgentClient210 to Pusher and translates drpcerr.Unimplemented to ErrPushUnimplemented so the push loop exits cleanly when talking to coderd deployments that have not enabled the real handler.

agent/proto (v2.10)

  • New messages ContextResource, PushContextStateRequest, PushContextStateResponse and the PushContextState RPC on service Agent.
  • Generated DRPCAgentClient210 interface and codersdk/agentsdk.Client.ConnectRPC210 / ConnectRPC210WithRole.
  • tailnet/proto.CurrentMinor bumped from 9 to 10.

Agent wiring

  • agent.Options.Client declares both v2.9 and v2.10 connectors; run() dials with ConnectRPC210WithRole.
  • apiConnRoutineManager holds a DRPCAgentClient210. Existing v2.8 routines keep their narrower DRPCAgentClient28 signature thanks to interface embedding.
  • startAgentAPI210 is the v2.10 counterpart to startAgentAPI for routines that need the new client. The push context state routine uses it.
  • A contextManager is constructed in agent.init(), seeded from the existing CODER_AGENT_EXP_*_DIRS env vars, started in its own goroutine under gracefulCtx, and closed in agent.Close.
  • handleManifest calls Manager.SeedSources for sources rooted at the manifest directory, then Resync after manifest.Swap, so the snapshot reflects the workspace working directory immediately instead of waiting for the next filesystem event.
  • HTTP routes mounted at /api/v0/context when the manager is up.

Coderd stub

coderd/agentapi/context.go returns drpcerr.Unimplemented for PushContextState. The real handler that persists workspace_agent_context rows, hydrates chats, and emits dirty events lives in CODAGT-569.

Tests

24 tests across agent/agentcontext cover types, paths, resolver behavior with file caps, skill containers, MCP secret omission, symlink target validation, the recursive watcher firing on real fsnotify events, manager source CRUD / Resync / SeedSources / Run lifetime, the HTTP API, the DRPC adapter, and the push retry / initial-flag / unimplemented paths. Passes go test -race -count=2.

TestAgent_ContextStatePushed boots a full agent against agenttest.FakeAgentAPI (which now records PushContextState traffic) and asserts the seeded AGENTS.md appears in a snapshot push with schema_version = 1.

Notes for reviewers
  • Source CRUD is workspace-agent-token only; coderd is not in the path for source mutation.
  • Per-resource cap 64 KiB, aggregate 2 MiB, count cap 500; resources past the cap ship with StatusExcluded and an empty payload so the aggregate hash still detects content edits. MCP-emitted resources enforce both a per-provider count cap and the aggregate byte cap.
  • Symlinks inside the scan root are followed; symlinks pointing outside (or broken) are rejected with StatusExcluded so credentials reachable via a stray symlink stay off the wire.
  • The initial push gates lifecycle = ready in the eventual full design. For this PR the SeedSources plus handleManifest-driven Resync keeps the snapshot fresh; the live push loop ships now and DRPCPusher translates the coderd Unimplemented stub into a clean exit.
  • The PLUGIN/HOOK/SUBAGENT/COMMAND kinds are reserved in proto and Go enums but unused; the Claude Code plugin resolver ships in a follow-up that does not need a schema migration.
  • Two follow-ups remain, both tracked by CODAGT-569: (1) the chatd-side handler that persists snapshots and dirties chats; (2) the coder exp chat context CLI command set for list/show/add/remove/refresh.

This PR was authored by Coder Agents on Kyle Carberry's behalf.

@kylecarbs kylecarbs changed the title feat(agent/agentcontext): agent-side workspace context resolver and v2.10 PushContextState feat: workspace context resolver and v2.10 PushContextState RPC Jun 2, 2026

Copy link
Copy Markdown
Member Author

/coder-agents-review

@coder-agents-review

coder-agents-review Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Chat: Spend limit reached | View chat
Requested: 2026-06-02 18:56 UTC by @kylecarbs
Spend: $114.93 / $100.00 (limit reached)

Review history
  • R1 (2026-06-02): 1 reviewers, 2 Nit, 1 Note, 1 P2, 2 P3, COMMENT. Review
  • R2 (2026-06-02): 19 reviewers, 9 Nit, 3 Note, 6 P2, 12 P3, 1 P4, REQUEST_CHANGES. Review
  • R3 (2026-06-02): 7 reviewers, 9 Nit, 3 Note, 6 P2, 16 P3, 1 P4, COMMENT. Review
  • R4 (2026-06-02): 4 reviewers, 9 Nit, 3 Note, 6 P2, 19 P3, 1 P4, COMMENT. Review
  • R5 (2026-06-02): 3 reviewers, 11 Nit, 3 Note, 7 P2, 20 P3, 1 P4, COMMENT. Review

deep-review v0.6.1 | Round 5 | eea427f..eb613a4

Last posted: Round 5, 42 findings (7 P2, 20 P3, 1 P4, 11 Nit, 3 Note), COMMENT. Review

Finding inventory

Findings

# Sev Status Location Summary Round Reviewer Posted
CRF-1 P2 Author fixed (8389a1e) push.go:161 time.After(backoff) bypasses quartz testability R1 Netero Yes
CRF-2 P3 Author fixed (8389a1e) resolve.go:76 7 of 8 skillsParentNames entries are unreachable dead data R1 Netero Yes
CRF-3 P3 Author fixed (8389a1e) types.go:212 AggregateHashHex() has no production caller R1 Netero Yes
CRF-4 Nit Author fixed (8389a1e) watch_test.go:37 time.Sleep(50ms) in tests violates project convention R1 Netero Yes
CRF-5 Nit Author fixed (8389a1e) agentcontext_seed.go:16 defaultContextRoots accepts two unused parameters R1 Netero Yes
CRF-6 Note Author accepted R2 (Go type-system limitation) agent.go:2556 startAgentAPI210 is a full copy of startAgentAPI R1 Netero Yes
CRF-7 P2 Author fixed (9da0cfe) agentcontext_seed.go:77 AllowedRoots frozen at init; working directory missing R2 Hisoka P2, Meruem P2, Ryosuke P2, Knov P2, Kite P3 Yes
CRF-8 P2 Author fixed (9da0cfe) resolve.go:248 walkDir error handler hardcodes KindInstructionFile R2 Chopper Yes
CRF-9 P2 Author fixed (9da0cfe) types.go:209 writeLengthPrefixed comment says "uvarint" R2 Mafuuu P2, Gon P2, Leorio P3, Ryosuke P3, Kite Nit, Zoro P3, Razor P3 Yes
CRF-10 P2 Author fixed (9da0cfe) manager.go:459 resolveLocked holds mutex across filesystem I/O R2 Ryosuke P2, Meruem P3, Killua P3 Yes
CRF-11 P2 Author fixed (9da0cfe) api.go:22 SnapshotResource comment references nonexistent PayloadBase64 R2 Gon Yes
CRF-12 P3 Author fixed (9da0cfe) watch.go:317 Watcher depth hardcoded to DefaultMaxScanDepth R2 Hisoka P3, Mafuuu P3, Zoro P3, Melody P3, Chopper Note Yes
CRF-13 P3 Author fixed (9da0cfe) doc.go:24 Package doc says wiring is follow-up R2 Mafu-san P3, Gon P2, Mafuuu Nit, Leorio Nit, Razor Nit Yes
CRF-14 P3 Author fixed (9da0cfe) push.go:38 Pusher doc references phantom "proto v30" R2 Gon P3, Leorio P3, Mafuuu Nit, Razor Nit Yes
CRF-15 P3 Author fixed (9da0cfe) manager.go:458 resolveAndBroadcast accepts ctx but discards it R2 Mafuuu P3, Meruem Nit, Chopper Nit, Ryosuke Nit, Zoro Nit, Pariston Nit Yes
CRF-16 P3 Author fixed (9da0cfe) resolve_test.go No test for StatusUnreadable R2 Bisky Yes
CRF-17 P3 Author fixed (9da0cfe) push_test.go No test for Manager.Close/RunPush via closedCh R2 Bisky Yes
CRF-18 P3 Author fixed (9da0cfe) push.go:123 Dead !ok branch on subscriber channel R2 Meruem P3, Hisoka Note Yes
CRF-19 P3 Author fixed (9da0cfe) resolve.go:331 readInstructionFile/readMCPConfig structurally identical R2 Zoro Yes
CRF-20 P4 Dropped by orchestrator (scope creep) watch.go:27 DefaultWatchDebounce duplicates MCP constant R2 Robin No
CRF-21 P4 Dropped by orchestrator (scope creep) resolve.go:75 skipDirNames overlaps filefinder's skipDirs R2 Robin No
CRF-22 P4 Dropped by orchestrator (scope creep) paths.go:18 CanonicalizePath reimplements MCP resolvePath R2 Robin No
CRF-23 Note Dropped by orchestrator (non-code artifact) PR description PR description claims 23 tests but actual is 58 R2 Mafu-san No
CRF-24 P3 Author fixed (9da0cfe) push.go:83 Push retry defaults are inline magic numbers R2 Gon Yes
CRF-25 P3 Author fixed (9da0cfe) push_test.go:115 TestRunPush doesn't assert updated content R2 Chopper Yes
CRF-26 P4 Author fixed (9da0cfe) push.go:149 Push Accepted=false path untested R2 Bisky Yes
CRF-27 Nit Author fixed (9da0cfe) resolve.go:139 sort.Slice instead of slices.SortFunc R2 Ging-Go Yes
CRF-28 Nit Author fixed (9da0cfe) resolve.go:558 strings.Split instead of SplitSeq R2 Ging-Go Yes
CRF-29 Nit Author fixed (9da0cfe) manager.go:355 sync.Once+wrapper instead of OnceFunc R2 Ging-Go Yes
CRF-30 Nit Author fixed (9da0cfe) watch_test.go:23 atomic.AddInt32 instead of atomic.Int32 R2 Ging-Go Yes
CRF-31 Nit Author fixed (9da0cfe) manager.go:497 Redundant sort in resolveLocked R2 Ryosuke, Zoro Yes
CRF-32 Nit Author fixed (9da0cfe) push.go:108 Log messages carry redundant "agentcontext:" prefix R2 Leorio Yes
CRF-33 Nit Author fixed (9da0cfe) manager.go:457 resolveAndBroadcast docstring contradicts implementation R2 Pariston Yes
CRF-34 Note Author fixed (9da0cfe) types.go:78 ResourceStatus ordinals diverge from proto R2 Melody Yes
CRF-35 Note Author fixed (9da0cfe) resolve.go:331 Resolver omits sanitization from agentcontextconfig R2 Zoro Yes
CRF-36 P3 Author fixed (ef1e235) manager.go:383 Resync holds mutex during I/O + ignores ctx; resolveLocked comment is wrong R3 Mafuuu P3, Pariston P3, Knov P3, Mafu-san P2, Meruem Nit Yes
CRF-37 P3 Author fixed (ef1e235) watch.go:148 Watcher.Sync holds mu across collectDirs filesystem walk R3 Meruem P3, Hisoka P3 Yes
CRF-38 P3 Author fixed (ef1e235) watch.go:180 Watcher.degraded is sticky; never clears after recovery R3 Hisoka Yes
CRF-39 P3 Author fixed (ef1e235) resolve_test.go No test for kebab-case skill name validation branch R3 Bisky Yes
CRF-40 P3 Author fixed (eb613a4) manager.go:190 Manager.Run sets running=true before NewWatcher; if NewWatcher errors, runDoneCh never closed, Close deadlocks R4 Mafuuu Yes
CRF-41 P3 Author fixed (eb613a4) manager.go:528 Concurrent Resync + resolveAndBroadcast can publish stale snapshot at higher version (one-cycle self-healing) R4 Hisoka Yes
CRF-42 P3 Author fixed (eb613a4) resolve.go:443 readSkillMeta oversize path omits ContentHash; breaks change-detection contract readFileResource establishes R4 Razor Yes
CRF-43 P3 Open manager.go:239 Manager.Started() exported with zero production callers R5 Netero Yes
CRF-44 P2 Open manager.go:407 Resync/resolveAndBroadcast publish empty snapshot when ctx canceled during ResolveContext R5 Mafuuu Yes
CRF-45 Nit Open resolve.go:388 readMCPConfig comment claims "Unix newline conversion" that does not happen R5 Mafuuu Yes
CRF-46 Nit Open manager.go:43 AllowedRoots doc says "Empty disables validation" but empty triggers home-dir-only R5 Mafuuu Yes

Contested and acknowledged

CRF-6 (Note, agent.go:2556) - startAgentAPI210 is a full copy of startAgentAPI

  • Finding: startAgentAPI210 duplicates startAgentAPI, differing only in callback type parameter.
  • Author defense: Go cannot generalize over the typed callback. Structural fix planned if duplication grows past two versions.
  • Author accepted (R2): Defense is valid. Go lacks generic method dispatch for interface-typed callbacks.

Law analysis

Effective LOC: 4194. Head SHA: 4b3eda5. Verdict: Don't split. Enforcement: N/A.

Round log

Round 1

Netero-only. 1 P2, 2 P3, 2 Nit, 1 Note. Law: Don't split. Reviewed against eea427f..4b3eda5.

Round 2

Churn guard: PROCEED. 5 addressed, 1 acknowledged. Panel: 19 reviewers. Netero R2: no new findings. 5 P2, 14 P3, 1 P4, 4 Nit, 2 Note new. 3 dropped, 1 dropped. Reviewed against eea427f..8389a1e.

Round 3

Churn guard: PROCEED. 29 addressed. Re-review panel (8): Netero, Bisky, Mafuuu, Meruem, Hisoka, Pariston, Mafu-san, Knov. Netero: no findings. Panel: 4 P3 new. 0 P2. Reviewed against eea427f..9da0cfe.

Round 4

Churn guard: PROCEED. 4 addressed (ef1e235). Re-review panel (5): Netero, Mafuuu, Meruem, Hisoka, Razor. Netero: no findings. Panel: 3 P3 new. 0 P2. Reviewed against eea427f..ef1e235.

Round 5

Churn guard: PROCEED. 3 addressed (eb613a4). Re-review panel (4): Netero, Mafuuu, Hisoka, Pariston. 1 P2, 1 P3, 2 Nit new. Reviewed against eea427f..eb613a4.

About deep-review

CRF = Coder Review Finding (P0-P4, Nit, Note)

Reviewer Focus
Bisky tests
Chopper ops/errors
Churn-guard change verification
Ging language modernization
Gon naming
Hisoka edge cases
Killua perf
Kite change integrity
Knov contracts
Knuckle SQL
Kurapika security
Law decomposition
Leorio docs
Luffy product
Mafu-san process
Mafuuu contracts
Melody dispatch/pairing
Meruem structural
Nami frontend
Netero mechanical checks
Pariston premise testing
Pen-botter product gaps
Razor verification
Robin duplication
Ryosuke Go arch
Takumi concurrency
Zoro shape

🤖 Managed by Coder Agents.

@coder-agents-review coder-agents-review Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First-pass mechanical review (Netero). These are automated findings from the mechanical first-pass reviewer. The full review panel has not yet reviewed this PR. The panel will review after these findings are addressed.

The feature design is clean: a linear dependency chain from proto through types/resolver/watcher/manager/push to agent wiring, with a well-contained coderd stub. Law evaluated the PR for splitting and concluded it should stay unified (single risk domain, no independent vertical slices, generated protobuf accounts for 762 of 4956 raw lines).

Findings: 1 P2, 2 P3, 2 Nit, 1 Note.

As Netero once said: "The resolve loop is a tidy chain. The quartz clock, however, has a gap in its coverage."

🤖 This review was automatically generated with Coder Agents.

Comment thread agent/agentcontext/push.go Outdated
Comment thread agent/agentcontext/resolve.go Outdated
Comment thread agent/agentcontext/types.go Outdated
Comment thread agent/agentcontext/watch_test.go Outdated
Comment thread agent/agentcontext_seed.go Outdated
Comment thread agent/agent.go Outdated

Copy link
Copy Markdown
Member Author

/coder-agents-review

@coder-agents-review coder-agents-review Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full panel review (19 reviewers). R1 findings all addressed; the quartz clock fix, dead-data cleanup, and time.Sleep elimination are well done.

The overall design is sound: a clean linear dependency chain, well-scoped boundaries, and good test density (34.7%). Takumi found no concurrency issues after tracing every lock, channel, and shutdown path. Pariston evaluated four alternative framings and concluded the author's premise holds. Kite confirmed the risk is uniform and low. Melody walked every serialization pairing and found them correct.

Four areas need attention before merge:

  1. AllowedRoots is frozen at init before the manifest loads, locking devcontainer workspaces out of runtime AddSource. Five reviewers converged on this independently.
  2. The walkDir error handler hardcodes KindInstructionFile for all recognized files, including .mcp.json and SKILL.md. This creates phantom resource ID changes when permissions are fixed.
  3. Several doc comments make false claims about the encoding format, reference nonexistent fields, or describe functionality that doesn't match the implementation.
  4. resolveLocked holds the mutex across filesystem I/O. Three reviewers flagged this; the impact scales with filesystem latency.

Findings: 5 P2, 14 P3, 1 P4, 4 Nit, 2 Note.

Ryosuke: "Both P2s stem from the same design choice: the Manager computes everything eagerly under the mutex and at construction time."

🤖 This review was automatically generated with Coder Agents.

Comment thread agent/agentcontext_seed.go Outdated
Comment thread agent/agentcontext/resolve.go
Comment thread agent/agentcontext/types.go Outdated
Comment thread agent/agentcontext/manager.go
Comment thread agent/agentcontext/api.go
Comment thread agent/agentcontext/manager.go Outdated
Comment thread agent/agentcontext/push.go Outdated
Comment thread agent/agentcontext/manager.go Outdated
Comment thread agent/agentcontext/types.go Outdated
Comment thread agent/agentcontext/resolve.go Outdated

Copy link
Copy Markdown
Member Author

/coder-agents-review

@coder-agents-review coder-agents-review Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review (8 reviewers). All 29 R2 findings addressed in 9da0cfe. The CRF-10 mutex refactor, CRF-7 AllowedRoots fix, CRF-8 kindFromFilename extraction, and CRF-15 ctx plumbing are all verified correct. The fix commit is clean: 396 insertions, 122 deletions across 13 files, no scope drift, every CRF number traceable to a specific code change.

Netero found no mechanical issues. Takumi-equivalent concurrency traces by Meruem and Hisoka found the same structural pattern (I/O under mutex) in two sibling call sites that the CRF-10 fix did not address. Hisoka also found that watcher degraded state is sticky after ENOSPC recovery. These are bounded-impact issues on infrequent code paths.

New findings: 4 P3, 0 P2. No ship-blockers. The prior REQUEST_CHANGES is dismissed.

Mafu-san: "The 9da0cfe commit addresses all 29 R2 findings. Every finding maps to a specific code change, and the commit message itemizes each one by CRF number. No findings were silently dropped or substituted."

🤖 This review was automatically generated with Coder Agents.

Comment thread agent/agentcontext/manager.go Outdated
Comment thread agent/agentcontext/watcher.go
Comment thread agent/agentcontext/watcher.go
Comment thread agent/agentcontext/resolve.go

Copy link
Copy Markdown
Member Author

/coder-agents-review

@coder-agents-review coder-agents-review Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verification round (5 reviewers). All 4 R3 findings verified fixed. The Resync mutex refactor (CRF-36), Watcher.Sync lock split (CRF-37), degraded-state clearing (CRF-38), and kebab-case test (CRF-39) are all confirmed correct by every reviewer.

Three new P3 findings surfaced, all edge cases with bounded impact. The mutex refactors introduced a theoretical version-ordering race (self-healing in one resolve cycle) and exposed an unreachable deadlock path in Run/Close sequencing. A behavioral inconsistency in the oversize skill hash path was also identified.

No P2 or above. The PR is in good shape.

New findings: 3 P3.

Hisoka, on the version race: "Self-healing exists: the watcher fires another trigger, the next resolve produces version N+2 with F2. Window is one resolve cycle (sub-second). But nothing in the protocol tells coderd that version N+1 was a regression."

🤖 This review was automatically generated with Coder Agents.

Comment thread agent/agentcontext/manager.go
Comment thread agent/agentcontext/manager.go
Comment thread agent/agentcontext/resolve.go Outdated

Copy link
Copy Markdown
Member Author

/coder-agents-review

@coder-agents-review coder-agents-review Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final verification round (4 reviewers). All 3 R4 fixes verified correct. The epoch counter (CRF-41), defer placement (CRF-40), and skill hash (CRF-42) are all confirmed by every reviewer. Hisoka probed seven additional edge cases and found nothing. Pariston re-evaluated the premise from four framings and confirmed the solution is proportional.

One new P2: Resync and resolveAndBroadcast can publish an empty snapshot when the context is canceled mid-walk. The consequence is deferred (coderd returns Unimplemented today), but the pattern is a data-loss path once the chatd follow-up accepts pushes. One P3 (exported test-only method) and two Nits (comment inaccuracies).

New findings: 1 P2, 1 P3, 2 Nit.

Cumulative: 46 findings across 5 rounds. 42 fixed, 1 accepted, 3 dropped, 4 open.

Hisoka: "Boring code gets silence. This earned it."

🤖 This review was automatically generated with Coder Agents.

Comment thread agent/agentcontext/manager.go
Comment thread agent/agentcontext/manager.go Outdated
Comment thread agent/agentcontext/resolve.go Outdated
Comment thread agent/agentcontext/manager.go Outdated

Copy link
Copy Markdown
Member Author

/coder-agents-review

@kylecarbs kylecarbs requested a review from ibetitsmike June 3, 2026 14:26

@ibetitsmike ibetitsmike left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid decomposition of the resolver, watcher, manager, and DRPC adapter, and the cancellation and epoch regression coverage is useful. This review found 2 P1 and 5 P2 findings across 7 inline comments. This review contains findings that may need attention before merge.

Mux working on behalf of Mike.

return res
}
data, err := os.ReadFile(path)
if err != nil {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Reject or re-validate symlinked context files before reading payloads (Security Reviewer P1, Edge Case Analyst P1)

every workspace gets the manifest working directory as a scan root in agent/agentcontext/manager.go:497-500; during recursive walks, any non-directory entry whose basename matches AGENTS.md, .mcp.json, or SKILL.md is classified at agent/agentcontext/resolve.go:315-320 and agent/agentcontext/resolve.go:331-355; the reader then calls os.ReadFile(path) at agent/agentcontext/resolve.go:419-425 without rejecting fs.ModeSymlink, resolving the symlink target, or checking that the opened target remains under an allowed root. A malicious repo can include a symlink such as nested/AGENTS.md -> ~/.ssh/id_rsa; the agent will treat it as an instruction file and send the target bytes in the context snapshot.

Edge Case Analyst adds: So a workspace containing AGENTS.md as a symlink to ~/.ssh/id_rsa, /etc/passwd, or a very large file is classified as an in-root instruction file, then the target bytes are loaded and pushed even though the target is outside the scan root and may be larger than DefaultMaxResourceBytes.

This is worth fixing before merge because default working-directory scans are automatic. Skip symlink resources, or resolve each file target and re-validate it against the scan root before opening it with a capped reader.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 496f157. The resolver now follows symlinks (codex follows them unconditionally and developers legitimately symlink AGENTS.md into shared monorepo docs), but resolveReadTarget validates the resolved target is inside the contributing scan root before opening with the capped reader.

  • Inside the scan root → follow normally.
  • Escapes the scan root → StatusInvalid with "symlink target %q escapes scan root", no payload.
  • Broken or non-regular target → StatusUnreadable.

emitSkillsFromContainer switched from os.Stat to os.Lstat so symlinked SKILL.md files take the same path. Tests: TestResolver_SymlinkInsideScanRootAllowed, TestResolver_SymlinkOutsideScanRootRejected, TestResolver_BrokenSymlink.

🤖 Reply by Coder Agents.

Comment thread agent/agentcontext/resolve.go Outdated
// resolver only enforces the per-resource size cap. Future
// work: detect malformed JSON and surface StatusInvalid.
func (r *Resolver) readMCPConfig(path string, info fs.FileInfo, userSource string) Resource {
return r.readFileResource(KindMCPConfig, path, info, userSource)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Do not push raw .mcp.json payloads without redacting secret-bearing fields (Security Reviewer P1)

readMCPConfig delegates directly to the generic file reader at agent/agentcontext/resolve.go:387-392, readFileResource reads the full file bytes and stores them in res.Payload at agent/agentcontext/resolve.go:419-425, and the DRPC adapter copies that payload onto the wire at agent/agentcontext/drpc.go:63-69. The on-disk MCP schema includes secret-bearing fields, with Env map[string]string and Headers map[string]string in agent/x/agentmcp/config.go:12-20, and the user docs show token and Authorization header examples in .mcp.json at docs/ai-coder/agents/extending-agents.md:124-135.

Even with the current coderd stub, the request payload is still transmitted before coderd returns Unimplemented. Unless a downstream consumer has a demonstrated need for raw config bytes, send path/hash metadata only, or redact known secret-bearing fields before building the snapshot.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 496f157. readMCPConfig is no longer a thin wrapper around readFileResource: it reads the bytes only to compute SizeBytes and ContentHash, then drops the payload. Resources of kind KindMCPConfig carry path + hash + size on the wire (change detection still works) but never the bytes, so embedded Env and Headers values stay on disk. Live tool data continues to flow through the MCPProvider as KindMCPServer resources, which is what consumers actually want.

Test TestResolver_MCPConfigEmitted was extended to embed a GITHUB_TOKEN secret and assert Payload is empty while ContentHash is populated.

🤖 Reply by Coder Agents.

// loop translates this into a clean shutdown rather than a
// retry storm.
func (*API) PushContextState(_ context.Context, _ *agentproto.PushContextStateRequest) (*agentproto.PushContextStateResponse, error) {
return nil, drpcerr.WithCode(errPushContextStateUnimplemented, drpcerr.Unimplemented)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The published v2.10 contract is ahead of the implemented server and agent behavior (Go Architect P1, Product Reviewer P2)

agent/proto/version.go:73-81 declares API v2.10 as adding PushContextState and sets CurrentMinor = 10.

coderd/workspaceagentsrpc.go:46-61 accepts RPC connections whose version validates against proto.CurrentVersion, so a caller requesting 2.10 now passes version negotiation.

coderd/agentapi/context.go:11-21 says Coderd does not yet persist context snapshots; the chatd integration that consumes pushes lives in a follow-up change and returns drpcerr.Unimplemented from PushContextState.

Go Architect adds: agent/proto/agent.proto:541-556 defines ContextResource as covering live MCP server tool list resources and includes the MCP_SERVER kind.

This is a P2 rather than P1 because the current agent handles Unimplemented without retrying forever, but shipping CurrentMinor = 10 makes a nominally supported API whose only new method always fails and whose promised MCP server resources are not wired in production. Could we either keep this behind an internal rollout flag and leave the public version at v2.9, or include the coderd persistence plus MCP provider adapter in the same slice that advertises v2.10?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 496f157. Reverted tailnet/proto.CurrentMinor to 9 with a comment explaining the bump is deferred to the change that wires coderd persistence + MCP provider. The agent now connects via ConnectRPC29WithRole; apiConnRoutineManager.aAPI is DRPCAgentClient29 again; the startAgentAPI210("push context state", ...) block in run() is removed (with a TODO at the same location for the follow-up). DRPCAgentClient210, agentcontext.NewDRPCPusher, and the rest of the push plumbing remain compiled and tested. TestAgent_ContextStatePushed is skipped with a pointer back to the version bump.

🤖 Reply by Coder Agents.

Comment thread agent/agent.go
Clock: a.clock,
WorkingDir: workingDirFn,
InitialSources: initialContextSources(a.contextConfig, workingDirFn),
})

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Resolve boot-time context env sources after the manifest working directory is known (Contract Auditor P2, Structural Analyst P2)

However initialContextSources resolves the config once during init, when workingDirFn can still return an empty string because the manifest has not been stored yet:

The resolver used here explicitly drops relative paths when the base directory is empty, while the existing context-config contract resolves these same fields against the workspace directory lazily:

The fallback working-directory scan does not fully preserve that contract because the new resolver only recognizes its hard-coded filenames, for example .mcp.json and SKILL.md, rather than the configured file names:

A template that uses relative CODER_AGENT_EXP_* paths or custom CODER_AGENT_EXP_INSTRUCTIONS_FILE or CODER_AGENT_EXP_SKILL_META_FILE values can lose that configuration for the lifetime of the agent. Re-seed after manifest load, or store the raw config in the manager and resolve it lazily with the working directory and configured file names.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 496f157. Added Manager.SeedSources([]Source), a trusted late-binding equivalent of InitialSources that canonicalizes + deduplicates + bypasses AllowedRoots (the paths come from template env vars, not HTTP). The agent's handleManifest calls it right before Trigger() with initialContextSources(a.contextConfig, func() string { return manifest.Directory }), so any relative CODER_AGENT_EXP_*_DIRS that was dropped at boot (when workingDirFn() returned "") now resolves against the actual workspace directory and gets added before the snapshot is rebroadcast.

The narrower sub-finding about custom CODER_AGENT_EXP_INSTRUCTIONS_FILE / SKILL_META_FILE names not being honored is intentionally deferred to a follow-up: it requires plumbing the agentcontextconfig.Config into the resolver's filename classification, which touches more surface than this PR should grow. Filed as TODO in the handleManifest comment block.

Test TestManager_SeedSourcesLateBindsAfterManifest covers the late-binding path with AllowedRoots deliberately excluding the seeded path.

🤖 Reply by Coder Agents.

Comment thread agent/agentcontext/watcher.go Outdated
return
}
if w.watcher == nil {
// Degraded mode: nothing to wire up; fire the callback

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Degraded watcher mode re-arms itself into a full-resolve loop (Edge Case Analyst P2, Concurrency Reviewer P2)

NewWatcher returns a degraded watcher with no fsnotify instance when initialization fails:

Sync turns every degraded sync into a debounced callback:

That callback is m.signal, and the Run loop handles the signal by calling watcher.Sync again before resolving, which schedules the next signal and creates an endless 250ms scan loop on hosts where inotify cannot initialize:

When fsnotify startup fails, the fallback should surface degraded state without continuously walking and pushing the workspace while idle. Make degraded Sync a no-op for manager-triggered syncs, or switch to an explicit low-frequency poller that cannot re-arm itself through OnChange.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 496f157. Degraded Sync is now a true no-op: it returns immediately without calling schedule(), breaking the Sync → m.signal → Run loop → Sync cycle. Manual Resync, AddSource, and RemoveSource still drive re-resolves; the only thing lost in degraded mode is auto-update on file edits, which is the expected degradation when fsnotify cannot initialize.

🤖 Reply by Coder Agents.

Comment thread agent/agentcontext/manager.go Outdated
published := m.snapshot
m.mu.Unlock()
if watcher != nil {
watcher.Sync(ctx, roots)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Do not call Watcher.Sync with stale roots after the epoch guard rejects a Resync (Concurrency Reviewer P2)

Resync captures roots := m.scanRootsLocked() before dropping the mutex and starting its filesystem walk: agent/agentcontext/manager.go:395-406. If another resolve starts while that walk is running, the stale pass detects m.resolveEpoch != myEpoch, skips publishing, but still calls watcher.Sync(ctx, roots) with its older roots at agent/agentcontext/manager.go:432-443.

The concrete interleaving is: Resync A captures roots without a newly added source, AddSource signals Run B, B captures the new roots and publishes them, then A reaches the stale branch and Syncs the old roots, removing the new source watches even though the snapshot and source list are current. Later edits under that source are not observed until another explicit trigger re-arms the watcher.

The epoch guard already knows this pass is stale, so it should not perform watcher side effects with stale inputs. Return the published snapshot without syncing, or re-read the current roots before touching the watcher.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 496f157. The stale-epoch branch in Resync no longer calls watcher.Sync(ctx, roots) before returning. The winning pass already armed the watcher with the current root set; replaying the stale roots from a losing pass would drop watches on sources that only the newer pass knows about. The branch returns the currently published snapshot and no longer mutates the watcher.

🤖 Reply by Coder Agents.

Comment thread agent/agentcontext/resolve.go Outdated
mcp := r.MCP.MCPResources()
resources = append(resources, mcp...)
// MCP resources may push the aggregate over the cap.
// Re-apply count and size limits to MCP entries only.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Apply byte caps to MCP resources before adding them to the snapshot (Contract Auditor P2, Structural Analyst P2)

DefaultMaxSnapshotBytes is documented as the aggregate payload cap and says resources past it are emitted as StatusExcluded:

ResolveContext then acknowledges that MCP resources are appended after filesystem caps and says it will re-apply size limits:

But applyMCPCaps only checks the resource count and returns immediately when the count is within MaxResources, so a provider can return one large KindMCPServer payload and the snapshot will exceed MaxSnapshotBytes with StatusOK:

This is latent until the production MCP provider is wired, but the contract and comments introduced here already say the cap applies. Route MCP resources through the same per-resource and remaining aggregate budget checks before they can affect PayloadBytes or the wire payload.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 496f157. applyCaps now returns the running aggregate byte total so applyMCPCaps can pick up where it left off. applyMCPCaps walks the appended MCP tail and stamps StatusExcluded (with a payload-clear and a SnapshotError) on any resource that would push the running sum past MaxSnapshotBytes, in addition to the existing count cap. A provider returning a single oversized KindMCPServer payload now ends up excluded instead of breaking the aggregate cap contract.

Test TestResolver_MCPProviderRespectsAggregateByteCap uses a 1 KiB MCP payload with MaxSnapshotBytes: 512 and asserts the resource becomes StatusExcluded with empty Payload and a populated SnapshotError.

🤖 Reply by Coder Agents.

Comment thread tailnet/proto/version.go
const (
CurrentMajor = 2
CurrentMinor = 9
CurrentMinor = 10

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we can bump this in the context of just this PR

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed and reverted. CurrentMinor is back to 9 in 496f157 with a docstring explaining the v2.10 bump is deferred to the change that adds coderd persistence and the MCP provider adapter. See the CRF-52 reply on the parallel thread for the full agent-side wiring change.

🤖 Reply by Coder Agents.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at the historical precedent (notably #21293 which bumped to v2.7 for ReportBoundaryLogs with an explicit stub server, on the rationale of landing the feature across multiple PRs), I think the right call is actually to ship the bump with this PR after all and keep coderd's handler as a stub returning Unimplemented.

Reasons:

  1. Every prior minor bump on this API (v2.5 through v2.9) has shipped the version constant in the same PR that adds the proto change. Splitting them is a new pattern.
  2. The previous state (proto wire scaffolded, CurrentMinor held back) shipped ~500 lines of dead code into production binaries that no caller could reach because version negotiation hid them.
  3. DRPCPusher translates drpcerr.Unimplemented into ErrPushUnimplemented, so the agent push loop exits cleanly against older coderd deployments. Both directions remain interoperable.

a39871fae8 re-applies the bump:

  • tailnet/proto.CurrentMinor = 10
  • run() dials ConnectRPC210WithRole
  • apiConnRoutineManager.aAPI is DRPCAgentClient210; startAgentAPI210 helper restored
  • The push context state goroutine is live
  • TestAgent_ContextStatePushed is un-skipped and passes
  • coderd/workspaceagents_test.go postStartup uses ConnectRPC210 again

CODAGT-569 is now scoped to just the real coderd handler (persist to workspace_agent_context, fan dirty events to chatd, add the KindMCPServer provider) and the coder exp chat context CLI.

Happy to revert again if you'd rather follow a different convention.

🤖 Reply by Coder Agents.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against this approach 👍

kylecarbs added a commit that referenced this pull request Jun 4, 2026
…cher, caps)

Six concrete fixes from @ibetitsmike's review on PR #25983:

P1 symlink arbitrary read: the resolver previously called
os.ReadFile on any AGENTS.md/.mcp.json/SKILL.md it walked into,
including symlinks, so a workspace with
`nested/AGENTS.md -> ~/.ssh/id_rsa` would have shipped the
target. Symlinks are still followed (codex follows them too,
and they are the standard way to compose monorepo agent docs),
but the resolved target must now stay inside the contributing
scan root or the resource is emitted as StatusInvalid. Broken
symlinks and non-regular targets are emitted as StatusUnreadable.

P1 .mcp.json secret exfiltration: readMCPConfig used to ship
the raw file bytes, which embed Env tokens and Authorization
headers. The resolver now produces metadata only (path,
SizeBytes, ContentHash). Change detection still works; the
live tool list flows through MCPProvider as a KindMCPServer
resource, which is what consumers actually need.

P2 v2.10 contract ahead of behavior: revert
tailnet/proto.CurrentMinor to 9, switch the agent's connect
site back to ConnectRPC29WithRole, and disable the
PushContextState push goroutine. The DRPCAgentClient210
interface, the agentcontext.Manager, and the push code paths
remain compiled and tested so the v2.10 bump in the follow-up
that adds coderd persistence is a small, focused change. A
TestAgent_ContextStatePushed skip and a comment at the connect
site call out the re-enable steps.

P2 boot-time env sources lost: relative CODER_AGENT_EXP_*_DIRS
paths used to be dropped during agent.init when the manifest's
working directory was still unknown. Add Manager.SeedSources,
a trusted late-binding equivalent of InitialSources that
bypasses AllowedRoots, and call it from handleManifest with
the now-resolved working directory before Trigger.

P2 degraded watcher re-arm loop: when fsnotify init fails, Sync
used to schedule its OnChange callback, the Manager's signal
handler turned it into another Sync, and the cycle repeated
every 250ms forever. Degraded Sync is now a true no-op; manual
Resync, AddSource, and RemoveSource still drive re-resolves.

P2 stale-epoch Resync mutating the watcher: a Resync pass that
loses the epoch race already skipped publishing its stale
snapshot, but still called watcher.Sync with stale roots. That
removed watches on sources only the winning pass knew about.
The stale branch now returns the published snapshot without
touching the watcher.

P2 MCP byte cap not enforced: applyMCPCaps previously only
checked the resource-count cap, so a provider returning one
big KindMCPServer payload could exceed MaxSnapshotBytes with
StatusOK. The cap helper now picks up where applyCaps left off
and stamps StatusExcluded plus a snapshot error on MCP entries
that breach the count or byte budget.

Adds regression tests for symlink-inside, symlink-outside,
broken symlinks, MCP secret omission, MCP byte cap, and the
manifest-late SeedSources path.
@kylecarbs kylecarbs requested a review from ibetitsmike June 4, 2026 21:24
kylecarbs added 11 commits June 5, 2026 16:02
Add the ContextResource, PushContextStateRequest, and
PushContextStateResponse messages plus a new PushContextState
RPC on the Agent API. Bumps the tailnet+agent shared
CurrentVersion to 2.10 and adds the DRPCAgentClient210
interface, plus ConnectRPC210/ConnectRPC210WithRole on
codersdk/agentsdk's Client. Reserved kinds (PLUGIN, HOOK,
SUBAGENT, COMMAND) ship in the enum but are not emitted by
v1 agents.

The wire shape matches the workspace context sources RFC.
Coderd's server-side handler arrives separately.
Add an Unimplemented stub for PushContextState so the binary
satisfies the v2.10 DRPCAgentServer interface. The real
handler that persists snapshots and fans dirty events out to
chats lands with the chatd integration.

Until then, agents receive Unimplemented and stop pushing for
the connection (agentcontext.RunPush handles this gracefully).
DRPCPusher adapts the generated DRPCAgentClient210 to the
agentcontext.Pusher interface so RunPush can ship snapshots
without coupling the package to a specific proto version.
drpcerr.Unimplemented is translated to ErrPushUnimplemented so
the agent stops cleanly when talking to an older coderd.

Resync now re-arms the recursive watcher in addition to
re-resolving. This is required so a Resync triggered by a
late-arriving working directory (manifest fetch) starts
watching the workspace tree.
Boot the agentcontext.Manager during agent init, seed its
sources from the existing CODER_AGENT_EXP_*_DIRS env vars, run
its watcher/resolver loop in the graceful context, and tear it
down in agent.Close.

Mount the source CRUD and resync HTTP API at /api/v0/context.

Connect to coderd via the new ConnectRPC210WithRole and push
each snapshot via DRPCPusher. The push goroutine is registered
through a new startAgentAPI210 helper that exposes the v2.10
client. When the manifest lands, trigger a Resync so the
snapshot reflects the workspace directory immediately instead
of waiting for the next filesystem event.

Add agenttest.Client.ContextStatePushes and a FakeAgentAPI
PushContextState implementation so tests can assert on the
push traffic, plus an end-to-end test that boots a real agent
against the fake and verifies AGENTS.md appears in a snapshot.
- CRF-1 (P2): thread quartz.Clock through PushOptions so pushWithRetry
  uses clock.NewTimer, making the retry test deterministic via a
  quartz trap instead of real sleeps.
- CRF-2 (P3): remove dead skillsParentNames map; simplify
  isSkillsContainer to a base-name check that already covered every
  reachable case.
- CRF-3 (P3): remove unused Snapshot.AggregateHashHex (api.go inlines
  hex encoding for HTTP responses).
- CRF-4 (Nit): replace time.Sleep timing waits in watch_test.go and
  manager_test.go with Eventually-driven writes and a new
  Manager.Started signal channel.
- CRF-5 (Nit): drop unused parameters from defaultContextRoots.
P2:
- CRF-7: effectiveAllowedRoots now appends the current working
  directory lazily so the manager picks up the resolved path
  after the manifest loads.
- CRF-8: walkDir error path computes Kind from the basename
  (kindFromFilename) so .mcp.json and SKILL.md keep a stable
  resource ID across permission flips.
- CRF-9: ComputeAggregateHash / writeLengthPrefixed docs now
  describe the Netstring-style encoding accurately.
- CRF-10: resolveAndBroadcast snapshots inputs under the lock,
  releases it for the filesystem walk, then re-acquires to swap.
- CRF-11: api.go SnapshotResource comment no longer references
  a nonexistent PayloadBase64 field.

P3:
- CRF-12: WatcherOptions.MaxDepth; the watcher now mirrors the
  resolver's depth instead of hardcoding DefaultMaxScanDepth.
- CRF-13: drop obsolete "follow-up" sentence in doc.go.
- CRF-14: Pusher doc says Agent API v2.10, not proto v30.
- CRF-15: Resolver.ResolveContext threads ctx through walk so
  resolveAndBroadcast can cancel between roots.
- CRF-16: new tests cover StatusUnreadable for instruction and
  MCP-config files (skipped on Windows and root).
- CRF-17: TestRunPush_ClosesOnManagerClose covers closedCh path.
- CRF-18: drop dead !ok branch on the subscriber channel.
- CRF-19: extract readFileResource; readInstructionFile and
  readMCPConfig share the plumbing.
- CRF-24: DefaultPushInitialBackoff / DefaultPushMaxBackoff.
- CRF-25: subsequent-push test asserts AggregateHash and Version
  advance, not just the count.

P4:
- CRF-26: TestRunPush_RejectedResponseProceeds covers the
  Accepted=false fast-path.

Nit:
- CRF-27: slices.SortFunc/SortStableFunc replace sort.Slice.
- CRF-28: firstLine uses strings.SplitSeq.
- CRF-29: sync.OnceFunc replaces the sync.Once+closure wrapper.
- CRF-30: atomic.Int32 replaces bare int32 in watch_test.
- CRF-31: redundant sort in resolveLocked already gone with CRF-10.
- CRF-32: drop redundant "agentcontext:" prefix from log messages.
- CRF-33: resolveAndBroadcast doc no longer claims conditional.

Note:
- CRF-34: ResourceStatus doc explains the proto-enum offset.
- CRF-35: readInstructionFile doc flags the sanitization gap
  delegated to the chatd follow-up.
CRF-36: Resync now mirrors the resolveAndBroadcast pattern.
  Filesystem walks happen outside m.mu, so concurrent Sources,
  AddSource, RemoveSource, and Snapshot calls no longer block on
  resolver I/O. resolveLocked is only invoked from NewManager's
  initial resolve, matching the comment.

CRF-37: Watcher.Sync drops w.mu around collectDirs (the
  recursive directory walk) and re-acquires it only for the
  diff-and-apply phase. Close, schedule, and the run goroutine
  no longer block on a slow filesystem.

CRF-38: Watcher.degraded is cleared at the end of a Sync pass
  whose Adds all succeeded. A user who bumps the inotify watch
  limit and re-syncs now sees a clean snapshot instead of a
  permanent SnapshotError.

CRF-39: new TestResolver_SkillNameNonKebabInvalid covers the
  kebab-case validation branch in readSkillMeta (name matches
  parent directory but fails SkillNamePattern).
kylecarbs added 7 commits June 5, 2026 16:02
The handleManifest Resync call was synchronous; on slow filesystems
(CI macOS and Windows runners walking a real $HOME) the resolver
walk could outlive the agent's graceful shutdown window, leaking a
goroutine inside filepath.WalkDir and tripping goleak in the agent
test suite.

Two changes break the leak:

- New Manager.Trigger() queues an asynchronous re-resolve on the
  trigger channel. handleManifest now calls Trigger instead of
  Resync, so manifest handling returns immediately and the
  background walk is owned by the Manager.Run goroutine that
  already respects gracefulCtx cancellation.
- Resolver.walkDir threads ctx into the filepath.WalkDir callback,
  so when ctx cancels the walk aborts after the current directory
  read instead of running to completion.
CRF-40: ensure runDoneCh is closed even if NewWatcher fails so Close
does not deadlock waiting on a goroutine that already exited.

CRF-41: gate snapshot publishes with a monotonic resolveEpoch counter.
Resync and resolveAndBroadcast drop m.mu around the filesystem walk;
under concurrency a stale walk could overwrite a fresher one at a
higher version number. Each pass now captures its epoch under the
lock and skips the publish if a newer pass has started. Resync
returns the currently published Snapshot in the stale case, which is
guaranteed to be at least as fresh as the discarded result.

CRF-42: hash the (capped) prefix of oversize SKILL.md files so an
edit that keeps the file oversize still shifts the aggregate hash,
matching readFileResource and preserving the change-detection
contract.
CRF-44: bail without publishing when ctx is canceled mid-walk in
both Resync and resolveAndBroadcast. ResolveContext returns an
empty Snapshot when its ctx is canceled; publishing it would
replace the live Snapshot with empty resources until the next
trigger. Resync now returns the existing Snapshot and ctx.Err()
instead. Added regression test
TestManager_ResyncCanceledKeepsLiveSnapshot.

CRF-43: unexport Manager.Started to started; the method has no
production callers. Tests reach it through ManagerStarted in
export_test.go.

CRF-45: drop the inaccurate Unix newline conversion clause from
readMCPConfig's doc comment.

CRF-46: AllowedRoots doc clarified that an empty slice falls back
to the home directory rather than disabling validation.
agent.NewManager never returns an error; it had a single
return m, nil path. Drop the error from the signature so the
agent.go call site does not pretend to handle a failure that
cannot happen. Remove the five dead 'if a.contextManager != nil'
guards that followed.

Move the hardcoded default scan-root and allow-root lists out
of agent/agentcontext_seed.go and into the agentcontext package
itself as DefaultBuiltinRoots() and DefaultAllowedRoots(). The
package owns the policy now; the agent passes them in
explicitly, matching the pattern other defaults (Clock,
Debounce, SchemaVersion) follow.

agentcontext_seed.go is gone. The one helper that legitimately
lives in package agent, initialContextSources (which bridges
the legacy CODER_AGENT_EXP_*_DIRS env vars), moves into
agent.go next to its sole caller.
…watch to watcher

Move types.go contents next to their producers and consumers:

- Source moves into manager.go (Manager input).
- Resource, ResourceKind, ResourceStatus, their String() methods,
  Snapshot, ComputeAggregateHash, and writeLengthPrefixed move
  into resolve.go (the resolver emits and hashes these).
- types_test.go is folded into resolve_test.go.

Rename watch.go and watch_test.go to watcher.go and watcher_test.go;
the file describes the Watcher type.

Inline switchHomeEnv into paths_test.go (its only caller) and
delete helpers_test.go.

export_test.go stays. It is the Go-conventional bridge for
exposing the unexported started() channel to external test
packages and is recognized by the toolchain for this purpose.
DefaultBuiltinRoots and DefaultAllowedRoots were one-shot exports
whose only caller was the agent passing them straight back to the
package as ManagerOptions. The round-trip is dead API surface; the
package owns its scan and allow-list policy and should not require
callers to ship it on every construction.

- Drop the BuiltinRoots field from ManagerOptions. The package
  always layers ~/.coder, ~/.coder/skills, and
  ~/.claude/plugins/cache via the unexported defaultBuiltinRoots,
  consulted on every scanRootsLocked call.
- Keep AllowedRoots as an option because the package's tests use
  it as a control surface for the validation logic. An empty
  AllowedRoots now falls back to the unexported defaultAllowedRoots
  ([~, ~/.coder, ~/.claude]) instead of just [~].
- Agent no longer passes either field; it constructs the Manager
  with just Logger/Clock/WorkingDir/InitialSources.
- Add a package TestMain that sets HOME and USERPROFILE to a
  binary-wide temp dir so the now-mandatory built-in walks
  resolve to nothing on developer hosts, keeping
  Len(Resources, N) assertions deterministic.
…cher, caps)

Six concrete fixes from @ibetitsmike's review on PR #25983:

P1 symlink arbitrary read: the resolver previously called
os.ReadFile on any AGENTS.md/.mcp.json/SKILL.md it walked into,
including symlinks, so a workspace with
`nested/AGENTS.md -> ~/.ssh/id_rsa` would have shipped the
target. Symlinks are still followed (codex follows them too,
and they are the standard way to compose monorepo agent docs),
but the resolved target must now stay inside the contributing
scan root or the resource is emitted as StatusInvalid. Broken
symlinks and non-regular targets are emitted as StatusUnreadable.

P1 .mcp.json secret exfiltration: readMCPConfig used to ship
the raw file bytes, which embed Env tokens and Authorization
headers. The resolver now produces metadata only (path,
SizeBytes, ContentHash). Change detection still works; the
live tool list flows through MCPProvider as a KindMCPServer
resource, which is what consumers actually need.

P2 v2.10 contract ahead of behavior: revert
tailnet/proto.CurrentMinor to 9, switch the agent's connect
site back to ConnectRPC29WithRole, and disable the
PushContextState push goroutine. The DRPCAgentClient210
interface, the agentcontext.Manager, and the push code paths
remain compiled and tested so the v2.10 bump in the follow-up
that adds coderd persistence is a small, focused change. A
TestAgent_ContextStatePushed skip and a comment at the connect
site call out the re-enable steps.

P2 boot-time env sources lost: relative CODER_AGENT_EXP_*_DIRS
paths used to be dropped during agent.init when the manifest's
working directory was still unknown. Add Manager.SeedSources,
a trusted late-binding equivalent of InitialSources that
bypasses AllowedRoots, and call it from handleManifest with
the now-resolved working directory before Trigger.

P2 degraded watcher re-arm loop: when fsnotify init fails, Sync
used to schedule its OnChange callback, the Manager's signal
handler turned it into another Sync, and the cycle repeated
every 250ms forever. Degraded Sync is now a true no-op; manual
Resync, AddSource, and RemoveSource still drive re-resolves.

P2 stale-epoch Resync mutating the watcher: a Resync pass that
loses the epoch race already skipped publishing its stale
snapshot, but still called watcher.Sync with stale roots. That
removed watches on sources only the winning pass knew about.
The stale branch now returns the published snapshot without
touching the watcher.

P2 MCP byte cap not enforced: applyMCPCaps previously only
checked the resource-count cap, so a provider returning one
big KindMCPServer payload could exceed MaxSnapshotBytes with
StatusOK. The cap helper now picks up where applyCaps left off
and stamps StatusExcluded plus a snapshot error on MCP entries
that breach the count or byte budget.

Adds regression tests for symlink-inside, symlink-outside,
broken symlinks, MCP secret omission, MCP byte cap, and the
manifest-late SeedSources path.
@kylecarbs kylecarbs force-pushed the feat/agent-context-package branch from 496f157 to 5576dc8 Compare June 5, 2026 16:04
@kylecarbs kylecarbs marked this pull request as ready for review June 5, 2026 18:24
@kylecarbs kylecarbs changed the title feat: workspace context resolver and v2.10 PushContextState RPC feat(agent/agentcontext): add workspace context sources resolver Jun 5, 2026
@kylecarbs kylecarbs changed the title feat(agent/agentcontext): add workspace context sources resolver feat: add agent-side workspace context sources resolver Jun 5, 2026
Comment thread tailnet/proto/version.go Outdated
Comment on lines +72 to +85
//
// API v2.10 (proposed, NOT advertised):
// - PushContextState RPC on the Agent API for pushing
// resolved workspace context snapshots (instruction files,
// skills, MCP configs, MCP server tool lists) from the
// agent to coderd. Adds ContextResource, PushContextState
// Request, and PushContextStateResponse messages.
// - CurrentMinor stays at 9 until coderd persistence and the
// KindMCPServer provider land. The RPC is defined and the
// agent-side plumbing exists so we can iterate behind a
// flag, but advertising v2.10 today would expose an API
// whose only new method returns Unimplemented from every
// deployed server. Bump CurrentMinor in the same change
// that wires coderd PushContextState persistence.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add and link a follow-up issue to track this work.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in CODAGT-569 under the Coder Agents project, assigned to @kylecarbs. The issue captures the agent-side bump (CurrentMinor to 10, switch run() back to ConnectRPC210WithRole, re-enable startAgentAPI210 for push, un-skip TestAgent_ContextStatePushed), the coderd PushContextState handler that persists to workspace_agent_context and emits dirty events, the chatd consumer of those snapshots, and the coder exp chat context CLI.

🤖 Reply by Coder Agents.

kylecarbs added 2 commits June 8, 2026 16:23
Cian asked on PR review (#25983) for a follow-up issue tracking the
deferred v2.10 PushContextState enablement. CODAGT-569 covers the
CurrentMinor bump, the agent push goroutine, the coderd handler, and
the chatd/CLI consumers. Reference it from the comments where the
deferral is documented so the issue is discoverable from the source.

🤖 Filed by Coder Agents.
Match the historical pattern for adding new RPCs (see #21293 for v2.7
ReportBoundaryLogs): ship the proto change with the version bump and a
stub handler in the same PR, then land real persistence in a follow-up.

Changes:
- tailnet/proto.CurrentMinor goes from 9 to 10.
- agent.run() dials with ConnectRPC210WithRole.
- apiConnRoutineManager holds DRPCAgentClient210; existing v2.8 routines
  still bind to DRPCAgentClient28 via interface embedding.
- New startAgentAPI210 helper hands v2.10 routines the upgraded client.
- The push context state routine is live. DRPCPusher translates the
  coderd Unimplemented stub to ErrPushUnimplemented so RunPush exits
  cleanly against older coderd deployments.
- TestAgent_ContextStatePushed is un-skipped and passes against
  agenttest.FakeAgentAPI.
- coderd/workspaceagents_test.go postStartup uses ConnectRPC210 again.

Real persistence (workspace_agent_context rows, chatd hydration,
KindMCPServer provider) tracked by CODAGT-569.

🤖 Reply by Coder Agents.
@kylecarbs kylecarbs changed the title feat: add agent-side workspace context sources resolver feat: add agent-side workspace context sources and Agent API v2.10 PushContextState Jun 8, 2026
Replace the flat `bytes payload` / `Kind enum` / `id` / `description`
fields on ContextResource with a proto3 oneof whose variant tells coderd
the resource kind and carries kind-specific structured fields. Driven by
review feedback that the prior shape:

- shipped arbitrary JSON for mcp_server tools with no schema, so two
  agent versions could silently disagree on the inner JSON shape and
  produce conflicting aggregate_hash values.
- duplicated information already implied by the variant (kind, id).
- forced coderd to switch on Kind and then json.Unmarshal the payload to
  reach the structure.

New proto shape on `ContextResource`:

- source, optional source_path, content_hash, size_bytes, status, error.
- oneof body { InstructionFileBody, SkillMetaBody, MCPConfigBody,
  MCPServerBody }.
- Reserved tags 7-9 and 14-16 to prevent reintroducing the legacy
  id/kind/payload/description slots or accidentally renumbering the
  reserved plugin/hook/subagent/command kinds.

Body messages:

- InstructionFileBody { content }.
- SkillMetaBody { meta, name, description }.
- MCPConfigBody {} (intentionally empty so env-block secrets cannot
  leave the agent; content_hash + size_bytes on the outer resource
  still detect changes).
- MCPServerBody { server_name, description, tools[] } where MCPTool is
  { name, description, input_schema (google.protobuf.Struct) }.

Go side:

- Resource gains Name (skill front-matter, MCP server name) and
  Tools []MCPTool; the flat struct stays the internal representation
  and drpc.go is the only place that maps to the wire.
- MCPProvider interface doc updated to require Tools and Name.
- readSkillMeta populates Resource.Name from front-matter.
- Wire mapping always sets a body variant, even for resources with
  Status != OK, so coderd can attribute failures to a known kind.

The body variant is always set; an unset body indicates an unrecognized
kind. ID is no longer transmitted; coderd derives the per-resource key
from (oneof variant, source). description is dropped from the
ContextResource flat field and lives only on the body types that
naturally carry it (skill, mcp_server).

🤖 Reply by Coder Agents.

Copy link
Copy Markdown
Member Author

Following up on Kyle's call to do this right rather than ship bytes payload. df47cf39aa replaces ContextResource's flat id / Kind enum / bytes payload / description with a proto3 oneof body whose variant doubles as the kind discriminator. Bodies are typed per kind:

  • InstructionFileBody { content }
  • SkillMetaBody { meta, name, description }
  • MCPConfigBody {} (intentionally empty; env-block secrets stay agent-local)
  • MCPServerBody { server_name, description, tools[] } with MCPTool { name, description, input_schema: google.protobuf.Struct }

source, source_path, content_hash, size_bytes, status, and error stay flat on ContextResource. ID is no longer on the wire; coderd derives the dedup key from (body_variant, source). Tags 7-9, 14-16 are reserved to prevent reintroducing the legacy slots.

Wire-mapping in agent/agentcontext/drpc.go always sets a body variant (even for Status != OK) so coderd can attribute failures to a known kind. Tests cover all four variants including an INVALID skill that still ships its body.

Could you take another look at the proto change in df47cf39aa when you have a moment? The diff is intentionally a single commit so the proto change is reviewable on its own.

🤖 Reply by Coder Agents.

@kylecarbs kylecarbs requested a review from ibetitsmike June 8, 2026 18:12

@johnstcn johnstcn left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the oneof way better!

@kylecarbs kylecarbs merged commit cd3692c into main Jun 8, 2026
28 checks passed
@kylecarbs kylecarbs deleted the feat/agent-context-package branch June 8, 2026 19:08
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 8, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants