Skip to content

feat: rank chat workspace templates#25037

Merged
jaaydenh merged 24 commits into
mainfrom
workspace-bdje
Jun 18, 2026
Merged

feat: rank chat workspace templates#25037
jaaydenh merged 24 commits into
mainfrom
workspace-bdje

Conversation

@jaaydenh

@jaaydenh jaaydenh commented May 7, 2026

Copy link
Copy Markdown
Contributor

closes CODAGT-203

Summary

list_templates now returns a ranked shortlist with a recommendation, so the chat agent can pick the right template the way a colleague would: prefer what matches the request, what the user already uses, and what the rest of the organization uses. Instead of teaching the model an enum protocol in prompts, every result carries a fixed next_step instruction telling the agent what to do.

How list_templates works

  1. Fetch: active, non-deprecated templates in the chat's organization, filtered by the admin template allowlist, authorized as the chat owner (no system escalation).

  2. Query relevance (optional query argument): each template receives the highest tier any of its fields matches, and a higher tier always outranks a lower one regardless of usage:

    Tier Match
    4 name or display name equals the query
    3 name or display name starts with the query
    2 name or display name contains the query
    1 description contains the query (checked only when no name field matched)
    0 no match; the template is excluded

    Matching is case-insensitive and ignores spaces/hyphens/underscores (python gpu matches python-gpu).

  3. Usage signals: a new GetTemplateRankingSignalsByOwnerID query returns, per template, the owner's active and recently-deleted workspace counts within a 60-day window, the last in-window usage, and the count of distinct developers with an active workspace (unclaimed prebuilds excluded).

  4. Affinity score (computed in Go, per template, from that template's signals only):

    affinity = 10 x (active + 0.5 x deleted) x 0.5^(days_since_last_use / 14)
             + ln(1 + active_developers)
    

    active/deleted are the owner's in-window workspace counts, days_since_last_use is measured from the most recent in-window usage (the personal term is zero without in-window usage), and active_developers is the org-wide count. Personal usage carries 10x the weight of org popularity; the confidence floor is the score of two active developers (ln 3) and the required lead over the runner-up is ln 3 - ln 2.

  5. Rank: query tier first (when a query is present), then affinity score, then name/ID for determinism. Results paginate 10 per page with next_page present only when more exist.

Recommendation contract

The result tells the agent what to do next instead of describing confidence levels:

  • recommended_template_id is present only when the top template is a clear winner: the only available template, a decisive query match, or an affinity score that clears a floor and leads the runner-up by a derived margin.
  • next_step is always present and is one of four fixed sentences: use the recommendation, ask the user to choose, retry a query that matched nothing, or report that no templates are available.

Per-template items carry raw evidence (active_developers, your_workspace_count, last_used_by_you) rather than derived labels. When signals fail to load, the tool logs and degrades to asking the user unless the query alone is decisive.

Prompts and the create_workspace/read_template descriptions reference the field through the chattool.NextStepField constant, so the instruction lives in one place and cannot drift. create_workspace remains idempotent and allowlist-enforced.

Authorization

The signals query runs with the chat owner's permissions: reading the owner's own workspaces plus a template-metadata read for the cross-user popularity count. dbauthz rejects the call if any requested template is not readable by the owner (covered by allow and deny method tests).

Docs

Adds docs/ai-coder/agents/tools/ explaining how agent tool calls work, with list_templates ranking and the next_step contract as the first documented tools.

@jaaydenh

jaaydenh commented May 7, 2026

Copy link
Copy Markdown
Contributor Author

/coder-agents-review
@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Keep it up!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@coder-agents-review coder-agents-review Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, well-decomposed PR. The ranking pipeline (fetch, score, enrich, rank, recommend, paginate, format) is easy to follow, and each stage is a pure or near-pure function. The dbauthz.AsSystemRestricted fix for the developer-count query corrects a latent bug where non-admin users always got zero active developer counts due to a silent RBAC failure. Six new tests cover the major ranking scenarios, and the runListTemplates/listTemplateItems helpers keep test noise down.

The main concern is the silent error-swallowing pattern in the two enrichment helpers. The old code had this for developer counts (and nobody noticed the broken auth because the error was hidden). The new code extends it to usage data and adds a confidence signal that triggers auto-selection, which amplifies the consequence of silent failures.

Severity breakdown: 1 P2, 8 P3, 3 Nit.

"One active developer versus zero is enough. The prompt tells the agent to treat recommended_template_id as the default and skip asking. For a first-time user with no personal history, the agent confidently creates a workspace from a template that one person happened to use." (Luffy)

🤖 This review was automatically generated with Coder Agents.

Comment thread coderd/x/chatd/chattool/listtemplates.go
Comment thread coderd/x/chatd/chattool/listtemplates.go
Comment thread coderd/x/chatd/chattool/listtemplates.go
Comment thread coderd/x/chatd/chattool/listtemplates.go Outdated
Comment thread coderd/x/chatd/chattool/listtemplates_test.go
Comment thread coderd/x/chatd/chattool/listtemplates.go Outdated
Comment thread coderd/x/chatd/chattool/listtemplates.go Outdated
Comment thread coderd/database/queries/workspaces.sql Outdated
Comment thread coderd/x/chatd/chattool/listtemplates.go Outdated
Comment thread coderd/x/chatd/chattool/listtemplates.go Outdated
@jaaydenh

Copy link
Copy Markdown
Contributor Author

/coder-agents-review

@coder-agents-review coder-agents-review Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 13 R1 findings addressed cleanly. The fixes are structural, not point patches: DEREM-6 became a shared templateRankSignals struct, DEREM-7 became a confidence threshold system with named constants and a 90-day recency window, DEREM-4 got proper error logging and graceful degradation. Test additions are proportional (210 new test lines for 108 net production lines).

The DEREM-4 fix introduced a new ordering issue: the rankingSignalsErr check now preempts the only_available_template path, which means a transient DB error in a single-template org causes the agent to ask the user to choose from a list of one. The fix is a one-line reorder.

The DEREM-5 fix (compact search normalization) was applied to name/displayName but not to description matching, leaving the same class of gap in the weakest match tier.

Severity breakdown: 1 P2, 4 P3.

"The ranking signals are irrelevant when there is no competition; the recommendation is trivially correct." (Hisoka)

🤖 This review was automatically generated with Coder Agents.

Comment thread coderd/x/chatd/chattool/listtemplates.go Outdated
Comment thread coderd/x/chatd/chattool/listtemplates.go Outdated
Comment thread coderd/x/chatd/chattool/listtemplates.go Outdated
Comment thread coderd/x/chatd/chattool/listtemplates.go Outdated
Comment thread coderd/x/chatd/chattool/listtemplates.go Outdated
@jaaydenh

Copy link
Copy Markdown
Contributor Author

/coder-agents-review
@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b8ae6390e8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread coderd/x/chatd/chattool/listtemplates.go Outdated

@coder-agents-review coder-agents-review Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 18 prior findings addressed across two fix commits. The R2 fixes are solid: quartz clock injection, reordered recommendation checks, description compact search, internal unit tests for the error degradation path, and workspace-count threshold coverage. Netero found zero mechanical issues on this revision. Three reviewers (Mafuuu, Pariston, Meruem) verified the R2 fixes and found no new issues.

Two remaining P3s and a Nit, none blocking. The PR is in good shape.

Severity breakdown: 2 P3, 1 Nit.

"Delete the t.DisplayName entries from all three loops in templateQueryScore and every test in this file still passes. The display name code path is unreachable under the current test fixtures." (Bisky)

🤖 This review was automatically generated with Coder Agents.

Comment thread coderd/x/chatd/chattool/listtemplates_test.go
Comment thread coderd/x/chatd/chattool/listtemplates.go Outdated
Comment thread coderd/x/chatd/chattool/listtemplates_test.go
@jaaydenh jaaydenh changed the title feat(coderd): rank chat workspace templates feat: rank chat workspace templates Jun 1, 2026
@jaaydenh

jaaydenh commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. 🎉

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@jaaydenh jaaydenh marked this pull request as ready for review June 1, 2026 14:13
@jaaydenh jaaydenh requested a review from mafredri June 1, 2026 14:14

@mafredri mafredri left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ranking/recommendation approach looks solid. One concern about stale guidance in workspaceDetachedAwareness that now contradicts the new template selection rules.

workspaceDetachedAwareness (prompt.go line 12, not in this diff) still says:

If a workspace is needed, use list_templates and read_template as needed before create_workspace.

This presents read_template as a standard step in the workflow. But the new <workspace-template-selection> block and the updated tool descriptions make it conditional (only when you need parameters or presets). The LLM sees both since workspaceDetachedAwareness is injected as a separate message in chatd.go:1617.

I think that line should be updated to match, something like:

If a workspace is needed, use list_templates before create_workspace. Call read_template only when you need parameter or preset details.

🤖 This review was generated with the help of Coder Agents.

Comment thread coderd/x/chatd/prompt.go Outdated
Comment thread coderd/x/chatd/chattool/createworkspace.go Outdated
@jaaydenh jaaydenh requested a review from mafredri June 2, 2026 08:53
Comment thread coderd/database/queries/workspaces.sql Outdated
Replace the lexicographic template-ranking comparator in list_templates
with a frecency score (frequency discounted by recency), per reviewer
feedback.

- Add GetTemplateRankingSignalsByOwnerID, returning the user's recent
  active and recently-deleted workspace counts, last usage, and the count
  of distinct active developers in the org. Recently-deleted workspaces
  now contribute (recovering history the deleted=false filter discarded),
  scoped to a lookback window, and the prebuilds system user is excluded
  from the org popularity count. Replaces
  GetWorkspaceUsageGroupedByTemplateIDByOwnerID.
- Compute the affinity score in Go
  (Wp*(active + Wd*deleted)*0.5^(age/half_life) + Wo*ln(1+org_devs))
  because sqlc cannot reliably compile the parameterized decay
  expression; the query returns the raw signals. Weights, half-life, and
  lookback are explicit constants.
- Recommendation confidence is now a single score comparison: a decisive
  query match recommends on its own, otherwise the top score must clear a
  floor derived from the active-developer minimum and lead the runner-up
  by a derived margin. Stale-but-frequent usage no longer recommends.
- Replace the AsSystemRestricted call for the cross-user org count with a
  narrow dbauthz wrapper checking workspace-owner read plus a
  template-metadata read.
- Clarify list_templates/read_template guidance in the detached prompt.
@jaaydenh

jaaydenh commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

/coder-agents-review

@coder-agents-review

coder-agents-review Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Chat: Review in progress | View chat
Requested: 2026-06-11 12:40 UTC by @jaaydenh
Spend: $101.72 / $200.00

deep-review v0.7.1 | Round 7 | 01ec5e4..36a6be6

Last posted: Round 7, 29 findings (2 P2, 21 P3, 6 Nit), APPROVE. Review

Finding inventory

Findings

# Sev Status Location Summary Round Reviewer Posted
DEREM-1 P3 Author fixed (34c7fac) dbauthz_test.go:4767 dbauthz test placed in wrong suite section (TestSystemFunctions instead of TestWorkspace) R1 Netero Yes
DEREM-2 P3 Author fixed (34c7fac) listtemplates.go:261 Two selectTemplateRecommendation code paths (no_matching_templates, no_ranking_signal) lack test assertions R1 Netero Yes
DEREM-3 Nit Author fixed (34c7fac) workspaces.sql:500 Query name uses ForOwner instead of ByOwnerID per naming convention R1 Netero Yes
DEREM-4 P2 Author fixed (34c7fac) listtemplates.go:192 Silent error swallowing in loadTemplateActiveDeveloperCounts and loadTemplateUsage produces misleading confidence signals R1 Hisoka P2, Chopper P2, Knov P2, Mafuuu P3, Ryosuke P3, Leorio P3 Yes
DEREM-5 P3 Author fixed (34c7fac) listtemplates.go:375 compactTemplateSearch strips spaces but not hyphens/underscores; multi-word queries miss hyphenated template names R1 Mafuuu P3, Kite P3, Razor P3 Yes
DEREM-6 P3 Author fixed (34c7fac) listtemplates.go:230 Sort, ambiguity predicate, and reason classifier encode ranking model three times with no structural coupling R1 Meruem Yes
DEREM-7 P3 Author fixed (34c7fac) listtemplates.go:275 high_confidence_recommendation fires on weak signal (one old workspace or one active developer) R1 Pariston P3, Luffy P3 Yes
DEREM-8 P3 Author fixed (34c7fac) listtemplates_test.go:223 TestListTemplates_QueryRelevanceOutranksPersonalUsage tests different query scores, not usage-vs-query tiebreaking R1 Bisky Yes
DEREM-9 P3 Author fixed (34c7fac) listtemplates.go:316 rank_reason labels create misleading signal inversion between recommended and non-recommended templates R1 Hisoka P3, Razor P3 Yes
DEREM-10 P3 Author fixed (34c7fac) listtemplates.go:151 total_count is post-filter; agent cannot distinguish empty-org from no-query-match R1 Chopper P3 Yes
DEREM-11 Nit Author fixed (34c7fac) listtemplates.go:163 candidateRankedTemplates returns unranked templates; name misleads about postcondition R1 Gon Yes
DEREM-12 P3 Author fixed (34c7fac) listtemplates.go:344 Query score tiers 1-4 are undocumented magic numbers driving ranking R1 Gon Yes
DEREM-13 Nit Author fixed (34c7fac) listtemplates.go:246 strings.Compare used where cmp.Compare would be uniform with rest of comparator R1 Ging-Go Yes
DEREM-14 P3 Author fixed (b8ae639) listtemplates.go:158 time.Now() bypasses quartz clock; 90-day recency boundary untestable R2 Netero Yes
DEREM-15 P2 Author fixed (b8ae639) listtemplates.go:333 rankingSignalsErr check preempts only_available_template; single-template path gated on irrelevant signals R2 Hisoka P2, Mafuuu P3 Yes
DEREM-16 P3 Author fixed (b8ae639) listtemplates.go:453 Description matching omits compactTemplateSearch; incomplete sibling of DEREM-5 fix R2 Hisoka P3, Meruem P3, Chopper P3, Zoro Nit Yes
DEREM-17 P3 Author fixed (b8ae639) listtemplates.go:334 ranking_signals_unavailable branch (DEREM-4 fix) has no test coverage R2 Bisky P3, Mafuuu P3, Mafu-san P3, Chopper Note Yes
DEREM-18 P3 Author fixed (b8ae639) listtemplates.go:365 WorkspaceCount >= 2 threshold in templateHasConfidentRankingSignal has no test R2 Bisky Yes
DEREM-19 P3 Author fixed (4d357de) listtemplates_test.go:131 Display name matching in templateQueryScore has no isolated test; compact search resolves name first R3 Bisky Yes
DEREM-20 P3 Author fixed (4d357de) listtemplates.go:164 errors.Join fuses independent enrichment failures; surviving signal cannot contribute to confidence R3 Hisoka Yes
DEREM-21 Nit Author fixed (4d357de) listtemplates_test.go:255 No test verifies page 2 retrieval; pagination offset arithmetic unverified R3 Bisky Yes
CRF-22 P3 Author fixed (dfc293d) dbauthz_test.go:3376 Dbauthz test for GetTemplateRankingSignalsByOwnerID only exercises no-org path; org-scoped case silently overwritten R4 Netero Yes
CRF-23 P3 Author fixed (dfc293d) prompt.go:11 workspaceDetachedAwareness contradicts new template selection guidance; deferred without ticket R4 Mafu-san Yes
CRF-24 P3 Author fixed (dfc293d) listtemplates.go:460 relevance_signals claims used_by_you for deleted-only usage but item metadata absent (ActiveCount gate) R4 Mafuuu P3, Razor Note, Hisoka Note Yes
CRF-25 P3 Author fixed (dfc293d) listtemplates_test.go:943 SQL test asserts LastUsedAt.Valid but never checks the actual value; MAX-to-MIN regression invisible R4 Bisky Yes
CRF-26 Nit Author fixed (5a7641f) listtemplates.go:159 Error from asOwner has no operation context R4 Leorio Yes
CRF-27 Nit Author fixed (dfc293d) listtemplates.go:91 ListTemplatesOptions doc says nothing about required fields or defaults R4 Leorio Yes
CRF-28 P3 Open listtemplates_test.go:628 DeletedRecentPersonalUsageShowsEvidence omits next_step and recommended_template_id assertions R7 Bisky Yes
CRF-29 P3 Open listtemplates_internal_test.go:158 No test exercises both-above-confidence-floor-with-large-gap branch of selectTemplateRecommendation R7 Bisky Yes

Law analysis

Effective LOC: 1635 (553 production, 1082 test, 151 generated). Head SHA: 4c20d3f. Verdict: Don't split. Enforcement: Advisory.

Round log

Round 1

Panel. 1 P2, 8 P3, 3 Nit. Reviewed against 3d03c39..f8882fc.

Round 2

All 13 R1 findings addressed in 34c7fac. Churn guard: PROCEED. Panel found 1 P2, 3 P3 new plus 1 P3 from Netero. Reviewed against 3d03c39..34c7fac.

Round 3

All 5 R2 findings addressed in b8ae639. Churn guard: PROCEED. Netero clean. Panel found 2 P3, 1 Nit. Reviewed against 3d03c39..b8ae639.

Round 4

All 3 R3 findings addressed in 4d357de. Major restructuring: new frecency SQL query, affinity scoring model. Churn guard: PROCEED. Netero 1 P3, Law don't-split. Reviewed against 6ecf804..4c20d3f.

Round 5

5 of 6 R4 findings addressed in dfc293d. CRF-26 (Nit, error context) silent. Churn guard: BLOCKED. Reviewed against 6ecf804..dfc293d.

Round 6

CRF-26 addressed in 5a7641f. Churn guard: PROCEED. Netero clean. No new findings. All 27 findings resolved. Reviewed against 6ecf804..5a7641f.

Round 7

No open findings from R6. Rebase + new commits (docs, next_step constants). Netero clean. Panel found 2 P3 (test gaps). Reviewed against 01ec5e4..36a6be6.

About deep-review

CRF = Coder Review Finding (P0-P4, Nit, Note)

Reviewer Focus
Bisky tests
Chopper ops/errors
Churn-guard change verification
Ging language modernization
Gon naming
Hisoka edge cases
Killua perf
Kite change integrity
Knov contracts
Knuckle SQL
Kurapika security
Law decomposition
Leorio docs
Luffy product
Mafu-san process
Mafuuu contracts
Melody dispatch/pairing
Meruem structural
Nami frontend
Netero mechanical checks
Pariston premise testing
Pen-botter product gaps
Razor verification
Robin duplication
Ryosuke Go arch
Takumi concurrency
Zoro shape

🤖 Managed by Coder Agents.

jaaydenh commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@coder-agents-review

Copy link
Copy Markdown
Contributor

@jaaydenh ⛔ This review has reached its per-chat spend limit ($141.20 / $128.45). Further review rounds are paused.

To raise the limit and continue, comment:

/coder-agents-review set-spend-limit:179

This is a per-chat budget, separate from any account-level usage limit.

🤖 Managed by Coder Agents.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Already looking forward to the next diff.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread coderd/database/queries/workspaces.sql Outdated
// because sqlc cannot reliably compile the parameterized decay expression;
// see GetTemplateRankingSignalsByOwnerID. Keeping the score and the
// confidence thresholds in the same place also avoids Postgres-versus-Go
// floating-point differences at confidence boundaries.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this introduces a lot of complexity vs doing it in SQL. And floating-point differences seem like a contrived reason? It's used for sorting and PostgreSQL can do that, no need to re-sort in Go. It also seems very unlikely to matter in practice.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue is more about the how this comment is really not that useful and can be improved. Below are the main points about why its makes sense to keep the current split between Go and SQL.

  • The algorithm is much easier to unit test in Go without database setup, mock clocks, SQL generation, and type/cast issues.
  • The ranking combines Go-only query scoring with DB-derived usage signals, so SQL sorting would only own part of the ordering unless query scoring also moved into SQL.
  • The candidate list is already materialized in Go before pagination, so the extra in-memory sort is not a meaningful performance cost here.
  • SQL remains a stable, explainable aggregation query; Go owns the tunable recommendation policy.
  • Changing weights, half-life, threshold behavior, or future-age clamping does not require touching generated SQL.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The algorithm is much easier to unit test in Go without database setup, mock clocks, SQL generation, and type/cast issues.

I think this is kind of a wash, you could use dbgen to generate the data and have an end-to-end test of the algorithm. That said, I won't argue with this reasoning, splitting the logic into two parts makes testing one part simpler.

The ranking combines Go-only query scoring with DB-derived usage signals, so SQL sorting would only own part of the ordering unless query scoring also moved into SQL.

How do you mean?

The candidate list is already materialized in Go before pagination, so the extra in-memory sort is not a meaningful performance cost here.

This is the kind of reasoning I'd expect an LLM to give me. 😂

Changing weights, half-life, threshold behavior, or future-age clamping does not require touching generated SQL.

This is immaterial, changing Go requires rebuilding the binary, changing SQL requires make gen, this is not an argument worth making.


Either way, as long as we keep the Go implementation simple and not over-engineered, I won't push back on the split.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On "how do you mean": the final order sorts by query-relevance tier first, and that tier is computed in Go from the query against name/display name/description with compact matching, so an ORDER BY could only own the affinity part; Go would still re-sort by (query tier, affinity) anyway unless query scoring also moved into SQL.

Also tidied the contested justification in 19ce6aa: the comments no longer claim sqlc fragility, just that keeping the score in Go keeps the policy and thresholds in one place. The Go side also got simpler in 36a6be6 (selection enums and derived fields removed; ranking math unchanged).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish llms would identify when they respond to comments, last comment was not me

jaaydenh and others added 5 commits June 9, 2026 18:02
…truction

list_templates previously taught the model a four-value selection_hint
enum, an eleven-value recommendation_reason, per-item relevance_signals,
and a derived user_selection_required flag, re-explained in two tool
descriptions and two prompt sections. The ranked output now carries a
single fixed next_step instruction at the point of use plus an optional
recommended_template_id, and the per-item payload keeps only raw usage
evidence. Ranking, confidence thresholds, and authorization are
unchanged. Prompts reference the field through chattool.NextStepField so
they cannot drift.

Documents how agent tool calls work, including template selection and
workspace creation, in docs/ai-coder/agents/tools.
@github-actions

Copy link
Copy Markdown

Docs preview

📖 View docs preview for docs/ai-coder/agents/tools/index.md

@jaaydenh

Copy link
Copy Markdown
Contributor Author

/coder-agents-review set-spend-limit:200

@coder-agents-review coder-agents-review Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 27 prior findings remain resolved. CI green. The latest commits add next_step constants, docs, and a rebase. Netero found zero mechanical issues. Two reviewers (Mafuuu, Pariston) found no issues. Bisky found two test gaps in the new code, both P3.

Severity breakdown: 2 P3.

"Every scenario test asserts next_step and recommended_template_id. Except one." (Bisky)

🤖 This review was automatically generated with Coder Agents.

Comment thread coderd/x/chatd/chattool/listtemplates_test.go
Comment thread coderd/x/chatd/chattool/listtemplates_internal_test.go
Asserts the recommendation fires for recent deleted-only personal usage
(CRF-28), adds the both-above-floor-with-large-gap recommendation case
(CRF-29), and rewords the affinity-score comments to drop the inaccurate
sqlc fragility claim; the score lives in Go so the ranking policy and its
confidence thresholds stay in one place.
@jaaydenh jaaydenh requested a review from johnstcn June 11, 2026 13:17
jaaydenh added 3 commits June 12, 2026 09:59
Shows a concrete query and ranked two-template response in the agent
tools doc, and clarifies that templates are scored independently with
only raw evidence fields returned.
@jaaydenh jaaydenh requested a review from mafredri June 12, 2026 10:24
@linear-code

linear-code Bot commented Jun 12, 2026

Copy link
Copy Markdown

CODAGT-203

@johnstcn johnstcn left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of computing an opaque score, could we instead try just surfacing the raw numbers in the list_templates response and letting the model decide for itself? This would also let us drop the recommended_template_id.

The NextStep field is definitely a good addition though.

s := templateRankingSignals{
ActiveCount: row.ActiveCount,
DeletedRecentCount: row.DeletedRecentCount,
OrgDevs: row.OrgDevs,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already provided by GetWorkspaceUniqueOwnerCountByTemplateID.

@jaaydenh

Copy link
Copy Markdown
Contributor Author

Instead of computing an opaque score, could we instead try just surfacing the raw numbers in the list_templates response and letting the model decide for itself? This would also let us drop the recommended_template_id.

The NextStep field is definitely a good addition though.
@johnstcn

  1. The way the ranking response is designed the score would not be that useful. The ranking is broken up into query tiers (see the PR description) and the affinity score is only used as a score to rank templates in each tier. If there is not query, then there are no tiers. It would be necessary to also return the tier and the affinity score to avoid confusing the LLM. Or create a new normalized score when a query is involved. I think this potentially adds unnecessary complexity.

  2. The design of the response as it currently is attempts to simplify the decision making process so that the response is a ordered list of templates sorted by relevance with next_step providing a clear answer so that the LLM doesn’t need to do as much thinking/processing of the list_templates response.


If you have a strong argument that a score is necessary for each template in the response. I have to rethink and rework the logic a bit so it makes sense. I also think it may make sense to remove next_step so that the LLM then decides completely based on the information in the response containing the list of templates.

@johnstcn

Copy link
Copy Markdown
Member

If you have a strong argument that a score is necessary for each template in the response.

I have no strong arguments here; it's hard to make the determination without data and test results.

@johnstcn johnstcn left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock.

@jaaydenh jaaydenh merged commit bc44cdd into main Jun 18, 2026
32 of 33 checks passed
@jaaydenh jaaydenh deleted the workspace-bdje branch June 18, 2026 05:41
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 18, 2026
@mafredri

Copy link
Copy Markdown
Member
  • Call read_template only when you need parameter or preset details before create_workspace.

This rule seems a bit suspect to me, the agent won't know if something matters for the task without checking, so the only signal is when the user explicitly ask for it? Don't know what the solution should be, though, if the goal is to avoid the tool call.

Other than that, LGTM 👍🏻

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants