feat: rank chat workspace templates#25037
Conversation
|
/coder-agents-review |
|
Codex Review: Didn't find any major issues. Keep it up! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
There was a problem hiding this comment.
Clean, well-decomposed PR. The ranking pipeline (fetch, score, enrich, rank, recommend, paginate, format) is easy to follow, and each stage is a pure or near-pure function. The dbauthz.AsSystemRestricted fix for the developer-count query corrects a latent bug where non-admin users always got zero active developer counts due to a silent RBAC failure. Six new tests cover the major ranking scenarios, and the runListTemplates/listTemplateItems helpers keep test noise down.
The main concern is the silent error-swallowing pattern in the two enrichment helpers. The old code had this for developer counts (and nobody noticed the broken auth because the error was hidden). The new code extends it to usage data and adds a confidence signal that triggers auto-selection, which amplifies the consequence of silent failures.
Severity breakdown: 1 P2, 8 P3, 3 Nit.
"One active developer versus zero is enough. The prompt tells the agent to treat recommended_template_id as the default and skip asking. For a first-time user with no personal history, the agent confidently creates a workspace from a template that one person happened to use." (Luffy)
🤖 This review was automatically generated with Coder Agents.
|
/coder-agents-review |
There was a problem hiding this comment.
All 13 R1 findings addressed cleanly. The fixes are structural, not point patches: DEREM-6 became a shared templateRankSignals struct, DEREM-7 became a confidence threshold system with named constants and a 90-day recency window, DEREM-4 got proper error logging and graceful degradation. Test additions are proportional (210 new test lines for 108 net production lines).
The DEREM-4 fix introduced a new ordering issue: the rankingSignalsErr check now preempts the only_available_template path, which means a transient DB error in a single-template org causes the agent to ask the user to choose from a list of one. The fix is a one-line reorder.
The DEREM-5 fix (compact search normalization) was applied to name/displayName but not to description matching, leaving the same class of gap in the weakest match tier.
Severity breakdown: 1 P2, 4 P3.
"The ranking signals are irrelevant when there is no competition; the recommendation is trivially correct." (Hisoka)
🤖 This review was automatically generated with Coder Agents.
|
/coder-agents-review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b8ae6390e8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
All 18 prior findings addressed across two fix commits. The R2 fixes are solid: quartz clock injection, reordered recommendation checks, description compact search, internal unit tests for the error degradation path, and workspace-count threshold coverage. Netero found zero mechanical issues on this revision. Three reviewers (Mafuuu, Pariston, Meruem) verified the R2 fixes and found no new issues.
Two remaining P3s and a Nit, none blocking. The PR is in good shape.
Severity breakdown: 2 P3, 1 Nit.
"Delete the t.DisplayName entries from all three loops in templateQueryScore and every test in this file still passes. The display name code path is unreachable under the current test fixtures." (Bisky)
🤖 This review was automatically generated with Coder Agents.
…main # Conflicts: # coderd/x/chatd/chattool/createworkspace.go
|
@codex review |
|
Codex Review: Didn't find any major issues. 🎉 ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
mafredri
left a comment
There was a problem hiding this comment.
The ranking/recommendation approach looks solid. One concern about stale guidance in workspaceDetachedAwareness that now contradicts the new template selection rules.
workspaceDetachedAwareness (prompt.go line 12, not in this diff) still says:
If a workspace is needed, use list_templates and read_template as needed before create_workspace.
This presents read_template as a standard step in the workflow. But the new <workspace-template-selection> block and the updated tool descriptions make it conditional (only when you need parameters or presets). The LLM sees both since workspaceDetachedAwareness is injected as a separate message in chatd.go:1617.
I think that line should be updated to match, something like:
If a workspace is needed, use list_templates before create_workspace. Call read_template only when you need parameter or preset details.
🤖 This review was generated with the help of Coder Agents.
Replace the lexicographic template-ranking comparator in list_templates with a frecency score (frequency discounted by recency), per reviewer feedback. - Add GetTemplateRankingSignalsByOwnerID, returning the user's recent active and recently-deleted workspace counts, last usage, and the count of distinct active developers in the org. Recently-deleted workspaces now contribute (recovering history the deleted=false filter discarded), scoped to a lookback window, and the prebuilds system user is excluded from the org popularity count. Replaces GetWorkspaceUsageGroupedByTemplateIDByOwnerID. - Compute the affinity score in Go (Wp*(active + Wd*deleted)*0.5^(age/half_life) + Wo*ln(1+org_devs)) because sqlc cannot reliably compile the parameterized decay expression; the query returns the raw signals. Weights, half-life, and lookback are explicit constants. - Recommendation confidence is now a single score comparison: a decisive query match recommends on its own, otherwise the top score must clear a floor derived from the active-developer minimum and lead the runner-up by a derived margin. Stale-but-frequent usage no longer recommends. - Replace the AsSystemRestricted call for the cross-user org count with a narrow dbauthz wrapper checking workspace-owner read plus a template-metadata read. - Clarify list_templates/read_template guidance in the detached prompt.
|
/coder-agents-review |
|
Chat: Review in progress | View chat deep-review v0.7.1 | Round 7 | Last posted: Round 7, 29 findings (2 P2, 21 P3, 6 Nit), APPROVE. Review Finding inventoryFindings
Law analysisEffective LOC: 1635 (553 production, 1082 test, 151 generated). Head SHA: 4c20d3f. Verdict: Don't split. Enforcement: Advisory. Round logRound 1Panel. 1 P2, 8 P3, 3 Nit. Reviewed against 3d03c39..f8882fc. Round 2All 13 R1 findings addressed in 34c7fac. Churn guard: PROCEED. Panel found 1 P2, 3 P3 new plus 1 P3 from Netero. Reviewed against 3d03c39..34c7fac. Round 3All 5 R2 findings addressed in b8ae639. Churn guard: PROCEED. Netero clean. Panel found 2 P3, 1 Nit. Reviewed against 3d03c39..b8ae639. Round 4All 3 R3 findings addressed in 4d357de. Major restructuring: new frecency SQL query, affinity scoring model. Churn guard: PROCEED. Netero 1 P3, Law don't-split. Reviewed against 6ecf804..4c20d3f. Round 55 of 6 R4 findings addressed in dfc293d. CRF-26 (Nit, error context) silent. Churn guard: BLOCKED. Reviewed against 6ecf804..dfc293d. Round 6CRF-26 addressed in 5a7641f. Churn guard: PROCEED. Netero clean. No new findings. All 27 findings resolved. Reviewed against 6ecf804..5a7641f. Round 7No open findings from R6. Rebase + new commits (docs, next_step constants). Netero clean. Panel found 2 P3 (test gaps). Reviewed against 01ec5e4..36a6be6. About deep-reviewCRF = Coder Review Finding (P0-P4, Nit, Note)
|
|
@codex review |
|
@jaaydenh ⛔ This review has reached its per-chat spend limit ($141.20 / $128.45). Further review rounds are paused. To raise the limit and continue, comment: This is a per-chat budget, separate from any account-level usage limit.
|
|
Codex Review: Didn't find any major issues. Already looking forward to the next diff. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
| // because sqlc cannot reliably compile the parameterized decay expression; | ||
| // see GetTemplateRankingSignalsByOwnerID. Keeping the score and the | ||
| // confidence thresholds in the same place also avoids Postgres-versus-Go | ||
| // floating-point differences at confidence boundaries. |
There was a problem hiding this comment.
I feel this introduces a lot of complexity vs doing it in SQL. And floating-point differences seem like a contrived reason? It's used for sorting and PostgreSQL can do that, no need to re-sort in Go. It also seems very unlikely to matter in practice.
There was a problem hiding this comment.
I think the issue is more about the how this comment is really not that useful and can be improved. Below are the main points about why its makes sense to keep the current split between Go and SQL.
- The algorithm is much easier to unit test in Go without database setup, mock clocks, SQL generation, and type/cast issues.
- The ranking combines Go-only query scoring with DB-derived usage signals, so SQL sorting would only own part of the ordering unless query scoring also moved into SQL.
- The candidate list is already materialized in Go before pagination, so the extra in-memory sort is not a meaningful performance cost here.
- SQL remains a stable, explainable aggregation query; Go owns the tunable recommendation policy.
- Changing weights, half-life, threshold behavior, or future-age clamping does not require touching generated SQL.
There was a problem hiding this comment.
The algorithm is much easier to unit test in Go without database setup, mock clocks, SQL generation, and type/cast issues.
I think this is kind of a wash, you could use dbgen to generate the data and have an end-to-end test of the algorithm. That said, I won't argue with this reasoning, splitting the logic into two parts makes testing one part simpler.
The ranking combines Go-only query scoring with DB-derived usage signals, so SQL sorting would only own part of the ordering unless query scoring also moved into SQL.
How do you mean?
The candidate list is already materialized in Go before pagination, so the extra in-memory sort is not a meaningful performance cost here.
This is the kind of reasoning I'd expect an LLM to give me. 😂
Changing weights, half-life, threshold behavior, or future-age clamping does not require touching generated SQL.
This is immaterial, changing Go requires rebuilding the binary, changing SQL requires make gen, this is not an argument worth making.
Either way, as long as we keep the Go implementation simple and not over-engineered, I won't push back on the split.
There was a problem hiding this comment.
On "how do you mean": the final order sorts by query-relevance tier first, and that tier is computed in Go from the query against name/display name/description with compact matching, so an ORDER BY could only own the affinity part; Go would still re-sort by (query tier, affinity) anyway unless query scoring also moved into SQL.
Also tidied the contested justification in 19ce6aa: the comments no longer claim sqlc fragility, just that keeping the score in Go keeps the policy and thresholds in one place. The Go side also got simpler in 36a6be6 (selection enums and derived fields removed; ranking math unchanged).
There was a problem hiding this comment.
I wish llms would identify when they respond to comments, last comment was not me
…truction list_templates previously taught the model a four-value selection_hint enum, an eleven-value recommendation_reason, per-item relevance_signals, and a derived user_selection_required flag, re-explained in two tool descriptions and two prompt sections. The ranked output now carries a single fixed next_step instruction at the point of use plus an optional recommended_template_id, and the per-item payload keeps only raw usage evidence. Ranking, confidence thresholds, and authorization are unchanged. Prompts reference the field through chattool.NextStepField so they cannot drift. Documents how agent tool calls work, including template selection and workspace creation, in docs/ai-coder/agents/tools.
Docs preview📖 View docs preview for |
|
/coder-agents-review set-spend-limit:200 |
There was a problem hiding this comment.
All 27 prior findings remain resolved. CI green. The latest commits add next_step constants, docs, and a rebase. Netero found zero mechanical issues. Two reviewers (Mafuuu, Pariston) found no issues. Bisky found two test gaps in the new code, both P3.
Severity breakdown: 2 P3.
"Every scenario test asserts next_step and recommended_template_id. Except one." (Bisky)
🤖 This review was automatically generated with Coder Agents.
Asserts the recommendation fires for recent deleted-only personal usage (CRF-28), adds the both-above-floor-with-large-gap recommendation case (CRF-29), and rewords the affinity-score comments to drop the inaccurate sqlc fragility claim; the score lives in Go so the ranking policy and its confidence thresholds stay in one place.
Shows a concrete query and ranked two-template response in the agent tools doc, and clarifies that templates are scored independently with only raw evidence fields returned.
johnstcn
left a comment
There was a problem hiding this comment.
Instead of computing an opaque score, could we instead try just surfacing the raw numbers in the list_templates response and letting the model decide for itself? This would also let us drop the recommended_template_id.
The NextStep field is definitely a good addition though.
| s := templateRankingSignals{ | ||
| ActiveCount: row.ActiveCount, | ||
| DeletedRecentCount: row.DeletedRecentCount, | ||
| OrgDevs: row.OrgDevs, |
There was a problem hiding this comment.
This is already provided by GetWorkspaceUniqueOwnerCountByTemplateID.
If you have a strong argument that a score is necessary for each template in the response. I have to rethink and rework the logic a bit so it makes sense. I also think it may make sense to remove next_step so that the LLM then decides completely based on the information in the response containing the list of templates. |
I have no strong arguments here; it's hard to make the determination without data and test results. |
# Conflicts: # coderd/x/chatd/chatd_test.go
This rule seems a bit suspect to me, the agent won't know if something matters for the task without checking, so the only signal is when the user explicitly ask for it? Don't know what the solution should be, though, if the goal is to avoid the tool call. Other than that, LGTM 👍🏻 |
closes CODAGT-203
Summary
list_templatesnow returns a ranked shortlist with a recommendation, so the chat agent can pick the right template the way a colleague would: prefer what matches the request, what the user already uses, and what the rest of the organization uses. Instead of teaching the model an enum protocol in prompts, every result carries a fixednext_stepinstruction telling the agent what to do.How list_templates works
Fetch: active, non-deprecated templates in the chat's organization, filtered by the admin template allowlist, authorized as the chat owner (no system escalation).
Query relevance (optional
queryargument): each template receives the highest tier any of its fields matches, and a higher tier always outranks a lower one regardless of usage:Matching is case-insensitive and ignores spaces/hyphens/underscores (
python gpumatchespython-gpu).Usage signals: a new
GetTemplateRankingSignalsByOwnerIDquery returns, per template, the owner's active and recently-deleted workspace counts within a 60-day window, the last in-window usage, and the count of distinct developers with an active workspace (unclaimed prebuilds excluded).Affinity score (computed in Go, per template, from that template's signals only):
active/deletedare the owner's in-window workspace counts,days_since_last_useis measured from the most recent in-window usage (the personal term is zero without in-window usage), andactive_developersis the org-wide count. Personal usage carries 10x the weight of org popularity; the confidence floor is the score of two active developers (ln 3) and the required lead over the runner-up isln 3 - ln 2.Rank: query tier first (when a query is present), then affinity score, then name/ID for determinism. Results paginate 10 per page with
next_pagepresent only when more exist.Recommendation contract
The result tells the agent what to do next instead of describing confidence levels:
recommended_template_idis present only when the top template is a clear winner: the only available template, a decisive query match, or an affinity score that clears a floor and leads the runner-up by a derived margin.next_stepis always present and is one of four fixed sentences: use the recommendation, ask the user to choose, retry a query that matched nothing, or report that no templates are available.Per-template items carry raw evidence (
active_developers,your_workspace_count,last_used_by_you) rather than derived labels. When signals fail to load, the tool logs and degrades to asking the user unless the query alone is decisive.Prompts and the
create_workspace/read_templatedescriptions reference the field through thechattool.NextStepFieldconstant, so the instruction lives in one place and cannot drift.create_workspaceremains idempotent and allowlist-enforced.Authorization
The signals query runs with the chat owner's permissions: reading the owner's own workspaces plus a template-metadata read for the cross-user popularity count. dbauthz rejects the call if any requested template is not readable by the owner (covered by allow and deny method tests).
Docs
Adds
docs/ai-coder/agents/tools/explaining how agent tool calls work, withlist_templatesranking and thenext_stepcontract as the first documented tools.