feat: add list_templates ranking telemetry#26458
Conversation
…main # Conflicts: # coderd/x/chatd/chattool/createworkspace.go
Replace the lexicographic template-ranking comparator in list_templates with a frecency score (frequency discounted by recency), per reviewer feedback. - Add GetTemplateRankingSignalsByOwnerID, returning the user's recent active and recently-deleted workspace counts, last usage, and the count of distinct active developers in the org. Recently-deleted workspaces now contribute (recovering history the deleted=false filter discarded), scoped to a lookback window, and the prebuilds system user is excluded from the org popularity count. Replaces GetWorkspaceUsageGroupedByTemplateIDByOwnerID. - Compute the affinity score in Go (Wp*(active + Wd*deleted)*0.5^(age/half_life) + Wo*ln(1+org_devs)) because sqlc cannot reliably compile the parameterized decay expression; the query returns the raw signals. Weights, half-life, and lookback are explicit constants. - Recommendation confidence is now a single score comparison: a decisive query match recommends on its own, otherwise the top score must clear a floor derived from the active-developer minimum and lead the runner-up by a derived margin. Stale-but-frequent usage no longer recommends. - Replace the AsSystemRestricted call for the cross-user org count with a narrow dbauthz wrapper checking workspace-owner read plus a template-metadata read. - Clarify list_templates/read_template guidance in the detached prompt.
Address the latest PR review feedback for frecency-based template ranking: - Authorize template ranking signals against the exact candidate template IDs using the same prepared-filter semantics as GetTemplatesWithFilter, so users with template ACL access keep ranking signals without broad org-wide template read. - Split the dbauthz mock coverage for org-scoped and any-organization calls; normalize duplicate subtest suffixes in the dbauthz method-test harness. - Surface deleted-only personal usage evidence in list_templates output with a recently-deleted count and last-used timestamp. - Assert the raw SQL query returns the maximum last_used_at value. - Clarify detached prompt guidance for user_selection_required and conditional read_template usage. - Document ListTemplatesOptions requirements and defaults.
Address CRF-26 by adding list_templates operation context to the user-facing asOwner authorization error response.
Keep TestChatSystemPrompt's detached workspace awareness expectation in sync with the updated list_templates guidance.
…truction list_templates previously taught the model a four-value selection_hint enum, an eleven-value recommendation_reason, per-item relevance_signals, and a derived user_selection_required flag, re-explained in two tool descriptions and two prompt sections. The ranked output now carries a single fixed next_step instruction at the point of use plus an optional recommended_template_id, and the per-item payload keeps only raw usage evidence. Ranking, confidence thresholds, and authorization are unchanged. Prompts reference the field through chattool.NextStepField so they cannot drift. Documents how agent tool calls work, including template selection and workspace creation, in docs/ai-coder/agents/tools.
Asserts the recommendation fires for recent deleted-only personal usage (CRF-28), adds the both-above-floor-with-large-gap recommendation case (CRF-29), and rewords the affinity-score comments to drop the inaccurate sqlc fragility claim; the score lives in Go so the ranking policy and its confidence thresholds stay in one place.
Shows a concrete query and ranked two-template response in the agent tools doc, and clarifies that templates are scored independently with only raw evidence fields returned.
# Conflicts: # coderd/x/chatd/chatd_test.go
Add observability for the list_templates ranking decision and its downstream acceptance. Tier 1: a structured decision log per call with correlation IDs, ranking inputs (scores, gap, thresholds), outcome, and reason. The raw query text is never logged, only its presence and length. Tier 2: Prometheus metrics for recommendation outcome, signal-load failures, and the affinity gap between the top two candidates. Tier 3: an in-memory, best-effort RecommendationTracker correlates a recommendation with the template create_workspace later builds, recorded as template_recommendation_followup_total plus a log. The affinity score stays internal to the response; the telemetry surfaces it in logs and metrics without exposing it to the model. Regenerates the Prometheus metrics docs for the new metrics.
Docs preview📖 View docs preview for |
|
/coder-agents-review |
|
Chat: Review in progress | View chat Review historydeep-review v0.8.0 | Round 3 | Last posted: Round 3, 15 findings (4 P3, 9 Nit, 2 Note), COMMENT. Review Finding inventoryLaw analysisEffective LOC: 2797. Head SHA: 137ccf9. Verdict: Don't split. Enforcement: Advisory. The three core concerns (scoring, telemetry, tracker) form a linear dependency chain. 11 lines of unrelated cleanup are independently mergeable but trivial. Findings
Contested and acknowledgedCRF-9 (Note, createworkspace_internal_test.go:2114) - Only accepted_recommendation followup integration-tested
CRF-12 (Nit, dbauthz_test.go:3609) - Three subtests share one s.Run name
CRF-11 (Note, listtemplates.go:246) - Outcome counter increments on every page fetch
Round logRound 1Panel (14 reviewers). 3 P3, 6 Nit, 2 Note. Reviewed against 9a91632..3017d21. Round 2Panel (11 reviewers). CRF-1 through CRF-8, CRF-10 verified fixed. CRF-9, CRF-11 accepted. 3 new Nit (CRF-12, CRF-13, CRF-14). 0 new P0-P3. Reviewed against f1ce101..137ccf9. Round 3Panel (5 reviewers). CRF-12 contested, panel closed (5/5 accept scope defense). CRF-13, CRF-14 verified fixed. 1 new P3 (CRF-15). Reviewed against a5a4c49..ab4fb75. About deep-reviewCRF = Coder Review Finding (P0-P4, Nit, Note)
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3017d21d08
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Clean telemetry design: three-tier observability (structured decision log, Prometheus metrics, acceptance tracker) that is purely additive, does not alter tool behavior, and degrades gracefully when partial. The RecommendationTracker is well-bounded (TTL, max entries, consume-once), the interface split avoids the import cycle correctly, and the test suite covers every classification path with real DB integration tests and mock clocks for deterministic TTL. Netero (mechanical first pass) found no issues.
3 P3, 6 Nit, 2 Note.
The main convergent finding (5 of 14 reviewers flagged independently): Record overwrites the tracker's listed set on each page fetch instead of accumulating, so templates from earlier pages are misclassified as "unlisted" rather than "overrode_with_listed_template." The primary signal (accepted_recommendation) is unaffected. The fix is straightforward: union listed IDs across pages or gate Record on page 1.
"I tried to build a case that it doesn't, and I couldn't." (Pariston, on whether the in-memory tracker justifies its complexity over the durable source of truth)
🤖 This review was automatically generated with Coder Agents.
Resolve reviewer findings on the telemetry PR: - RecommendationTracker.Record accumulates listed template IDs across pages of the same result, so a follow-up build from a later page is classified as listed rather than unlisted (CRF-1). - list_templates records the affinity gap only when ranking signals loaded; a failed load leaves affinity scores at zero and would pollute the histogram (Codex P2). - listTemplatesOutcome maps NextStepAskUser explicitly and routes unknown next steps to an "unknown" bucket (CRF-4). - template_recommendation_followup_total help text enumerates its outcome values; metrics docs regenerated (CRF-5). - create_workspace follow-up log includes owner_id (CRF-6). - Rename recommendationFollowupListedNoRec to recommendationFollowupListedNoRecommendation (CRF-7). - Drop "Tier" labels from comments and tighten the tracker field comment (CRF-3, CRF-10). - Eviction test asserts a newer entry survives and uses range-over-int; add coverage for cross-page accumulation and the unknown outcome (CRF-2, CRF-8).
|
Pushed 137ccf9 addressing the round-1 review feedback. Summary: Codex P2 — CRF-1 (P3) — CRF-2 (P3) — eviction test now asserts a newer entry survives. CRF-3 (P3) — dropped the CRF-4,5,6,7,8,10 (Nit) — explicit CRF-9, CRF-11 (Note) — acknowledged with rationale on the threads; no code change (intentional per-call counter, classification paths unit-tested). Added tests: cross-page accumulation, eviction survivor, gap-skipped-on-signal-failure, and the outcome mapping incl. the unknown bucket. All inline threads addressed and resolved. |
|
/coder-agents-review @codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 137ccf9457
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Round 2. All 3 P3 findings from R1 verified fixed in 137ccf9. All 6 Nits fixed. Both Notes acknowledged with rationale. Clean fixes, no regressions, no sibling issues introduced.
The base rebase brought the ranking and scoring logic into the diff (from the stacked PR). 11 reviewers covered the full scope: ranking formula, confidence thresholds, recommendation logic, telemetry, tracker, dbauthz authorization, docs. No P0-P3 issues found. 3 new Nits.
"I tried to build simpler alternatives... I could not build a simpler alternative that preserves the decision quality. The solution is proportional to the problem." (Pariston, stress-testing the design)
3 Nit.
🤖 This review was automatically generated with Coder Agents.
Resolve conflicts in listtemplates.go, listtemplates_internal_test.go, and listtemplates_test.go by taking the branch version: main's squashed "rank chat workspace templates" (#25037) is byte-identical to this branch's pre-telemetry base for those files, so the branch already contains all of main's changes there plus the telemetry additions.
- Reword recordListTemplatesTelemetry doc: the affinity score is "not included in the tool result shown to the model" rather than "stays internal", which misread as not logged (CRF-13). - Drop the redundant "list_templates ranking telemetry" section heading in the Metrics struct; the field names already convey it (CRF-14). - Document the intentional query-tie recommendation rationale: a query match is the baseline confidence signal there, so affinity only breaks the tie and need not clear the no-query affinity floor (Codex P2).
|
Pushed ab4fb75. Two changes since the last review: Merged latest Round 2 feedback:
All four review threads addressed and resolved. |
|
/coder-agents-review @codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ab4fb75581
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // their listed IDs instead of overwriting. Page 1, a missing entry, or a | ||
| // changed recommendation starts fresh. | ||
| if page > 1 { | ||
| if entry, ok := t.entries[chatID]; ok && entry.recommendedID == recommendedID { |
There was a problem hiding this comment.
Do not revive expired list_templates history
When a chat requests page > 1 after the 30-minute tracker TTL has elapsed, this branch updates recordedAt and keeps the old page-1 listed set instead of treating the entry as expired. That makes a later create_workspace classify templates shown only in the stale result as freshly listed, corrupting the follow-up telemetry the TTL is meant to bound.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Round 3. CRF-13 and CRF-14 verified fixed (ab4fb75). CRF-12 (contested): panel closed 5/5, the file has zero changed lines in this PR's diff and the code exists identically on main from #25037.
1 new P3.
"The contest is valid; the cleanup belongs in a standalone PR targeting the code that introduced it." (Mafu-san, on CRF-12)
🤖 This review was automatically generated with Coder Agents.
| // page-1 template is still "listed" rather than "unlisted". | ||
| tr.Record(chat, rec, []uuid.UUID{rec, page1}, 1) | ||
| tr.Record(chat, rec, []uuid.UUID{page2}, 2) | ||
| require.Equal(t, recommendationFollowupOverrodeListed, tr.Classify(chat, page1)) |
There was a problem hiding this comment.
P3 [CRF-15] AccumulatesListedAcrossPages proves page-1 IDs survived the page-2 Record, not that page-2 IDs were accumulated. (Bisky)
The test records page 1 with
{rec, page1}, then page 2 with{page2}, and classifies withpage1. That proves page-1 IDs survive the page-2 Record (the union-not-overwrite property), butpage1was already in the set from the first Record call. The accumulation loop body inRecordcould be deleted entirely (keeping only the early-return), and this assertion still passes.
Fix: classify with page2 instead of page1. That proves the loop added page-2 IDs to the listed set:
require.Equal(t, recommendationFollowupOverrodeListed, tr.Classify(chat, page2))🤖
Summary
Adds observability for the
list_templatesranking decision (from #25037) and its downstream acceptance, so the ranking thesis can be evaluated with data instead of speculation. The affinity score stays internal to the tool response; this telemetry surfaces it in logs and metrics, never to the model.Stacked on #25037 (base:
workspace-bdje).What's added
Tier 1 — decision log. One structured
Infolog (list_templates decision) per call: correlation IDs (chat_id,owner_id,tool_call_id),query_present/query_len(raw query text is never logged), candidate/visible/returned counts, top + runner-up query/affinity scores, the affinity gap, the confidence thresholds,ranking_version, theoutcome, arecommendation_reason, andsignals_load_failed. The reason comes straight fromselectTemplateRecommendationso the log cannot drift from the actual decision branch.Tier 2 — Prometheus metrics (prefix
coderd_chatd_):list_templates_outcome_totaloutcome= recommended / ask_user / no_matches / no_templateslist_templates_signals_failures_totallist_templates_affinity_gaprecommendedtemplate_recommendation_followup_totaloutcome(see Tier 3)The affinity-gap histogram is only observed when affinity is the deciding signal (no query, or the top two share a query tier), so the gap is never negative or misleading when query relevance dominates.
Tier 3 — closed loop (acceptance). An in-memory
RecommendationTracker(TTL + max-entries eviction, consume-once) records the recommendation + shown page per chat inlist_templates, andcreate_workspaceclassifies the template it builds intoaccepted_recommendation,overrode_with_listed_template,created_listed_without_recommendation,created_unlisted_template, orno_recent_list_templates. Classification runs only on genuine creation (after the idempotent early-return path).Downstream quality ("was the workspace used or deleted within N days") is intentionally left reconstructable from durable data (the persisted transcript +
chats.workspace_id) rather than built as an always-on background collector, which would be disproportionate for an experimental feature and would duplicate gauges across replicas.Where to view it
list_templates decisionandcreate_workspace recommendation follow-up(info level).--prometheus-enable, default127.0.0.1:2112).Design notes / limitations
chatloopimportschattool, so the metrics hooks are small interfaces declared inchattoolthat*chatloop.Metricssatisfies structurally, wired fromchatd.go.no_recent_list_templates. Treat the acceptance metric as directional; the durable source of truth is the transcript + DB.ranking_versionis a hand-maintained constant; bump it when the scoring formula or thresholds change.Testing
go test ./coderd/x/chatd/chattool/... ./coderd/x/chatd/chatloop/...(incl. new tracker classification/TTL/eviction tests, metric-gating tests, reason coverage, andlist_templates+create_workspaceintegration tests); race-clean.make genregenerated the Prometheus metrics docs for the four new metrics.