Skip to content

Commit 7fe64b3

Browse files
committed
feat(eval): add agent-eval harness and /audit + /publish Claude skills
Replaces the old interactive publish.js script with two Claude skills and a full agent-evaluation harness: - `.claude/skills/audit/` — `/audit` skill drives `scripts/agent-eval/audit.sh` to benchmark retrieval quality (with vs. without codegraph) on a chosen real-world repo from the new `corpus.json` (17 repos across 14 languages). - `.claude/skills/publish/` — `/publish` skill orchestrates the full release workflow (preflight → changelog → confirmation gate → bump/build → npm publish → GitHub release), replacing `publish.js`. - `scripts/agent-eval/` — headless (`run-agent.sh`, `run-all.sh`) and interactive tmux (`itrun.sh`) harnesses with stream-json parsers (`parse-run.mjs`, `parse-session.mjs`) that report tool calls, token usage, and a VERDICT line summarising codegraph_explore vs. Read/Grep counts. - `run-interactive-test.md` — documents the two harnesses, idle-detection approach, and what "good" agent behavior looks like after explore-first guidance.
1 parent 1cbca5a commit 7fe64b3

11 files changed

Lines changed: 818 additions & 65 deletions

File tree

.claude/skills/audit/SKILL.md

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
---
2+
name: audit
3+
description: Benchmark CodeGraph retrieval quality on a real codebase by comparing agent behavior with vs without CodeGraph. Use when the user runs /audit or asks to test, benchmark, audit, or validate a codegraph version (the local dev build or a published npm version) against a language's repo.
4+
---
5+
6+
# CodeGraph Quality Audit
7+
8+
Measures how much CodeGraph helps an agent versus plain grep/read, for a chosen
9+
codegraph version on a chosen real-world repo. Drives the harness in
10+
`scripts/agent-eval/`.
11+
12+
## Prerequisites
13+
- `tmux` 3+, a logged-in `claude` CLI, `node`, `git` (macOS/Linux).
14+
- Run from the codegraph repo root.
15+
16+
## Workflow
17+
18+
Copy this checklist:
19+
```
20+
- [ ] 1. Pick version (local or npm)
21+
- [ ] 2. Pick language
22+
- [ ] 3. Pick repo by size
23+
- [ ] 4. Pick harness (headless / tmux / both)
24+
- [ ] 5. Run audit.sh in the background
25+
- [ ] 6. Report results
26+
```
27+
28+
**Step 1 — version.** Ask with `AskUserQuestion`: which codegraph version to test.
29+
Offer "Local dev build" and "Latest published"; the free-text "Other" lets the
30+
user type a specific version (e.g. `0.7.10`). Map the answer to a VERSION token:
31+
- "Local dev build" → `local`
32+
- "Latest published" → `latest`
33+
- a typed version → that string (e.g. `0.7.10`)
34+
35+
**Step 2 — language.** Read `.claude/skills/audit/corpus.json`. Ask with
36+
`AskUserQuestion` which language to test, listing the languages that have entries.
37+
38+
**Step 3 — repo.** From the chosen language's entries, ask which repo. Label each
39+
option with its size and file count, e.g. `excalidraw — Medium (~600 files)`.
40+
Each entry carries the `repo` URL and a representative `question`.
41+
42+
**Step 4 — harness.** Ask with `AskUserQuestion` which harness to run, and map
43+
the answer to a MODE token:
44+
- "Headless" → `headless``claude -p` with stream-json: exact tokens/cost and a
45+
clean tool sequence (2 runs, fast, no TTY).
46+
- "Interactive (tmux)" → `tmux` — drives the real Claude TUI in tmux: faithful
47+
Explore-subagent behavior, metrics from session logs (2 runs, slower).
48+
- "Both" → `all` — headless + interactive (4 runs).
49+
50+
**Step 5 — run.** Launch in the background (sets the version, clones if missing,
51+
wipes + re-indexes, runs the chosen arms — several minutes):
52+
```bash
53+
scripts/agent-eval/audit.sh <VERSION> <repo-name> <repo-url> "<question>" <MODE>
54+
```
55+
56+
**Step 6 — report.** When the job finishes, read the log and report per arm:
57+
- Headless (`parse-run.mjs`): total tool calls, file `Read`s, Grep/Bash,
58+
codegraph-tool calls, duration, **total cost**.
59+
- Interactive (`parse-session.mjs`): the `VERDICT: codegraph_explore used Nx |
60+
Read N | Grep/Bash N` and `TOKENS:` lines.
61+
62+
Lead with cost + tool/Read counts — they are the reliable signals; raw token
63+
in/out are confounded by subagent delegation and prompt caching. State whether
64+
codegraph reduced effort and whether both arms reached a correct answer.
65+
66+
## Notes
67+
- The index is rebuilt every run (`audit.sh` wipes `.codegraph`) — different
68+
versions extract differently, so an index must be served by the same binary
69+
that built it.
70+
- `audit.sh` temporarily mutates the global `codegraph` install for the test,
71+
then restores your dev link via `local-install.sh`.
72+
- Corpus repos are cloned to `/tmp/codegraph-corpus` (reused if already present).
73+
- Add or edit repos in `corpus.json` (fields: `name`, `repo`, `size`, `files`,
74+
`question`).

.claude/skills/audit/corpus.json

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
{
2+
"_comment": "Test corpus for /audit. Add entries freely. size: Small (<~150 files), Medium (~150-1500), Large (>~1500). 'question' is a representative architectural question that exercises cross-file understanding.",
3+
"TypeScript": [
4+
{ "name": "ky", "repo": "https://github.com/sindresorhus/ky", "size": "Small", "files": "~25", "question": "How does ky implement request retries and timeouts?" },
5+
{ "name": "excalidraw", "repo": "https://github.com/excalidraw/excalidraw", "size": "Medium", "files": "~600", "question": "How does Excalidraw render and update canvas elements?" },
6+
{ "name": "vscode", "repo": "https://github.com/microsoft/vscode", "size": "Large", "files": "~10000", "question": "How does the extension host communicate with the main process?" }
7+
],
8+
"JavaScript": [
9+
{ "name": "express", "repo": "https://github.com/expressjs/express", "size": "Small", "files": "~50", "question": "How does Express route a request through its middleware stack?" }
10+
],
11+
"Go": [
12+
{ "name": "cobra", "repo": "https://github.com/spf13/cobra", "size": "Small", "files": "~50", "question": "How does cobra parse commands and flags?" },
13+
{ "name": "gin", "repo": "https://github.com/gin-gonic/gin", "size": "Medium", "files": "~150", "question": "How does gin route requests through its middleware chain?" },
14+
{ "name": "terraform", "repo": "https://github.com/hashicorp/terraform", "size": "Large", "files": "~4000", "question": "How does Terraform build and walk the resource dependency graph?" }
15+
],
16+
"Python": [
17+
{ "name": "click", "repo": "https://github.com/pallets/click", "size": "Small", "files": "~60", "question": "How does click parse command-line arguments into commands?" },
18+
{ "name": "flask", "repo": "https://github.com/pallets/flask", "size": "Medium", "files": "~90", "question": "How does Flask dispatch a request to a view function?" },
19+
{ "name": "django", "repo": "https://github.com/django/django", "size": "Large", "files": "~2700", "question": "How does Django's ORM build and execute a query from a QuerySet?" }
20+
],
21+
"Rust": [
22+
{ "name": "clap", "repo": "https://github.com/clap-rs/clap", "size": "Medium", "files": "~200", "question": "How does clap parse arguments against a derived command definition?" },
23+
{ "name": "tokio", "repo": "https://github.com/tokio-rs/tokio", "size": "Large", "files": "~700", "question": "How does tokio schedule and run async tasks on its runtime?" },
24+
{ "name": "deno", "repo": "https://github.com/denoland/deno", "size": "Large", "files": "~1500", "question": "How does Deno load and execute a TypeScript module?" }
25+
],
26+
"Java": [
27+
{ "name": "gson", "repo": "https://github.com/google/gson", "size": "Medium", "files": "~200", "question": "How does Gson serialize an object to JSON?" },
28+
{ "name": "okhttp", "repo": "https://github.com/square/okhttp", "size": "Medium", "files": "~640", "question": "How does OkHttp process a request through its interceptor chain?" },
29+
{ "name": "guava", "repo": "https://github.com/google/guava", "size": "Large", "files": "~3000", "question": "How does Guava's CacheBuilder build and configure a cache?" }
30+
],
31+
"Kotlin": [
32+
{ "name": "koin", "repo": "https://github.com/InsertKoinIO/koin", "size": "Medium", "files": "~300", "question": "How does Koin resolve and inject dependencies?" },
33+
{ "name": "leakcanary", "repo": "https://github.com/square/leakcanary", "size": "Medium", "files": "~250", "question": "How does LeakCanary detect and analyze a memory leak?" }
34+
],
35+
"Swift": [
36+
{ "name": "alamofire", "repo": "https://github.com/Alamofire/Alamofire", "size": "Small", "files": "~100", "question": "How does Alamofire build, send, and validate a request?" }
37+
],
38+
"C#": [
39+
{ "name": "serilog", "repo": "https://github.com/serilog/serilog", "size": "Medium", "files": "~250", "question": "How does Serilog route a log event to its sinks?" },
40+
{ "name": "jellyfin", "repo": "https://github.com/jellyfin/jellyfin", "size": "Large", "files": "~2500", "question": "How does Jellyfin scan and identify items in a media library?" }
41+
],
42+
"Ruby": [
43+
{ "name": "sinatra", "repo": "https://github.com/sinatra/sinatra", "size": "Small", "files": "~60", "question": "How does Sinatra match a request to a route handler?" },
44+
{ "name": "discourse", "repo": "https://github.com/discourse/discourse", "size": "Large", "files": "~3000", "question": "How does Discourse create and render a new post?" }
45+
],
46+
"PHP": [
47+
{ "name": "slim", "repo": "https://github.com/slimphp/Slim", "size": "Small", "files": "~80", "question": "How does Slim handle a request through its middleware?" },
48+
{ "name": "laravel", "repo": "https://github.com/laravel/framework", "size": "Large", "files": "~3000", "question": "How does Laravel resolve and dispatch a route to a controller?" }
49+
],
50+
"C": [
51+
{ "name": "redis", "repo": "https://github.com/redis/redis", "size": "Large", "files": "~600", "question": "How does Redis parse and dispatch a client command?" }
52+
],
53+
"C++": [
54+
{ "name": "json", "repo": "https://github.com/nlohmann/json", "size": "Small", "files": "~100", "question": "How does nlohmann::json parse a JSON string into a value?" },
55+
{ "name": "grpc", "repo": "https://github.com/grpc/grpc", "size": "Large", "files": "~3000", "question": "How does gRPC dispatch an incoming RPC to its handler?" }
56+
],
57+
"Dart": [
58+
{ "name": "flutter", "repo": "https://github.com/flutter/flutter", "size": "Large", "files": "~6000", "question": "How does Flutter build and lay out a widget tree?" }
59+
],
60+
"Svelte": [
61+
{ "name": "shadcn-svelte", "repo": "https://github.com/huntabyte/shadcn-svelte", "size": "Medium", "files": "~600", "question": "How do shadcn-svelte components compose and apply their styling?" }
62+
]
63+
}

.claude/skills/publish/SKILL.md

Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
---
2+
name: publish
3+
description: Publishes a new minor or major release of this npm package (codegraph). Reads the latest version from npm, generates a user-perspective CHANGELOG entry from commits since the last tag, bumps package.json, publishes to npm, and creates the matching GitHub release. Use when the user runs /publish or asks to cut, ship, or publish a release / new version.
4+
---
5+
6+
# Publish a release
7+
8+
Cut a **minor or major** release: generate the changelog, bump, publish to npm, and create the GitHub release. Patch releases are intentionally not offered here.
9+
10+
This skill performs the actual publish (npm publish, git push, GitHub release) — that is the whole point of invoking it, so the general "hand the user the commands" rule does **not** apply inside `/publish`. The **confirmation gate in Step 5 is the safeguard**: never run a step past it without explicit approval.
11+
12+
Run from the repo root.
13+
14+
## Workflow
15+
16+
Copy this checklist and work through it in order:
17+
18+
```
19+
- [ ] 1. Preflight: branch, sync, auth
20+
- [ ] 2. Read base version from npm, compute candidates
21+
- [ ] 3. Ask the user: minor or major
22+
- [ ] 4. Generate the CHANGELOG entry from commits since the last tag
23+
- [ ] 5. CONFIRMATION GATE — show changelog + plan, get explicit approval
24+
- [ ] 6. Write CHANGELOG.md, bump, build
25+
- [ ] 7. Commit + push
26+
- [ ] 8. npm publish
27+
- [ ] 9. scripts/release.sh (GitHub release)
28+
- [ ] 10. Verify on the npm registry
29+
```
30+
31+
### Step 1 — Preflight
32+
33+
```bash
34+
git rev-parse --abbrev-ref HEAD # expect: main
35+
git fetch origin
36+
git status --porcelain # working tree should be clean
37+
git rev-list --left-right --count origin/main...HEAD # "<behind> <ahead>"
38+
npm whoami # npm auth (publish will fail without it)
39+
gh auth status # gh auth (release.sh needs it)
40+
```
41+
42+
- If not on `main`, stop and ask the user to confirm releasing from this branch.
43+
- If behind origin, `git pull --ff-only` so the final push is a fast-forward.
44+
- If the tree has **unrelated** uncommitted changes, stop and ask — the release commit only stages 3 files, but a dirty tree usually means something's mid-flight.
45+
- If `npm whoami` or `gh auth status` fails, stop and tell the user to authenticate.
46+
47+
### Step 2 — Base version + candidates
48+
49+
The latest **published** version is the source of truth, not local `package.json`.
50+
51+
```bash
52+
PKG=$(node -p "require('./package.json').name")
53+
BASE=$(npm view "$PKG" version)
54+
node -e "const [a,b]=process.argv[1].split('.').map(Number);console.log('minor ->',a+'.'+(b+1)+'.0');console.log('major ->',(a+1)+'.0.0')" "$BASE"
55+
```
56+
57+
Note if local `package.json` differs from `BASE` (an unpublished bump) — surface it, but still base the new version on npm.
58+
59+
### Step 3 — Ask minor or major
60+
61+
Use the **AskUserQuestion** tool with the two computed candidates as options (show the resulting version in each label, e.g. "minor → 0.8.0"). Set the new version from the answer.
62+
63+
### Step 4 — Generate the changelog entry
64+
65+
```bash
66+
LAST=$(git describe --tags --abbrev=0 --match 'v*' 2>/dev/null)
67+
git log --no-merges "${LAST}..HEAD" --pretty=format:'%h %s'
68+
```
69+
70+
Read the commit subjects; for any whose user impact is unclear, inspect the diff (`git show <hash>` or `git diff "${LAST}..HEAD" -- <path>`). Then **write the entry yourself** following the repo's conventions in `CLAUDE.md` → "Writing changelog entries":
71+
72+
- Header: `## [X.Y.Z] - YYYY-MM-DD` (get the date with `date +%F`).
73+
- Group under `### Added`, `### Changed`, `### Fixed`, `### Removed`, `### Deprecated`, `### Security`**omit empty sections**.
74+
- Write from the **user's perspective** (observable capability/symptom), not the implementation. Collapse noisy commits ("fix typo", "address review") into the feature they belong to or drop them.
75+
- Plan the bottom link reference: `[X.Y.Z]: https://github.com/colbymchenry/codegraph/releases/tag/vX.Y.Z`.
76+
77+
Do not write to any file yet — draft it for review first.
78+
79+
### Step 5 — CONFIRMATION GATE
80+
81+
Show the user, in chat:
82+
1. The new version (`BASE``X.Y.Z`, minor/major).
83+
2. The full drafted changelog entry.
84+
3. The exact actions Steps 6–9 will take (commit + push + npm publish + GitHub release).
85+
86+
Then **STOP**. Proceed only on explicit approval ("yes" / "proceed"). If the user requests prose changes, revise the draft and re-show. Do not run any command below until approved.
87+
88+
### Step 6 — Write changelog, bump, build
89+
90+
1. Use the **Edit** tool to insert the drafted `## [X.Y.Z]` block at the **top** of `CHANGELOG.md` (under the intro, above the previous version), and add the link reference with the other `[x.y.z]:` links at the bottom.
91+
2. Bump (also updates `package-lock.json`; `--allow-same-version` keeps re-runs safe):
92+
```bash
93+
npm version X.Y.Z --no-git-tag-version --allow-same-version
94+
```
95+
3. Build (fail fast before any push/publish):
96+
```bash
97+
npm run build
98+
```
99+
100+
### Step 7 — Commit + push
101+
102+
`release.sh` tags HEAD, so the bump must be committed first.
103+
104+
```bash
105+
git add package.json package-lock.json CHANGELOG.md
106+
git commit -m "release: X.Y.Z"
107+
git push
108+
```
109+
110+
### Step 8 — Publish to npm
111+
112+
```bash
113+
npm publish --access public
114+
```
115+
116+
### Step 9 — GitHub release
117+
118+
`scripts/release.sh` reads the `## [X.Y.Z]` block from CHANGELOG.md, tags `vX.Y.Z`, pushes the tag, and creates the GitHub release. It is idempotent.
119+
120+
```bash
121+
./scripts/release.sh
122+
```
123+
124+
### Step 10 — Verify
125+
126+
Confirm against the **registry**, not the website (the website caches):
127+
128+
```bash
129+
npm view "$PKG" version # must equal X.Y.Z
130+
```
131+
132+
Report the release URL (`scripts/release.sh` prints it) and the published version.
133+
134+
## If something fails midway
135+
136+
Re-running is safe: `npm version --allow-same-version` no-ops if already bumped, `git commit` skips if nothing's staged (check `git diff --cached --quiet`), `git push` no-ops if up to date, and `scripts/release.sh` skips tag/release steps already done. Re-run from the failed step.

publish.js

Lines changed: 0 additions & 65 deletions
This file was deleted.

0 commit comments

Comments
 (0)