Skip to content

Commit 9f1a951

Browse files
committed
New skill to add languages
1 parent 2fc0df7 commit 9f1a951

4 files changed

Lines changed: 426 additions & 0 deletions

File tree

.claude/skills/add-lang/SKILL.md

Lines changed: 193 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,193 @@
1+
---
2+
name: add-lang
3+
description: Add tree-sitter language support to codegraph end-to-end — wire the grammar + extractor, write tests, then benchmark extraction quality and retrieval value on 3 popular real-world repos. Use when the user runs /add-lang <language> or asks to add/support a new language (e.g. Lua, Elixir, Zig, OCaml) in codegraph.
4+
---
5+
6+
# Add a language to CodeGraph
7+
8+
Wire a new tree-sitter language into codegraph's extraction pipeline, prove it
9+
extracts real symbols on popular repos, and prove it beats no-codegraph for an
10+
agent. Runs **fully autonomously** — pick repos, benchmark, update docs, then
11+
report. **Never commit, push, publish, or tag** (house rule); leave all changes
12+
for the user to review.
13+
14+
The argument is the language token used throughout the `Language` union, e.g.
15+
`lua`, `elixir`, `zig`. If none was given, ask which language. Use the lowercase
16+
single-token form everywhere (`csharp`, not `c#`).
17+
18+
## Prerequisites
19+
- Run from the codegraph repo root. `node`, `git`, `gh`, and a logged-in
20+
`claude` CLI (the benchmark spawns real `claude -p` runs).
21+
- The benchmark uses the local dev build — Step 8 builds + links it on PATH.
22+
23+
## Workflow
24+
25+
Copy this checklist and work through it in order:
26+
```
27+
- [ ] 1. Resolve language; bail early if already supported (just benchmark)
28+
- [ ] 2. Find a grammar (tree-sitter-wasms vs vendor a .wasm)
29+
- [ ] 3. Discover the grammar's AST node types (dump-ast.mjs)
30+
- [ ] 4. Wire the language (4 source edits)
31+
- [ ] 5. Build + verify-extraction loop until PASS
32+
- [ ] 6. Add extraction tests; make them green
33+
- [ ] 7. Auto-pick 3 popular repos by size tier; add to corpus.json
34+
- [ ] 8. Benchmark all 3: extraction + with/without A/B
35+
- [ ] 9. Update README + CHANGELOG
36+
- [ ] 10. Report; do NOT commit
37+
```
38+
39+
### Step 1 — Resolve + short-circuit
40+
41+
Check whether the language is already wired: look for the token in the
42+
`LANGUAGES` const (`src/types.ts`) and the `EXTRACTORS` map
43+
(`src/extraction/languages/index.ts`). If it is already supported (e.g.
44+
`typescript`, `rust`), **skip Steps 2–6** and go straight to benchmarking
45+
(Steps 7–8) to validate/measure it — note in the report that no code changed.
46+
47+
### Step 2 — Find a grammar
48+
49+
```bash
50+
ls node_modules/tree-sitter-wasms/out/ | grep -i <lang> # csharp -> c_sharp
51+
```
52+
- **Present** → off-the-shelf. No vendoring; `grammars.ts` resolves it from
53+
`tree-sitter-wasms` automatically. (Most popular languages are here: lua,
54+
elixir, zig, ocaml, solidity, toml, yaml, …)
55+
- **Absent** → you must vendor a `.wasm` into `src/extraction/wasm/` (like
56+
`pascal`/`scala`) and add the token to the vendored branch in Step 4. Get a
57+
wasm from the grammar's npm package (a prebuilt `*.wasm`) or by building one
58+
(`npx tree-sitter-cli build --wasm`, which needs emscripten/Docker — the
59+
`tree-sitter` CLI is usually not on PATH here). **If you cannot obtain a
60+
wasm, STOP and tell the user** — the language can't be added without it.
61+
62+
### Step 3 — Discover AST node types
63+
64+
Get a representative source file (write a small sample covering functions,
65+
classes/structs, imports, enums; or `curl` a raw file from a known repo), then:
66+
```bash
67+
node scripts/add-lang/dump-ast.mjs <lang> path/to/sample.<ext>
68+
# vendored grammar: pass the wasm path instead of the token
69+
node scripts/add-lang/dump-ast.mjs src/extraction/wasm/tree-sitter-<lang>.wasm sample.<ext>
70+
```
71+
The frequency table + field names (`name:`, `parameters:`, `body:`,
72+
`return_type:`) tell you what to map. Open the existing extractor closest to the
73+
language's paradigm as a model: `rust.ts`/`scala.ts` (functional, traits),
74+
`java.ts`/`csharp.ts` (OO), `python.ts`/`ruby.ts` (scripting), `go.ts`
75+
(top-level methods + receivers).
76+
77+
### Step 4 — Wire the language (4 edits)
78+
79+
These are exact, fragile wiring — match the existing style precisely:
80+
81+
1. **`src/types.ts`** — add `'<lang>',` to the `LANGUAGES` const (before
82+
`'unknown'`).
83+
2. **`src/extraction/grammars.ts`** — three maps:
84+
- `WASM_GRAMMAR_FILES`: `<lang>: 'tree-sitter-<lang>.wasm',`
85+
- `EXTENSION_MAP`: each file extension → `'<lang>'` (e.g. `'.lua': 'lua',`)
86+
- `getLanguageDisplayName`: `<lang>: '<Display Name>',`
87+
- **vendored only**: add `<lang>` to the
88+
`(lang === 'pascal' || lang === 'scala')` wasm-path branch.
89+
3. **`src/extraction/languages/<lang>.ts`** — new file exporting
90+
`export const <lang>Extractor: LanguageExtractor = { … }`. Map the node types
91+
from Step 3. Required fields: `functionTypes`, `classTypes`, `methodTypes`,
92+
`interfaceTypes`, `structTypes`, `enumTypes`, `typeAliasTypes`,
93+
`importTypes`, `callTypes`, `variableTypes`, `nameField`, `bodyField`,
94+
`paramsField`. Add hooks as the grammar needs them (`getSignature`,
95+
`getVisibility`, `isExported`, `extractImport`, `getReceiverType`,
96+
`interfaceKind`, `enumMemberTypes`, etc. — see
97+
`src/extraction/tree-sitter-types.ts`).
98+
4. **`src/extraction/languages/index.ts`** — `import { <lang>Extractor } from
99+
'./<lang>';` and add `<lang>: <lang>Extractor,` to `EXTRACTORS`.
100+
101+
### Step 5 — Build + verify loop
102+
103+
```bash
104+
npm run build # tsc + copy-assets (copies any vendored *.wasm into dist/)
105+
```
106+
Index a small sample repo and check extraction:
107+
```bash
108+
( cd <sample-repo> && codegraph init -i )
109+
node scripts/add-lang/verify-extraction.mjs <sample-repo> <lang>
110+
```
111+
`verify-extraction.mjs` fails (exit 1) if the language isn't detected or only
112+
`file`/`import` nodes were produced — the classic symptom of wrong node-type
113+
names. On FAIL or a thin WARN: re-run `dump-ast.mjs` on a richer file, fix the
114+
mappings in `<lang>.ts`, `npm run build`, re-index, re-verify. **Repeat until
115+
PASS.**
116+
117+
### Step 6 — Tests
118+
119+
Add to `__tests__/extraction.test.ts`, modeled on the `Rust Extraction` block:
120+
- a `detectLanguage` assertion in `describe('Language Detection')`
121+
- a `describe('<Lang> Extraction')` block asserting functions/classes/imports
122+
are extracted from an inline source string.
123+
```bash
124+
npx vitest run __tests__/extraction.test.ts
125+
```
126+
Green before continuing.
127+
128+
### Step 7 — Auto-pick 3 repos + corpus
129+
130+
Pick **without asking**. Find candidates, then curate 3 that are genuinely
131+
`<lang>`-dominant, one per size tier:
132+
```bash
133+
gh search repos --language=<lang> --sort=stars --limit 40 \
134+
--json fullName,stargazerCount,description
135+
```
136+
Tiers (match `corpus.json`): **Small** <~150 files · **Medium** ~150–1500 ·
137+
**Large** >~1500. Skip repos that are tagged `<lang>` but mostly another
138+
language. Write one cross-file architecture **question** per repo (the kind that
139+
needs tracing across files). Add a `"<Language>"` block to
140+
`.claude/skills/agent-eval/corpus.json` (fields: `name`, `repo`, `size`,
141+
`files`, `question`) so `/agent-eval` can reuse them.
142+
143+
### Step 8 — Benchmark all 3 (extraction + A/B)
144+
145+
Make the dev build the codegraph on PATH **once**, then loop:
146+
```bash
147+
npm run build && ./scripts/local-install.sh
148+
scripts/add-lang/bench.sh <lang> <name> <url> "<question>" headless # ×3
149+
```
150+
`bench.sh` clones (shared `/tmp/codegraph-corpus`), wipes + indexes, runs
151+
`verify-extraction.mjs`, then the with/without retrieval A/B via
152+
`scripts/agent-eval/run-all.sh` (skips the paid A/B if extraction is broken).
153+
Read each `parse-run.mjs` summary printed by `run-all.sh`: tool calls, file
154+
`Read`s, Grep/Bash, codegraph-tool calls, duration, and **cost** — for both the
155+
`with` and `without` arms. After the loop, restore the dev link if needed:
156+
`./scripts/local-install.sh`.
157+
158+
### Step 9 — Docs + CHANGELOG
159+
160+
- **README.md**: add `<Lang>` to the "19+ Languages" feature bullet, and add a
161+
row to the **Supported Languages** table:
162+
`| <Lang> | \`.ext\` | Full support (classes, methods, …) |`.
163+
- **CHANGELOG.md**: add an `## [Unreleased]` section at the top (above the
164+
latest version) with `### Added` → a user-perspective bullet, e.g.
165+
*"CodeGraph now indexes **<Lang>** (`.ext`) — functions, classes, imports, and
166+
call edges."* If `## [Unreleased]` already exists, append under it. (`/publish`
167+
folds this into the next versioned block at release time.)
168+
169+
### Step 10 — Report (do NOT commit)
170+
171+
Summarize for review:
172+
- **Files changed**: the 4 wiring edits + new extractor + tests + README +
173+
CHANGELOG + corpus.json (+ any vendored `.wasm`).
174+
- **Extraction** per repo: files / nodes / edges / `verify-extraction` result.
175+
- **A/B** per repo: `with` vs `without` (tool calls, file Reads, cost) and a
176+
one-line verdict — did codegraph reduce effort, and did both arms reach a
177+
correct answer?
178+
- **Gaps / follow-ups** (node types not yet mapped, resolution edges missing,
179+
framework routes, etc.).
180+
181+
Hand the changes to the user. **Do not** run `git commit`/`push`,
182+
`npm publish`, or `scripts/release.sh`.
183+
184+
## Notes
185+
- The A/B spawns real **paid** `claude -p` runs (opus, `--max-budget-usd`),
186+
2 arms × 3 repos. The corpus dir `/tmp/codegraph-corpus` is shared with
187+
`/agent-eval`, so clones are reused across runs.
188+
- Any new `*.wasm` must live in `src/extraction/wasm/``copy-assets` (run by
189+
`npm run build`) ships it; otherwise it won't be in `dist/`.
190+
- An index must be served by the **same** binary that built it. Step 8 builds +
191+
links the dev build first, so this holds.
192+
- If a grammar can't be obtained, or extraction can't reach PASS, **STOP and
193+
report** — don't ship a half-wired language.

scripts/add-lang/bench.sh

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
#!/usr/bin/env bash
2+
# Add-lang benchmark for ONE repo:
3+
# clone -> wipe+index (with the codegraph on PATH) -> verify extraction ->
4+
# with/without retrieval A/B (reuses scripts/agent-eval/run-all.sh).
5+
#
6+
# Assumes the codegraph dev build is already built + linked on PATH — the skill
7+
# runs `npm run build && ./scripts/local-install.sh` ONCE before looping repos.
8+
# The A/B is skipped if extraction fails its critical checks (don't burn $ on a
9+
# broken extractor); set FORCE_AB=1 to run it anyway.
10+
#
11+
# Usage: bench.sh <lang> <repo-name> <repo-url> "<question>" [headless|tmux|all]
12+
# Env: CORPUS corpus dir (default /tmp/codegraph-corpus, shared with agent-eval)
13+
set -uo pipefail
14+
15+
LANG_TOKEN="${1:?usage: bench.sh <lang> <repo-name> <repo-url> \"<question>\" [mode]}"
16+
NAME="${2:?repo-name required}"
17+
URL="${3:?repo-url required}"
18+
Q="${4:?question required}"
19+
MODE="${5:-headless}"
20+
21+
HARNESS="$(cd "$(dirname "$0")" && pwd)"
22+
AGENT_EVAL="$(cd "$HARNESS/../agent-eval" && pwd)"
23+
CORPUS="${CORPUS:-/tmp/codegraph-corpus}"
24+
REPO="$CORPUS/$NAME"
25+
26+
command -v codegraph >/dev/null || { echo "no codegraph on PATH (build + ./scripts/local-install.sh first)"; exit 1; }
27+
28+
echo "==================== add-lang bench: $NAME ($LANG_TOKEN) ===================="
29+
echo "codegraph: $(command -v codegraph) -> $(codegraph --version 2>/dev/null || echo '?')"
30+
31+
# 1. Ensure the repo (shallow clone, reuse if present).
32+
mkdir -p "$CORPUS"
33+
if [ -d "$REPO/.git" ]; then
34+
echo "→ reusing checkout: $REPO"
35+
else
36+
echo "→ cloning $URL"
37+
git clone --depth 1 "$URL" "$REPO" || { echo "git clone failed"; exit 1; }
38+
fi
39+
40+
# 2. Wipe + index with the binary under test.
41+
echo "→ wiping .codegraph and indexing"
42+
rm -rf "$REPO/.codegraph"
43+
( cd "$REPO" && codegraph init -i ) || { echo "indexing failed"; exit 1; }
44+
45+
# 3. Verify extraction (cheap guard before the paid A/B).
46+
echo "→ verifying extraction"
47+
node "$HARNESS/verify-extraction.mjs" "$REPO" "$LANG_TOKEN"
48+
VERIFY=$?
49+
50+
# 4. Retrieval A/B (skipped if extraction is broken, unless FORCE_AB=1).
51+
if [ "$VERIFY" -ne 0 ] && [ "${FORCE_AB:-0}" != "1" ]; then
52+
echo "→ SKIPPING A/B — extraction failed critical checks (set FORCE_AB=1 to override)"
53+
else
54+
echo "→ retrieval A/B (mode=$MODE)"
55+
bash "$AGENT_EVAL/run-all.sh" "$REPO" "$Q" "$MODE"
56+
fi
57+
58+
echo "==================== bench complete: $NAME (verify exit=$VERIFY) ===================="
59+
# Exit reflects extraction: 0 = pass/warn, 1 = critical fail, 2 = couldn't read status.
60+
exit "$VERIFY"

scripts/add-lang/dump-ast.mjs

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
#!/usr/bin/env node
2+
// Dump the tree-sitter AST for a sample file so you can write a LanguageExtractor
3+
// mapping. Loads a grammar .wasm directly via web-tree-sitter (the same runtime
4+
// codegraph uses) — you do NOT need to register the language first.
5+
//
6+
// Usage:
7+
// node scripts/add-lang/dump-ast.mjs <lang|wasm-path> <sample-file> [--depth=N] [--full]
8+
// Examples:
9+
// node scripts/add-lang/dump-ast.mjs lua sample.lua
10+
// node scripts/add-lang/dump-ast.mjs src/extraction/wasm/tree-sitter-zig.wasm a.zig --depth=4
11+
//
12+
// Output: an indented AST (named nodes, with field names) followed by a
13+
// node-type FREQUENCY table. The frequency table is the payoff — it tells you
14+
// which node types to map to functionTypes / classTypes / importTypes / etc.
15+
16+
import { readFileSync, existsSync } from 'node:fs';
17+
import { createRequire } from 'node:module';
18+
import { Parser, Language } from 'web-tree-sitter';
19+
20+
const require = createRequire(import.meta.url);
21+
const fail = (msg) => { console.error(`[dump-ast] ${msg}`); process.exit(1); };
22+
23+
const argv = process.argv.slice(2);
24+
const positional = argv.filter((a) => !a.startsWith('--'));
25+
const [langOrWasm, sampleFile] = positional;
26+
const depthFlag = argv.find((a) => a.startsWith('--depth='));
27+
const showAll = argv.includes('--full'); // also print anonymous (token) nodes
28+
const maxDepth = depthFlag ? parseInt(depthFlag.split('=')[1], 10) : (showAll ? Infinity : 8);
29+
30+
if (!langOrWasm || !sampleFile) {
31+
fail('usage: dump-ast.mjs <lang|wasm-path> <sample-file> [--depth=N] [--full]');
32+
}
33+
if (!existsSync(sampleFile)) fail(`sample file not found: ${sampleFile}`);
34+
35+
// Language tokens whose tree-sitter-wasms filename differs from the token.
36+
const WASM_SPECIAL = { csharp: 'c_sharp', 'c#': 'c_sharp' };
37+
38+
function resolveWasm(token) {
39+
if (token.endsWith('.wasm')) {
40+
if (!existsSync(token)) fail(`wasm not found: ${token}`);
41+
return token;
42+
}
43+
const base = WASM_SPECIAL[token.toLowerCase()] ?? token.toLowerCase();
44+
try {
45+
return require.resolve(`tree-sitter-wasms/out/tree-sitter-${base}.wasm`);
46+
} catch {
47+
/* not in tree-sitter-wasms — try a vendored copy */
48+
}
49+
const vendored = `src/extraction/wasm/tree-sitter-${base}.wasm`;
50+
if (existsSync(vendored)) return vendored;
51+
fail(
52+
`no grammar for "${token}" — not in tree-sitter-wasms and not vendored at ` +
53+
`${vendored}. Pass an explicit .wasm path, or vendor one (see SKILL.md "Find a grammar").`
54+
);
55+
}
56+
57+
const wasmPath = resolveWasm(langOrWasm);
58+
const source = readFileSync(sampleFile, 'utf8');
59+
60+
try {
61+
await Parser.init();
62+
} catch {
63+
await Parser.init({ locateFile: () => require.resolve('web-tree-sitter/tree-sitter.wasm') });
64+
}
65+
66+
let language;
67+
try {
68+
language = await Language.load(wasmPath);
69+
} catch (e) {
70+
fail(`failed to load grammar ${wasmPath}: ${e.message}`);
71+
}
72+
73+
const parser = new Parser();
74+
parser.setLanguage(language);
75+
const tree = parser.parse(source);
76+
77+
const freq = new Map();
78+
const snippet = (node) => {
79+
const t = node.text.replace(/\s+/g, ' ').trim();
80+
return t.length > 48 ? `${t.slice(0, 48)}…` : t;
81+
};
82+
83+
function walk(node, depth, fieldName) {
84+
if (node.isNamed) freq.set(node.type, (freq.get(node.type) || 0) + 1);
85+
if ((node.isNamed || showAll) && depth <= maxDepth) {
86+
const field = fieldName ? `${fieldName}: ` : '';
87+
const leaf = node.childCount === 0 ? ` "${snippet(node)}"` : '';
88+
console.log(`${' '.repeat(depth)}${field}${node.type} @${node.startPosition.row + 1}:${node.startPosition.column}${leaf}`);
89+
}
90+
for (let i = 0; i < node.childCount; i++) {
91+
const child = node.child(i);
92+
if (child) walk(child, depth + 1, node.fieldNameForChild(i));
93+
}
94+
}
95+
96+
console.log(`\n# AST for ${sampleFile} (grammar: ${wasmPath.split('/').pop()})\n`);
97+
walk(tree.rootNode, 0, null);
98+
99+
console.log('\n# Node-type frequency (named nodes) — map the relevant ones in your extractor:\n');
100+
[...freq.entries()]
101+
.sort((a, b) => b[1] - a[1])
102+
.forEach(([type, n]) => console.log(` ${String(n).padStart(5)} ${type}`));
103+
console.log();

0 commit comments

Comments
 (0)