|
| 1 | +--- |
| 2 | +name: add-lang |
| 3 | +description: Add tree-sitter language support to codegraph end-to-end — wire the grammar + extractor, write tests, then benchmark extraction quality and retrieval value on 3 popular real-world repos. Use when the user runs /add-lang <language> or asks to add/support a new language (e.g. Lua, Elixir, Zig, OCaml) in codegraph. |
| 4 | +--- |
| 5 | + |
| 6 | +# Add a language to CodeGraph |
| 7 | + |
| 8 | +Wire a new tree-sitter language into codegraph's extraction pipeline, prove it |
| 9 | +extracts real symbols on popular repos, and prove it beats no-codegraph for an |
| 10 | +agent. Runs **fully autonomously** — pick repos, benchmark, update docs, then |
| 11 | +report. **Never commit, push, publish, or tag** (house rule); leave all changes |
| 12 | +for the user to review. |
| 13 | + |
| 14 | +The argument is the language token used throughout the `Language` union, e.g. |
| 15 | +`lua`, `elixir`, `zig`. If none was given, ask which language. Use the lowercase |
| 16 | +single-token form everywhere (`csharp`, not `c#`). |
| 17 | + |
| 18 | +## Prerequisites |
| 19 | +- Run from the codegraph repo root. `node`, `git`, `gh`, and a logged-in |
| 20 | + `claude` CLI (the benchmark spawns real `claude -p` runs). |
| 21 | +- The benchmark uses the local dev build — Step 8 builds + links it on PATH. |
| 22 | + |
| 23 | +## Workflow |
| 24 | + |
| 25 | +Copy this checklist and work through it in order: |
| 26 | +``` |
| 27 | +- [ ] 1. Resolve language; bail early if already supported (just benchmark) |
| 28 | +- [ ] 2. Find a grammar + health-check it (ABI / heap corruption) |
| 29 | +- [ ] 3. Discover the grammar's AST node types (dump-ast.mjs) |
| 30 | +- [ ] 4. Wire the language (4 files; sometimes a 5th core touch) |
| 31 | +- [ ] 5. Build + verify-extraction loop until PASS |
| 32 | +- [ ] 6. Add extraction tests; make them green |
| 33 | +- [ ] 7. Auto-pick 3 popular repos by size tier; add to corpus.json |
| 34 | +- [ ] 8. Benchmark all 3: extraction + with/without A/B |
| 35 | +- [ ] 9. Update README + CHANGELOG |
| 36 | +- [ ] 10. Report; do NOT commit |
| 37 | +``` |
| 38 | + |
| 39 | +### Step 1 — Resolve + short-circuit |
| 40 | + |
| 41 | +Check whether the language is already wired: look for the token in the |
| 42 | +`LANGUAGES` const (`src/types.ts`) and the `EXTRACTORS` map |
| 43 | +(`src/extraction/languages/index.ts`). If it is already supported (e.g. |
| 44 | +`typescript`, `rust`), **skip Steps 2–6** and go straight to benchmarking |
| 45 | +(Steps 7–8) to validate/measure it — note in the report that no code changed. |
| 46 | + |
| 47 | +### Step 2 — Find a grammar, then health-check it |
| 48 | + |
| 49 | +```bash |
| 50 | +ls node_modules/tree-sitter-wasms/out/ | grep -i <lang> # csharp -> c_sharp |
| 51 | +``` |
| 52 | +- **Present** → likely off-the-shelf; `grammars.ts` resolves it from |
| 53 | + `tree-sitter-wasms` automatically. (Many languages: elixir, zig, ocaml, |
| 54 | + solidity, toml, yaml, …) |
| 55 | +- **Absent** → vendor a `.wasm` into `src/extraction/wasm/` (like `pascal` / |
| 56 | + `scala` / `lua`) and add the token to the vendored branch in Step 4. |
| 57 | + |
| 58 | +**Always health-check before writing an extractor — a *present* grammar can |
| 59 | +still be unusable:** |
| 60 | +```bash |
| 61 | +node scripts/add-lang/check-grammar.mjs <lang> path/to/valid-sample.<ext> |
| 62 | +``` |
| 63 | +It prints the grammar's ABI version and parses a valid sample many times in a |
| 64 | +multi-grammar runtime. If it **FAILs** (ERROR trees on valid code — an old ABI |
| 65 | +corrupting the shared WASM heap, which silently drops nested calls/imports on |
| 66 | +every file after the first; e.g. the tree-sitter-wasms **Lua** grammar is ABI 13 |
| 67 | +and fails), do NOT use that wasm. **Vendor a newer (ABI 14/15) build instead:** |
| 68 | +```bash |
| 69 | +npm pack @tree-sitter-grammars/tree-sitter-<lang> # often ships a prebuilt *.wasm |
| 70 | +# or build one: npx tree-sitter build --wasm (needs Docker/emscripten) |
| 71 | +cp <the>.wasm src/extraction/wasm/tree-sitter-<lang>.wasm |
| 72 | +``` |
| 73 | +then add the token to the vendored branch in Step 4 and re-run check-grammar on |
| 74 | +the vendored path until it PASSes. **If you cannot obtain a healthy wasm, STOP |
| 75 | +and tell the user.** |
| 76 | + |
| 77 | +### Step 3 — Discover AST node types |
| 78 | + |
| 79 | +Get a representative source file (write a small sample covering functions, |
| 80 | +classes/structs, imports, enums; or `curl` a raw file from a known repo), then: |
| 81 | +```bash |
| 82 | +node scripts/add-lang/dump-ast.mjs <lang> path/to/sample.<ext> |
| 83 | +# vendored grammar: pass the wasm path instead of the token |
| 84 | +node scripts/add-lang/dump-ast.mjs src/extraction/wasm/tree-sitter-<lang>.wasm sample.<ext> |
| 85 | +``` |
| 86 | +The frequency table + field names (`name:`, `parameters:`, `body:`, |
| 87 | +`return_type:`) tell you what to map. Open the existing extractor closest to the |
| 88 | +language's paradigm as a model: `rust.ts`/`scala.ts` (functional, traits), |
| 89 | +`java.ts`/`csharp.ts` (OO), `python.ts`/`ruby.ts` (scripting), `go.ts` |
| 90 | +(top-level methods + receivers). |
| 91 | + |
| 92 | +### Step 4 — Wire the language (4 files) |
| 93 | + |
| 94 | +These are exact, fragile wiring — match the existing style precisely: |
| 95 | + |
| 96 | +1. **`src/types.ts`** — TWO edits: |
| 97 | + - add `'<lang>',` to the `LANGUAGES` const (before `'unknown'`); |
| 98 | + - add `'**/*.<ext>',` to `DEFAULT_CONFIG.include`. **Don't skip this** — it's |
| 99 | + the file-scan allowlist; without the glob, `codegraph init` finds **0 |
| 100 | + files** even though detection/extraction are wired. |
| 101 | +2. **`src/extraction/grammars.ts`** — three maps: |
| 102 | + - `WASM_GRAMMAR_FILES`: `<lang>: 'tree-sitter-<lang>.wasm',` |
| 103 | + - `EXTENSION_MAP`: each file extension → `'<lang>'` (e.g. `'.lua': 'lua',`) |
| 104 | + - `getLanguageDisplayName`: `<lang>: '<Display Name>',` |
| 105 | + - **vendored only**: add `<lang>` to the |
| 106 | + `(lang === 'pascal' || lang === 'scala' || …)` wasm-path branch. |
| 107 | +3. **`src/extraction/languages/<lang>.ts`** — new file exporting |
| 108 | + `export const <lang>Extractor: LanguageExtractor = { … }`. Map the node types |
| 109 | + from Step 3. Required fields: `functionTypes`, `classTypes`, `methodTypes`, |
| 110 | + `interfaceTypes`, `structTypes`, `enumTypes`, `typeAliasTypes`, |
| 111 | + `importTypes`, `callTypes`, `variableTypes`, `nameField`, `bodyField`, |
| 112 | + `paramsField`. Add hooks as the grammar needs them (`getSignature`, |
| 113 | + `getVisibility`, `isExported`, `extractImport`, `visitNode`, `getReceiverType`, |
| 114 | + `interfaceKind`, `enumMemberTypes`, etc. — see |
| 115 | + `src/extraction/tree-sitter-types.ts`). |
| 116 | +4. **`src/extraction/languages/index.ts`** — `import { <lang>Extractor } from |
| 117 | + './<lang>';` and add `<lang>: <lang>Extractor,` to `EXTRACTORS`. |
| 118 | + |
| 119 | +**Sometimes a 5th, core touch in `src/extraction/tree-sitter.ts`** — variable |
| 120 | +extraction has per-language branches in `extractVariable` (the generic fallback |
| 121 | +only finds direct `identifier`/`variable_declarator` children). If the grammar |
| 122 | +nests declared names (e.g. Lua's `variable_declaration → variable_list`), add a |
| 123 | +`} else if (this.language === '<lang>')` branch there, mirroring the existing |
| 124 | +ts/python/go ones. Import forms that aren't a distinct node (Lua/Ruby `require` |
| 125 | +is a *call*) are handled in the extractor's `visitNode` hook instead. |
| 126 | + |
| 127 | +### Step 5 — Build + verify loop |
| 128 | + |
| 129 | +```bash |
| 130 | +npm run build # tsc + copy-assets (copies any vendored *.wasm into dist/) |
| 131 | +``` |
| 132 | +Index a small sample repo and check extraction: |
| 133 | +```bash |
| 134 | +( cd <sample-repo> && codegraph init -i ) |
| 135 | +node scripts/add-lang/verify-extraction.mjs <sample-repo> <lang> |
| 136 | +``` |
| 137 | +`verify-extraction.mjs` fails (exit 1) if the language isn't detected or only |
| 138 | +`file`/`import` nodes were produced — the classic symptom of wrong node-type |
| 139 | +names. On FAIL or a thin WARN: re-run `dump-ast.mjs` on a richer file, fix the |
| 140 | +mappings in `<lang>.ts`, `npm run build`, re-index, re-verify. **Repeat until |
| 141 | +PASS.** |
| 142 | + |
| 143 | +### Step 6 — Tests |
| 144 | + |
| 145 | +Add to `__tests__/extraction.test.ts`, modeled on the `Rust Extraction` block: |
| 146 | +- a `detectLanguage` assertion in `describe('Language Detection')` |
| 147 | +- a `describe('<Lang> Extraction')` block asserting functions/classes/imports |
| 148 | + are extracted from an inline source string. |
| 149 | +```bash |
| 150 | +npx vitest run __tests__/extraction.test.ts |
| 151 | +``` |
| 152 | +Green before continuing. |
| 153 | + |
| 154 | +### Step 7 — Auto-pick 3 repos + corpus |
| 155 | + |
| 156 | +Pick **without asking**. Find candidates, then curate 3 that are genuinely |
| 157 | +`<lang>`-dominant, one per size tier: |
| 158 | +```bash |
| 159 | +gh search repos --language=<lang> --sort=stars --limit 40 \ |
| 160 | + --json fullName,stargazerCount,description |
| 161 | +``` |
| 162 | +Tiers (match `corpus.json`): **Small** <~150 files · **Medium** ~150–1500 · |
| 163 | +**Large** >~1500. Skip repos that are tagged `<lang>` but mostly another |
| 164 | +language. Write one cross-file architecture **question** per repo (the kind that |
| 165 | +needs tracing across files). Add a `"<Language>"` block to |
| 166 | +`.claude/skills/agent-eval/corpus.json` (fields: `name`, `repo`, `size`, |
| 167 | +`files`, `question`) so `/agent-eval` can reuse them. |
| 168 | + |
| 169 | +### Step 8 — Benchmark all 3 (extraction + A/B) |
| 170 | + |
| 171 | +Make the dev build the codegraph on PATH **once**, then loop: |
| 172 | +```bash |
| 173 | +npm run build && ./scripts/local-install.sh |
| 174 | +scripts/add-lang/bench.sh <lang> <name> <url> "<question>" headless # ×3 |
| 175 | +``` |
| 176 | +`bench.sh` clones (shared `/tmp/codegraph-corpus`), wipes + indexes, runs |
| 177 | +`verify-extraction.mjs`, then the with/without retrieval A/B via |
| 178 | +`scripts/agent-eval/run-all.sh` (skips the paid A/B if extraction is broken). |
| 179 | +Read each `parse-run.mjs` summary printed by `run-all.sh`: tool calls, file |
| 180 | +`Read`s, Grep/Bash, codegraph-tool calls, duration, and **cost** — for both the |
| 181 | +`with` and `without` arms. After the loop, restore the dev link if needed: |
| 182 | +`./scripts/local-install.sh`. |
| 183 | + |
| 184 | +### Step 9 — Docs + CHANGELOG |
| 185 | + |
| 186 | +- **README.md**: add `<Lang>` to the "19+ Languages" feature bullet, and add a |
| 187 | + row to the **Supported Languages** table: |
| 188 | + `| <Lang> | \`.ext\` | Full support (classes, methods, …) |`. |
| 189 | +- **CHANGELOG.md**: add an `## [Unreleased]` section at the top (above the |
| 190 | + latest version) with `### Added` → a user-perspective bullet, e.g. |
| 191 | + *"CodeGraph now indexes **<Lang>** (`.ext`) — functions, classes, imports, and |
| 192 | + call edges."* If `## [Unreleased]` already exists, append under it. (`/publish` |
| 193 | + folds this into the next versioned block at release time.) |
| 194 | + |
| 195 | +### Step 10 — Report (do NOT commit) |
| 196 | + |
| 197 | +Summarize for review: |
| 198 | +- **Files changed**: the 4 wiring edits + new extractor + tests + README + |
| 199 | + CHANGELOG + corpus.json (+ any vendored `.wasm`). |
| 200 | +- **Extraction** per repo: files / nodes / edges / `verify-extraction` result. |
| 201 | +- **A/B** per repo: `with` vs `without` (tool calls, file Reads, cost) and a |
| 202 | + one-line verdict — did codegraph reduce effort, and did both arms reach a |
| 203 | + correct answer? |
| 204 | +- **Gaps / follow-ups** (node types not yet mapped, resolution edges missing, |
| 205 | + framework routes, etc.). |
| 206 | + |
| 207 | +Hand the changes to the user. **Do not** run `git commit`/`push`, |
| 208 | +`npm publish`, or `scripts/release.sh`. |
| 209 | + |
| 210 | +## Notes |
| 211 | +- The A/B spawns real **paid** `claude -p` runs (opus, `--max-budget-usd`), |
| 212 | + 2 arms × 3 repos. The corpus dir `/tmp/codegraph-corpus` is shared with |
| 213 | + `/agent-eval`, so clones are reused across runs. |
| 214 | +- Any new `*.wasm` must live in `src/extraction/wasm/` — `copy-assets` (run by |
| 215 | + `npm run build`) ships it; otherwise it won't be in `dist/`. |
| 216 | +- An index must be served by the **same** binary that built it. Step 8 builds + |
| 217 | + links the dev build first, so this holds. |
| 218 | +- If a grammar can't be obtained, or extraction can't reach PASS, **STOP and |
| 219 | + report** — don't ship a half-wired language. |
0 commit comments