Skip to content

Commit 4329a52

Browse files
authored
feat: add Lua and Luau language support (colbymchenry#273)
Adds Lua (.lua) and Luau (.luau) extraction — functions, methods with receivers, type aliases (Luau), require imports (incl. Roblox instance-path), and call edges. Vendors the ABI-15 Lua and ABI-14 Luau tree-sitter grammars. Addresses colbymchenry#232.
1 parent 2fc0df7 commit 4329a52

17 files changed

Lines changed: 969 additions & 3 deletions

File tree

.claude/skills/add-lang/SKILL.md

Lines changed: 219 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,219 @@
1+
---
2+
name: add-lang
3+
description: Add tree-sitter language support to codegraph end-to-end — wire the grammar + extractor, write tests, then benchmark extraction quality and retrieval value on 3 popular real-world repos. Use when the user runs /add-lang <language> or asks to add/support a new language (e.g. Lua, Elixir, Zig, OCaml) in codegraph.
4+
---
5+
6+
# Add a language to CodeGraph
7+
8+
Wire a new tree-sitter language into codegraph's extraction pipeline, prove it
9+
extracts real symbols on popular repos, and prove it beats no-codegraph for an
10+
agent. Runs **fully autonomously** — pick repos, benchmark, update docs, then
11+
report. **Never commit, push, publish, or tag** (house rule); leave all changes
12+
for the user to review.
13+
14+
The argument is the language token used throughout the `Language` union, e.g.
15+
`lua`, `elixir`, `zig`. If none was given, ask which language. Use the lowercase
16+
single-token form everywhere (`csharp`, not `c#`).
17+
18+
## Prerequisites
19+
- Run from the codegraph repo root. `node`, `git`, `gh`, and a logged-in
20+
`claude` CLI (the benchmark spawns real `claude -p` runs).
21+
- The benchmark uses the local dev build — Step 8 builds + links it on PATH.
22+
23+
## Workflow
24+
25+
Copy this checklist and work through it in order:
26+
```
27+
- [ ] 1. Resolve language; bail early if already supported (just benchmark)
28+
- [ ] 2. Find a grammar + health-check it (ABI / heap corruption)
29+
- [ ] 3. Discover the grammar's AST node types (dump-ast.mjs)
30+
- [ ] 4. Wire the language (4 files; sometimes a 5th core touch)
31+
- [ ] 5. Build + verify-extraction loop until PASS
32+
- [ ] 6. Add extraction tests; make them green
33+
- [ ] 7. Auto-pick 3 popular repos by size tier; add to corpus.json
34+
- [ ] 8. Benchmark all 3: extraction + with/without A/B
35+
- [ ] 9. Update README + CHANGELOG
36+
- [ ] 10. Report; do NOT commit
37+
```
38+
39+
### Step 1 — Resolve + short-circuit
40+
41+
Check whether the language is already wired: look for the token in the
42+
`LANGUAGES` const (`src/types.ts`) and the `EXTRACTORS` map
43+
(`src/extraction/languages/index.ts`). If it is already supported (e.g.
44+
`typescript`, `rust`), **skip Steps 2–6** and go straight to benchmarking
45+
(Steps 7–8) to validate/measure it — note in the report that no code changed.
46+
47+
### Step 2 — Find a grammar, then health-check it
48+
49+
```bash
50+
ls node_modules/tree-sitter-wasms/out/ | grep -i <lang> # csharp -> c_sharp
51+
```
52+
- **Present** → likely off-the-shelf; `grammars.ts` resolves it from
53+
`tree-sitter-wasms` automatically. (Many languages: elixir, zig, ocaml,
54+
solidity, toml, yaml, …)
55+
- **Absent** → vendor a `.wasm` into `src/extraction/wasm/` (like `pascal` /
56+
`scala` / `lua`) and add the token to the vendored branch in Step 4.
57+
58+
**Always health-check before writing an extractor — a *present* grammar can
59+
still be unusable:**
60+
```bash
61+
node scripts/add-lang/check-grammar.mjs <lang> path/to/valid-sample.<ext>
62+
```
63+
It prints the grammar's ABI version and parses a valid sample many times in a
64+
multi-grammar runtime. If it **FAILs** (ERROR trees on valid code — an old ABI
65+
corrupting the shared WASM heap, which silently drops nested calls/imports on
66+
every file after the first; e.g. the tree-sitter-wasms **Lua** grammar is ABI 13
67+
and fails), do NOT use that wasm. **Vendor a newer (ABI 14/15) build instead:**
68+
```bash
69+
npm pack @tree-sitter-grammars/tree-sitter-<lang> # often ships a prebuilt *.wasm
70+
# or build one: npx tree-sitter build --wasm (needs Docker/emscripten)
71+
cp <the>.wasm src/extraction/wasm/tree-sitter-<lang>.wasm
72+
```
73+
then add the token to the vendored branch in Step 4 and re-run check-grammar on
74+
the vendored path until it PASSes. **If you cannot obtain a healthy wasm, STOP
75+
and tell the user.**
76+
77+
### Step 3 — Discover AST node types
78+
79+
Get a representative source file (write a small sample covering functions,
80+
classes/structs, imports, enums; or `curl` a raw file from a known repo), then:
81+
```bash
82+
node scripts/add-lang/dump-ast.mjs <lang> path/to/sample.<ext>
83+
# vendored grammar: pass the wasm path instead of the token
84+
node scripts/add-lang/dump-ast.mjs src/extraction/wasm/tree-sitter-<lang>.wasm sample.<ext>
85+
```
86+
The frequency table + field names (`name:`, `parameters:`, `body:`,
87+
`return_type:`) tell you what to map. Open the existing extractor closest to the
88+
language's paradigm as a model: `rust.ts`/`scala.ts` (functional, traits),
89+
`java.ts`/`csharp.ts` (OO), `python.ts`/`ruby.ts` (scripting), `go.ts`
90+
(top-level methods + receivers).
91+
92+
### Step 4 — Wire the language (4 files)
93+
94+
These are exact, fragile wiring — match the existing style precisely:
95+
96+
1. **`src/types.ts`** — TWO edits:
97+
- add `'<lang>',` to the `LANGUAGES` const (before `'unknown'`);
98+
- add `'**/*.<ext>',` to `DEFAULT_CONFIG.include`. **Don't skip this** — it's
99+
the file-scan allowlist; without the glob, `codegraph init` finds **0
100+
files** even though detection/extraction are wired.
101+
2. **`src/extraction/grammars.ts`** — three maps:
102+
- `WASM_GRAMMAR_FILES`: `<lang>: 'tree-sitter-<lang>.wasm',`
103+
- `EXTENSION_MAP`: each file extension → `'<lang>'` (e.g. `'.lua': 'lua',`)
104+
- `getLanguageDisplayName`: `<lang>: '<Display Name>',`
105+
- **vendored only**: add `<lang>` to the
106+
`(lang === 'pascal' || lang === 'scala' || …)` wasm-path branch.
107+
3. **`src/extraction/languages/<lang>.ts`** — new file exporting
108+
`export const <lang>Extractor: LanguageExtractor = { … }`. Map the node types
109+
from Step 3. Required fields: `functionTypes`, `classTypes`, `methodTypes`,
110+
`interfaceTypes`, `structTypes`, `enumTypes`, `typeAliasTypes`,
111+
`importTypes`, `callTypes`, `variableTypes`, `nameField`, `bodyField`,
112+
`paramsField`. Add hooks as the grammar needs them (`getSignature`,
113+
`getVisibility`, `isExported`, `extractImport`, `visitNode`, `getReceiverType`,
114+
`interfaceKind`, `enumMemberTypes`, etc. — see
115+
`src/extraction/tree-sitter-types.ts`).
116+
4. **`src/extraction/languages/index.ts`** — `import { <lang>Extractor } from
117+
'./<lang>';` and add `<lang>: <lang>Extractor,` to `EXTRACTORS`.
118+
119+
**Sometimes a 5th, core touch in `src/extraction/tree-sitter.ts`** — variable
120+
extraction has per-language branches in `extractVariable` (the generic fallback
121+
only finds direct `identifier`/`variable_declarator` children). If the grammar
122+
nests declared names (e.g. Lua's `variable_declaration → variable_list`), add a
123+
`} else if (this.language === '<lang>')` branch there, mirroring the existing
124+
ts/python/go ones. Import forms that aren't a distinct node (Lua/Ruby `require`
125+
is a *call*) are handled in the extractor's `visitNode` hook instead.
126+
127+
### Step 5 — Build + verify loop
128+
129+
```bash
130+
npm run build # tsc + copy-assets (copies any vendored *.wasm into dist/)
131+
```
132+
Index a small sample repo and check extraction:
133+
```bash
134+
( cd <sample-repo> && codegraph init -i )
135+
node scripts/add-lang/verify-extraction.mjs <sample-repo> <lang>
136+
```
137+
`verify-extraction.mjs` fails (exit 1) if the language isn't detected or only
138+
`file`/`import` nodes were produced — the classic symptom of wrong node-type
139+
names. On FAIL or a thin WARN: re-run `dump-ast.mjs` on a richer file, fix the
140+
mappings in `<lang>.ts`, `npm run build`, re-index, re-verify. **Repeat until
141+
PASS.**
142+
143+
### Step 6 — Tests
144+
145+
Add to `__tests__/extraction.test.ts`, modeled on the `Rust Extraction` block:
146+
- a `detectLanguage` assertion in `describe('Language Detection')`
147+
- a `describe('<Lang> Extraction')` block asserting functions/classes/imports
148+
are extracted from an inline source string.
149+
```bash
150+
npx vitest run __tests__/extraction.test.ts
151+
```
152+
Green before continuing.
153+
154+
### Step 7 — Auto-pick 3 repos + corpus
155+
156+
Pick **without asking**. Find candidates, then curate 3 that are genuinely
157+
`<lang>`-dominant, one per size tier:
158+
```bash
159+
gh search repos --language=<lang> --sort=stars --limit 40 \
160+
--json fullName,stargazerCount,description
161+
```
162+
Tiers (match `corpus.json`): **Small** <~150 files · **Medium** ~150–1500 ·
163+
**Large** >~1500. Skip repos that are tagged `<lang>` but mostly another
164+
language. Write one cross-file architecture **question** per repo (the kind that
165+
needs tracing across files). Add a `"<Language>"` block to
166+
`.claude/skills/agent-eval/corpus.json` (fields: `name`, `repo`, `size`,
167+
`files`, `question`) so `/agent-eval` can reuse them.
168+
169+
### Step 8 — Benchmark all 3 (extraction + A/B)
170+
171+
Make the dev build the codegraph on PATH **once**, then loop:
172+
```bash
173+
npm run build && ./scripts/local-install.sh
174+
scripts/add-lang/bench.sh <lang> <name> <url> "<question>" headless # ×3
175+
```
176+
`bench.sh` clones (shared `/tmp/codegraph-corpus`), wipes + indexes, runs
177+
`verify-extraction.mjs`, then the with/without retrieval A/B via
178+
`scripts/agent-eval/run-all.sh` (skips the paid A/B if extraction is broken).
179+
Read each `parse-run.mjs` summary printed by `run-all.sh`: tool calls, file
180+
`Read`s, Grep/Bash, codegraph-tool calls, duration, and **cost** — for both the
181+
`with` and `without` arms. After the loop, restore the dev link if needed:
182+
`./scripts/local-install.sh`.
183+
184+
### Step 9 — Docs + CHANGELOG
185+
186+
- **README.md**: add `<Lang>` to the "19+ Languages" feature bullet, and add a
187+
row to the **Supported Languages** table:
188+
`| <Lang> | \`.ext\` | Full support (classes, methods, …) |`.
189+
- **CHANGELOG.md**: add an `## [Unreleased]` section at the top (above the
190+
latest version) with `### Added` → a user-perspective bullet, e.g.
191+
*"CodeGraph now indexes **<Lang>** (`.ext`) — functions, classes, imports, and
192+
call edges."* If `## [Unreleased]` already exists, append under it. (`/publish`
193+
folds this into the next versioned block at release time.)
194+
195+
### Step 10 — Report (do NOT commit)
196+
197+
Summarize for review:
198+
- **Files changed**: the 4 wiring edits + new extractor + tests + README +
199+
CHANGELOG + corpus.json (+ any vendored `.wasm`).
200+
- **Extraction** per repo: files / nodes / edges / `verify-extraction` result.
201+
- **A/B** per repo: `with` vs `without` (tool calls, file Reads, cost) and a
202+
one-line verdict — did codegraph reduce effort, and did both arms reach a
203+
correct answer?
204+
- **Gaps / follow-ups** (node types not yet mapped, resolution edges missing,
205+
framework routes, etc.).
206+
207+
Hand the changes to the user. **Do not** run `git commit`/`push`,
208+
`npm publish`, or `scripts/release.sh`.
209+
210+
## Notes
211+
- The A/B spawns real **paid** `claude -p` runs (opus, `--max-budget-usd`),
212+
2 arms × 3 repos. The corpus dir `/tmp/codegraph-corpus` is shared with
213+
`/agent-eval`, so clones are reused across runs.
214+
- Any new `*.wasm` must live in `src/extraction/wasm/``copy-assets` (run by
215+
`npm run build`) ships it; otherwise it won't be in `dist/`.
216+
- An index must be served by the **same** binary that built it. Step 8 builds +
217+
links the dev build first, so this holds.
218+
- If a grammar can't be obtained, or extraction can't reach PASS, **STOP and
219+
report** — don't ship a half-wired language.

.claude/skills/agent-eval/corpus.json

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,5 +59,15 @@
5959
],
6060
"Svelte": [
6161
{ "name": "shadcn-svelte", "repo": "https://github.com/huntabyte/shadcn-svelte", "size": "Medium", "files": "~600", "question": "How do shadcn-svelte components compose and apply their styling?" }
62+
],
63+
"Lua": [
64+
{ "name": "lualine.nvim", "repo": "https://github.com/nvim-lualine/lualine.nvim", "size": "Small", "files": "~120", "question": "How does lualine assemble and render its statusline sections and components?" },
65+
{ "name": "telescope.nvim", "repo": "https://github.com/nvim-telescope/telescope.nvim", "size": "Medium", "files": "~80", "question": "How does Telescope wire a picker to its finder, sorter, and previewer?" },
66+
{ "name": "kong", "repo": "https://github.com/Kong/kong", "size": "Large", "files": "~1330", "question": "How does Kong execute plugins across a request's lifecycle phases?" }
67+
],
68+
"Luau": [
69+
{ "name": "Knit", "repo": "https://github.com/Sleitnick/Knit", "size": "Small", "files": "~10", "question": "How does Knit register services and expose them to clients?" },
70+
{ "name": "vide", "repo": "https://github.com/centau/vide", "size": "Small", "files": "~40", "question": "How does vide track reactive sources and re-run effects when state changes?" },
71+
{ "name": "Fusion", "repo": "https://github.com/dphfox/Fusion", "size": "Medium", "files": "~115", "question": "How does Fusion build and update its reactive UI graph from state objects?" }
6272
]
6373
}

CHANGELOG.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,20 @@ a [GitHub Release](https://github.com/colbymchenry/codegraph/releases) tagged
77
This project follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/)
88
and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
99

10+
## [Unreleased]
11+
12+
### Added
13+
- **Lua**: CodeGraph now indexes Lua (`.lua`) — functions, methods (table `t.f`
14+
and `t:m` definitions become methods with a `t::f` receiver-qualified name),
15+
local variables, `require(...)` imports, and the call edges between them.
16+
Querying a Lua project (Neovim plugins, Kong, OpenResty, game code) now
17+
surfaces its modules, methods, and call graph.
18+
- **Luau** ([#232](https://github.com/colbymchenry/codegraph/issues/232)):
19+
CodeGraph now indexes Luau (`.luau`), Roblox's typed superset of Lua —
20+
everything Lua extracts, plus `type` / `export type` aliases, typed function
21+
signatures, generics, and Roblox instance-path `require(script.Parent.X)`
22+
imports.
23+
1024
## [0.8.0] - 2026-05-20
1125

1226
### Added

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ The gains scale with codebase size: on large repos the agent answers from the in
107107
| **Full-Text Search** | Find code by name instantly across your entire codebase, powered by FTS5 |
108108
| **Impact Analysis** | Trace callers, callees, and the full impact radius of any symbol before making changes |
109109
| **Always Fresh** | File watcher uses native OS events (FSEvents/inotify/ReadDirectoryChangesW) with debounced auto-sync — the graph stays current as you code, zero config |
110-
| **19+ Languages** | TypeScript, JavaScript, Python, Go, Rust, Java, C#, PHP, Ruby, C, C++, Swift, Kotlin, Dart, Svelte, Liquid, Pascal/Delphi |
110+
| **19+ Languages** | TypeScript, JavaScript, Python, Go, Rust, Java, C#, PHP, Ruby, C, C++, Swift, Kotlin, Dart, Lua, Luau, Svelte, Liquid, Pascal/Delphi |
111111
| **Framework-aware Routes** | Recognizes web-framework routing files and links URL patterns to their handlers across 13 frameworks |
112112
| **100% Local** | No data leaves your machine. No API keys. No external services. SQLite database only |
113113

@@ -447,6 +447,8 @@ The `.codegraph/config.json` file controls indexing:
447447
| Vue | `.vue` | Full support (script + script-setup extraction, Nuxt page/API/middleware routes) |
448448
| Liquid | `.liquid` | Full support |
449449
| Pascal / Delphi | `.pas`, `.dpr`, `.dpk`, `.lpr` | Full support (classes, records, interfaces, enums, DFM/FMX form files) |
450+
| Lua | `.lua` | Full support (functions, methods with receivers, local variables, `require` imports, call edges) |
451+
| Luau | `.luau` | Full support (everything in Lua, plus `type`/`export type` aliases, typed signatures, and Roblox instance-path `require`) |
450452

451453
## Troubleshooting
452454

0 commit comments

Comments
 (0)