Date: 2026-05-24 · Branch: main · codegraph 0.9.4
A headless agent (Claude Opus, --permission-mode bypassPermissions) answers one
canonical flow question per repo — twice: with the codegraph MCP server, and
without any MCP (built-in Read/Grep/Glob/Bash only). Same model, same prompt; codegraph
is the only variable. Each cell was re-indexed fresh first (against a dist/ build of the
current main HEAD), so the "with" arm reflects the shipped 0.9.4 resolvers.
Across 37 cells, codegraph cut total file reads from 159 → 38 — 76% fewer. It never increased reads in any cell (0 regressions). The mechanism: a few sub-millisecond codegraph calls replace a read-and-grep exploration.
Cost stays roughly flat — marginally higher on the with-arm here (summed across the 37
cells: with $15.4 vs without $13.8). On these short single-flow questions the without-arm
resolves in <10 calls and never balloons, so it doesn't reach the regime where codegraph's cost
savings compound, while the with-arm pays fixed MCP overhead (tool definitions in context +
tool-loading) that short tasks don't amortize. The win is **fewer tool calls (189 vs 321, −41%)
- lower wall-clock** (mean 38s vs 48s), which is the design target. On harder multi-turn
investigations cost flips to a net saving as the without-arm's accumulated context balloons —
see
docs/benchmarks/call-sequence-analysis.md.
The gap widens with repo size and flow complexity: on medium/large repos the without-codegraph
arm often thrashes — many greps/globs, shell find/grep (Bash), and occasionally spawning
a sub-agent — while the with-codegraph arm answers in 2–8 calls. On tiny repos (a handful of
files) the two arms tie or codegraph is marginally slower (MCP/index overhead doesn't pay off
when the whole flow fits in one or two files) — but reads still drop.
- R / G / Gl / B / Ag = Read / Grep / Glob / Bash / sub-agent (Task) tool calls.
- cg-calls = codegraph MCP calls in the "with" arm (the trade for reads/greps).
- dur = wall-clock seconds. files = indexed file count (the size proxy).
- reads saved = without-reads − with-reads.
- One run per arm (a snapshot — run-to-run variance is real; treat ±1–2 reads and ±10s as
noise, look at the pattern across cells). 2-runs/arm headline numbers for several of these flows
live in
docs/design/dynamic-dispatch-coverage-playbook.md§7.
| Language | Size | Repo | files | with R/G | cg-calls | dur | without R/G | dur | reads saved |
|---|---|---|---|---|---|---|---|---|---|
| C | L | c-redis |
884 | 0R / 2G | 4 | 42s | 5R / 6G | 51s | 5 |
| C# | S | aspnet-realworld |
78 | 0R / 0G | 2 | 27s | 5R / 3G / 2Gl | 54s | 5 |
| C# | M | aspnet-eshop |
262 | 0R / 1G | 5 | 39s | 9R / 2G / 5Gl | 58s | 9 |
| C# | L | aspnet-jellyfin |
2081 | 3R / 0G | 4 | 51s | 17R / 1G / 2Gl / 17B / 1Ag | 212s | 14 |
| C++ | M | cpp-leveldb |
134 | 0R / 0G | 3 | 26s | 4R / 2G | 37s | 4 |
| Dart | S | flutter_module_books |
6 | 1R / 0G | 2 | 24s | 2R / 0G / 1Gl | 29s | 1 |
| Dart | M | compass_app |
212 | 2R / 0G / 1Gl | 2 | 42s | 3R / 0G / 2Gl | 30s | 1 |
| Go | S | gin-realworld |
21 | 0R / 0G | 5 | 35s | 4R / 3G / 1Gl | 57s | 4 |
| Go | M | gin-vueadmin |
625 | 1R / 1G | 4 | 47s | 3R / 3G / 1Gl | 44s | 2 |
| Go | L | gin-gitness |
4438 | 4R / 3G | 4 | 64s | 8R / 7G / 2Gl | 57s | 4 |
| Java | S | spring-realworld |
117 | 2R / 0G | 3 | 35s | 8R / 1G / 5B | 57s | 6 |
| Java | M | spring-mall |
536 | 1R / 0G | 5 | 39s | 2R / 4G / 2Gl | 49s | 1 |
| Java | L | spring-halo |
2444 | 1R / 2G | 8 | 60s | 4R / 1G / 6B | 52s | 3 |
| Kotlin | S | kotlin-petclinic |
43 | 0R / 0G | 2 | 37s | 3R / 0G / 1Gl | 23s | 3 |
| Kotlin | M | Jetcaster |
166 | 1R / 0G | 3 | 36s | 1R / 0G / 2Gl | 46s | 0 |
| Lua | S | lualine.nvim |
123 | 1R / 1G | 4 | 48s | 4R / 0G / 2Gl | 49s | 3 |
| Lua | M | telescope.nvim |
84 | 0R / 0G | 1 | 15s | 1R / 0G / 1Gl | 20s | 1 |
| Luau | S | Knit |
11 | 0R / 0G | 2 | 30s | 5R / 0G / 2Gl | 37s | 5 |
| PHP | S | laravel-realworld |
114 | 1R / 0G | 6 | 40s | 5R / 1G / 3Gl | 39s | 4 |
| PHP | M | laravel-firefly |
2047 | 2R / 1G | 4 | 47s | 4R / 5G / 3Gl | 75s | 2 |
| PHP | L | laravel-bookstack |
2160 | 1R / 2G | 2 | 41s | 2R / 4G / 1Gl | 50s | 1 |
| Python | S | django-realworld |
44 | 2R / 1G | 2 | 47s | 9R / 0G / 1B | 38s | 7 |
| Python | M | django-wagtail |
1672 | 2R / 0G | 4 | 45s | 8R / 3G / 3Gl / 1B | 66s | 6 |
| Python | L | django-saleor |
4429 | 2R / 2G | 4 | 52s | 4R / 6G / 1Gl | 64s | 2 |
| Ruby | S | rails-realworld |
59 | 0R / 0G | 2 | 30s | 3R / 0G / 2B | 33s | 3 |
| Ruby | M | rails-spree |
2905 | 2R / 3G / 1Gl | 5 | 43s | 3R / 3G / 2Gl / 1B | 55s | 1 |
| Ruby | L | rails-forem |
4658 | 3R / 1G | 3 | 43s | 4R / 2G / 3Gl | 48s | 1 |
| Rust | S | rust-axum-realworld |
13 | 0R / 0G | 2 | 21s | 3R / 0G / 1Gl | 38s | 3 |
| Rust | M | rust-actix-examples |
176 | 0R / 1G | 3 | 42s | 3R / 0G / 3B | 36s | 3 |
| Rust | L | rust-cratesio |
1053 | 1R / 0G | 3 | 22s | 1R / 2G | 18s | 0 |
| Scala | S | computer-database |
10 | 1R / 0G | 2 | 27s | 3R / 0G / 1Gl | 25s | 2 |
| Swift | S | vapor-template |
14 | 0R / 0G | 2 | 21s | 2R / 0G / 2Gl | 22s | 2 |
| Swift | M | vapor-steampress |
100 | 0R / 0G | 5 | 49s | 3R / 1G / 2Gl | 39s | 3 |
| Swift | L | vapor-spi |
542 | 1R / 1G | 4 | 27s | 2R / 5G | 34s | 1 |
| TypeScript/JS | S | express-realworld |
39 | 1R / 0G | 1 | 25s | 2R / 2G | 19s | 1 |
| TypeScript/JS | M | excalidraw |
643 | 1R / 0G | 3 | 55s | 7R / 5G / 3Gl / 1B | 87s | 6 |
| TypeScript/JS | L | nest-immich |
2759 | 1R / 0G | 7 | 50s | 3R / 0G / 1Gl | 44s | 2 |
Totals (37 cells): with codegraph 38 reads / 22 greps, without 159 reads / 72 greps —
76% fewer reads, ~69% fewer greps. Codegraph never increased reads in any cell, and the
without-arm additionally ran 52 globs + 37 shell find/grep (Bash) + 1 sub-agent that the
with-arm (0 Bash, 0 sub-agents) never needed. (74 agent runs, $29.18 total.)
- Biggest wins are medium/large backends with a real route→handler→service flow: aspnet-jellyfin (3R / 51s vs 17R + 17 Bash + a spawned sub-agent / 212s — the single most dramatic cell), aspnet-eshop (0R vs 9R), django-realworld (2R vs 9R), spring-realworld (2R vs 8R + 5 Bash), django-wagtail (2R vs 8R), excalidraw (1R / 55s vs 7R / 87s), Luau Knit (0R vs 5R), aspnet-realworld (0R vs 5R), c-redis (0R vs 5R).
- Without codegraph, large repos make the agent thrash: it falls back to shell
find/grep(37 Bash calls across the matrix) and on jellyfin even spawned a sub-agent — exactly the behavior codegraph is meant to prevent. The with-arm answers those in 2–8 codegraph calls and used 0 Bash and 0 sub-agents anywhere. - Tie zone = tiny repos (Kotlin Jetcaster 1R/1R, Rust cratesio 1R/1R, express 1R/2R, Swift template 0R/2R): the whole flow fits in 1–2 files, so reading is already cheap; codegraph ties on reads and is sometimes a few seconds slower (MCP + index overhead — Kotlin petclinic 37s vs 23s, cratesio 22s vs 18s). This matches the design note that codegraph's value scales with repo size.
- Duration tracks reads on the big repos (jellyfin 51s vs 212s, excalidraw 55s vs 87s, aspnet-eshop 39s vs 58s, django-wagtail 45s vs 66s) and is noise on small ones; mean wall-clock is 38s with vs 48s without.
- Some "with" cells still read 2–4 files (jellyfin, gitness, forem, saleor, django) — the residual is the documented frontier (anonymous handlers, deep service chains, dynamic finders); codegraph gets the agent to the right file, then it reads one to confirm a detail.
All 14 README frameworks and every flow-relevant language are validated (see the playbook). The sizes here are by indexed file count; a few languages lack a clean third size in the corpus (Dart/Kotlin = S/M, Scala/Luau = S only, C = L only, C++ = M only) — those cells are omitted rather than faked.
Canonical harness: scripts/agent-eval/run-all.sh <repo> "<question>" headless (with = codegraph-only
MCP, without = empty MCP), parsed from the stream-json logs. The throwaway matrix driver + parser used
for this table live in /tmp/ab-matrix/: run.sh (the lang|size|repo|question matrix — each cell does
rm -rf .codegraph && codegraph init -i then both arms), parse-matrix.mjs (cells → this table), and
compare.mjs (old-vs-new diff + aggregates). Build dist/ from the target commit first so the MCP
server loads the code under test (codegraph on PATH is npm linked to the dev dist/).