codegraph-ab-matrix.md

CodeGraph A/B benchmark — with vs without, every language × S/M/L

Date: 2026-05-23 · Branch: architectural-improvements

A headless agent (Claude Opus, --permission-mode bypassPermissions) answers one canonical flow question per repo — twice: with the codegraph MCP server, and without any MCP (built-in Read/Grep/Glob/Bash only). Same model, same prompt; codegraph is the only variable. Each cell was re-indexed fresh first, so the "with" arm reflects the current resolvers.

Headline

Across 37 cells, codegraph cut total file reads from 158 → 40 — 75% fewer. It never increased reads in any cell. The mechanism: a few sub-millisecond codegraph calls replace a read-and-grep exploration. Token cost stays roughly flat (codegraph calls trade for reads) — the win is fewer tool calls + lower wall-clock, which is the design target.

The gap widens with repo size and flow complexity: on medium/large repos the without-codegraph arm often thrashes — many greps/globs, shell find/grep (Bash), and occasionally spawning a sub-agent — while the with-codegraph arm answers in 2–6 calls. On tiny repos (a handful of files) the two arms tie or codegraph is marginally slower (MCP/index overhead doesn't pay off when the whole flow fits in one or two files) — but reads still drop.

How to read the table

R / G / Gl / B / Ag = Read / Grep / Glob / Bash / sub-agent (Task) tool calls.
cg-calls = codegraph MCP calls in the "with" arm (the trade for reads/greps).
dur = wall-clock seconds. files = indexed file count (the size proxy).
reads saved = without-reads − with-reads.
One run per arm (a snapshot — run-to-run variance is real; treat ±1–2 reads and ±10s as noise, look at the pattern across cells). 2-runs/arm headline numbers for several of these flows live in docs/design/dynamic-dispatch-coverage-playbook.md §7.

Results

Language	Size	Repo	files	with R/G	cg-calls	dur	without R/G	dur	reads saved
C	L	`c-redis`	884	0R / 4G	4	48s	4R / 9G / 1Gl	50s	4
C#	S	`aspnet-realworld`	78	0R / 0G	2	40s	2R / 1G / 2Gl	31s	2
C#	M	`aspnet-eshop`	262	0R / 0G	5	39s	6R / 2G / 3Gl / 1B	61s	6
C#	L	`aspnet-jellyfin`	2081	4R / 0G	2	61s	13R / 0G / 4Gl / 21B / 1Ag	132s	9
C++	M	`cpp-leveldb`	134	0R / 0G	3	40s	2R / 3G	52s	2
Dart	S	`flutter_module_books`	6	1R / 0G	2	37s	1R / 0G / 1Gl	20s	0
Dart	M	`compass_app`	212	2R / 0G	2	31s	3R / 1G / 3Gl	47s	1
Go	S	`gin-realworld`	21	2R / 1G	3	31s	4R / 0G / 1B	44s	2
Go	M	`gin-vueadmin`	625	0R / 0G	2	31s	3R / 3G / 2Gl	47s	3
Go	L	`gin-gitness`	4438	3R / 3G	4	52s	7R / 4G / 3Gl	60s	4
Java	S	`spring-realworld`	117	0R / 0G	4	31s	8R / 1G / 1Gl	50s	8
Java	M	`spring-mall`	536	1R / 0G	5	51s	5R / 0G / 4Gl	64s	4
Java	L	`spring-halo`	2444	0R / 1G	8	75s	9R / 5G / 8B	148s	9
Kotlin	S	`kotlin-petclinic`	43	1R / 0G	1	23s	3R / 0G / 2Gl	26s	2
Kotlin	M	`Jetcaster`	166	1R / 0G	3	36s	1R / 0G / 2Gl	34s	0
Lua	S	`lualine.nvim`	123	1R / 0G	4	48s	4R / 0G / 1Gl	45s	3
Lua	M	`telescope.nvim`	84	0R / 0G	2	33s	2R / 0G / 1Gl	26s	2
Luau	S	`Knit`	11	0R / 0G	4	36s	5R / 0G / 2Gl	57s	5
PHP	S	`laravel-realworld`	114	3R / 0G / 1Gl	2	41s	6R / 2G / 3Gl	38s	3
PHP	M	`laravel-firefly`	2047	4R / 4G	5	79s	5R / 3G / 3Gl / 2B	70s	1
PHP	L	`laravel-bookstack`	2160	0R / 1G	5	42s	3R / 2G / 2Gl	46s	3
Python	S	`django-realworld`	44	1R / 1G	2	30s	8R / 0G / 1Gl	35s	7
Python	M	`django-wagtail`	1672	3R / 0G	5	73s	7R / 5G / 2Gl / 1B	63s	4
Python	L	`django-saleor`	4429	1R / 2G	3	59s	6R / 5G / 2Gl / 1B	72s	5
Ruby	S	`rails-realworld`	59	0R / 0G	2	34s	4R / 0G / 3Gl	40s	4
Ruby	M	`rails-spree`	2905	1R / 2G	8	60s	3R / 4G / 3Gl	56s	2
Ruby	L	`rails-forem`	4658	3R / 1G	3	54s	3R / 2G / 1Gl	49s	0
Rust	S	`rust-axum-realworld`	13	1R / 0G	4	28s	3R / 1G / 1Gl	49s	2
Rust	M	`rust-actix-examples`	176	1R / 0G	5	42s	4R / 1G / 2B	35s	3
Rust	L	`rust-cratesio`	1053	0R / 0G	3	20s	1R / 2G	15s	1
Scala	S	`computer-database`	10	1R / 0G	4	47s	2R / 0G / 1B	28s	1
Swift	S	`vapor-template`	14	0R / 0G	1	16s	2R / 0G / 1Gl	22s	2
Swift	M	`vapor-steampress`	100	1R / 0G	8	53s	3R / 3G / 2B	57s	2
Swift	L	`vapor-spi`	542	2R / 0G	5	49s	2R / 3G / 2Gl	36s	0
TypeScript/JS	S	`express-realworld`	39	1R / 0G	1	16s	2R / 1G / 1Gl	27s	1
TypeScript/JS	M	`excalidraw`	643	0R / 0G	4	53s	9R / 7G	98s	9
TypeScript/JS	L	`nest-immich`	2759	1R / 1G	6	50s	3R / 1G / 2Gl	57s	2

Totals (37 cells): with codegraph 40 reads / 21 greps, without 158 reads / 71 greps — 75% fewer reads, ~70% fewer greps. Codegraph never increased reads in any cell, and the without-arm additionally ran shell find/grep (Bash) and a sub-agent that the with-arm never needed. (74 agent runs, ~$29 total.)

Observations

Biggest wins are medium/large backends with a real route→handler→service flow: excalidraw (0R vs 9R/7G), spring-halo (0R vs 9R + 8 Bash), spring-realworld (0R vs 8R), django-realworld (1R vs 8R), aspnet-jellyfin (4R vs 13R + 21 Bash + a spawned sub-agent), aspnet-eshop (0R vs 6R).
Without codegraph, large repos make the agent thrash: it falls back to shell find/grep (Bash) and on jellyfin even spawned a sub-agent — exactly the behavior codegraph is meant to prevent. The with-arm answers those in 2–6 codegraph calls.
Tie zone = tiny repos (Dart books 6 files, Kotlin Jetcaster, Ruby forem, Swift spi): the whole flow fits in 1–2 files, so reading is already cheap; codegraph ties on reads and is sometimes a few seconds slower (MCP + index overhead). This matches the design note that codegraph's value scales with repo size.
Duration tracks reads on the big repos (jellyfin 61s vs 132s, spring-halo 75s vs 148s, excalidraw 53s vs 98s) and is noise on small ones.
Some "with" cells still read 2–4 files (jellyfin, gitness, laravel-firefly, forem) — the residual is the documented frontier (anonymous handlers, deep service chains, dynamic finders); codegraph gets the agent to the right file, then it reads one to confirm a detail.

Coverage note

All 14 README frameworks and every flow-relevant language are validated (see the playbook). The sizes here are by indexed file count; a few languages lack a clean third size in the corpus (Dart/Kotlin = S/M, Scala/Luau = S only, C = L only, C++ = M only) — those cells are omitted rather than faked.

Reproduce

Driver + parser: /tmp/ab-matrix/run.sh (matrix of lang|size|repo|question) and /tmp/ab-matrix/parse-matrix.mjs. Each cell: rm -rf .codegraph && codegraph init -i, then scripts/agent-eval/run-all.sh <repo> "<question>" headless (with = codegraph-only MCP, without = empty MCP), parsed from the stream-json logs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CodeGraph A/B benchmark — with vs without, every language × S/M/L

Headline

How to read the table

Results

Observations

Coverage note

Reproduce

FilesExpand file tree

codegraph-ab-matrix.md

Latest commit

History

codegraph-ab-matrix.md

File metadata and controls

CodeGraph A/B benchmark — with vs without, every language × S/M/L

Headline

How to read the table

Results

Observations

Coverage note

Reproduce