Skip to content

Latest commit

 

History

History
111 lines (95 loc) · 7.35 KB

File metadata and controls

111 lines (95 loc) · 7.35 KB

CodeGraph A/B benchmark — with vs without, every language × S/M/L

Date: 2026-05-23 · Branch: architectural-improvements

A headless agent (Claude Opus, --permission-mode bypassPermissions) answers one canonical flow question per repo — twice: with the codegraph MCP server, and without any MCP (built-in Read/Grep/Glob/Bash only). Same model, same prompt; codegraph is the only variable. Each cell was re-indexed fresh first, so the "with" arm reflects the current resolvers.

Headline

Across 37 cells, codegraph cut total file reads from 158 → 40 — 75% fewer. It never increased reads in any cell. The mechanism: a few sub-millisecond codegraph calls replace a read-and-grep exploration. Token cost stays roughly flat (codegraph calls trade for reads) — the win is fewer tool calls + lower wall-clock, which is the design target.

The gap widens with repo size and flow complexity: on medium/large repos the without-codegraph arm often thrashes — many greps/globs, shell find/grep (Bash), and occasionally spawning a sub-agent — while the with-codegraph arm answers in 2–6 calls. On tiny repos (a handful of files) the two arms tie or codegraph is marginally slower (MCP/index overhead doesn't pay off when the whole flow fits in one or two files) — but reads still drop.

How to read the table

  • R / G / Gl / B / Ag = Read / Grep / Glob / Bash / sub-agent (Task) tool calls.
  • cg-calls = codegraph MCP calls in the "with" arm (the trade for reads/greps).
  • dur = wall-clock seconds. files = indexed file count (the size proxy).
  • reads saved = without-reads − with-reads.
  • One run per arm (a snapshot — run-to-run variance is real; treat ±1–2 reads and ±10s as noise, look at the pattern across cells). 2-runs/arm headline numbers for several of these flows live in docs/design/dynamic-dispatch-coverage-playbook.md §7.

Results

Language Size Repo files with R/G cg-calls dur without R/G dur reads saved
C L c-redis 884 0R / 4G 4 48s 4R / 9G / 1Gl 50s 4
C# S aspnet-realworld 78 0R / 0G 2 40s 2R / 1G / 2Gl 31s 2
C# M aspnet-eshop 262 0R / 0G 5 39s 6R / 2G / 3Gl / 1B 61s 6
C# L aspnet-jellyfin 2081 4R / 0G 2 61s 13R / 0G / 4Gl / 21B / 1Ag 132s 9
C++ M cpp-leveldb 134 0R / 0G 3 40s 2R / 3G 52s 2
Dart S flutter_module_books 6 1R / 0G 2 37s 1R / 0G / 1Gl 20s 0
Dart M compass_app 212 2R / 0G 2 31s 3R / 1G / 3Gl 47s 1
Go S gin-realworld 21 2R / 1G 3 31s 4R / 0G / 1B 44s 2
Go M gin-vueadmin 625 0R / 0G 2 31s 3R / 3G / 2Gl 47s 3
Go L gin-gitness 4438 3R / 3G 4 52s 7R / 4G / 3Gl 60s 4
Java S spring-realworld 117 0R / 0G 4 31s 8R / 1G / 1Gl 50s 8
Java M spring-mall 536 1R / 0G 5 51s 5R / 0G / 4Gl 64s 4
Java L spring-halo 2444 0R / 1G 8 75s 9R / 5G / 8B 148s 9
Kotlin S kotlin-petclinic 43 1R / 0G 1 23s 3R / 0G / 2Gl 26s 2
Kotlin M Jetcaster 166 1R / 0G 3 36s 1R / 0G / 2Gl 34s 0
Lua S lualine.nvim 123 1R / 0G 4 48s 4R / 0G / 1Gl 45s 3
Lua M telescope.nvim 84 0R / 0G 2 33s 2R / 0G / 1Gl 26s 2
Luau S Knit 11 0R / 0G 4 36s 5R / 0G / 2Gl 57s 5
PHP S laravel-realworld 114 3R / 0G / 1Gl 2 41s 6R / 2G / 3Gl 38s 3
PHP M laravel-firefly 2047 4R / 4G 5 79s 5R / 3G / 3Gl / 2B 70s 1
PHP L laravel-bookstack 2160 0R / 1G 5 42s 3R / 2G / 2Gl 46s 3
Python S django-realworld 44 1R / 1G 2 30s 8R / 0G / 1Gl 35s 7
Python M django-wagtail 1672 3R / 0G 5 73s 7R / 5G / 2Gl / 1B 63s 4
Python L django-saleor 4429 1R / 2G 3 59s 6R / 5G / 2Gl / 1B 72s 5
Ruby S rails-realworld 59 0R / 0G 2 34s 4R / 0G / 3Gl 40s 4
Ruby M rails-spree 2905 1R / 2G 8 60s 3R / 4G / 3Gl 56s 2
Ruby L rails-forem 4658 3R / 1G 3 54s 3R / 2G / 1Gl 49s 0
Rust S rust-axum-realworld 13 1R / 0G 4 28s 3R / 1G / 1Gl 49s 2
Rust M rust-actix-examples 176 1R / 0G 5 42s 4R / 1G / 2B 35s 3
Rust L rust-cratesio 1053 0R / 0G 3 20s 1R / 2G 15s 1
Scala S computer-database 10 1R / 0G 4 47s 2R / 0G / 1B 28s 1
Swift S vapor-template 14 0R / 0G 1 16s 2R / 0G / 1Gl 22s 2
Swift M vapor-steampress 100 1R / 0G 8 53s 3R / 3G / 2B 57s 2
Swift L vapor-spi 542 2R / 0G 5 49s 2R / 3G / 2Gl 36s 0
TypeScript/JS S express-realworld 39 1R / 0G 1 16s 2R / 1G / 1Gl 27s 1
TypeScript/JS M excalidraw 643 0R / 0G 4 53s 9R / 7G 98s 9
TypeScript/JS L nest-immich 2759 1R / 1G 6 50s 3R / 1G / 2Gl 57s 2

Totals (37 cells): with codegraph 40 reads / 21 greps, without 158 reads / 71 greps75% fewer reads, ~70% fewer greps. Codegraph never increased reads in any cell, and the without-arm additionally ran shell find/grep (Bash) and a sub-agent that the with-arm never needed. (74 agent runs, ~$29 total.)

Observations

  • Biggest wins are medium/large backends with a real route→handler→service flow: excalidraw (0R vs 9R/7G), spring-halo (0R vs 9R + 8 Bash), spring-realworld (0R vs 8R), django-realworld (1R vs 8R), aspnet-jellyfin (4R vs 13R + 21 Bash + a spawned sub-agent), aspnet-eshop (0R vs 6R).
  • Without codegraph, large repos make the agent thrash: it falls back to shell find/grep (Bash) and on jellyfin even spawned a sub-agent — exactly the behavior codegraph is meant to prevent. The with-arm answers those in 2–6 codegraph calls.
  • Tie zone = tiny repos (Dart books 6 files, Kotlin Jetcaster, Ruby forem, Swift spi): the whole flow fits in 1–2 files, so reading is already cheap; codegraph ties on reads and is sometimes a few seconds slower (MCP + index overhead). This matches the design note that codegraph's value scales with repo size.
  • Duration tracks reads on the big repos (jellyfin 61s vs 132s, spring-halo 75s vs 148s, excalidraw 53s vs 98s) and is noise on small ones.
  • Some "with" cells still read 2–4 files (jellyfin, gitness, laravel-firefly, forem) — the residual is the documented frontier (anonymous handlers, deep service chains, dynamic finders); codegraph gets the agent to the right file, then it reads one to confirm a detail.

Coverage note

All 14 README frameworks and every flow-relevant language are validated (see the playbook). The sizes here are by indexed file count; a few languages lack a clean third size in the corpus (Dart/Kotlin = S/M, Scala/Luau = S only, C = L only, C++ = M only) — those cells are omitted rather than faked.

Reproduce

Driver + parser: /tmp/ab-matrix/run.sh (matrix of lang|size|repo|question) and /tmp/ab-matrix/parse-matrix.mjs. Each cell: rm -rf .codegraph && codegraph init -i, then scripts/agent-eval/run-all.sh <repo> "<question>" headless (with = codegraph-only MCP, without = empty MCP), parsed from the stream-json logs.