Skip to content

src: improve compile cache performance and size#63861

Open
anonrig wants to merge 3 commits into
nodejs:mainfrom
anonrig:compile-cache-perf
Open

src: improve compile cache performance and size#63861
anonrig wants to merge 3 commits into
nodejs:mainfrom
anonrig:compile-cache-perf

Conversation

@anonrig

@anonrig anonrig commented Jun 11, 2026

Copy link
Copy Markdown
Member

Improves the on-disk compile cache (NODE_COMPILE_CACHE / module.enableCompileCache()):

  • Read path: read cache files with a single exactly-sized read (using the file size from fstat) instead of an exponentially growing buffer, which previously cost O(log N) syscalls/allocations and ~2N bytes of copying per file.
  • Size: compress the cache content on disk with zstd (level 1, prioritizing speed since persistence happens at shutdown), falling back to raw storage when not compressible. Shrinks cache directories ~2-4x and makes the crc32 integrity check cheaper since it now runs over the compressed bytes. The magic number is bumped so files in the old format are discarded as cache misses and overwritten in place.
  • Consume path: hand the cache to V8 through a non-owning CachedData wrapper (BufferNotOwned) instead of copying the entire buffer on every cache hit. The underlying buffer is owned by the cache entry, which outlives the synchronous compilation (same pattern as the vm cached-data path in node_contextify.cc).

Corrupted cache files keep degrading to silent cache misses and are regenerated; a corrupted size header can no longer cause an oversized allocation since the zstd frame content size is cross-checked first. Added test/parallel/test-compile-cache-corrupted.js covering bad magic, truncation, content bit-flips, and header corruption.

No public API or documented behavior changes; the file format is private to src/compile_cache.cc. Benchmark numbers (cache size and warm-startup timings) to follow in a comment.

This change was developed with AI assistance (see Co-authored-by trailer).

@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run. labels Jun 11, 2026
@nodejs-github-bot

Copy link
Copy Markdown
Collaborator

Review requested:

  • @nodejs/loaders
  • @nodejs/vm

@anonrig

anonrig commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

Verification results (macOS arm64, release build, vs. baseline at the merge-base):

Tests: all 23 parallel/test-compile-cache* pass (22 existing unchanged + the new corruption test).

On-disk size

Scenario Baseline This PR Reduction
10 MB snapshot/typescript.js fixture (single big CJS file) 1,818,380 B 752,924 B 2.42x
300 small/medium CJS modules 274,316 B 179,915 B 1.52x

Warm-start, self-controlled (same binary, cache on vs. off, interleaved 20 runs, trimmed means — avoids binary-layout skew between builds):

Scenario Cache benefit (baseline) Cache benefit (this PR)
big-file +26.4 ms +25.0 ms
many-modules +1.7 ms +1.5 ms

Warm-start with a hot page cache is neutral within noise (~1 ms on the pathological single-10 MB-file case, which is dominated by one large zstd decompress; with cold page cache or slower storage the 2.4x smaller read wins). Cold-start adds one-time compression at persist (~35 ms for the 1.8 MB blob at level 1, proportionally less for typical files).

The second commit reuses zstd contexts (one ZSTD_DCtx on the handler, one ZSTD_CCtx across Persist()), which removed most of the per-file decompression overhead observed with one-shot contexts on the many-modules scenario.

Improve the compile cache by:

- Reading cache files with a single exactly-sized read using the file
  size from fstat instead of reading into an exponentially growing
  buffer, which previously cost O(log N) syscalls and allocations and
  about 2N bytes of copying per file.
- Compressing the cache content on disk with zstd at level 1, falling
  back to raw storage when the data is not compressible. This shrinks
  cache directories by about 2-4x. The magic number is bumped so that
  files in the old format are discarded as cache misses and then
  overwritten in place.
- Handing the cache to V8 through a non-owning CachedData wrapper
  instead of copying the whole buffer on every cache hit.

Corrupted cache files keep degrading to silent cache misses and are
regenerated, now covered by a regression test.

Co-authored-by: Grok <grok@x.ai>
Signed-off-by: Yagiz Nizipli <yagiz@nizipli.com>
@anonrig anonrig force-pushed the compile-cache-perf branch from 816f4ef to 0cf8818 Compare June 11, 2026 22:38
Creating and freeing a zstd context for every cache file costs more
than the (de)compression itself for small caches. Lazily create one
decompression context on the handler and reuse it across reads, and
share one compression context across all entries in Persist().

Co-authored-by: Grok <grok@x.ai>
Signed-off-by: Yagiz Nizipli <yagiz@nizipli.com>
@anonrig anonrig force-pushed the compile-cache-perf branch from 0cf8818 to f427fb3 Compare June 11, 2026 22:40
lemire added a commit to lemire/node that referenced this pull request Jun 12, 2026
… on nodejs#63861)

- Add src/compile_cache_zstd.dict (48 KiB trained on 5.7k objective V8
  code cache samples harvested via vm.Script from test/parallel + fixtures
  + benchmarks; the 300 small benchmark measurement set and the big TS
  fixture raw are held completely out of training).
- Generate and include src/compile_cache_zstd_dict.h (the embeddable form).
- Add tools/generate-compile-cache-dict.js (run after updating the .dict).
- Wire always-on use of the prepared CDict/DDict in Persist() (pick best of
  plain vs dict-assisted per entry) and ReadCacheFile() (decompress_usingDDict).
- Reuses the ctxs and "only if smaller than raw" policy from Yagiz's PR.
- Yes, the dictionary is joined/embedded in the node binary (tiny, must be
  available with no extra FS for portable/early/restricted cache use).
- Measurements on the benchmark scenarios (big TS fixture + 300 held-out
  small/medium objective samples) with the representative dict:
    small/medium: raw -> plain-zstd-l1 2.13x -> +dict 3.06x (1.44x further
                  win over the zstd already in nodejs#63861).
    big: ~2.69x plain (dict neutral/slightly worse, still >> raw; we take
         the min so big stays optimal).
- See the investigation notes in the branch for full details and reproduction.

Co-authored-by: Grok (investigation + prototype)
PR on top of nodejs#63861
@joyeecheung

joyeecheung commented Jun 12, 2026

Copy link
Copy Markdown
Member

It seems the performance gains are within noise and it's mostly only the compression changes size? Can you split them into different PRs and measure them individually?

  1. I am not sure if the read changes actually gives any wins for small files - in the happy path where the cache is small, it's better to just read once and resize rather than stat and read (which are two file system calls instead of just one). Also there is a TOCTOU risk in doing stat, and fstat is not realiable across platforms, so the loop condition should not be gated on the fstat result but we must always read until EOF is reached in case the size is not accurate.
  2. It doesn't appear that compression alone does much to performance or it might actually hurt but just got compensated by other changes. In that case it's better to make that configurable and let users choose instead.
  3. Avoiding the copy would be better and there were precedents in builtin caches that it actually helped. I suspect this was the only one that actually helps performance while the other two may not. Hence it's better to split and measure individually.

lemire added a commit to lemire/node that referenced this pull request Jun 12, 2026
Builds on the zstd compression in nodejs#63861 by embedding a small zstd
dictionary trained on a diverse corpus of real modules, so each
small/medium compile-cache entry compresses better. Per entry we keep
the smaller of the plain and dictionary-assisted frame, so the
dictionary only ever helps.

- Add src/compile_cache_zstd.dict (16 KiB). It is trained on V8 code
  caches harvested (via vm.compileFunction, the same shape the CJS
  loader produces) from a diverse corpus: bundled npm packages, lib/,
  tools/ and a few deps.
- Add tools/generate_compile_cache_dict.py and a node.gyp action that
  generates compile_cache_zstd_dict.h into SHARED_INTERMEDIATE_DIR at
  build time; no generated header is checked in. libnode include_dirs
  updated to pick it up.
- Prepare the CDict/DDict once per process (shared across all handlers
  and Workers, matching the lazy-context approach from nodejs#63861) and use
  them in Persist() and ReadCacheFile(). Persist() compresses the plain
  and dict frames into separate buffers and selects the smaller, so the
  written bytes and recorded size always agree. The dictionary is only
  tried for entries up to 256 KiB; larger blobs never benefit, so the
  second compression is skipped to avoid wasted work. Falls back to
  plain zstd if dictionary preparation fails.
- The dictionary is embedded in the binary because the compile cache
  must be usable early, portably, and without extra filesystem state.
- No on-disk format change: dict-assisted frames carry the dictID, plain
  frames carry none, and a single DDict decompresses both.
- Size, measured on data held out from training (per-entry min policy):
  diverse modules go from ~1.87x (plain zstd) to ~2.44x with the
  dictionary (~24% smaller on disk); on test/parallel, which is not in
  the training corpus at all, ~1.74x -> ~2.22x (~22% smaller). A real
  end-to-end run (npm --version, ~70 modules) is ~15% smaller. Read
  time is unchanged and the extra write-time work is negligible.
- Add a multi-module write/read roundtrip test and a startup benchmark
  (standard createBenchmark harness) plus the many-modules fixture list.
lemire added a commit to lemire/node that referenced this pull request Jun 12, 2026
Builds on the zstd compression in nodejs#63861 by embedding a small zstd
dictionary trained on a diverse corpus of real modules, so each
small/medium compile-cache entry compresses better. Per entry we keep
the smaller of the plain and dictionary-assisted frame, so the
dictionary only ever helps.

- Add src/compile_cache_zstd.dict (16 KiB). It is trained on V8 code
  caches harvested (via vm.compileFunction, the same shape the CJS
  loader produces) from a diverse corpus: bundled npm packages, lib/,
  tools/ and a few deps.
- Add tools/generate_compile_cache_dict.py and a node.gyp action that
  generates compile_cache_zstd_dict.h into SHARED_INTERMEDIATE_DIR at
  build time; no generated header is checked in. libnode include_dirs
  updated to pick it up.
- Prepare the CDict/DDict once per process (shared across all handlers
  and Workers, matching the lazy-context approach from nodejs#63861) and use
  them in Persist() and ReadCacheFile(). Persist() compresses the plain
  and dict frames into separate buffers and selects the smaller, so the
  written bytes and recorded size always agree. The dictionary is only
  tried for entries up to 256 KiB; larger blobs never benefit, so the
  second compression is skipped to avoid wasted work. Falls back to
  plain zstd if dictionary preparation fails.
- The dictionary is embedded in the binary because the compile cache
  must be usable early, portably, and without extra filesystem state.
- No on-disk format change: dict-assisted frames carry the dictID, plain
  frames carry none, and a single DDict decompresses both.
- Size, measured on data held out from training (per-entry min policy):
  diverse modules go from ~1.87x (plain zstd) to ~2.44x with the
  dictionary (~24% smaller on disk); on test/parallel, which is not in
  the training corpus at all, ~1.74x -> ~2.22x (~22% smaller). A real
  end-to-end run (npm --version, ~70 modules) is ~15% smaller. Read
  time is unchanged and the extra write-time work is negligible.
- Add a multi-module write/read roundtrip test and a startup benchmark
  (standard createBenchmark harness).
Builds on the zstd compression in nodejs#63861 by embedding a small zstd
dictionary trained on a diverse corpus of real modules, so each
small/medium compile-cache entry compresses better. Per entry we keep
the smaller of the plain and dictionary-assisted frame, so the
dictionary only ever helps.

- Add src/compile_cache_zstd.dict (16 KiB). It is trained on V8 code
  caches harvested (via vm.compileFunction, the same shape the CJS
  loader produces) from a diverse corpus: bundled npm packages, lib/,
  tools/ and a few deps.
- Add tools/generate_compile_cache_dict.py and a node.gyp action that
  generates compile_cache_zstd_dict.h into SHARED_INTERMEDIATE_DIR at
  build time; no generated header is checked in. libnode include_dirs
  updated to pick it up.
- Prepare the CDict/DDict once per process (shared across all handlers
  and Workers, matching the lazy-context approach from nodejs#63861) and use
  them in Persist() and ReadCacheFile(). Persist() compresses the plain
  and dict frames into separate buffers and selects the smaller, so the
  written bytes and recorded size always agree. The dictionary is only
  tried for entries up to 256 KiB; larger blobs never benefit, so the
  second compression is skipped to avoid wasted work. Falls back to
  plain zstd if dictionary preparation fails.
- The dictionary is embedded in the binary because the compile cache
  must be usable early, portably, and without extra filesystem state.
- No on-disk format change: dict-assisted frames carry the dictID, plain
  frames carry none, and a single DDict decompresses both.
- Size, measured on data held out from training (per-entry min policy):
  diverse modules go from ~1.87x (plain zstd) to ~2.44x with the
  dictionary (~24% smaller on disk); on test/parallel, which is not in
  the training corpus at all, ~1.74x -> ~2.22x (~22% smaller). A real
  end-to-end run (npm --version, ~70 modules) is ~15% smaller. Read
  time is unchanged and the extra write-time work is negligible.
- Add a multi-module write/read roundtrip test and a startup benchmark
  (standard createBenchmark harness).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

c++ Issues and PRs that require attention from people who are familiar with C++. lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants