src: improve compile cache performance and size#63861
Conversation
|
Review requested:
|
|
Verification results (macOS arm64, release build, vs. baseline at the merge-base): Tests: all 23 On-disk size
Warm-start, self-controlled (same binary, cache on vs. off, interleaved 20 runs, trimmed means — avoids binary-layout skew between builds):
Warm-start with a hot page cache is neutral within noise (~1 ms on the pathological single-10 MB-file case, which is dominated by one large zstd decompress; with cold page cache or slower storage the 2.4x smaller read wins). Cold-start adds one-time compression at persist (~35 ms for the 1.8 MB blob at level 1, proportionally less for typical files). The second commit reuses zstd contexts (one |
Improve the compile cache by: - Reading cache files with a single exactly-sized read using the file size from fstat instead of reading into an exponentially growing buffer, which previously cost O(log N) syscalls and allocations and about 2N bytes of copying per file. - Compressing the cache content on disk with zstd at level 1, falling back to raw storage when the data is not compressible. This shrinks cache directories by about 2-4x. The magic number is bumped so that files in the old format are discarded as cache misses and then overwritten in place. - Handing the cache to V8 through a non-owning CachedData wrapper instead of copying the whole buffer on every cache hit. Corrupted cache files keep degrading to silent cache misses and are regenerated, now covered by a regression test. Co-authored-by: Grok <grok@x.ai> Signed-off-by: Yagiz Nizipli <yagiz@nizipli.com>
816f4ef to
0cf8818
Compare
Creating and freeing a zstd context for every cache file costs more than the (de)compression itself for small caches. Lazily create one decompression context on the handler and reuse it across reads, and share one compression context across all entries in Persist(). Co-authored-by: Grok <grok@x.ai> Signed-off-by: Yagiz Nizipli <yagiz@nizipli.com>
0cf8818 to
f427fb3
Compare
… on nodejs#63861) - Add src/compile_cache_zstd.dict (48 KiB trained on 5.7k objective V8 code cache samples harvested via vm.Script from test/parallel + fixtures + benchmarks; the 300 small benchmark measurement set and the big TS fixture raw are held completely out of training). - Generate and include src/compile_cache_zstd_dict.h (the embeddable form). - Add tools/generate-compile-cache-dict.js (run after updating the .dict). - Wire always-on use of the prepared CDict/DDict in Persist() (pick best of plain vs dict-assisted per entry) and ReadCacheFile() (decompress_usingDDict). - Reuses the ctxs and "only if smaller than raw" policy from Yagiz's PR. - Yes, the dictionary is joined/embedded in the node binary (tiny, must be available with no extra FS for portable/early/restricted cache use). - Measurements on the benchmark scenarios (big TS fixture + 300 held-out small/medium objective samples) with the representative dict: small/medium: raw -> plain-zstd-l1 2.13x -> +dict 3.06x (1.44x further win over the zstd already in nodejs#63861). big: ~2.69x plain (dict neutral/slightly worse, still >> raw; we take the min so big stays optimal). - See the investigation notes in the branch for full details and reproduction. Co-authored-by: Grok (investigation + prototype) PR on top of nodejs#63861
|
It seems the performance gains are within noise and it's mostly only the compression changes size? Can you split them into different PRs and measure them individually?
|
Builds on the zstd compression in nodejs#63861 by embedding a small zstd dictionary trained on a diverse corpus of real modules, so each small/medium compile-cache entry compresses better. Per entry we keep the smaller of the plain and dictionary-assisted frame, so the dictionary only ever helps. - Add src/compile_cache_zstd.dict (16 KiB). It is trained on V8 code caches harvested (via vm.compileFunction, the same shape the CJS loader produces) from a diverse corpus: bundled npm packages, lib/, tools/ and a few deps. - Add tools/generate_compile_cache_dict.py and a node.gyp action that generates compile_cache_zstd_dict.h into SHARED_INTERMEDIATE_DIR at build time; no generated header is checked in. libnode include_dirs updated to pick it up. - Prepare the CDict/DDict once per process (shared across all handlers and Workers, matching the lazy-context approach from nodejs#63861) and use them in Persist() and ReadCacheFile(). Persist() compresses the plain and dict frames into separate buffers and selects the smaller, so the written bytes and recorded size always agree. The dictionary is only tried for entries up to 256 KiB; larger blobs never benefit, so the second compression is skipped to avoid wasted work. Falls back to plain zstd if dictionary preparation fails. - The dictionary is embedded in the binary because the compile cache must be usable early, portably, and without extra filesystem state. - No on-disk format change: dict-assisted frames carry the dictID, plain frames carry none, and a single DDict decompresses both. - Size, measured on data held out from training (per-entry min policy): diverse modules go from ~1.87x (plain zstd) to ~2.44x with the dictionary (~24% smaller on disk); on test/parallel, which is not in the training corpus at all, ~1.74x -> ~2.22x (~22% smaller). A real end-to-end run (npm --version, ~70 modules) is ~15% smaller. Read time is unchanged and the extra write-time work is negligible. - Add a multi-module write/read roundtrip test and a startup benchmark (standard createBenchmark harness) plus the many-modules fixture list.
Builds on the zstd compression in nodejs#63861 by embedding a small zstd dictionary trained on a diverse corpus of real modules, so each small/medium compile-cache entry compresses better. Per entry we keep the smaller of the plain and dictionary-assisted frame, so the dictionary only ever helps. - Add src/compile_cache_zstd.dict (16 KiB). It is trained on V8 code caches harvested (via vm.compileFunction, the same shape the CJS loader produces) from a diverse corpus: bundled npm packages, lib/, tools/ and a few deps. - Add tools/generate_compile_cache_dict.py and a node.gyp action that generates compile_cache_zstd_dict.h into SHARED_INTERMEDIATE_DIR at build time; no generated header is checked in. libnode include_dirs updated to pick it up. - Prepare the CDict/DDict once per process (shared across all handlers and Workers, matching the lazy-context approach from nodejs#63861) and use them in Persist() and ReadCacheFile(). Persist() compresses the plain and dict frames into separate buffers and selects the smaller, so the written bytes and recorded size always agree. The dictionary is only tried for entries up to 256 KiB; larger blobs never benefit, so the second compression is skipped to avoid wasted work. Falls back to plain zstd if dictionary preparation fails. - The dictionary is embedded in the binary because the compile cache must be usable early, portably, and without extra filesystem state. - No on-disk format change: dict-assisted frames carry the dictID, plain frames carry none, and a single DDict decompresses both. - Size, measured on data held out from training (per-entry min policy): diverse modules go from ~1.87x (plain zstd) to ~2.44x with the dictionary (~24% smaller on disk); on test/parallel, which is not in the training corpus at all, ~1.74x -> ~2.22x (~22% smaller). A real end-to-end run (npm --version, ~70 modules) is ~15% smaller. Read time is unchanged and the extra write-time work is negligible. - Add a multi-module write/read roundtrip test and a startup benchmark (standard createBenchmark harness).
Builds on the zstd compression in nodejs#63861 by embedding a small zstd dictionary trained on a diverse corpus of real modules, so each small/medium compile-cache entry compresses better. Per entry we keep the smaller of the plain and dictionary-assisted frame, so the dictionary only ever helps. - Add src/compile_cache_zstd.dict (16 KiB). It is trained on V8 code caches harvested (via vm.compileFunction, the same shape the CJS loader produces) from a diverse corpus: bundled npm packages, lib/, tools/ and a few deps. - Add tools/generate_compile_cache_dict.py and a node.gyp action that generates compile_cache_zstd_dict.h into SHARED_INTERMEDIATE_DIR at build time; no generated header is checked in. libnode include_dirs updated to pick it up. - Prepare the CDict/DDict once per process (shared across all handlers and Workers, matching the lazy-context approach from nodejs#63861) and use them in Persist() and ReadCacheFile(). Persist() compresses the plain and dict frames into separate buffers and selects the smaller, so the written bytes and recorded size always agree. The dictionary is only tried for entries up to 256 KiB; larger blobs never benefit, so the second compression is skipped to avoid wasted work. Falls back to plain zstd if dictionary preparation fails. - The dictionary is embedded in the binary because the compile cache must be usable early, portably, and without extra filesystem state. - No on-disk format change: dict-assisted frames carry the dictID, plain frames carry none, and a single DDict decompresses both. - Size, measured on data held out from training (per-entry min policy): diverse modules go from ~1.87x (plain zstd) to ~2.44x with the dictionary (~24% smaller on disk); on test/parallel, which is not in the training corpus at all, ~1.74x -> ~2.22x (~22% smaller). A real end-to-end run (npm --version, ~70 modules) is ~15% smaller. Read time is unchanged and the extra write-time work is negligible. - Add a multi-module write/read roundtrip test and a startup benchmark (standard createBenchmark harness).
Improves the on-disk compile cache (
NODE_COMPILE_CACHE/module.enableCompileCache()):fstat) instead of an exponentially growing buffer, which previously costO(log N)syscalls/allocations and ~2N bytes of copying per file.CachedDatawrapper (BufferNotOwned) instead of copying the entire buffer on every cache hit. The underlying buffer is owned by the cache entry, which outlives the synchronous compilation (same pattern as thevmcached-data path innode_contextify.cc).Corrupted cache files keep degrading to silent cache misses and are regenerated; a corrupted size header can no longer cause an oversized allocation since the zstd frame content size is cross-checked first. Added
test/parallel/test-compile-cache-corrupted.jscovering bad magic, truncation, content bit-flips, and header corruption.No public API or documented behavior changes; the file format is private to
src/compile_cache.cc. Benchmark numbers (cache size and warm-startup timings) to follow in a comment.This change was developed with AI assistance (see
Co-authored-bytrailer).