Optimize bit iteration and validity checking with SIMD popcount#8636
Optimize bit iteration and validity checking with SIMD popcount#8636robert3005 wants to merge 1 commit into
Conversation
Merging this PR will degrade performance by 14.66%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | slice_empty_vortex |
339.4 ns | 397.8 ns | -14.66% |
| 🆕 | Simulation | execute_mask_then_value[1024] |
N/A | 8.9 µs | N/A |
| 🆕 | Simulation | execute_mask_then_value[256] |
N/A | 12.8 µs | N/A |
| 🆕 | Simulation | is_valid_per_element[1024] |
N/A | 186.6 µs | N/A |
| 🆕 | Simulation | is_valid_per_element[256] |
N/A | 48.6 µs | N/A |
| 🆕 | Simulation | filter_run_end[len=16384_run=16_density=0.1] |
N/A | 70.2 µs | N/A |
| 🆕 | Simulation | filter_run_end[len=16384_run=16_density=0.5] |
N/A | 69.4 µs | N/A |
| 🆕 | Simulation | filter_run_end[len=16384_run=16_density=0.9] |
N/A | 69 µs | N/A |
| 🆕 | Simulation | filter_run_end[len=4096_run=16_density=0.1] |
N/A | 67.9 µs | N/A |
| 🆕 | Simulation | filter_run_end[len=4096_run=16_density=0.5] |
N/A | 22.9 µs | N/A |
| 🆕 | Simulation | filter_run_end[len=4096_run=16_density=0.9] |
N/A | 22.9 µs | N/A |
| 🆕 | Simulation | many_indices[(16384, 0.1)] |
N/A | 16.3 µs | N/A |
| 🆕 | Simulation | many_indices[(16384, 0.9)] |
N/A | 16.3 µs | N/A |
| 🆕 | Simulation | slice_bounds[(16384, 0.1)] |
N/A | 3 µs | N/A |
| 🆕 | Simulation | slice_bounds[(16384, 0.9)] |
N/A | 3 µs | N/A |
| 🆕 | Simulation | bit_iterator[(16384, 0.01)] |
N/A | 64.3 µs | N/A |
| 🆕 | Simulation | bit_iterator[(16384, 0.1)] |
N/A | 73.3 µs | N/A |
| 🆕 | Simulation | bit_iterator[(16384, 0.5)] |
N/A | 113.7 µs | N/A |
| 🆕 | Simulation | bit_iterator[(16384, 0.9)] |
N/A | 153.9 µs | N/A |
| 🆕 | Simulation | bit_iterator[(16384, 0.99)] |
N/A | 160.8 µs | N/A |
| ... | ... | ... | ... | ... | ... |
ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/bit-iterator-perf-0IMQ5 (7609b8f) with develop (890704f)
Footnotes
-
4 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
`Mask::valid_counts_for_indices` walked the validity bit buffer one bit at a time via `BitIterator::next`, accumulating a running valid count. The pco/zstd slice decoders call it with `[slice_start, slice_stop]`, so the cost was a bit-by-bit scan of the entire prefix up to `slice_stop`. Replace the per-bit walk with an incremental SIMD popcount over each gap between consecutive indices. Total scanned bits are unchanged, but the work now goes through the vectorized `count_ones` path. Add `BitBuffer::count_range(start, end)`, which counts set bits in a range directly over the backing buffer without cloning a sliced `BitBuffer`, so the many-indices case has no per-gap allocation overhead. Benchmark (vortex-mask/benches/valid_counts.rs, median): slice_bounds 16384 8.7us -> 51ns (~170x) slice_bounds 262144 138us -> 310ns (~445x) slice_bounds 1M 554us -> 1.86us (~300x) many_indices 16384 10.1us -> 1.68us (~6x) many_indices 1M 613us -> 3.33us (~185x) Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> Signed-off-by: Robert Kruszewski <github@robertk.io>
839b365 to
7609b8f
Compare
recreation of #8261 which had real performance benefits for iterating valid indices and had guidelines for future agents performance work