Skip to content

Optimize bit iteration and validity checking with SIMD popcount#8636

Open
robert3005 wants to merge 1 commit into
developfrom
claude/bit-iterator-perf-0IMQ5
Open

Optimize bit iteration and validity checking with SIMD popcount#8636
robert3005 wants to merge 1 commit into
developfrom
claude/bit-iterator-perf-0IMQ5

Conversation

@robert3005

Copy link
Copy Markdown
Contributor

recreation of #8261 which had real performance benefits for iterating valid indices and had guidelines for future agents performance work

@robert3005 robert3005 requested a review from a team July 1, 2026 15:11
@codspeed-hq

codspeed-hq Bot commented Jul 1, 2026

Copy link
Copy Markdown

Merging this PR will degrade performance by 14.66%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

❌ 1 regressed benchmark
✅ 1594 untouched benchmarks
🆕 48 new benchmarks
⏩ 4 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation slice_empty_vortex 339.4 ns 397.8 ns -14.66%
🆕 Simulation execute_mask_then_value[1024] N/A 8.9 µs N/A
🆕 Simulation execute_mask_then_value[256] N/A 12.8 µs N/A
🆕 Simulation is_valid_per_element[1024] N/A 186.6 µs N/A
🆕 Simulation is_valid_per_element[256] N/A 48.6 µs N/A
🆕 Simulation filter_run_end[len=16384_run=16_density=0.1] N/A 70.2 µs N/A
🆕 Simulation filter_run_end[len=16384_run=16_density=0.5] N/A 69.4 µs N/A
🆕 Simulation filter_run_end[len=16384_run=16_density=0.9] N/A 69 µs N/A
🆕 Simulation filter_run_end[len=4096_run=16_density=0.1] N/A 67.9 µs N/A
🆕 Simulation filter_run_end[len=4096_run=16_density=0.5] N/A 22.9 µs N/A
🆕 Simulation filter_run_end[len=4096_run=16_density=0.9] N/A 22.9 µs N/A
🆕 Simulation many_indices[(16384, 0.1)] N/A 16.3 µs N/A
🆕 Simulation many_indices[(16384, 0.9)] N/A 16.3 µs N/A
🆕 Simulation slice_bounds[(16384, 0.1)] N/A 3 µs N/A
🆕 Simulation slice_bounds[(16384, 0.9)] N/A 3 µs N/A
🆕 Simulation bit_iterator[(16384, 0.01)] N/A 64.3 µs N/A
🆕 Simulation bit_iterator[(16384, 0.1)] N/A 73.3 µs N/A
🆕 Simulation bit_iterator[(16384, 0.5)] N/A 113.7 µs N/A
🆕 Simulation bit_iterator[(16384, 0.9)] N/A 153.9 µs N/A
🆕 Simulation bit_iterator[(16384, 0.99)] N/A 160.8 µs N/A
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/bit-iterator-perf-0IMQ5 (7609b8f) with develop (890704f)

Open in CodSpeed

Footnotes

  1. 4 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@robert3005 robert3005 added the changelog/performance A performance improvement label Jul 1, 2026
`Mask::valid_counts_for_indices` walked the validity bit buffer one bit at
a time via `BitIterator::next`, accumulating a running valid count. The
pco/zstd slice decoders call it with `[slice_start, slice_stop]`, so the
cost was a bit-by-bit scan of the entire prefix up to `slice_stop`.

Replace the per-bit walk with an incremental SIMD popcount over each gap
between consecutive indices. Total scanned bits are unchanged, but the
work now goes through the vectorized `count_ones` path.

Add `BitBuffer::count_range(start, end)`, which counts set bits in a range
directly over the backing buffer without cloning a sliced `BitBuffer`, so
the many-indices case has no per-gap allocation overhead.

Benchmark (vortex-mask/benches/valid_counts.rs, median):

  slice_bounds 16384    8.7us  -> 51ns    (~170x)
  slice_bounds 262144   138us  -> 310ns   (~445x)
  slice_bounds 1M       554us  -> 1.86us  (~300x)
  many_indices 16384    10.1us -> 1.68us  (~6x)
  many_indices 1M       613us  -> 3.33us  (~185x)

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Signed-off-by: Robert Kruszewski <github@robertk.io>
@robert3005 robert3005 force-pushed the claude/bit-iterator-perf-0IMQ5 branch from 839b365 to 7609b8f Compare July 2, 2026 10:32
@robert3005 robert3005 requested a review from 0ax1 July 2, 2026 13:52
@robert3005 robert3005 enabled auto-merge (squash) July 3, 2026 10:44
@robert3005 robert3005 requested a review from AdamGS July 3, 2026 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/performance A performance improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants