Optimize bit iteration and validity checking with SIMD popcount by robert3005 · Pull Request #8636 · vortex-data/vortex

robert3005 · 2026-07-01T15:11:16Z

recreation of #8261 which had real performance benefits for iterating valid indices and had guidelines for future agents performance work

codspeed-hq · 2026-07-01T15:29:09Z

Merging this PR will degrade performance by 14.66%

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

❌ 1 regressed benchmark
✅ 1594 untouched benchmarks
🆕 48 new benchmarks
⏩ 4 skipped benchmarks¹

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	Simulation	`slice_empty_vortex`	339.4 ns	397.8 ns	-14.66%
🆕	Simulation	`execute_mask_then_value[1024]`	N/A	8.9 µs	N/A
🆕	Simulation	`execute_mask_then_value[256]`	N/A	12.8 µs	N/A
🆕	Simulation	`is_valid_per_element[1024]`	N/A	186.6 µs	N/A
🆕	Simulation	`is_valid_per_element[256]`	N/A	48.6 µs	N/A
🆕	Simulation	`filter_run_end[len=16384_run=16_density=0.1]`	N/A	70.2 µs	N/A
🆕	Simulation	`filter_run_end[len=16384_run=16_density=0.5]`	N/A	69.4 µs	N/A
🆕	Simulation	`filter_run_end[len=16384_run=16_density=0.9]`	N/A	69 µs	N/A
🆕	Simulation	`filter_run_end[len=4096_run=16_density=0.1]`	N/A	67.9 µs	N/A
🆕	Simulation	`filter_run_end[len=4096_run=16_density=0.5]`	N/A	22.9 µs	N/A
🆕	Simulation	`filter_run_end[len=4096_run=16_density=0.9]`	N/A	22.9 µs	N/A
🆕	Simulation	`many_indices[(16384, 0.1)]`	N/A	16.3 µs	N/A
🆕	Simulation	`many_indices[(16384, 0.9)]`	N/A	16.3 µs	N/A
🆕	Simulation	`slice_bounds[(16384, 0.1)]`	N/A	3 µs	N/A
🆕	Simulation	`slice_bounds[(16384, 0.9)]`	N/A	3 µs	N/A
🆕	Simulation	`bit_iterator[(16384, 0.01)]`	N/A	64.3 µs	N/A
🆕	Simulation	`bit_iterator[(16384, 0.1)]`	N/A	73.3 µs	N/A
🆕	Simulation	`bit_iterator[(16384, 0.5)]`	N/A	113.7 µs	N/A
🆕	Simulation	`bit_iterator[(16384, 0.9)]`	N/A	153.9 µs	N/A
🆕	Simulation	`bit_iterator[(16384, 0.99)]`	N/A	160.8 µs	N/A
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing claude/bit-iterator-perf-0IMQ5 (7609b8f) with develop (890704f)}

4 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

`Mask::valid_counts_for_indices` walked the validity bit buffer one bit at a time via `BitIterator::next`, accumulating a running valid count. The pco/zstd slice decoders call it with `[slice_start, slice_stop]`, so the cost was a bit-by-bit scan of the entire prefix up to `slice_stop`. Replace the per-bit walk with an incremental SIMD popcount over each gap between consecutive indices. Total scanned bits are unchanged, but the work now goes through the vectorized `count_ones` path. Add `BitBuffer::count_range(start, end)`, which counts set bits in a range directly over the backing buffer without cloning a sliced `BitBuffer`, so the many-indices case has no per-gap allocation overhead. Benchmark (vortex-mask/benches/valid_counts.rs, median): slice_bounds 16384 8.7us -> 51ns (~170x) slice_bounds 262144 138us -> 310ns (~445x) slice_bounds 1M 554us -> 1.86us (~300x) many_indices 16384 10.1us -> 1.68us (~6x) many_indices 1M 613us -> 3.33us (~185x) Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> Signed-off-by: Robert Kruszewski <github@robertk.io>

robert3005 requested a review from a team July 1, 2026 15:11

robert3005 added the changelog/performance A performance improvement label Jul 1, 2026

robert3005 force-pushed the claude/bit-iterator-perf-0IMQ5 branch from 839b365 to 7609b8f Compare July 2, 2026 10:32

robert3005 requested a review from 0ax1 July 2, 2026 13:52

robert3005 enabled auto-merge (squash) July 3, 2026 10:44

robert3005 requested a review from AdamGS July 3, 2026 10:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize bit iteration and validity checking with SIMD popcount#8636

Optimize bit iteration and validity checking with SIMD popcount#8636
robert3005 wants to merge 1 commit into
developfrom
claude/bit-iterator-perf-0IMQ5

robert3005 commented Jul 1, 2026

Uh oh!

codspeed-hq Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

robert3005 commented Jul 1, 2026

Uh oh!

codspeed-hq Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will degrade performance by 14.66%

Performance Changes

Footnotes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codspeed-hq Bot commented Jul 1, 2026 •

edited

Loading