Skip to content

Commit 7609b8f

Browse files
joseph-isaacsrobert3005
authored andcommitted
perf(mask): SIMD popcount for valid_counts_for_indices
`Mask::valid_counts_for_indices` walked the validity bit buffer one bit at a time via `BitIterator::next`, accumulating a running valid count. The pco/zstd slice decoders call it with `[slice_start, slice_stop]`, so the cost was a bit-by-bit scan of the entire prefix up to `slice_stop`. Replace the per-bit walk with an incremental SIMD popcount over each gap between consecutive indices. Total scanned bits are unchanged, but the work now goes through the vectorized `count_ones` path. Add `BitBuffer::count_range(start, end)`, which counts set bits in a range directly over the backing buffer without cloning a sliced `BitBuffer`, so the many-indices case has no per-gap allocation overhead. Benchmark (vortex-mask/benches/valid_counts.rs, median): slice_bounds 16384 8.7us -> 51ns (~170x) slice_bounds 262144 138us -> 310ns (~445x) slice_bounds 1M 554us -> 1.86us (~300x) many_indices 16384 10.1us -> 1.68us (~6x) many_indices 1M 613us -> 3.33us (~185x) Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> Signed-off-by: Robert Kruszewski <github@robertk.io>
1 parent 890704f commit 7609b8f

16 files changed

Lines changed: 662 additions & 26 deletions

File tree

AGENTS.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -151,6 +151,47 @@ Notes:
151151
self-explanatory code.
152152
- Keep public APIs small and consistent with neighboring crates.
153153

154+
## Performance: avoid hidden-cost accessors in hot loops
155+
156+
The most common performance trap in this codebase is calling a *per-element accessor that
157+
hides non-trivial work* inside an `O(n)` loop, instead of doing the work once over the whole
158+
chunk. The call site looks like a cheap getter, but each call re-pays a cost that is constant
159+
(or amortizable) across the loop, making the loop `O(n · k)`.
160+
161+
Watch for these accessors used inside `for i in 0..n { ... }`:
162+
163+
| Per-element accessor (in a loop) | Hidden cost | Bulk replacement |
164+
| --- | --- | --- |
165+
| `Validity::is_valid(i)` / `is_null(i)` | for array-backed validity, **allocates an `ExecutionCtx` and runs a scalar lookup per call** | `validity.execute_mask(len, ctx)?` once, then `Mask::value(i)` |
166+
| `array.scalar_at(i)` / `array.execute_scalar(i, ctx)` | per-element execution through the compute stack | canonicalize once (`execute::<PrimitiveArray>` / `as_slice`) then index |
167+
| `BitBuffer::value(i)` / `Mask::value(i)` accumulated into a count | recomputes the byte address each call; defeats popcount | `true_count()`, `BitBuffer::count_range(start, end)`, `set_indices()` |
168+
| `BitIterator::next()` to accumulate a running rank/prefix count | bit-at-a-time | `count_range` over each gap (SIMD popcount) |
169+
| re-deriving a value inside the loop (e.g. `self.validity()?` each iteration) | re-runs the derivation `n` times | hoist it above the loop |
170+
171+
Decide per site — bulk is not always the answer:
172+
173+
- **Sequential / contiguous access** over an accessor that hides amortizable work → hoist and
174+
go bulk (materialize once, then index or iterate the chunk).
175+
- **Gather over arbitrary indices** → you cannot amortize a per-element *decode*, but you can
176+
still materialize the backing buffer once (e.g. `execute_mask`) and then do cheap `O(1)`
177+
random reads, avoiding per-call context/allocation.
178+
- **The accessor is already genuinely `O(1)`** (e.g. reading an already-materialized `Mask`/
179+
slice, or a native bitmap) → leave it; bulk would not help.
180+
181+
Even after materializing into a `Mask`/`BitBuffer`, **do not loop with `value(i)` per
182+
element** to act on each set bit — the per-element branch dominates. Iterate a `u64` word
183+
at a time with all-set / all-unset fast paths: use [`BitBuffer::for_each_set_index`]. It
184+
beats `for i in 0..len { if buf.value(i) {..} }` by 2-45x (more at low density) and beats
185+
collecting `set_indices()` by ~2x at mid/high density, while self-adapting from sparse to
186+
dense — see `vortex-mask/benches/mask_iteration.rs`. Reach for the cached `indices()` /
187+
`slices()` representations when you need them more than once; otherwise `for_each_set_index`
188+
needs no materialization.
189+
190+
When you touch such a loop, back the change with a benchmark (see
191+
`vortex-array/benches/validity_is_valid.rs` for the `is_valid` case,
192+
`vortex-mask/benches/valid_counts.rs` for the popcount case, and
193+
`vortex-mask/benches/mask_iteration.rs` for the per-element-vs-word-vs-sparse comparison).
194+
154195
## Tests
155196

156197
- Strongly consider `rstest` cases when parameterizing repetitive test logic.
@@ -179,6 +220,9 @@ Check new and modified lines against this list before finishing:
179220
- Updating expected test output to match buggy behavior without independently verifying the
180221
intended semantics.
181222
- Silently reducing the scope of an approved plan when implementation is harder than expected.
223+
- Calling a hidden-cost per-element accessor (`Validity::is_valid`, `scalar_at`, `BitBuffer::
224+
value` accumulation) inside a hot loop instead of materializing once — see
225+
"Performance: avoid hidden-cost accessors in hot loops".
182226

183227
## Summaries
184228

Cargo.lock

Lines changed: 1 addition & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

encodings/runend/Cargo.toml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,3 +56,7 @@ harness = false
5656
[[bench]]
5757
name = "run_end_take"
5858
harness = false
59+
60+
[[bench]]
61+
name = "run_end_filter"
62+
harness = false
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
// SPDX-License-Identifier: Apache-2.0
2+
// SPDX-FileCopyrightText: Copyright the Vortex contributors
3+
4+
//! Benchmarks for the run-end filter inner loop (`filter_run_end_primitive`).
5+
//!
6+
//! This measures the kernel directly rather than going through the lazy
7+
//! `ArrayRef::filter` (which only builds a `FilterArray` node and does not run
8+
//! the kernel). The hot work is a per-run popcount of the predicate mask, which
9+
//! now uses `BitBuffer::count_range` (SIMD) instead of a bit-by-bit walk.
10+
11+
#![expect(clippy::cast_possible_truncation)]
12+
#![expect(clippy::cast_precision_loss)]
13+
#![expect(clippy::cast_sign_loss)]
14+
#![expect(clippy::expect_used)]
15+
16+
use std::fmt;
17+
18+
use divan::Bencher;
19+
use rand::SeedableRng;
20+
use rand::rngs::StdRng;
21+
use rand::seq::SliceRandom;
22+
use vortex_buffer::BitBuffer;
23+
use vortex_runend::_benchmarking::filter_run_end_primitive;
24+
25+
fn main() {
26+
divan::main();
27+
}
28+
29+
#[derive(Clone, Copy)]
30+
struct FilterBenchArgs {
31+
/// Total logical length of the decoded array.
32+
length: usize,
33+
/// Average run length used when building the run-end array.
34+
run_length: usize,
35+
/// Fraction of mask bits that are set to `true`.
36+
density: f64,
37+
}
38+
39+
impl fmt::Display for FilterBenchArgs {
40+
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
41+
write!(
42+
f,
43+
"len={}_run={}_density={:.1}",
44+
self.length, self.run_length, self.density
45+
)
46+
}
47+
}
48+
49+
const FILTER_ARGS: &[FilterBenchArgs] = &[
50+
FilterBenchArgs {
51+
length: 4_096,
52+
run_length: 16,
53+
density: 0.1,
54+
},
55+
FilterBenchArgs {
56+
length: 4_096,
57+
run_length: 16,
58+
density: 0.5,
59+
},
60+
FilterBenchArgs {
61+
length: 4_096,
62+
run_length: 16,
63+
density: 0.9,
64+
},
65+
FilterBenchArgs {
66+
length: 16_384,
67+
run_length: 16,
68+
density: 0.1,
69+
},
70+
FilterBenchArgs {
71+
length: 16_384,
72+
run_length: 16,
73+
density: 0.5,
74+
},
75+
FilterBenchArgs {
76+
length: 16_384,
77+
run_length: 16,
78+
density: 0.9,
79+
},
80+
];
81+
82+
/// Build the run-end boundaries (cumulative run lengths) for `length` rows.
83+
fn build_run_ends(length: usize, run_length: usize) -> Vec<u32> {
84+
let n_runs = length.div_ceil(run_length);
85+
(0..n_runs)
86+
.map(|r| (((r + 1) * run_length).min(length)) as u32)
87+
.collect()
88+
}
89+
90+
/// Build a predicate mask of `length` bits with approximately `density` set bits,
91+
/// shuffled so the set bits are spread across runs.
92+
fn build_mask(length: usize, density: f64) -> BitBuffer {
93+
let n_true = (length as f64 * density).round() as usize;
94+
let mut bits = vec![false; length];
95+
for b in bits.iter_mut().take(n_true) {
96+
*b = true;
97+
}
98+
let mut rng = StdRng::seed_from_u64(0x5eed);
99+
bits.shuffle(&mut rng);
100+
BitBuffer::from(bits)
101+
}
102+
103+
#[divan::bench(args = FILTER_ARGS)]
104+
fn filter_run_end(bencher: Bencher, args: FilterBenchArgs) {
105+
let run_ends = build_run_ends(args.length, args.run_length);
106+
let mask = build_mask(args.length, args.density);
107+
let length = args.length as u64;
108+
bencher
109+
.with_inputs(|| (run_ends.clone(), mask.clone()))
110+
.bench_refs(|(run_ends, mask)| {
111+
filter_run_end_primitive::<u32>(run_ends, 0, length, mask).expect("filter")
112+
});
113+
}

encodings/runend/src/compute/filter.rs

Lines changed: 15 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ use std::cmp::min;
55
use std::ops::AddAssign;
66

77
use num_traits::AsPrimitive;
8+
use num_traits::NumCast;
89
use vortex_array::ArrayRef;
910
use vortex_array::ArrayView;
1011
use vortex_array::ExecutionCtx;
@@ -74,7 +75,7 @@ impl FilterKernel for RunEnd {
7475
}
7576

7677
// Code adapted from apache arrow-rs https://github.com/apache/arrow-rs/blob/b1f5c250ebb6c1252b4e7c51d15b8e77f4c361fa/arrow-select/src/filter.rs#L425
77-
fn filter_run_end_primitive<R: NativePType + AddAssign + From<bool> + AsPrimitive<u64>>(
78+
pub fn filter_run_end_primitive<R: NativePType + AddAssign + From<bool> + AsPrimitive<u64>>(
7879
run_ends: &[R],
7980
offset: u64,
8081
length: u64,
@@ -87,16 +88,21 @@ fn filter_run_end_primitive<R: NativePType + AddAssign + From<bool> + AsPrimitiv
8788
let mut count = R::zero();
8889

8990
let new_mask: Mask = BitBuffer::collect_bool(run_ends.len(), |i| {
90-
let mut keep = false;
9191
let end = min(run_ends[i].as_() - offset, length);
9292

93-
// Safety: predicate must be the same length as the array the ends have been taken from
94-
for pred in (start..end).map(|i| unsafe {
95-
mask.value_unchecked(i.try_into().vortex_expect("index must fit in usize"))
96-
}) {
97-
count += <R as From<bool>>::from(pred);
98-
keep |= pred
99-
}
93+
// SIMD popcount of the predicate bits in this run. The range matches the
94+
// bit-by-bit `value_unchecked` read it replaces, so `end <= mask.len()`.
95+
let start_usize = start
96+
.try_into()
97+
.vortex_expect("run start index must fit in usize");
98+
let end_usize = end
99+
.try_into()
100+
.vortex_expect("run end index must fit in usize");
101+
let run_trues = mask.count_range(start_usize, end_usize);
102+
count += <R as NumCast>::from(run_trues)
103+
.vortex_expect("run popcount must fit in run-end native type");
104+
let keep = run_trues > 0;
105+
100106
// this is to avoid branching
101107
new_run_ends[j] = count;
102108
j += keep as usize;

encodings/runend/src/lib.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ mod rules;
2121

2222
#[doc(hidden)]
2323
pub mod _benchmarking {
24+
pub use compute::filter::filter_run_end_primitive;
2425
pub use compute::take::take_indices_unchecked;
2526

2627
use super::*;

vortex-array/Cargo.toml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -200,6 +200,10 @@ required-features = ["_test-harness"]
200200
name = "dict_mask"
201201
harness = false
202202

203+
[[bench]]
204+
name = "validity_is_valid"
205+
harness = false
206+
203207
[[bench]]
204208
name = "dict_unreferenced_mask"
205209
harness = false
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
// SPDX-License-Identifier: Apache-2.0
2+
// SPDX-FileCopyrightText: Copyright the Vortex contributors
3+
4+
//! Benchmarks the cost of checking array-backed [`Validity`] per element.
5+
//!
6+
//! For the `Validity::Array` variant, `Validity::execute_is_valid` runs a scalar lookup through
7+
//! the compute stack on *every* call, and the (now deprecated) `is_valid` additionally spins up a
8+
//! fresh `ExecutionCtx` per call. Either way, calling it in a `for i in 0..n` loop is
9+
//! pathologically slow. The fix used by callers is to materialize the validity into a `Mask` once
10+
//! (`execute_mask`) and then do cheap O(1) bit reads via `Mask::value`. This bench contrasts the two.
11+
//!
12+
//! Sizes are kept small because the per-element variant is intentionally the slow one we are
13+
//! measuring.
14+
15+
#![allow(clippy::unwrap_used)]
16+
17+
use std::sync::LazyLock;
18+
19+
use divan::Bencher;
20+
use vortex_array::VortexSessionExecute;
21+
use vortex_array::array_session;
22+
use vortex_array::validity::Validity;
23+
use vortex_buffer::BitBuffer;
24+
use vortex_session::VortexSession;
25+
26+
fn main() {
27+
divan::main();
28+
}
29+
30+
const SIZES: &[usize] = &[256, 1024];
31+
32+
/// Build an array-backed validity (~10% nulls so it is `Validity::Array`).
33+
fn array_validity(len: usize) -> Validity {
34+
Validity::from(BitBuffer::from_iter(
35+
(0..len).map(|i| !i.is_multiple_of(10)),
36+
))
37+
}
38+
39+
static SESSION: LazyLock<VortexSession> = LazyLock::new(array_session);
40+
41+
/// Per-element validity check over array-backed validity (the antipattern). This mirrors the
42+
/// deprecated `Validity::is_valid(i)`: a fresh `ExecutionCtx` plus a scalar lookup on every call.
43+
#[divan::bench(args = SIZES)]
44+
fn is_valid_per_element(bencher: Bencher, len: usize) {
45+
let validity = array_validity(len);
46+
bencher
47+
.with_inputs(|| (&validity, SESSION.create_execution_ctx()))
48+
.bench_refs(|(validity, ctx)| {
49+
let mut count = 0usize;
50+
for i in 0..len {
51+
count += validity.execute_is_valid(i, ctx).unwrap() as usize;
52+
}
53+
count
54+
});
55+
}
56+
57+
/// Materialize the validity into a `Mask` once, then read bits (the fix).
58+
#[divan::bench(args = SIZES)]
59+
fn execute_mask_then_value(bencher: Bencher, len: usize) {
60+
let validity = array_validity(len);
61+
bencher
62+
.with_inputs(|| (&validity, SESSION.create_execution_ctx()))
63+
.bench_refs(|(validity, ctx)| {
64+
let mask = validity.execute_mask(len, ctx).unwrap();
65+
let mut count = 0usize;
66+
for i in 0..len {
67+
count += mask.value(i) as usize;
68+
}
69+
count
70+
});
71+
}

0 commit comments

Comments
 (0)