You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
perf(mask): SIMD popcount for valid_counts_for_indices
`Mask::valid_counts_for_indices` walked the validity bit buffer one bit at
a time via `BitIterator::next`, accumulating a running valid count. The
pco/zstd slice decoders call it with `[slice_start, slice_stop]`, so the
cost was a bit-by-bit scan of the entire prefix up to `slice_stop`.
Replace the per-bit walk with an incremental SIMD popcount over each gap
between consecutive indices. Total scanned bits are unchanged, but the
work now goes through the vectorized `count_ones` path.
Add `BitBuffer::count_range(start, end)`, which counts set bits in a range
directly over the backing buffer without cloning a sliced `BitBuffer`, so
the many-indices case has no per-gap allocation overhead.
Benchmark (vortex-mask/benches/valid_counts.rs, median):
slice_bounds 16384 8.7us -> 51ns (~170x)
slice_bounds 262144 138us -> 310ns (~445x)
slice_bounds 1M 554us -> 1.86us (~300x)
many_indices 16384 10.1us -> 1.68us (~6x)
many_indices 1M 613us -> 3.33us (~185x)
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Signed-off-by: Robert Kruszewski <github@robertk.io>
Copy file name to clipboardExpand all lines: AGENTS.md
+44Lines changed: 44 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -151,6 +151,47 @@ Notes:
151
151
self-explanatory code.
152
152
- Keep public APIs small and consistent with neighboring crates.
153
153
154
+
## Performance: avoid hidden-cost accessors in hot loops
155
+
156
+
The most common performance trap in this codebase is calling a *per-element accessor that
157
+
hides non-trivial work* inside an `O(n)` loop, instead of doing the work once over the whole
158
+
chunk. The call site looks like a cheap getter, but each call re-pays a cost that is constant
159
+
(or amortizable) across the loop, making the loop `O(n · k)`.
160
+
161
+
Watch for these accessors used inside `for i in 0..n { ... }`:
162
+
163
+
| Per-element accessor (in a loop) | Hidden cost | Bulk replacement |
164
+
| --- | --- | --- |
165
+
|`Validity::is_valid(i)` / `is_null(i)`| for array-backed validity, **allocates an `ExecutionCtx` and runs a scalar lookup per call**|`validity.execute_mask(len, ctx)?` once, then `Mask::value(i)`|
166
+
|`array.scalar_at(i)` / `array.execute_scalar(i, ctx)`| per-element execution through the compute stack | canonicalize once (`execute::<PrimitiveArray>` / `as_slice`) then index |
167
+
|`BitBuffer::value(i)` / `Mask::value(i)` accumulated into a count | recomputes the byte address each call; defeats popcount |`true_count()`, `BitBuffer::count_range(start, end)`, `set_indices()`|
168
+
|`BitIterator::next()` to accumulate a running rank/prefix count | bit-at-a-time |`count_range` over each gap (SIMD popcount) |
169
+
| re-deriving a value inside the loop (e.g. `self.validity()?` each iteration) | re-runs the derivation `n` times | hoist it above the loop |
170
+
171
+
Decide per site — bulk is not always the answer:
172
+
173
+
-**Sequential / contiguous access** over an accessor that hides amortizable work → hoist and
174
+
go bulk (materialize once, then index or iterate the chunk).
175
+
-**Gather over arbitrary indices** → you cannot amortize a per-element *decode*, but you can
176
+
still materialize the backing buffer once (e.g. `execute_mask`) and then do cheap `O(1)`
177
+
random reads, avoiding per-call context/allocation.
178
+
-**The accessor is already genuinely `O(1)`** (e.g. reading an already-materialized `Mask`/
179
+
slice, or a native bitmap) → leave it; bulk would not help.
180
+
181
+
Even after materializing into a `Mask`/`BitBuffer`, **do not loop with `value(i)` per
182
+
element** to act on each set bit — the per-element branch dominates. Iterate a `u64` word
183
+
at a time with all-set / all-unset fast paths: use [`BitBuffer::for_each_set_index`]. It
184
+
beats `for i in 0..len { if buf.value(i) {..} }` by 2-45x (more at low density) and beats
185
+
collecting `set_indices()` by ~2x at mid/high density, while self-adapting from sparse to
186
+
dense — see `vortex-mask/benches/mask_iteration.rs`. Reach for the cached `indices()` /
187
+
`slices()` representations when you need them more than once; otherwise `for_each_set_index`
188
+
needs no materialization.
189
+
190
+
When you touch such a loop, back the change with a benchmark (see
191
+
`vortex-array/benches/validity_is_valid.rs` for the `is_valid` case,
192
+
`vortex-mask/benches/valid_counts.rs` for the popcount case, and
193
+
`vortex-mask/benches/mask_iteration.rs` for the per-element-vs-word-vs-sparse comparison).
194
+
154
195
## Tests
155
196
156
197
- Strongly consider `rstest` cases when parameterizing repetitive test logic.
@@ -179,6 +220,9 @@ Check new and modified lines against this list before finishing:
179
220
- Updating expected test output to match buggy behavior without independently verifying the
180
221
intended semantics.
181
222
- Silently reducing the scope of an approved plan when implementation is harder than expected.
223
+
- Calling a hidden-cost per-element accessor (`Validity::is_valid`, `scalar_at`, `BitBuffer::
224
+
value` accumulation) inside a hot loop instead of materializing once — see
225
+
"Performance: avoid hidden-cost accessors in hot loops".
0 commit comments