vortex-data
diff --git a/‎AGENTS.md‎
Lines changed: 44 additions & 0 deletions b/‎AGENTS.md‎
Lines changed: 44 additions & 0 deletions
diff --git a/‎Cargo.lock‎
Lines changed: 1 addition & 0 deletions b/‎Cargo.lock‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎encodings/runend/Cargo.toml‎
Lines changed: 4 additions & 0 deletions b/‎encodings/runend/Cargo.toml‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎encodings/runend/benches/run_end_filter.rs‎
Lines changed: 113 additions & 0 deletions b/‎encodings/runend/benches/run_end_filter.rs‎
Lines changed: 113 additions & 0 deletions
diff --git a/‎encodings/runend/src/compute/filter.rs‎
Lines changed: 15 additions & 9 deletions b/‎encodings/runend/src/compute/filter.rs‎
Lines changed: 15 additions & 9 deletions
diff --git a/‎encodings/runend/src/lib.rs‎
Lines changed: 1 addition & 0 deletions b/‎encodings/runend/src/lib.rs‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎vortex-array/Cargo.toml‎
Lines changed: 4 additions & 0 deletions b/‎vortex-array/Cargo.toml‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎vortex-array/benches/validity_is_valid.rs‎
Lines changed: 71 additions & 0 deletions b/‎vortex-array/benches/validity_is_valid.rs‎
Lines changed: 71 additions & 0 deletions
@@ -151,6 +151,47 @@ Notes:
   self-explanatory code.
 - Keep public APIs small and consistent with neighboring crates.
 
+## Performance: avoid hidden-cost accessors in hot loops
+
+The most common performance trap in this codebase is calling a *per-element accessor that
+hides non-trivial work* inside an `O(n)` loop, instead of doing the work once over the whole
+chunk. The call site looks like a cheap getter, but each call re-pays a cost that is constant
+(or amortizable) across the loop, making the loop `O(n · k)`.
+
+Watch for these accessors used inside `for i in 0..n { ... }`:
+
+| Per-element accessor (in a loop) | Hidden cost | Bulk replacement |
+| --- | --- | --- |
+| `Validity::is_valid(i)` / `is_null(i)` | for array-backed validity, **allocates an `ExecutionCtx` and runs a scalar lookup per call** | `validity.execute_mask(len, ctx)?` once, then `Mask::value(i)` |
+| `array.scalar_at(i)` / `array.execute_scalar(i, ctx)` | per-element execution through the compute stack | canonicalize once (`execute::<PrimitiveArray>` / `as_slice`) then index |
+| `BitBuffer::value(i)` / `Mask::value(i)` accumulated into a count | recomputes the byte address each call; defeats popcount | `true_count()`, `BitBuffer::count_range(start, end)`, `set_indices()` |
+| `BitIterator::next()` to accumulate a running rank/prefix count | bit-at-a-time | `count_range` over each gap (SIMD popcount) |
+| re-deriving a value inside the loop (e.g. `self.validity()?` each iteration) | re-runs the derivation `n` times | hoist it above the loop |
+
+Decide per site — bulk is not always the answer:
+
+- **Sequential / contiguous access** over an accessor that hides amortizable work → hoist and
+  go bulk (materialize once, then index or iterate the chunk).
+- **Gather over arbitrary indices** → you cannot amortize a per-element *decode*, but you can
+  still materialize the backing buffer once (e.g. `execute_mask`) and then do cheap `O(1)`
+  random reads, avoiding per-call context/allocation.
+- **The accessor is already genuinely `O(1)`** (e.g. reading an already-materialized `Mask`/
+  slice, or a native bitmap) → leave it; bulk would not help.
+
+Even after materializing into a `Mask`/`BitBuffer`, **do not loop with `value(i)` per
+element** to act on each set bit — the per-element branch dominates. Iterate a `u64` word
+at a time with all-set / all-unset fast paths: use [`BitBuffer::for_each_set_index`]. It
+beats `for i in 0..len { if buf.value(i) {..} }` by 2-45x (more at low density) and beats
+collecting `set_indices()` by ~2x at mid/high density, while self-adapting from sparse to
+dense — see `vortex-mask/benches/mask_iteration.rs`. Reach for the cached `indices()` /
+`slices()` representations when you need them more than once; otherwise `for_each_set_index`
+needs no materialization.
+
+When you touch such a loop, back the change with a benchmark (see
+`vortex-array/benches/validity_is_valid.rs` for the `is_valid` case,
+`vortex-mask/benches/valid_counts.rs` for the popcount case, and
+`vortex-mask/benches/mask_iteration.rs` for the per-element-vs-word-vs-sparse comparison).
+
 ## Tests
 
 - Strongly consider `rstest` cases when parameterizing repetitive test logic.
@@ -179,6 +220,9 @@ Check new and modified lines against this list before finishing:
 - Updating expected test output to match buggy behavior without independently verifying the
   intended semantics.
 - Silently reducing the scope of an approved plan when implementation is harder than expected.
+- Calling a hidden-cost per-element accessor (`Validity::is_valid`, `scalar_at`, `BitBuffer::
+  value` accumulation) inside a hot loop instead of materializing once — see
+  "Performance: avoid hidden-cost accessors in hot loops".
 
 ## Summaries
 
 
@@ -56,3 +56,7 @@ harness = false
 [[bench]]
 name = "run_end_take"
 harness = false
+
+[[bench]]
+name = "run_end_filter"
+harness = false
@@ -0,0 +1,113 @@
+// SPDX-License-Identifier: Apache-2.0
+// SPDX-FileCopyrightText: Copyright the Vortex contributors
+
+//! Benchmarks for the run-end filter inner loop (`filter_run_end_primitive`).
+//!
+//! This measures the kernel directly rather than going through the lazy
+//! `ArrayRef::filter` (which only builds a `FilterArray` node and does not run
+//! the kernel). The hot work is a per-run popcount of the predicate mask, which
+//! now uses `BitBuffer::count_range` (SIMD) instead of a bit-by-bit walk.
+
+#![expect(clippy::cast_possible_truncation)]
+#![expect(clippy::cast_precision_loss)]
+#![expect(clippy::cast_sign_loss)]
+#![expect(clippy::expect_used)]
+
+use std::fmt;
+
+use divan::Bencher;
+use rand::SeedableRng;
+use rand::rngs::StdRng;
+use rand::seq::SliceRandom;
+use vortex_buffer::BitBuffer;
+use vortex_runend::_benchmarking::filter_run_end_primitive;
+
+fn main() {
+    divan::main();
+}
+
+#[derive(Clone, Copy)]
+struct FilterBenchArgs {
+    /// Total logical length of the decoded array.
+    length: usize,
+    /// Average run length used when building the run-end array.
+    run_length: usize,
+    /// Fraction of mask bits that are set to `true`.
+    density: f64,
+}
+
+impl fmt::Display for FilterBenchArgs {
+    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+        write!(
+            f,
+            "len={}_run={}_density={:.1}",
+            self.length, self.run_length, self.density
+        )
+    }
+}
+
+const FILTER_ARGS: &[FilterBenchArgs] = &[
+    FilterBenchArgs {
+        length: 4_096,
+        run_length: 16,
+        density: 0.1,
+    },
+    FilterBenchArgs {
+        length: 4_096,
+        run_length: 16,
+        density: 0.5,
+    },
+    FilterBenchArgs {
+        length: 4_096,
+        run_length: 16,
+        density: 0.9,
+    },
+    FilterBenchArgs {
+        length: 16_384,
+        run_length: 16,
+        density: 0.1,
+    },
+    FilterBenchArgs {
+        length: 16_384,
+        run_length: 16,
+        density: 0.5,
+    },
+    FilterBenchArgs {
+        length: 16_384,
+        run_length: 16,
+        density: 0.9,
+    },
+];
+
+/// Build the run-end boundaries (cumulative run lengths) for `length` rows.
+fn build_run_ends(length: usize, run_length: usize) -> Vec<u32> {
+    let n_runs = length.div_ceil(run_length);
+    (0..n_runs)
+        .map(|r| (((r + 1) * run_length).min(length)) as u32)
+        .collect()
+}
+
+/// Build a predicate mask of `length` bits with approximately `density` set bits,
+/// shuffled so the set bits are spread across runs.
+fn build_mask(length: usize, density: f64) -> BitBuffer {
+    let n_true = (length as f64 * density).round() as usize;
+    let mut bits = vec![false; length];
+    for b in bits.iter_mut().take(n_true) {
+        *b = true;
+    }
+    let mut rng = StdRng::seed_from_u64(0x5eed);
+    bits.shuffle(&mut rng);
+    BitBuffer::from(bits)
+}
+
+#[divan::bench(args = FILTER_ARGS)]
+fn filter_run_end(bencher: Bencher, args: FilterBenchArgs) {
+    let run_ends = build_run_ends(args.length, args.run_length);
+    let mask = build_mask(args.length, args.density);
+    let length = args.length as u64;
+    bencher
+        .with_inputs(|| (run_ends.clone(), mask.clone()))
+        .bench_refs(|(run_ends, mask)| {
+            filter_run_end_primitive::<u32>(run_ends, 0, length, mask).expect("filter")
+        });
+}
@@ -5,6 +5,7 @@ use std::cmp::min;
 use std::ops::AddAssign;
 
 use num_traits::AsPrimitive;
+use num_traits::NumCast;
 use vortex_array::ArrayRef;
 use vortex_array::ArrayView;
 use vortex_array::ExecutionCtx;
@@ -74,7 +75,7 @@ impl FilterKernel for RunEnd {
 }
 
 // Code adapted from apache arrow-rs https://github.com/apache/arrow-rs/blob/b1f5c250ebb6c1252b4e7c51d15b8e77f4c361fa/arrow-select/src/filter.rs#L425
-fn filter_run_end_primitive<R: NativePType + AddAssign + From<bool> + AsPrimitive<u64>>(
+pub fn filter_run_end_primitive<R: NativePType + AddAssign + From<bool> + AsPrimitive<u64>>(
     run_ends: &[R],
     offset: u64,
     length: u64,
@@ -87,16 +88,21 @@ fn filter_run_end_primitive<R: NativePType + AddAssign + From<bool> + AsPrimitiv
     let mut count = R::zero();
 
     let new_mask: Mask = BitBuffer::collect_bool(run_ends.len(), |i| {
-        let mut keep = false;
         let end = min(run_ends[i].as_() - offset, length);
 
-        // Safety: predicate must be the same length as the array the ends have been taken from
-        for pred in (start..end).map(|i| unsafe {
-            mask.value_unchecked(i.try_into().vortex_expect("index must fit in usize"))
-        }) {
-            count += <R as From<bool>>::from(pred);
-            keep |= pred
-        }
+        // SIMD popcount of the predicate bits in this run. The range matches the
+        // bit-by-bit `value_unchecked` read it replaces, so `end <= mask.len()`.
+        let start_usize = start
+            .try_into()
+            .vortex_expect("run start index must fit in usize");
+        let end_usize = end
+            .try_into()
+            .vortex_expect("run end index must fit in usize");
+        let run_trues = mask.count_range(start_usize, end_usize);
+        count += <R as NumCast>::from(run_trues)
+            .vortex_expect("run popcount must fit in run-end native type");
+        let keep = run_trues > 0;
+
         // this is to avoid branching
         new_run_ends[j] = count;
         j += keep as usize;
 
@@ -21,6 +21,7 @@ mod rules;
 
 #[doc(hidden)]
 pub mod _benchmarking {
+    pub use compute::filter::filter_run_end_primitive;
     pub use compute::take::take_indices_unchecked;
 
     use super::*;
 
@@ -200,6 +200,10 @@ required-features = ["_test-harness"]
 name = "dict_mask"
 harness = false
 
+[[bench]]
+name = "validity_is_valid"
+harness = false
+
 [[bench]]
 name = "dict_unreferenced_mask"
 harness = false
 
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: Apache-2.0
+// SPDX-FileCopyrightText: Copyright the Vortex contributors
+
+//! Benchmarks the cost of checking array-backed [`Validity`] per element.
+//!
+//! For the `Validity::Array` variant, `Validity::execute_is_valid` runs a scalar lookup through
+//! the compute stack on *every* call, and the (now deprecated) `is_valid` additionally spins up a
+//! fresh `ExecutionCtx` per call. Either way, calling it in a `for i in 0..n` loop is
+//! pathologically slow. The fix used by callers is to materialize the validity into a `Mask` once
+//! (`execute_mask`) and then do cheap O(1) bit reads via `Mask::value`. This bench contrasts the two.
+//!
+//! Sizes are kept small because the per-element variant is intentionally the slow one we are
+//! measuring.
+
+#![allow(clippy::unwrap_used)]
+
+use std::sync::LazyLock;
+
+use divan::Bencher;
+use vortex_array::VortexSessionExecute;
+use vortex_array::array_session;
+use vortex_array::validity::Validity;
+use vortex_buffer::BitBuffer;
+use vortex_session::VortexSession;
+
+fn main() {
+    divan::main();
+}
+
+const SIZES: &[usize] = &[256, 1024];
+
+/// Build an array-backed validity (~10% nulls so it is `Validity::Array`).
+fn array_validity(len: usize) -> Validity {
+    Validity::from(BitBuffer::from_iter(
+        (0..len).map(|i| !i.is_multiple_of(10)),
+    ))
+}
+
+static SESSION: LazyLock<VortexSession> = LazyLock::new(array_session);
+
+/// Per-element validity check over array-backed validity (the antipattern). This mirrors the
+/// deprecated `Validity::is_valid(i)`: a fresh `ExecutionCtx` plus a scalar lookup on every call.
+#[divan::bench(args = SIZES)]
+fn is_valid_per_element(bencher: Bencher, len: usize) {
+    let validity = array_validity(len);
+    bencher
+        .with_inputs(|| (&validity, SESSION.create_execution_ctx()))
+        .bench_refs(|(validity, ctx)| {
+            let mut count = 0usize;
+            for i in 0..len {
+                count += validity.execute_is_valid(i, ctx).unwrap() as usize;
+            }
+            count
+        });
+}
+
+/// Materialize the validity into a `Mask` once, then read bits (the fix).
+#[divan::bench(args = SIZES)]
+fn execute_mask_then_value(bencher: Bencher, len: usize) {
+    let validity = array_validity(len);
+    bencher
+        .with_inputs(|| (&validity, SESSION.create_execution_ctx()))
+        .bench_refs(|(validity, ctx)| {
+            let mask = validity.execute_mask(len, ctx).unwrap();
+            let mut count = 0usize;
+            for i in 0..len {
+                count += mask.value(i) as usize;
+            }
+            count
+        });
+}