Skip to content

feat(vortex-row): row-oriented byte encoder (size + encode passes)#8253

Open
joseph-isaacs wants to merge 14 commits into
developfrom
claude/nice-archimedes-yjGyO
Open

feat(vortex-row): row-oriented byte encoder (size + encode passes)#8253
joseph-isaacs wants to merge 14 commits into
developfrom
claude/nice-archimedes-yjGyO

Conversation

@joseph-isaacs
Copy link
Copy Markdown
Contributor

Summary

Adds vortex-row, a new crate that encodes one or more columnar Vortex arrays into a single
ListView<u8> whose per-row byte slices are lexicographically comparable. The byte order
matches tuple ordering of the input values under per-column sort options, so the output works
directly as a sort key / row key — the Vortex analogue of arrow-row.

This is the base of the row-encoding work: the byte codec, the two-pass converter, the
public API, tests, and benches. Codec hot-path performance tuning and the per-encoding
(Constant / Dict / Patched / RunEnd / BitPacked / FoR / Delta) fast-path kernels land in
follow-up PRs and are intentionally out of scope here.

Single clean commit on top of current develop. The crate is marked publish = false, so —
following the same convention as vortex-python, vortex-bench, and xtask — no
public-api.lock is tracked while the API is still settling.

Design

Encoding runs as two scalar functions wired behind the RowEncoder API:

  1. Size pass — RowSize. Walks the N input columns once, classifies each column as
    fixed- or variable-width, accumulates the fixed-width prefix per row, and lazily collects
    per-row variable lengths. Returns Struct { fixed: u32, var: u32 } so callers read per-row
    widths without materializing the constant fixed slot as a per-row buffer.
  2. Encode pass — RowEncode. Uses those sizes to compute totals, allocate one contiguous
    elements buffer, build per-row absolute offsets, then writes each column left-to-right into
    its per-row slot via a write cursor that doubles as the ListView sizes array — so no
    separate finalize step is needed.

The converter is effectively 2 passes for the pure-fixed-width case and 3 when
variable-length columns require the prefix-sum offsets pass.

Per-column ordering is controlled by RowSortField { descending, nulls_first }: descending
reverses the encoded value bytes, and a leading sentinel byte (0x00 / 0x01 / 0x02) places
nulls before or after non-nulls independently of sort direction.

convert_columns and where the API lives

convert_columns(cols: &[ArrayRef], fields: &[RowSortField], ctx) -> VortexResult<ListViewArray>
is the one-shot entry point; RowEncoder is the reusable form (build once with options, encode
many). Each public item is defined in a single file:

Item File
RowEncoder, convert_columns(_with_options), compute_row_sizes(_with_options) src/encoder.rs
RowEncode scalar fn + the 5-phase encode driver src/encode.rs
RowSize scalar fn + the size/classify pass (compute_sizes) src/size.rs
RowEncodingOptions, RowSortField src/options.rs
per-dtype byte codec (field_size / field_encode) src/codec.rs
initialize(session) + re-exports src/lib.rs

Implementation. convert_columns validates the columns (non-empty, equal lengths, one
RowSortField per column) and delegates to a RowEncoder, which runs RowSize then
RowEncode: the size pass canonicalizes each column once and classifies it fixed- vs
variable-width, accumulating a constant fixed_per_row and — only when needed — per-row
var_lengths; the encode pass sums those into the total byte length, allocates one contiguous
elements buffer, computes per-row offsets (i * fixed_per_row plus a varlen prefix sum), then
writes each column left-to-right into its per-row slot using a cursor that becomes the ListView
sizes array, producing the final ListView<u8> with no separate finalize step.

Type coverage

Supported: nulls, booleans, integer/float primitives, decimals up to 128 bits, UTF-8 and
binary, structs, fixed-size lists, and extensions whose storage type is supported. Variant,
union, and variable-size list arrays are rejected — this crate does not define an ordering for
them.

Testing

  • cargo test -p vortex-row21 passed: sort-order round-trip tests (bool, i64 asc/desc,
    u32, f16/f32/f64, utf8, multi-column, nulls first/last, struct), the single-buffer ListView
    invariant, RowSize output shape, and dtype rejection.
  • cargo build -p vortex-row and cargo clippy -p vortex-row --all-targets — clean.

Generated by Claude Code

Adds `vortex-row`, which encodes N columnar arrays into a single
byte-comparable `ListView<u8>` (the Vortex analogue of arrow-row) for use as
sort/row keys. Encoding runs as two scalar functions behind the `RowEncoder`
API: a `RowSize` sizing/classification pass and a `RowEncode` pass that
allocates one contiguous buffer and writes each column left-to-right into its
per-row slot. Per-column ordering (`RowSortField`) controls ascending/
descending and null placement.

Includes the byte codec for fixed-width, varlen, and nested canonical types,
the `convert_columns`/`compute_row_sizes` helpers, round-trip + invariant
tests, and arrow-row-baselined throughput benches.

The crate is marked `publish = false` for now, so no public-api.lock is
tracked.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs added the changelog/feature A new feature label Jun 4, 2026 — with Claude
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Jun 4, 2026

Merging this PR will degrade performance by 12.7%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 1 improved benchmark
❌ 1 regressed benchmark
✅ 1505 untouched benchmarks
🆕 10 new benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_bool_canonical_into[(1000, 10)] 31.7 µs 46.6 µs -31.87%
Simulation bitwise_not_vortex_buffer_mut[128] 275.3 ns 246.1 ns +11.85%
🆕 Simulation struct_mixed_vortex N/A 22.9 ms N/A
🆕 Simulation primitive_i64_vortex N/A 1.5 ms N/A
🆕 Simulation fsst_decompress_only N/A 16.4 ms N/A
🆕 Simulation plain_row_encode_only N/A 14.9 ms N/A
🆕 Simulation primitive_i64_arrow_row N/A 2.4 ms N/A
🆕 Simulation fsst_fast_fused N/A 29.5 ms N/A
🆕 Simulation fsst_unpack_then_convert N/A 30.6 ms N/A
🆕 Simulation struct_mixed_arrow_row N/A 18.7 ms N/A
🆕 Simulation utf8_arrow_row N/A 8.6 ms N/A
🆕 Simulation utf8_vortex N/A 9.2 ms N/A

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/nice-archimedes-yjGyO (946e18c) with develop (e484daf)

Open in CodSpeed

Add a CodSpeed shard for `vortex-row` so the `row_encode` divan benchmarks
(vortex vs arrow-row) build and run in CI alongside the other crates.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs marked this pull request as ready for review June 4, 2026 15:35
@joseph-isaacs joseph-isaacs requested a review from robert3005 June 4, 2026 15:41
The row encoder builds the output `(elements, offsets, sizes)` triple itself, so
the invariants `ListViewArray::try_new` validates (monotone offsets, per-row
slices within bounds and disjoint) already hold by construction. Skip the
revalidation walk via `new_unchecked`.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Introduce `ValidityKind`/`resolve_validity`: resolve a column's validity once,
materializing the per-row mask only when the column may actually contain nulls.
The size pass for varbinview and the bool and primitive encoders now branch once
on validity, so the all-valid path drops the per-row `mask.value(i)` check (and
mask allocation) entirely.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Every byte of the output range is written by some encoder: fixed-width null rows
write sentinel + explicit zero-fill, varlen encoders zero-pad their final partial
block, and struct/FSL null parent bodies are overwritten with the canonical null
encoding. The pre-zero-init memset is therefore redundant, so replace it with
`set_len`, saving a `total_len`-byte memset per call.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Materialize the listview offsets buffer with `set_len` + a slice write instead of
per-row `push`. For the pure-fixed path, `iter_mut().enumerate()` lets LLVM
auto-vectorize `offsets[i] = i * fixed_per_row` (no per-element bounds or capacity
checks). `nrows` is validated to fit u32 at function entry, so the cast is exact.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Write the mixed (fixed + varlen) offsets through `iter_mut().zip` with wrapping
arithmetic, mirroring the pure-fixed path: this elides per-element bounds checks so
the `i * fixed_per_row` multiply auto-vectorizes while the varlen prefix sum stays a
cheap sequential accumulator. The total is validated to fit u32 upstream, so the
wrapping operations never actually wrap.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…ping

The varlen body writer was a per-byte XOR loop. Split it into an ascending fast
path (`copy_nonoverlapping` of each 32-byte block plus a single stamped continuation
byte, then a partial final block) and a descending path that XORs a u64 at a time via
`xor_copy_block` for a vectorizable inner loop. The emitted bytes are identical to the
previous implementation for every length and direction (full-block counts and final
length byte match exactly); only the write strategy changes.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Replace the `with_iterator` traversal in `encode_varbinview` with a direct walk over
the view array: cache the data-buffer slices once, then for each row read the bytes
straight from the inlined view slot or the referenced buffer at `offset..offset+len`.
This drops the iterator's per-row option/bounds machinery. Validity is resolved once
via `resolve_validity`, keeping the no-nulls path branch-free on validity.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
The auto-vectorized offset loops and the varlen block writer used raw `as` casts
that trip this crate's `cast_possible_truncation` lint. Iterate a `u32` counter
instead of casting `usize` per element, and use `u8`/`u32` `try_from` for the
varlen final-block length byte and total byte count. No behavior change.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Classify each column in the size pass (`ColKind` + `first_varlen_idx`): a fixed-width
column with no varlen column before it has a constant within-row offset, so its write
position is pure arithmetic (`i * fixed_per_row + prefix + var_prefix[i]`) with no
per-row cursor. Route those columns through `field_encode_fixed_arithmetic`; the cursor
path is seeded to start at the first varlen column. Primitive columns in the pure-fixed
case use a `chunks_exact_mut` hot loop (matching arrow-row's not-null path); all other
fixed types reuse the cursor encoder at the computed offsets, so output is byte-identical.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
FSST is not order-preserving, so row keys must be the decompressed bytes; the only
strategy today is decompress to a canonical VarBinView then row-encode it. This bench
measures that path and its two phases (decompress-only, and row-encode of an
already-decompressed column) on compressible multi-block strings, to quantify the
opportunity for a future fused FSST row-encode kernel: the phases are additive
(decompress ~46%, row-encode ~54%), and the row-encode phase re-reads/re-writes the
decompressed bytes a fused kernel could emit once.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Apply nightly rustfmt formatting to the FSST benchmark added in the previous commit.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Adds `fsst_fast_fused`: bulk-decompresses the FSST code heap straight into a
contiguous buffer (no intermediate VarBinViewArray) and block-encodes rows directly
into the row-key ListView using the stored uncompressed_lengths (free size pass), with
the same no-zero-init / no-extra-copy techniques as the row encoder. Lets us compare
the fused path head-to-head against decode-then-convert.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant