feat(vortex-row): row-oriented byte encoder (size + encode passes) by joseph-isaacs · Pull Request #8253 · vortex-data/vortex

joseph-isaacs · 2026-06-04T14:28:46Z

Summary

Adds vortex-row, a new crate that encodes one or more columnar Vortex arrays into a single
ListView<u8> whose per-row byte slices are lexicographically comparable. The byte order
matches tuple ordering of the input values under per-column sort options, so the output works
directly as a sort key / row key — the Vortex analogue of arrow-row.

This is the base of the row-encoding work: the byte codec, the two-pass converter, the
public API, tests, and benches. Codec hot-path performance tuning and the per-encoding
(Constant / Dict / Patched / RunEnd / BitPacked / FoR / Delta) fast-path kernels land in
follow-up PRs and are intentionally out of scope here.

Single clean commit on top of current develop. The crate is marked publish = false, so —
following the same convention as vortex-python, vortex-bench, and xtask — no
public-api.lock is tracked while the API is still settling.

Design

Encoding runs as two scalar functions wired behind the RowEncoder API:

Size pass — RowSize. Walks the N input columns once, classifies each column as
fixed- or variable-width, accumulates the fixed-width prefix per row, and lazily collects
per-row variable lengths. Returns Struct { fixed: u32, var: u32 } so callers read per-row
widths without materializing the constant fixed slot as a per-row buffer.
Encode pass — RowEncode. Uses those sizes to compute totals, allocate one contiguous
elements buffer, build per-row absolute offsets, then writes each column left-to-right into
its per-row slot via a write cursor that doubles as the ListView sizes array — so no
separate finalize step is needed.

The converter is effectively 2 passes for the pure-fixed-width case and 3 when
variable-length columns require the prefix-sum offsets pass.

Per-column ordering is controlled by RowSortField { descending, nulls_first }: descending
reverses the encoded value bytes, and a leading sentinel byte (0x00 / 0x01 / 0x02) places
nulls before or after non-nulls independently of sort direction.

`convert_columns` and where the API lives

convert_columns(cols: &[ArrayRef], fields: &[RowSortField], ctx) -> VortexResult<ListViewArray>
is the one-shot entry point; RowEncoder is the reusable form (build once with options, encode
many). Each public item is defined in a single file:

Item	File
`RowEncoder`, `convert_columns(_with_options)`, `compute_row_sizes(_with_options)`	`src/encoder.rs`
`RowEncode` scalar fn + the 5-phase encode driver	`src/encode.rs`
`RowSize` scalar fn + the size/classify pass (`compute_sizes`)	`src/size.rs`
`RowEncodingOptions`, `RowSortField`	`src/options.rs`
per-dtype byte codec (`field_size` / `field_encode`)	`src/codec.rs`
`initialize(session)` + re-exports	`src/lib.rs`

Implementation. convert_columns validates the columns (non-empty, equal lengths, one
RowSortField per column) and delegates to a RowEncoder, which runs RowSize then
RowEncode: the size pass canonicalizes each column once and classifies it fixed- vs
variable-width, accumulating a constant fixed_per_row and — only when needed — per-row
var_lengths; the encode pass sums those into the total byte length, allocates one contiguous
elements buffer, computes per-row offsets (i * fixed_per_row plus a varlen prefix sum), then
writes each column left-to-right into its per-row slot using a cursor that becomes the ListView
sizes array, producing the final ListView<u8> with no separate finalize step.

Type coverage

Supported: nulls, booleans, integer/float primitives, decimals up to 128 bits, UTF-8 and
binary, structs, fixed-size lists, and extensions whose storage type is supported. Variant,
union, and variable-size list arrays are rejected — this crate does not define an ordering for
them.

Testing

cargo test -p vortex-row — 21 passed: sort-order round-trip tests (bool, i64 asc/desc,
u32, f16/f32/f64, utf8, multi-column, nulls first/last, struct), the single-buffer ListView
invariant, RowSize output shape, and dtype rejection.
cargo build -p vortex-row and cargo clippy -p vortex-row --all-targets — clean.

Generated by Claude Code

Adds `vortex-row`, which encodes N columnar arrays into a single byte-comparable `ListView<u8>` (the Vortex analogue of arrow-row) for use as sort/row keys. Encoding runs as two scalar functions behind the `RowEncoder` API: a `RowSize` sizing/classification pass and a `RowEncode` pass that allocates one contiguous buffer and writes each column left-to-right into its per-row slot. Per-column ordering (`RowSortField`) controls ascending/ descending and null placement. Includes the byte codec for fixed-width, varlen, and nested canonical types, the `convert_columns`/`compute_row_sizes` helpers, round-trip + invariant tests, and arrow-row-baselined throughput benches. The crate is marked `publish = false` for now, so no public-api.lock is tracked. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

codspeed-hq · 2026-06-04T14:43:42Z

Merging this PR will degrade performance by 12.7%

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 1 improved benchmark
❌ 1 regressed benchmark
✅ 1505 untouched benchmarks
🆕 10 new benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	Simulation	`chunked_bool_canonical_into[(1000, 10)]`	31.7 µs	46.6 µs	-31.87%
⚡	Simulation	`bitwise_not_vortex_buffer_mut[128]`	275.3 ns	246.1 ns	+11.85%
🆕	Simulation	`struct_mixed_vortex`	N/A	22.9 ms	N/A
🆕	Simulation	`primitive_i64_vortex`	N/A	1.5 ms	N/A
🆕	Simulation	`fsst_decompress_only`	N/A	16.4 ms	N/A
🆕	Simulation	`plain_row_encode_only`	N/A	14.9 ms	N/A
🆕	Simulation	`primitive_i64_arrow_row`	N/A	2.4 ms	N/A
🆕	Simulation	`fsst_fast_fused`	N/A	29.5 ms	N/A
🆕	Simulation	`fsst_unpack_then_convert`	N/A	30.6 ms	N/A
🆕	Simulation	`struct_mixed_arrow_row`	N/A	18.7 ms	N/A
🆕	Simulation	`utf8_arrow_row`	N/A	8.6 ms	N/A
🆕	Simulation	`utf8_vortex`	N/A	9.2 ms	N/A

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing claude/nice-archimedes-yjGyO (946e18c) with develop (e484daf)}

Add a CodSpeed shard for `vortex-row` so the `row_encode` divan benchmarks (vortex vs arrow-row) build and run in CI alongside the other crates. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

The row encoder builds the output `(elements, offsets, sizes)` triple itself, so the invariants `ListViewArray::try_new` validates (monotone offsets, per-row slices within bounds and disjoint) already hold by construction. Skip the revalidation walk via `new_unchecked`. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Introduce `ValidityKind`/`resolve_validity`: resolve a column's validity once, materializing the per-row mask only when the column may actually contain nulls. The size pass for varbinview and the bool and primitive encoders now branch once on validity, so the all-valid path drops the per-row `mask.value(i)` check (and mask allocation) entirely. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Every byte of the output range is written by some encoder: fixed-width null rows write sentinel + explicit zero-fill, varlen encoders zero-pad their final partial block, and struct/FSL null parent bodies are overwritten with the canonical null encoding. The pre-zero-init memset is therefore redundant, so replace it with `set_len`, saving a `total_len`-byte memset per call. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Materialize the listview offsets buffer with `set_len` + a slice write instead of per-row `push`. For the pure-fixed path, `iter_mut().enumerate()` lets LLVM auto-vectorize `offsets[i] = i * fixed_per_row` (no per-element bounds or capacity checks). `nrows` is validated to fit u32 at function entry, so the cast is exact. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Write the mixed (fixed + varlen) offsets through `iter_mut().zip` with wrapping arithmetic, mirroring the pure-fixed path: this elides per-element bounds checks so the `i * fixed_per_row` multiply auto-vectorizes while the varlen prefix sum stays a cheap sequential accumulator. The total is validated to fit u32 upstream, so the wrapping operations never actually wrap. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

…ping The varlen body writer was a per-byte XOR loop. Split it into an ascending fast path (`copy_nonoverlapping` of each 32-byte block plus a single stamped continuation byte, then a partial final block) and a descending path that XORs a u64 at a time via `xor_copy_block` for a vectorizable inner loop. The emitted bytes are identical to the previous implementation for every length and direction (full-block counts and final length byte match exactly); only the write strategy changes. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Replace the `with_iterator` traversal in `encode_varbinview` with a direct walk over the view array: cache the data-buffer slices once, then for each row read the bytes straight from the inlined view slot or the referenced buffer at `offset..offset+len`. This drops the iterator's per-row option/bounds machinery. Validity is resolved once via `resolve_validity`, keeping the no-nulls path branch-free on validity. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

The auto-vectorized offset loops and the varlen block writer used raw `as` casts that trip this crate's `cast_possible_truncation` lint. Iterate a `u32` counter instead of casting `usize` per element, and use `u8`/`u32` `try_from` for the varlen final-block length byte and total byte count. No behavior change. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Classify each column in the size pass (`ColKind` + `first_varlen_idx`): a fixed-width column with no varlen column before it has a constant within-row offset, so its write position is pure arithmetic (`i * fixed_per_row + prefix + var_prefix[i]`) with no per-row cursor. Route those columns through `field_encode_fixed_arithmetic`; the cursor path is seeded to start at the first varlen column. Primitive columns in the pure-fixed case use a `chunks_exact_mut` hot loop (matching arrow-row's not-null path); all other fixed types reuse the cursor encoder at the computed offsets, so output is byte-identical. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

FSST is not order-preserving, so row keys must be the decompressed bytes; the only strategy today is decompress to a canonical VarBinView then row-encode it. This bench measures that path and its two phases (decompress-only, and row-encode of an already-decompressed column) on compressible multi-block strings, to quantify the opportunity for a future fused FSST row-encode kernel: the phases are additive (decompress ~46%, row-encode ~54%), and the row-encode phase re-reads/re-writes the decompressed bytes a fused kernel could emit once. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Apply nightly rustfmt formatting to the FSST benchmark added in the previous commit. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Adds `fsst_fast_fused`: bulk-decompresses the FSST code heap straight into a contiguous buffer (no intermediate VarBinViewArray) and block-encodes rows directly into the row-key ListView using the stored uncompressed_lengths (free size pass), with the same no-zero-init / no-extra-copy techniques as the row encoder. Lets us compare the fused path head-to-head against decode-then-convert. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

joseph-isaacs added the changelog/feature A new feature label Jun 4, 2026 — with Claude

ci(vortex-row): run row_encode benchmarks on CodSpeed

083c7f3

Add a CodSpeed shard for `vortex-row` so the `row_encode` divan benchmarks (vortex vs arrow-row) build and run in CI alongside the other crates. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

joseph-isaacs marked this pull request as ready for review June 4, 2026 15:35

joseph-isaacs requested a review from robert3005 June 4, 2026 15:41

joseph-isaacs added 12 commits June 4, 2026 17:24

vortex-row: rustfmt the fsst row-encode benchmark

4f155e2

Apply nightly rustfmt formatting to the FSST benchmark added in the previous commit. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vortex-row): row-oriented byte encoder (size + encode passes)#8253

feat(vortex-row): row-oriented byte encoder (size + encode passes)#8253
joseph-isaacs wants to merge 14 commits into
developfrom
claude/nice-archimedes-yjGyO

joseph-isaacs commented Jun 4, 2026

Uh oh!

codspeed-hq Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joseph-isaacs commented Jun 4, 2026

Summary

Design

convert_columns and where the API lives

Type coverage

Testing

Uh oh!

codspeed-hq Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will degrade performance by 12.7%

Performance Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`convert_columns` and where the API lives

codspeed-hq Bot commented Jun 4, 2026 •

edited

Loading