feat(vortex-row): row-oriented byte encoder (size + encode passes)#8253
Open
joseph-isaacs wants to merge 14 commits into
Open
feat(vortex-row): row-oriented byte encoder (size + encode passes)#8253joseph-isaacs wants to merge 14 commits into
joseph-isaacs wants to merge 14 commits into
Conversation
Adds `vortex-row`, which encodes N columnar arrays into a single byte-comparable `ListView<u8>` (the Vortex analogue of arrow-row) for use as sort/row keys. Encoding runs as two scalar functions behind the `RowEncoder` API: a `RowSize` sizing/classification pass and a `RowEncode` pass that allocates one contiguous buffer and writes each column left-to-right into its per-row slot. Per-column ordering (`RowSortField`) controls ascending/ descending and null placement. Includes the byte codec for fixed-width, varlen, and nested canonical types, the `convert_columns`/`compute_row_sizes` helpers, round-trip + invariant tests, and arrow-row-baselined throughput benches. The crate is marked `publish = false` for now, so no public-api.lock is tracked. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Merging this PR will degrade performance by 12.7%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | chunked_bool_canonical_into[(1000, 10)] |
31.7 µs | 46.6 µs | -31.87% |
| ⚡ | Simulation | bitwise_not_vortex_buffer_mut[128] |
275.3 ns | 246.1 ns | +11.85% |
| 🆕 | Simulation | struct_mixed_vortex |
N/A | 22.9 ms | N/A |
| 🆕 | Simulation | primitive_i64_vortex |
N/A | 1.5 ms | N/A |
| 🆕 | Simulation | fsst_decompress_only |
N/A | 16.4 ms | N/A |
| 🆕 | Simulation | plain_row_encode_only |
N/A | 14.9 ms | N/A |
| 🆕 | Simulation | primitive_i64_arrow_row |
N/A | 2.4 ms | N/A |
| 🆕 | Simulation | fsst_fast_fused |
N/A | 29.5 ms | N/A |
| 🆕 | Simulation | fsst_unpack_then_convert |
N/A | 30.6 ms | N/A |
| 🆕 | Simulation | struct_mixed_arrow_row |
N/A | 18.7 ms | N/A |
| 🆕 | Simulation | utf8_arrow_row |
N/A | 8.6 ms | N/A |
| 🆕 | Simulation | utf8_vortex |
N/A | 9.2 ms | N/A |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/nice-archimedes-yjGyO (946e18c) with develop (e484daf)
Add a CodSpeed shard for `vortex-row` so the `row_encode` divan benchmarks (vortex vs arrow-row) build and run in CI alongside the other crates. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
The row encoder builds the output `(elements, offsets, sizes)` triple itself, so the invariants `ListViewArray::try_new` validates (monotone offsets, per-row slices within bounds and disjoint) already hold by construction. Skip the revalidation walk via `new_unchecked`. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Introduce `ValidityKind`/`resolve_validity`: resolve a column's validity once, materializing the per-row mask only when the column may actually contain nulls. The size pass for varbinview and the bool and primitive encoders now branch once on validity, so the all-valid path drops the per-row `mask.value(i)` check (and mask allocation) entirely. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Every byte of the output range is written by some encoder: fixed-width null rows write sentinel + explicit zero-fill, varlen encoders zero-pad their final partial block, and struct/FSL null parent bodies are overwritten with the canonical null encoding. The pre-zero-init memset is therefore redundant, so replace it with `set_len`, saving a `total_len`-byte memset per call. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Materialize the listview offsets buffer with `set_len` + a slice write instead of per-row `push`. For the pure-fixed path, `iter_mut().enumerate()` lets LLVM auto-vectorize `offsets[i] = i * fixed_per_row` (no per-element bounds or capacity checks). `nrows` is validated to fit u32 at function entry, so the cast is exact. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Write the mixed (fixed + varlen) offsets through `iter_mut().zip` with wrapping arithmetic, mirroring the pure-fixed path: this elides per-element bounds checks so the `i * fixed_per_row` multiply auto-vectorizes while the varlen prefix sum stays a cheap sequential accumulator. The total is validated to fit u32 upstream, so the wrapping operations never actually wrap. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…ping The varlen body writer was a per-byte XOR loop. Split it into an ascending fast path (`copy_nonoverlapping` of each 32-byte block plus a single stamped continuation byte, then a partial final block) and a descending path that XORs a u64 at a time via `xor_copy_block` for a vectorizable inner loop. The emitted bytes are identical to the previous implementation for every length and direction (full-block counts and final length byte match exactly); only the write strategy changes. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Replace the `with_iterator` traversal in `encode_varbinview` with a direct walk over the view array: cache the data-buffer slices once, then for each row read the bytes straight from the inlined view slot or the referenced buffer at `offset..offset+len`. This drops the iterator's per-row option/bounds machinery. Validity is resolved once via `resolve_validity`, keeping the no-nulls path branch-free on validity. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
The auto-vectorized offset loops and the varlen block writer used raw `as` casts that trip this crate's `cast_possible_truncation` lint. Iterate a `u32` counter instead of casting `usize` per element, and use `u8`/`u32` `try_from` for the varlen final-block length byte and total byte count. No behavior change. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Classify each column in the size pass (`ColKind` + `first_varlen_idx`): a fixed-width column with no varlen column before it has a constant within-row offset, so its write position is pure arithmetic (`i * fixed_per_row + prefix + var_prefix[i]`) with no per-row cursor. Route those columns through `field_encode_fixed_arithmetic`; the cursor path is seeded to start at the first varlen column. Primitive columns in the pure-fixed case use a `chunks_exact_mut` hot loop (matching arrow-row's not-null path); all other fixed types reuse the cursor encoder at the computed offsets, so output is byte-identical. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
FSST is not order-preserving, so row keys must be the decompressed bytes; the only strategy today is decompress to a canonical VarBinView then row-encode it. This bench measures that path and its two phases (decompress-only, and row-encode of an already-decompressed column) on compressible multi-block strings, to quantify the opportunity for a future fused FSST row-encode kernel: the phases are additive (decompress ~46%, row-encode ~54%), and the row-encode phase re-reads/re-writes the decompressed bytes a fused kernel could emit once. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Apply nightly rustfmt formatting to the FSST benchmark added in the previous commit. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Adds `fsst_fast_fused`: bulk-decompresses the FSST code heap straight into a contiguous buffer (no intermediate VarBinViewArray) and block-encodes rows directly into the row-key ListView using the stored uncompressed_lengths (free size pass), with the same no-zero-init / no-extra-copy techniques as the row encoder. Lets us compare the fused path head-to-head against decode-then-convert. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
vortex-row, a new crate that encodes one or more columnar Vortex arrays into a singleListView<u8>whose per-row byte slices are lexicographically comparable. The byte ordermatches tuple ordering of the input values under per-column sort options, so the output works
directly as a sort key / row key — the Vortex analogue of
arrow-row.This is the base of the row-encoding work: the byte codec, the two-pass converter, the
public API, tests, and benches. Codec hot-path performance tuning and the per-encoding
(Constant / Dict / Patched / RunEnd / BitPacked / FoR / Delta) fast-path kernels land in
follow-up PRs and are intentionally out of scope here.
Single clean commit on top of current
develop. The crate is markedpublish = false, so —following the same convention as
vortex-python,vortex-bench, andxtask— nopublic-api.lockis tracked while the API is still settling.Design
Encoding runs as two scalar functions wired behind the
RowEncoderAPI:RowSize. Walks the N input columns once, classifies each column asfixed- or variable-width, accumulates the fixed-width prefix per row, and lazily collects
per-row variable lengths. Returns
Struct { fixed: u32, var: u32 }so callers read per-rowwidths without materializing the constant
fixedslot as a per-row buffer.RowEncode. Uses those sizes to compute totals, allocate one contiguouselements buffer, build per-row absolute offsets, then writes each column left-to-right into
its per-row slot via a write cursor that doubles as the ListView
sizesarray — so noseparate finalize step is needed.
The converter is effectively 2 passes for the pure-fixed-width case and 3 when
variable-length columns require the prefix-sum offsets pass.
Per-column ordering is controlled by
RowSortField { descending, nulls_first }: descendingreverses the encoded value bytes, and a leading sentinel byte (
0x00/0x01/0x02) placesnulls before or after non-nulls independently of sort direction.
convert_columnsand where the API livesconvert_columns(cols: &[ArrayRef], fields: &[RowSortField], ctx) -> VortexResult<ListViewArray>is the one-shot entry point;
RowEncoderis the reusable form (build once with options, encodemany). Each public item is defined in a single file:
RowEncoder,convert_columns(_with_options),compute_row_sizes(_with_options)src/encoder.rsRowEncodescalar fn + the 5-phase encode driversrc/encode.rsRowSizescalar fn + the size/classify pass (compute_sizes)src/size.rsRowEncodingOptions,RowSortFieldsrc/options.rsfield_size/field_encode)src/codec.rsinitialize(session)+ re-exportssrc/lib.rsImplementation.
convert_columnsvalidates the columns (non-empty, equal lengths, oneRowSortFieldper column) and delegates to aRowEncoder, which runsRowSizethenRowEncode: the size pass canonicalizes each column once and classifies it fixed- vsvariable-width, accumulating a constant
fixed_per_rowand — only when needed — per-rowvar_lengths; the encode pass sums those into the total byte length, allocates one contiguouselements buffer, computes per-row offsets (
i * fixed_per_rowplus a varlen prefix sum), thenwrites each column left-to-right into its per-row slot using a cursor that becomes the ListView
sizesarray, producing the finalListView<u8>with no separate finalize step.Type coverage
Supported: nulls, booleans, integer/float primitives, decimals up to 128 bits, UTF-8 and
binary, structs, fixed-size lists, and extensions whose storage type is supported. Variant,
union, and variable-size list arrays are rejected — this crate does not define an ordering for
them.
Testing
cargo test -p vortex-row— 21 passed: sort-order round-trip tests (bool, i64 asc/desc,u32, f16/f32/f64, utf8, multi-column, nulls first/last, struct), the single-buffer ListView
invariant,
RowSizeoutput shape, and dtype rejection.cargo build -p vortex-rowandcargo clippy -p vortex-row --all-targets— clean.Generated by Claude Code