feat: add dictionary_columns parameter to Table.scan() for memory-efficient reads#3461
Conversation
…icient reads
Columns that contain large or frequently repeated strings (e.g. JSON
blobs, low-cardinality categoricals) can exhaust memory when PyArrow
loads them as plain string arrays. PyArrow's Parquet reader supports
reading such columns as dictionary-encoded arrays, which deduplicates
values and can dramatically reduce memory usage.
Add a dictionary_columns: tuple[str, ...] parameter to Table.scan()
(and the underlying TableScan / ArrowScan classes) that is forwarded
to _get_file_format() as PyArrow's dictionary_columns kwarg. Only
applies to Parquet files; silently ignored for ORC.
Usage:
table.scan(dictionary_columns=("payload",)).to_arrow()
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fokko
left a comment
There was a problem hiding this comment.
More of a meta question, but I don't think this helps for large JSON blobs since it would directly exceed the dictionary limit and fall back to plain encoding. I do think this helps a lot for low-cardinality strings.
Do we know how Arrow decodes the data? For example, if the Parquet column is dictionary encoded, would Arrow do something smart with the buffers to not repeat this value many times?
| An integer representing the number of rows to | ||
| return in the scan result. If None, fetches all | ||
| matching rows. | ||
| dictionary_columns: |
There was a problem hiding this comment.
I'm hesitant to add Arrow specific things to the public API
There was a problem hiding this comment.
Good point @Fokko I have moved dictionary_columns off the public scan() API and onto the Arrow-specific output methods instead:
table.scan(...).to_arrow(dictionary_columns=("payload",))
table.scan(...).to_arrow_batch_reader(dictionary_columns=("payload",))That way it doesn't pollute the general scan interface. ArrowScan still accepts it for lower-level use. Pushed in the latest commit. Let me know if you have any further suggestions.
…w_batch_reader()
Addresses reviewer feedback that dictionary_columns is an Arrow-specific
concern and should not be part of the general-purpose scan() public API.
The parameter is now accepted directly by the Arrow output methods:
table.scan(...).to_arrow(dictionary_columns=("payload",))
table.scan(...).to_arrow_batch_reader(dictionary_columns=("payload",))
ArrowScan still accepts dictionary_columns for lower-level use.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Closes #3170
Rationale
Columns that contain large or frequently repeated string values (e.g. JSON blobs, low-cardinality categoricals) can exhaust memory when PyArrow loads them as plain string arrays. PyArrow's Parquet reader natively supports dictionary-encoded reads via its
dictionary_columnskwarg, which deduplicates values and can dramatically reduce peak memory usage.This was previously discussed in #3168 and a prior implementation (#3234) was closed as stale.
Changes
dictionary_columns: tuple[str, ...] = ()toTable.scan(),TableScan.__init__, andStagedTable.scan().DataScan.to_arrow()andto_arrow_batch_reader()→ArrowScan.__init__→_task_to_record_batches→_get_file_format().task.file.file_format == FileFormat.PARQUET; silently ignored for ORC (which does not support this kwarg).Usage
Verification
test_dictionary_columns_produces_dict_encoded_output— confirms the requested column is dict-encoded, non-requested columns are plain, and values are identical.make lint✓pytest tests/table/ tests/io/test_pyarrow.py✓