Skip to content

Commit d31644a

Browse files
authored
GH-41863: [Python][Parquet] Support lz4_raw as a compression name alias (#49135)
Closes #41863 ### Rationale for this change Other tools in the parquet ecosystem distinguish between `LZ4` and `LZ4_RAW`, matching the specification: https://parquet.apache.org/docs/file-format/data-pages/compression/ `LZ4` (framing) is of course deprecated. PyArrow does not support it, and instead simplifies the user-facing API, using `LZ4` as an alias for the `LZ4_RAW` codec. However, PyArrow does not accept `LZ4_RAW` as a valid alias for the `LZ4_RAW` codec: ``` ArrowException: Unsupported compression: lz4_raw ``` This is a friction issue, and confusing for some users who are aware of the differences. ### What changes are included in this PR? - Adding `LZ4_RAW` to the acceptable codec names list. - Modifying the `LZ4->LZ4_RAW` mapping to also accept `LZ4_RAW->LZ4_RAW`. - Adding a test ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes, an additive change to the accepted codec names. * GitHub Issue: #41863 Authored-by: Nick Woolmer <29717167+nwoolmer@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com>
1 parent 49423f8 commit d31644a

4 files changed

Lines changed: 16 additions & 3 deletions

File tree

docs/source/python/parquet.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -437,6 +437,9 @@ also supported:
437437
Snappy generally results in better performance, while Gzip may yield smaller
438438
files.
439439

440+
``'lz4_raw'`` is also accepted as an alias for ``'lz4'``. Both use the
441+
LZ4_RAW codec as defined in the Parquet specification.
442+
440443
These settings can also be set on a per-column basis:
441444

442445
.. code-block:: python

python/pyarrow/_parquet.pyx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1524,7 +1524,7 @@ cdef compression_name_from_enum(ParquetCompression compression_):
15241524

15251525
cdef int check_compression_name(name) except -1:
15261526
if name.upper() not in {'NONE', 'SNAPPY', 'GZIP', 'LZO', 'BROTLI', 'LZ4',
1527-
'ZSTD'}:
1527+
'LZ4_RAW', 'ZSTD'}:
15281528
raise ArrowException("Unsupported compression: " + name)
15291529
return 0
15301530

@@ -1539,7 +1539,7 @@ cdef ParquetCompression compression_from_name(name):
15391539
return ParquetCompression_LZO
15401540
elif name == 'BROTLI':
15411541
return ParquetCompression_BROTLI
1542-
elif name == 'LZ4':
1542+
elif name == 'LZ4' or name == 'LZ4_RAW':
15431543
return ParquetCompression_LZ4
15441544
elif name == 'ZSTD':
15451545
return ParquetCompression_ZSTD

python/pyarrow/parquet/core.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -768,7 +768,9 @@ def _sanitize_table(table, new_schema, flavor):
768768
doesn't support dictionary encoding.
769769
compression : str or dict, default 'snappy'
770770
Specify the compression codec, either on a general basis or per-column.
771-
Valid values: {'NONE', 'SNAPPY', 'GZIP', 'BROTLI', 'LZ4', 'ZSTD'}.
771+
Valid values: {'NONE', 'SNAPPY', 'GZIP', 'BROTLI', 'LZ4', 'LZ4_RAW', 'ZSTD'}.
772+
'LZ4_RAW' is accepted as an alias for 'LZ4' (both use the LZ4_RAW
773+
codec as defined in the Parquet specification).
772774
write_statistics : bool or list, default True
773775
Specify if we should write statistics in general (default is True) or only
774776
for some columns.

python/pyarrow/tests/parquet/test_basic.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -612,6 +612,14 @@ def test_compression_level():
612612
compression_level=level)
613613

614614

615+
def test_lz4_raw_compression_alias():
616+
# GH-41863: lz4_raw should be accepted as a compression name alias
617+
arr = pa.array(list(map(int, range(1000))))
618+
table = pa.Table.from_arrays([arr, arr], names=['a', 'b'])
619+
_check_roundtrip(table, expected=table, compression="lz4_raw")
620+
_check_roundtrip(table, expected=table, compression="LZ4_RAW")
621+
622+
615623
def test_sanitized_spark_field_names():
616624
a0 = pa.array([0, 1, 2, 3, 4])
617625
name = 'prohib; ,\t{}'

0 commit comments

Comments
 (0)