[SPARK-56350][SQL] Skip ColumnarToRow for Arrow-backed input to Python UDFs by viirya · Pull Request #55120 · apache/spark

viirya · 2026-03-31T21:47:55Z

What changes were proposed in this pull request?

This PR adds a columnar execution path to ArrowEvalPythonExec that allows Arrow-backed ColumnarBatch input (e.g., from DataSource V2 connectors that produce ArrowColumnVector columns) to be serialized directly to Arrow IPC for Python UDF evaluation, bypassing the existing ColumnarToRow → ArrowWriter round-trip.

Specifically:

ArrowEvalPythonExec now declares supportsColumnar = child.supportsColumnar and supportsRowBased = true, so Spark's transition rules no longer insert ColumnarToRowExec when the child supports columnar output.
A new doExecuteColumnar() method produces RDD[ColumnarBatch] where each output batch combines pass-through columns (kept as ColumnVector references without row conversion) with UDF result columns from Python.
doExecute() delegates to doExecuteColumnar().flatMap(_.rowIterator()) when the child supports columnar, and falls back to the existing row-based path otherwise.
A new ColumnarArrowPythonInput trait detects ArrowColumnVector columns at runtime and extracts the underlying Arrow FieldVectors directly via VectorSchemaRoot.of() + VectorUnloader for zero-copy IPC serialization. Non-Arrow ColumnarBatch inputs fall back to row-by-row ArrowWriter.
When UDF inputs contain complex expressions (not simple column references), the evaluator falls back to row-based projection for UDF input serialization while still keeping pass-through columns in columnar format.

Benchmark:

Scenario	Rows	String Length	UDF	Sink	Arrow Columnar (ms)	Row-based (ms)	Speedup
string concat	2M	200 chars	`name + data`	noop	2086	2401	1.15x
string identity	1M	1000 chars	`return data`	noop	4506	4839	1.07x
string identity	2M	200 chars	`return data`	noop	2118	2324	1.10x
string identity	10M	100 chars	`return data`	noop	6067	7124	1.17x
string identity	20M	100 chars	`return data`	noop	11917	14106	1.18x

Why are the changes needed?

When a Spark operator produces Arrow-backed ColumnarBatch (e.g., connectors that read columnar formats like Parquet into Arrow vectors), the current execution path for Arrow Python UDFs performs a wasteful columnar → row → columnar round-trip: ColumnarToRowExec converts the columnar data to InternalRow, then ArrowWriter converts each row back to Arrow columnar format for IPC serialization to the Python worker.

This round-trip is expensive due to per-row virtual dispatch, type conversion, null checking, and poor cache locality in ArrowWriter.write(). Since the data is already in Arrow format, the conversion is entirely unnecessary.

With this change, Arrow FieldVectors are extracted directly from ArrowColumnVector and serialized to IPC via VectorUnloader — skipping both ColumnarToRowExec and ArrowWriter. Pass-through columns are kept as ColumnVector references throughout, avoiding row materialization entirely in the fully columnar path.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.6

viirya · 2026-03-31T22:00:23Z

+Usage:
+    cd $SPARK_HOME
+    python python/pyspark/sql/tests/pandas/bench_arrow_columnar_udf.py \
+        [--rows N] [--iterations N] [--partitions N]


Due to some constraints on local dev environment for now, I cannot build Spark locally to run this benchmark. I will update the result once the constraints are removed.

Scenario Rows String Length UDF Sink Arrow Columnar (ms) Row-based (ms) Speedup

string concat 2M 200 chars name + data noop 2086 2401 1.15x

string identity 1M 1000 chars return data noop 4506 4839 1.07x

string identity 2M 200 chars return data noop 2118 2324 1.10x

string identity 10M 100 chars return data noop 6067 7124 1.17x

string identity 20M 100 chars return data noop 11917 14106 1.18x

Arrow columnar path:

*(1) Project [id#0, name#1, value#2, data#3, pythonUDF0#6 AS identity_udf(data)#5] +- ArrowEvalPython [identity_udf(data#3)#4], [pythonUDF0#6], 200 +- BatchScan ArrowBackedTestTable[id#0, name#1, value#2, data#3]

Row-based path:

*(2) Project [id#14, name#15, value#16, data#17, pythonUDF0#20 AS identity_udf(data)#19] +- ArrowEvalPython [identity_udf(data#17)#18], [pythonUDF0#20], 200 +- *(1) ColumnarToRow +- BatchScan ArrowBackedTestTable[id#14, name#15, value#16, data#17]

dongjoon-hyun · 2026-03-31T22:18:01Z

Please create and use a JIRA ID in the PR title before converting back from Draft status~

viirya · 2026-03-31T22:39:50Z

Please create and use a JIRA ID in the PR title before converting back from Draft status~

Okay. Thank you @dongjoon-hyun

dongjoon-hyun · 2026-04-08T19:52:28Z

Oh, I didn't notice this is still under developing.

Stale.

viirya · 2026-04-08T20:04:29Z

Oh, I didn't notice this is still under developing.

I was unable to run benchmark due to some constraints on local environment. I found a way to overcome it and ran the benchmark. These changes are adjustment to only benchmark related stuffs + a new config to control this new behavior. They are just for making the benchmark able to measure the difference correctly.

When a DSv2 connector produces Arrow-backed ColumnarBatch (with ArrowColumnVector), ArrowEvalPythonExec now extracts the underlying FieldVectors directly and serializes them to Arrow IPC, bypassing the wasteful columnar -> row -> columnar round-trip. Key changes: - ArrowEvalPythonExec declares supportsColumnar = child.supportsColumnar so the transition rule no longer inserts ColumnarToRowExec. - New ColumnarArrowPythonInput trait handles the Arrow-direct path (VectorSchemaRoot.of + VectorUnloader) with a row-by-row fallback for non-Arrow ColumnarBatch. - New ColumnarArrowEvalPythonEvaluatorFactory resolves UDF input columns by index when all inputs are simple AttributeReferences, falling back to row-based evaluation for complex expressions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Redesign the columnar evaluator factory to stay fully in columnar format instead of buffering pass-through columns as rows: - doExecuteColumnar() returns RDD[ColumnarBatch] with combined pass-through + UDF result columns. - doExecute() delegates to doExecuteColumnar().flatMap(_.rowIterator) when child supports columnar, mirroring DBR's approach. - Pass-through columns are kept as ColumnVector references in a queue instead of being converted to rows via HybridRowQueue. - combineResults() directly concatenates pass-through and result ColumnVector arrays into a new ColumnarBatch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

remove internal references from comments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

A test-only DataSource V2 that produces ColumnarBatch with ArrowColumnVector columns (backed by Arrow IntVector, VarCharVector, Float8Vector). Used for testing and benchmarking the columnar Arrow Python UDF path. Schema: (id INT, name STRING, value DOUBLE) Configurable via options: numRows (default 10000), numPartitions (default 2). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PySpark benchmark script that compares end-to-end execution time of scalar Arrow UDFs with two data sources: - ArrowBackedDataSourceV2 (columnar path, direct FieldVector extraction) - spark.range() (row-based path, ColumnarToRow + ArrowWriter) Uses a minimal UDF (id + value) to isolate data transfer overhead. Prints physical plans to verify the different execution paths. Usage: python python/pyspark/sql/tests/pandas/bench_arrow_columnar_udf.py \ [--rows N] [--iterations N] [--partitions N] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>