Skip to content

feat: expose SessionContext.write_csv, write_json, write_parquet#1569

Open
timsaucer wants to merge 1 commit into
apache:mainfrom
timsaucer:feat/df54-session-write-methods
Open

feat: expose SessionContext.write_csv, write_json, write_parquet#1569
timsaucer wants to merge 1 commit into
apache:mainfrom
timsaucer:feat/df54-session-write-methods

Conversation

@timsaucer
Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #. No dedicated tracking issue; related to umbrella issue #462 (interface design / user stories).

Rationale for this change

DataFusion's SessionContext exposes write_csv, write_json, and write_parquet methods that take an already-built Arc<dyn ExecutionPlan> and a target path. These complement the existing per-DataFrame write methods and are the right entry point when a caller already holds a physical plan -- for example after running custom physical optimizer rules (recently exposed via PR #1557) or after constructing a plan directly. The Python bindings did not surface them.

What changes are included in this PR?

  • crates/core/src/context.rs: add write_csv, write_json, and write_parquet PyO3 methods on PySessionContext. Each accepts a PyExecutionPlan and a path, converts the plan to Arc<dyn ExecutionPlan>, and delegates to the matching upstream SessionContext method. write_parquet passes None for the WriterProperties argument; per-partition Parquet tuning remains on DataFrame.write_parquet.
  • python/datafusion/context.py: add Python wrappers with doctest examples that round-trip data through a temp directory. The docstrings flag DataFrame.write_* as the right entry point when callers need header control, compression, or other write options.

Are there any user-facing changes?

Yes. Three new public methods on datafusion.SessionContext:

  • write_csv(plan, path)
  • write_json(plan, path)
  • write_parquet(plan, path)

No breaking changes.

Adds three plan-level writers on SessionContext that mirror the
upstream datafusion::execution::context API. Each takes an
ExecutionPlan and an output directory path; the plan is executed and
its results are written one partition per file inside that directory.

These complement the existing DataFrame.write_* methods, which are
the right choice when callers need finer control (CSV header, Parquet
compression, write options). The new SessionContext methods are the
right choice when a caller already holds a physical ExecutionPlan
(for example after custom physical optimizer rules or hand-built
plans) and just wants the rows materialized.

Related to apache#462.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant