Skip to content

feat: Added DataFrameWriteOptions option when writing as csv, json, p…#857

Closed
allinux wants to merge 0 commit into
apache:mainfrom
allinux:main
Closed

feat: Added DataFrameWriteOptions option when writing as csv, json, p…#857
allinux wants to merge 0 commit into
apache:mainfrom
allinux:main

Conversation

@allinux

@allinux allinux commented Sep 6, 2024

Copy link
Copy Markdown

…arquet.

Which issue does this PR close?

N/A

Rationale for this change

Added DataFrameWriteOptions when using write_csv, write_json, write_parquet functions.

Are there any user-facing changes?

No

@timsaucer timsaucer left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very nice addition! Thank you!

Comment thread python/datafusion/dataframe.py Outdated
Comment on lines +416 to +418
write_options_overwrite: bool = False,
write_options_single_file_output: bool = False,
write_options_partition_by: List = [],

@timsaucer timsaucer Sep 6, 2024

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay to remove the write_options_ prefixes here.

        with_header: bool = False,
        overwrite: bool = False,
        single_file_output: bool = False,

Also for the partition by, I took a very quick look at the code and it looks like partition_by takes a list of strings, which I think our users would be surprised because all other uses of partition_by takes a list of expressions. So I think we want to add to the documentation a tiny bit about how to use that.

My understanding is that it's bad form in python to pass in a [] as a default, but I'm no expert. I bet we could change the type hint to partition_by: Optional[List[str]] = None and make the appropriate change on the call in the lines below.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit has been completed. Thank you for the review.

Comment thread python/datafusion/dataframe.py Outdated
Comment on lines +436 to +438
write_options_overwrite: bool = False,
write_options_single_file_output: bool = False,
write_options_partition_by: List = [],

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same recommendation on parameter names and partition_by as above

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit has been completed. Thank you for the review.

Comment thread python/datafusion/dataframe.py Outdated
with_header: If true, output the CSV header row.
write_options_overwrite: Controls if existing data should be overwritten
write_options_single_file_output: Controls if all partitions should be coalesced into a single output file. Generally will have slower performance when set to true.
write_options_partition_by: Sets which columns should be used for hive-style partitioned writes by name. Can be set to empty vec![] for non-partitioned writes.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty vec![] is mixing rust and python terminology

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment는 rust 의 comment 를 복사한 것 입니다. vec![] 이 포함한 라인은 제거했습니다.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is a copy of Rust's comment. Lines containing vec![] have been removed.

Comment thread python/datafusion/dataframe.py Outdated
compression_level: Compression level to use.
write_options_overwrite: Controls if existing data should be overwritten
write_options_single_file_output: Controls if all partitions should be coalesced into a single output file. Generally will have slower performance when set to true.
write_options_partition_by: Sets which columns should be used for hive-style partitioned writes by name. Can be set to empty vec![] for non-partitioned writes.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vec![] is rust not python

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is a copy of Rust's comment. Lines containing vec![] have been removed.

Comment thread python/datafusion/dataframe.py Outdated
Comment on lines +455 to +458
write_options_overwrite: bool = False,
write_options_single_file_output: bool = False,
write_options_partition_by: List = [],
) -> None:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above on naming and partition_by

Comment thread python/datafusion/dataframe.py Outdated
Comment thread src/dataframe.rs Outdated
Comment on lines +405 to +411
#[pyo3(signature = (
path,
with_header=false,

write_options_overwrite=false,
write_options_single_file_output=false,
write_options_partition_by=vec![],
))]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're setting all the type hints in the wrappers, you don't have to include this here. It's up to you but can lead to duplicate effort and long term maintainability.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a note to myself, we need to include in our developer's documentation our best practice (and also decide as a group if we want these signatures in the rust code at all)

Comment thread src/dataframe.rs Outdated
path: &str,
compression: &str,
compression_level: Option<u32>,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting: extra blank line

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit has been completed. Thank you for the review.

Comment thread src/dataframe.rs Outdated
fn write_json(
&self,
path: &str,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting: extra blank line

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit has been completed. Thank you for the review.

Comment thread python/datafusion/dataframe.py Outdated
Comment on lines +425 to +429
write_options_overwrite: Controls if existing data should be overwritten
write_options_single_file_output: Controls if all partitions should be coalesced into a single output file. Generally will have slower performance when set to true.
write_options_partition_by: Sets which columns should be used for hive-style partitioned writes by name.
"""
self.df.write_csv(str(path), with_header)
self.df.write_csv(str(path), with_header, write_options_overwrite, write_options_single_file_output, write_options_partition_by)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we've updated the argument names we need to update the documentation and the function call. We should add in unit tests so we can catch these errors in CI.

@allinux allinux force-pushed the main branch 2 times, most recently from 1278955 to ff45187 Compare September 7, 2024 15:11
@timsaucer

Copy link
Copy Markdown
Member

I was hoping we could add this in to DF42. Would you be willing to add unit tests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants