Skip to content

Add reshard support for JSON Lines and CSV file formats#8084

Open
muyihao wants to merge 4 commits intohuggingface:mainfrom
muyihao:feature/add-reshard-for-other-format
Open

Add reshard support for JSON Lines and CSV file formats#8084
muyihao wants to merge 4 commits intohuggingface:mainfrom
muyihao:feature/add-reshard-for-other-format

Conversation

@muyihao
Copy link
Copy Markdown

@muyihao muyihao commented Mar 21, 2026

This PR extends the IterableDataset.reshard() functionality to support JSON Lines and CSV file formats, building on the foundation laid in #7992.

Summary

  • JSON Lines : Resharding splits files by line boundaries (only when num_shards is specified)
  • CSV : Resharding splits files by line boundaries (only when num_shards is specified)
  • Parquet : Already supported via row groups (enhanced to support num_shards parameter)

Usage

from datasets import load_dataset

# JSON Lines
ds = load_dataset("json", data_files=["data.jsonl"], streaming=True)
resharded_ds = ds['train'].reshard(num_shards=100)

# CSV
ds = load_dataset("csv", data_files=["data.csv"], streaming=True)
resharded_ds = ds['train'].reshard(num_shards=100)

# Parquet
ds = load_dataset("parquet", data_files=["data.parquet"], streaming=True)
resharded_ds = ds['train'].reshard(num_shards=100)

@muyihao
Copy link
Copy Markdown
Author

muyihao commented Mar 21, 2026

@lhoestq , hi, lhoestq, could you please help review this PR? Thanks!

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool ! I left a few comments. Mostly on CSV but this also applies to JSON since the logic is almost the same. Let me know what you think :)

Comment on lines +202 to +211
with open(
file,
"r",
encoding=self.config.encoding or "utf-8",
errors=self.config.encoding_errors or "strict",
) as f:
# Read the file to count lines
line_count = 0
while f.readline():
line_count += 1
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be expensive on big files, maybe limit to the first 20MB to estimate the total number of lines ?

we could also assume that the files contain the same kind of data, so maybe there is no need to check every single file

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great point. Iterating through massive files just for a line count is indeed a bottleneck.

Comment on lines +187 to +196
# No resharding - return the original gen_kwargs
for base_file, files_iterable in zip(base_files, files_iterables):
for file in files_iterable:
yield {
"base_files": [base_file],
"files_iterables": [[file]],
"shard_start_line": [0],
"shard_end_line": [None],
}
return
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC when num_shards is None then it should shard maximally.

In the case of the CSV loader, the minimum shard size is defined by self.config.chunksize which is 10_000 lines by default. It could make sense to define maximum sharding as aiming for self.config.chunksize lines per shard, WDYT ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I'll refactor this part to

Comment on lines +258 to +259
if shard_start_line > 0:
read_csv_kwargs["skiprows"] = shard_start_line
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skiprows may lead some shard to require downloading a lot of unnecessary data to skip the rows, maybe you could skip bytes instead. Something along those lines maybe ?

if skip_bytes is not None:
    with open(file, "rb") as f:
        header = f.readline()
        f.seek(skip_bytes)
        lines = (f.read(n_bytes) + f.readline()).splitlines()
        lines[0] = header
    file = io.BytesIO("\n".join(lines))
csv_file_reader = pd.read_csv(file, iterator=True, dtype=dtype, **read_csv_kwargs)

e.g. this dataset has multiple CSV files of 4GB each: https://huggingface.co/datasets/Koala-36M/Koala-36M-v1

Copy link
Copy Markdown
Author

@muyihao muyihao Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, I will try with it

Comment on lines +191 to +194
"base_files": [base_file],
"files_iterables": [[file]],
"shard_start_line": [0],
"shard_end_line": [None],
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's probably more practical to have them together like this:

Suggested change
"base_files": [base_file],
"files_iterables": [[file]],
"shard_start_line": [0],
"shard_end_line": [None],
"base_files": [base_file],
"files_iterables": [[(file, 0, None)]],

this way later you can do

        for shard_idx, files_iterable in enumerate(files_iterables):
            for file in files_iterable:
                shard_start_line, shard_end_line = None, None
                if isinstance(file, tuple):
                    file, shard_start_line, shard_end_line = file

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively you could keep it this way, but in that case you need to zip() on files_iterables:

Suggested change
"base_files": [base_file],
"files_iterables": [[file]],
"shard_start_line": [0],
"shard_end_line": [None],
"base_files": [base_file],
"files_iterables": [[file]],
"shard_start_lines": [0],
"shard_end_lines": [None],
        shard_start_lines = shard_start_lines or [None] * len(base_files)
        shard_end_lines = shard_start_lines or [None] * len(base_files)
        for base_file, files_iterable, shard_start_line, shard_end_line in zip(base_files, files_iterables, shard_start_lines, shard_end_lines):
            for file in files_iterable:

Comment on lines +133 to +143
if num_shards is None:
# No resharding - return the original gen_kwargs
for base_file, files_iterable in zip(base_files, files_iterables):
for file in files_iterable:
yield {
"base_files": [base_file],
"files_iterables": [[file]],
"shard_start_line": [0],
"shard_end_line": [None],
}
return
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when sharding maximally we could use the fact that self.config.chunksize is 10MB by default and aim for this maybe

(yes it has the same name as CSV but is in bytes, not lines. This collision is not ideal and I guess comes from the fact that pandas uses chunksize in lines for CSV and pyarrow uses chunksize in bytes for json)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it

@muyihao muyihao requested a review from lhoestq March 29, 2026 07:44
@muyihao
Copy link
Copy Markdown
Author

muyihao commented Mar 29, 2026

Cool ! I left a few comments. Mostly on CSV but this also applies to JSON since the logic is almost the same. Let me know what you think :)

@lhoestq Hi, lhoestq, I've made the updates accordingly, could you please take another look? 😊

@muyihao
Copy link
Copy Markdown
Author

muyihao commented Apr 7, 2026

Hi @lhoestq , just a gentle ping on this. I've re-verified the byte-offset logic for CSV/JSONL and confirmed it handles large files efficiently. Let me know if you have any concerns or would like me to adjust the implementation further. Looking forward to your feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants