Add reshard support for JSON Lines and CSV file formats by muyihao · Pull Request #8084 · huggingface/datasets

muyihao · 2026-03-21T16:56:42Z

This PR extends the IterableDataset.reshard() functionality to support JSON Lines and CSV file formats, building on the foundation laid in #7992.

Summary

JSON Lines : Resharding splits files by line boundaries (only when num_shards is specified)
CSV : Resharding splits files by line boundaries (only when num_shards is specified)
Parquet : Already supported via row groups (enhanced to support num_shards parameter)

Usage

from datasets import load_dataset

# JSON Lines
ds = load_dataset("json", data_files=["data.jsonl"], streaming=True)
resharded_ds = ds['train'].reshard(num_shards=100)

# CSV
ds = load_dataset("csv", data_files=["data.csv"], streaming=True)
resharded_ds = ds['train'].reshard(num_shards=100)

# Parquet
ds = load_dataset("parquet", data_files=["data.parquet"], streaming=True)
resharded_ds = ds['train'].reshard(num_shards=100)

muyihao · 2026-03-21T17:11:48Z

@lhoestq , hi, lhoestq, could you please help review this PR? Thanks!

HuggingFaceDocBuilderDev · 2026-03-22T19:24:58Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lhoestq

Cool ! I left a few comments. Mostly on CSV but this also applies to JSON since the logic is almost the same. Let me know what you think :)

lhoestq · 2026-03-22T19:34:47Z

+                    with open(
+                        file,
+                        "r",
+                        encoding=self.config.encoding or "utf-8",
+                        errors=self.config.encoding_errors or "strict",
+                    ) as f:
+                        # Read the file to count lines
+                        line_count = 0
+                        while f.readline():
+                            line_count += 1


this can be expensive on big files, maybe limit to the first 20MB to estimate the total number of lines ?

we could also assume that the files contain the same kind of data, so maybe there is no need to check every single file

That's a great point. Iterating through massive files just for a line count is indeed a bottleneck.

lhoestq · 2026-03-22T19:39:04Z

+            # No resharding - return the original gen_kwargs
+            for base_file, files_iterable in zip(base_files, files_iterables):
+                for file in files_iterable:
+                    yield {
+                        "base_files": [base_file],
+                        "files_iterables": [[file]],
+                        "shard_start_line": [0],
+                        "shard_end_line": [None],
+                    }
+            return


IIUC when num_shards is None then it should shard maximally.

In the case of the CSV loader, the minimum shard size is defined by self.config.chunksize which is 10_000 lines by default. It could make sense to define maximum sharding as aiming for self.config.chunksize lines per shard, WDYT ?

Makes sense. I'll refactor this part to

lhoestq · 2026-03-22T20:03:38Z

+                if shard_start_line > 0:
+                    read_csv_kwargs["skiprows"] = shard_start_line


skiprows may lead some shard to require downloading a lot of unnecessary data to skip the rows, maybe you could skip bytes instead. Something along those lines maybe ?

if skip_bytes is not None: with open(file, "rb") as f: header = f.readline() f.seek(skip_bytes) lines = (f.read(n_bytes) + f.readline()).splitlines() lines[0] = header file = io.BytesIO("\n".join(lines)) csv_file_reader = pd.read_csv(file, iterator=True, dtype=dtype, **read_csv_kwargs)

e.g. this dataset has multiple CSV files of 4GB each: https://huggingface.co/datasets/Koala-36M/Koala-36M-v1

cool, I will try with it

lhoestq · 2026-03-22T20:12:23Z

+                        "base_files": [base_file],
+                        "files_iterables": [[file]],
+                        "shard_start_line": [0],
+                        "shard_end_line": [None],


it's probably more practical to have them together like this:

Suggested change

"base_files": [base_file],

"files_iterables": [[file]],

"shard_start_line": [0],

"shard_end_line": [None],

"base_files": [base_file],

"files_iterables": [[(file, 0, None)]],

this way later you can do

for shard_idx, files_iterable in enumerate(files_iterables): for file in files_iterable: shard_start_line, shard_end_line = None, None if isinstance(file, tuple): file, shard_start_line, shard_end_line = file

alternatively you could keep it this way, but in that case you need to zip() on files_iterables:

Suggested change

"base_files": [base_file],

"files_iterables": [[file]],

"shard_start_line": [0],

"shard_end_line": [None],

"base_files": [base_file],

"files_iterables": [[file]],

"shard_start_lines": [0],

"shard_end_lines": [None],

shard_start_lines = shard_start_lines or [None] * len(base_files) shard_end_lines = shard_start_lines or [None] * len(base_files) for base_file, files_iterable, shard_start_line, shard_end_line in zip(base_files, files_iterables, shard_start_lines, shard_end_lines): for file in files_iterable:

lhoestq · 2026-03-22T20:23:30Z

+        if num_shards is None:
+            # No resharding - return the original gen_kwargs
+            for base_file, files_iterable in zip(base_files, files_iterables):
+                for file in files_iterable:
+                    yield {
+                        "base_files": [base_file],
+                        "files_iterables": [[file]],
+                        "shard_start_line": [0],
+                        "shard_end_line": [None],
+                    }
+            return


when sharding maximally we could use the fact that self.config.chunksize is 10MB by default and aim for this maybe

(yes it has the same name as CSV but is in bytes, not lines. This collision is not ideal and I guess comes from the fact that pandas uses chunksize in lines for CSV and pyarrow uses chunksize in bytes for json)

muyihao · 2026-03-29T07:47:03Z

Cool ! I left a few comments. Mostly on CSV but this also applies to JSON since the logic is almost the same. Let me know what you think :)

@lhoestq Hi, lhoestq, I've made the updates accordingly, could you please take another look? 😊

muyihao · 2026-04-07T13:04:17Z

Hi @lhoestq , just a gentle ping on this. I've re-verified the byte-offset logic for CSV/JSONL and confirmed it handles large files efficiently. Let me know if you have any concerns or would like me to adjust the implementation further. Looking forward to your feedback!

Add reshard for jsonl and csv

39729d6

lhoestq reviewed Mar 22, 2026

View reviewed changes

muyihao added 3 commits March 25, 2026 01:14

feat: update performance and default behavor

ca05a9a

feat: optimize reshard with byte skip

ef7d52a

fix: keep file path for error log

878923f

muyihao requested a review from lhoestq March 29, 2026 07:44

		if shard_start_line > 0:
		read_csv_kwargs["skiprows"] = shard_start_line

Conversation

muyihao commented Mar 21, 2026

Summary

Usage

Uh oh!

muyihao commented Mar 21, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Mar 22, 2026

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

muyihao Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

muyihao commented Mar 29, 2026

Uh oh!

muyihao commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

muyihao Mar 25, 2026 •

edited

Loading