Fix raw text paragraph break normalization by kiankyars · Pull Request #4884 · unslothai/unsloth

kiankyars · 2026-04-07T00:39:52Z

Summary

preserve paragraph breaks when cleaning raw text
normalize horizontal whitespace without collapsing newlines
add a regression check for repeated CRLF paragraph separators

Testing

python3 tests/test_raw_text.py

gemini-code-assist · 2026-04-07T00:39:58Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fa85bc3ee6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-07T00:41:56Z

+        text = re.sub(r"[^\x20-\x7E\n\t]", "", text)
+        text = re.sub(r"[^\S\n]+", " ", text)


Normalize whitespace before stripping non-ASCII chars

clean_text now removes non-ASCII characters before whitespace normalization, so non-ASCII whitespace (for example NBSP \u00A0 from HTML/PDF text) is deleted instead of converted to a space. In this path, inputs like "hello\u00A0world" become "helloworld", which corrupts word boundaries and downstream tokenization; prior behavior preserved the separator because whitespace collapsing happened first.

Useful? React with 👍 / 👎.

…se leftover doubles Run the [^\S\n]+ horizontal-whitespace collapse before the non-ASCII strip so that Unicode whitespace (\u00A0, \u202F, \u2009, \u3000, \v, \f, etc.) becomes a single ASCII space instead of being deleted outright. The prior ordering silently merged adjacent words on HTML/PDF/OCR-sourced text: "hello\u00a0world" used to produce "helloworld" after this PR; it now produces "hello world". Also drop \t from the allow-list since the horizontal-whitespace collapse already normalizes tabs to a single space, and add a targeted [ ]{2,} pass right after the non-ASCII strip so that a non-whitespace non-ASCII character sitting between two spaces ("word1 (c) word2") does not leave an interior double space. Without this extra pass, clean_text was not idempotent on such inputs: the first call produced "word1 word2" and only the second call collapsed it to "word1 word2". Fuzz testing over 10000 random inputs now satisfies the idempotence invariant in every case.

…e cases Cover: - Unicode horizontal whitespace separators (NBSP, narrow NBSP, thin space, en/em space, ideographic space, vertical tab, form feed) normalizing to a single ASCII space instead of being deleted. - Mixed paragraph + Unicode whitespace realistic input ("Section\u00a01\r\n\r\nBody\ftext\u202Fhere"). - Tab collapsing and space trimming around newlines. - Non-whitespace non-ASCII characters (copyright, accented letters, emoji) sitting between spaces: must not leave an interior double space, and clean_text must be idempotent on these inputs. - Non-ASCII characters adjacent to a newline: stripping must not leave stray leading or trailing spaces on the neighbouring line, and must not swallow an adjacent paragraph break.

danielhanchen · 2026-04-09T11:45:38Z

Thank you!

Fix raw text paragraph break normalization

f91ed4c

kiankyars requested review from danielhanchen and rolandtannous as code owners April 7, 2026 00:39

[pre-commit.ci] auto fixes from pre-commit.com hooks

fa85bc3

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed Apr 7, 2026

View reviewed changes

danielhanchen added 2 commits April 9, 2026 11:42

danielhanchen merged commit ad59724 into unslothai:main Apr 9, 2026
1 check was pending

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix raw text paragraph break normalization#4884

Fix raw text paragraph break normalization#4884
danielhanchen merged 4 commits intounslothai:mainfrom
kiankyars:fix/raw-text-clean-text-newlines

kiankyars commented Apr 7, 2026

Uh oh!

gemini-code-assist Bot commented Apr 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 7, 2026

Uh oh!

danielhanchen commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		text = re.sub(r"[^\x20-\x7E\n\t]", "", text)
		text = re.sub(r"[^\S\n]+", " ", text)

Uh oh!

Conversation

kiankyars commented Apr 7, 2026

Summary

Testing

Uh oh!

gemini-code-assist Bot commented Apr 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants