Skip to content

Fix raw text paragraph break normalization#4884

Merged
danielhanchen merged 4 commits intounslothai:mainfrom
kiankyars:fix/raw-text-clean-text-newlines
Apr 9, 2026
Merged

Fix raw text paragraph break normalization#4884
danielhanchen merged 4 commits intounslothai:mainfrom
kiankyars:fix/raw-text-clean-text-newlines

Conversation

@kiankyars
Copy link
Copy Markdown
Contributor

Summary

  • preserve paragraph breaks when cleaning raw text
  • normalize horizontal whitespace without collapsing newlines
  • add a regression check for repeated CRLF paragraph separators

Testing

  • python3 tests/test_raw_text.py

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fa85bc3ee6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread unsloth/dataprep/raw_text.py Outdated
Comment on lines +248 to +249
text = re.sub(r"[^\x20-\x7E\n\t]", "", text)
text = re.sub(r"[^\S\n]+", " ", text)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Normalize whitespace before stripping non-ASCII chars

clean_text now removes non-ASCII characters before whitespace normalization, so non-ASCII whitespace (for example NBSP \u00A0 from HTML/PDF text) is deleted instead of converted to a space. In this path, inputs like "hello\u00A0world" become "helloworld", which corrupts word boundaries and downstream tokenization; prior behavior preserved the separator because whitespace collapsing happened first.

Useful? React with 👍 / 👎.

…se leftover doubles

Run the [^\S\n]+ horizontal-whitespace collapse before the non-ASCII strip
so that Unicode whitespace (\u00A0, \u202F, \u2009, \u3000, \v, \f, etc.)
becomes a single ASCII space instead of being deleted outright. The prior
ordering silently merged adjacent words on HTML/PDF/OCR-sourced text:
"hello\u00a0world" used to produce "helloworld" after this PR; it now
produces "hello world".

Also drop \t from the allow-list since the horizontal-whitespace collapse
already normalizes tabs to a single space, and add a targeted [ ]{2,} pass
right after the non-ASCII strip so that a non-whitespace non-ASCII character
sitting between two spaces ("word1 (c) word2") does not leave an interior
double space. Without this extra pass, clean_text was not idempotent on
such inputs: the first call produced "word1  word2" and only the second
call collapsed it to "word1 word2". Fuzz testing over 10000 random inputs
now satisfies the idempotence invariant in every case.
…e cases

Cover:
- Unicode horizontal whitespace separators (NBSP, narrow NBSP, thin space,
  en/em space, ideographic space, vertical tab, form feed) normalizing to
  a single ASCII space instead of being deleted.
- Mixed paragraph + Unicode whitespace realistic input ("Section\u00a01\r\n\r\nBody\ftext\u202Fhere").
- Tab collapsing and space trimming around newlines.
- Non-whitespace non-ASCII characters (copyright, accented letters, emoji)
  sitting between spaces: must not leave an interior double space, and
  clean_text must be idempotent on these inputs.
- Non-ASCII characters adjacent to a newline: stripping must not leave
  stray leading or trailing spaces on the neighbouring line, and must not
  swallow an adjacent paragraph break.
@danielhanchen
Copy link
Copy Markdown
Contributor

Thank you!

@danielhanchen danielhanchen merged commit ad59724 into unslothai:main Apr 9, 2026
1 check was pending
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants