Fix raw text paragraph break normalization#4884
Fix raw text paragraph break normalization#4884danielhanchen merged 4 commits intounslothai:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fa85bc3ee6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| text = re.sub(r"[^\x20-\x7E\n\t]", "", text) | ||
| text = re.sub(r"[^\S\n]+", " ", text) |
There was a problem hiding this comment.
Normalize whitespace before stripping non-ASCII chars
clean_text now removes non-ASCII characters before whitespace normalization, so non-ASCII whitespace (for example NBSP \u00A0 from HTML/PDF text) is deleted instead of converted to a space. In this path, inputs like "hello\u00A0world" become "helloworld", which corrupts word boundaries and downstream tokenization; prior behavior preserved the separator because whitespace collapsing happened first.
Useful? React with 👍 / 👎.
…se leftover doubles
Run the [^\S\n]+ horizontal-whitespace collapse before the non-ASCII strip
so that Unicode whitespace (\u00A0, \u202F, \u2009, \u3000, \v, \f, etc.)
becomes a single ASCII space instead of being deleted outright. The prior
ordering silently merged adjacent words on HTML/PDF/OCR-sourced text:
"hello\u00a0world" used to produce "helloworld" after this PR; it now
produces "hello world".
Also drop \t from the allow-list since the horizontal-whitespace collapse
already normalizes tabs to a single space, and add a targeted [ ]{2,} pass
right after the non-ASCII strip so that a non-whitespace non-ASCII character
sitting between two spaces ("word1 (c) word2") does not leave an interior
double space. Without this extra pass, clean_text was not idempotent on
such inputs: the first call produced "word1 word2" and only the second
call collapsed it to "word1 word2". Fuzz testing over 10000 random inputs
now satisfies the idempotence invariant in every case.
…e cases
Cover:
- Unicode horizontal whitespace separators (NBSP, narrow NBSP, thin space,
en/em space, ideographic space, vertical tab, form feed) normalizing to
a single ASCII space instead of being deleted.
- Mixed paragraph + Unicode whitespace realistic input ("Section\u00a01\r\n\r\nBody\ftext\u202Fhere").
- Tab collapsing and space trimming around newlines.
- Non-whitespace non-ASCII characters (copyright, accented letters, emoji)
sitting between spaces: must not leave an interior double space, and
clean_text must be idempotent on these inputs.
- Non-ASCII characters adjacent to a newline: stripping must not leave
stray leading or trailing spaces on the neighbouring line, and must not
swallow an adjacent paragraph break.
|
Thank you! |
Summary
Testing