fix: Python-Rust combining char diff in isalnum#7612
fix: Python-Rust combining char diff in isalnum#7612joshuamegnauth54 wants to merge 2 commits intoRustPython:mainfrom
Conversation
Related to: RustPython#7518 Rust and Python differ on alphanumeric characters. Rust follows the Unicode standard closer than Python. This means that is_alphanumeric (char function in Rust) is different from isalnum (Python). To fix the discrepancy, RustPython needs to mimic Python by rejecting certain characters. Some classes of combining characters count as alphanumeric in Rust but not Python. Combining characters are accent marks that are combined with other characters to create a single grapheme. It's possible that this PR is not exhaustive. I fixed the combining character issue BUT I don't know the full range of discrepancies.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (2)
📝 WalkthroughWalkthroughUpdated alphanumeric checks to consult ICU canonical combining class data so combining marks are excluded from Changes
Sequence Diagram(s)sequenceDiagram
participant PythonTest as Python test / Caller
participant VM as VM (`PyStr::isalnum`)
participant ICU as ICU property data (CodePointMapData<CanonicalCombiningClass>)
participant SRE as sre_engine (`is_uni_alnum`)
PythonTest->>VM: call isalnum() for each char/code point
VM->>SRE: delegate alnum check (char property)
SRE->>ICU: load CodePointMapData<CanonicalCombiningClass>
SRE->>ICU: get(c) for code point
ICU-->>SRE: return CanonicalCombiningClass
SRE-->>VM: return boolean (is_alnum && CCC == NotReordered)
VM-->>PythonTest: final isalnum result
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@extra_tests/snippets/stdlib_re.py`:
- Around line 82-84: The commented-out test shows that is_uni_word() currently
inherits is_uni_alnum() which uses Rust's char::is_alphanumeric() and therefore
wrongly treats combining marks (e.g. U+0345, category Mn) as word characters;
update is_uni_alnum() (and thus is_uni_word()) to exclude characters whose
Unicode General Category is a Mark (Mn, Mc, Me) rather than relying solely on
char::is_alphanumeric(), for example by querying the character's general
category and returning false for Mark categories so the test re.match(r"\w",
"\u0345") will not match; alternatively, if you intend to keep this limitation,
uncomment the test and mark it as an expected failure with a comment referencing
issue `#7518` and the functions is_uni_alnum() / is_uni_word().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 04cb8ffa-9709-4846-a2ad-187862235904
📒 Files selected for processing (3)
crates/vm/src/builtins/str.rsextra_tests/snippets/builtin_str.pyextra_tests/snippets/stdlib_re.py
| let map = CodePointMapData::<CanonicalCombiningClass>::new(); | ||
| char::try_from(ch) | ||
| .map(|x| x.is_alphanumeric()) | ||
| .map(|x| x.is_alphanumeric() && map.get(x) == CanonicalCombiningClass::NotReordered) |
There was a problem hiding this comment.
| .map(|x| x.is_alphanumeric() && map.get(x) == CanonicalCombiningClass::NotReordered) | |
| .map(|x| x.is_alphanumeric() && CanonicalCombiningClass::for_char(x) == CanonicalCombiningClass::NotReordered) |
Closes: #7518
Rust and Python differ on alphanumeric characters. Rust follows the Unicode standard closer than Python. This means that
is_alphanumeric(char function in Rust) is different fromisalnum(Python). To fix the discrepancy, RustPython needs to mimic Python by rejecting certain characters. Some classes of combining characters count as alphanumeric in Rust but not Python. Combining characters are accent marks that are combined with other characters to create a single grapheme.It's possible that this PR is not exhaustive. I fixed the combining character issue BUT I don't know the full range of discrepancies.
This doesn't actually fix #7518, but it fixes a similar issue based on that report.Actually, CodeRabbit pointed me in the right direction so now it does fix #7518.^Note the differences which are accounted for by the new tests.
Summary by CodeRabbit
Bug Fixes
Tests