fix: Python-Rust combining char diff in isalnum by joshuamegnauth54 · Pull Request #7612 · RustPython/RustPython

joshuamegnauth54 · 2026-04-15T00:17:03Z

Closes: #7518

Rust and Python differ on alphanumeric characters. Rust follows the Unicode standard closer than Python. This means that is_alphanumeric (char function in Rust) is different from isalnum (Python). To fix the discrepancy, RustPython needs to mimic Python by rejecting certain characters. Some classes of combining characters count as alphanumeric in Rust but not Python. Combining characters are accent marks that are combined with other characters to create a single grapheme.

It's possible that this PR is not exhaustive. I fixed the combining character issue BUT I don't know the full range of discrepancies.

~~This doesn't actually fix #7518, but it fixes a similar issue based on that report.~~ Actually, CodeRabbit pointed me in the right direction so now it does fix #7518.

assert!('\u{006e}'.is_alphanumeric());
assert!(!'\u{0303}'.is_alphanumeric());
assert!('\u{00f1}'.is_alphanumeric());
assert!('\u{0345}'.is_alphanumeric());

for raw in 0x0363..=0x036f {
    let c = char::from_u32(raw).unwrap();
    assert!(c.is_alphanumeric());
}

assert '\u006e'.isalnum()
assert not '\u0303'.isalnum()
assert '\u00f1'.isalnum()
assert not '\u0345'.isalnum()
assert not '\u0363'.isalnum()

for raw in range(0x0363, 0x036f):
    assert not chr(raw).isalnum()

^Note the differences which are accounted for by the new tests.

Summary by CodeRabbit

Bug Fixes
- Improved str.isalnum() to correctly treat Unicode combining marks so standalone combining characters are not considered alphanumeric, while base letters with combining marks remain valid.
Tests
- Expanded test coverage verifying combining-character behavior for str.isalnum().
- Added a test ensuring regex word-matching does not treat standalone combining marks as word characters.

Related to: RustPython#7518 Rust and Python differ on alphanumeric characters. Rust follows the Unicode standard closer than Python. This means that is_alphanumeric (char function in Rust) is different from isalnum (Python). To fix the discrepancy, RustPython needs to mimic Python by rejecting certain characters. Some classes of combining characters count as alphanumeric in Rust but not Python. Combining characters are accent marks that are combined with other characters to create a single grapheme. It's possible that this PR is not exhaustive. I fixed the combining character issue BUT I don't know the full range of discrepancies.

coderabbitai · 2026-04-15T00:17:24Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 55ab7a2f-cd0f-498e-95b9-60855bb60b03

📥 Commits

Reviewing files that changed from the base of the PR and between cd8b11d and 64dd760.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (3)

crates/sre_engine/Cargo.toml
crates/sre_engine/src/string.rs
extra_tests/snippets/stdlib_re.py

✅ Files skipped from review due to trivial changes (2)

crates/sre_engine/Cargo.toml
extra_tests/snippets/stdlib_re.py

📝 Walkthrough

Walkthrough

Updated alphanumeric checks to consult ICU canonical combining class data so combining marks are excluded from isalnum(); corresponding tests and a dependency on icu_properties were added.

Changes

Cohort / File(s)	Summary
VM builtin (isalnum) `crates/vm/src/builtins/str.rs`	Rewrote `PyStr::isalnum()` to require both `char::is_alphanumeric()` and that the code point's `CanonicalCombiningClass` is `NotReordered` via ICU lookup.
sre_engine: string predicate `crates/sre_engine/src/string.rs`	`is_uni_alnum` now consults `CodePointMapData<CanonicalCombiningClass>` and excludes combining marks from alphanumeric classification.
sre_engine: dependency `crates/sre_engine/Cargo.toml`	Added `icu_properties` workspace dependency.
Tests: builtin & stdlib `extra_tests/snippets/builtin_str.py`, `extra_tests/snippets/stdlib_re.py`	Added assertions ensuring combining marks (e.g., U+0345 and ranges U+0363..U+036E) are not treated as alphanumeric and that `\w` does not match such marks.

Sequence Diagram(s)

sequenceDiagram
    participant PythonTest as Python test / Caller
    participant VM as VM (`PyStr::isalnum`)
    participant ICU as ICU property data (CodePointMapData<CanonicalCombiningClass>)
    participant SRE as sre_engine (`is_uni_alnum`)

    PythonTest->>VM: call isalnum() for each char/code point
    VM->>SRE: delegate alnum check (char property)
    SRE->>ICU: load CodePointMapData<CanonicalCombiningClass>
    SRE->>ICU: get(c) for code point
    ICU-->>SRE: return CanonicalCombiningClass
    SRE-->>VM: return boolean (is_alnum && CCC == NotReordered)
    VM-->>PythonTest: final isalnum result

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Replace unmaintained unic crates #7555 — Similar migration to icu_properties and ICU-based Unicode property lookups affecting VM builtins.

Suggested reviewers

ShaharNaveh
youknowone

Poem

🐰 I hopped through code to mend the snare,
Combining marks no longer pass unaware.
With ICU maps and careful checks,
Alnum's now right—no more perplex! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: Python-Rust combining char diff in isalnum' clearly summarizes the main change: addressing differences between Python and Rust implementations of the isalnum function for combining characters.
Linked Issues check	✅ Passed	The PR successfully addresses issue `#7518` by modifying isalnum() to exclude combining marks (using CanonicalCombiningClass check) and updating regex \w matching in SRE to align with CPython behavior.
Out of Scope Changes check	✅ Passed	All changes are directly related to fixing the combining character handling in isalnum and regex \w matching across str.rs, sre_engine modules, and corresponding test files.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@extra_tests/snippets/stdlib_re.py`:
- Around line 82-84: The commented-out test shows that is_uni_word() currently
inherits is_uni_alnum() which uses Rust's char::is_alphanumeric() and therefore
wrongly treats combining marks (e.g. U+0345, category Mn) as word characters;
update is_uni_alnum() (and thus is_uni_word()) to exclude characters whose
Unicode General Category is a Mark (Mn, Mc, Me) rather than relying solely on
char::is_alphanumeric(), for example by querying the character's general
category and returning false for Mark categories so the test re.match(r"\w",
"\u0345") will not match; alternatively, if you intend to keep this limitation,
uncomment the test and mark it as an expected failure with a comment referencing
issue `#7518` and the functions is_uni_alnum() / is_uni_word().

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 04cb8ffa-9709-4846-a2ad-187862235904

📥 Commits

Reviewing files that changed from the base of the PR and between 898da7f and cd8b11d.

📒 Files selected for processing (3)

crates/vm/src/builtins/str.rs
extra_tests/snippets/builtin_str.py
extra_tests/snippets/stdlib_re.py

extra_tests/snippets/stdlib_re.py

Closes: RustPython#7518

ShaharNaveh · 2026-04-15T06:07:24Z

crates/sre_engine/src/string.rs

+    let map = CodePointMapData::<CanonicalCombiningClass>::new();
    char::try_from(ch)
-        .map(|x| x.is_alphanumeric())
+        .map(|x| x.is_alphanumeric() && map.get(x) == CanonicalCombiningClass::NotReordered)


Suggested change

.map(|x| x.is_alphanumeric() && map.get(x) == CanonicalCombiningClass::NotReordered)

.map(|x| x.is_alphanumeric() && CanonicalCombiningClass::for_char(x) == CanonicalCombiningClass::NotReordered)

coderabbitai bot reviewed Apr 15, 2026

View reviewed changes

extra_tests/snippets/stdlib_re.py Outdated Show resolved Hide resolved

fix: Ignore combining characters in SRE

64dd760

Closes: RustPython#7518

ShaharNaveh reviewed Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Python-Rust combining char diff in isalnum#7612

fix: Python-Rust combining char diff in isalnum#7612
joshuamegnauth54 wants to merge 2 commits intoRustPython:mainfrom
joshuamegnauth54:isalnum-combining-character

joshuamegnauth54 commented Apr 15, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 15, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

ShaharNaveh Apr 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	.map(\|x\| x.is_alphanumeric() && map.get(x) == CanonicalCombiningClass::NotReordered)
	.map(\|x\| x.is_alphanumeric() && CanonicalCombiningClass::for_char(x) == CanonicalCombiningClass::NotReordered)

Conversation

joshuamegnauth54 commented Apr 15, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ShaharNaveh Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joshuamegnauth54 commented Apr 15, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 15, 2026 •

edited

Loading

ShaharNaveh Apr 15, 2026 •

edited

Loading