fix: Python-Rust combining char diff in isalnum by joshuamegnauth54 · Pull Request #7612 · RustPython/RustPython

joshuamegnauth54 · 2026-04-15T00:17:03Z

Closes: #7518

Rust and Python differ on alphanumeric characters. Rust follows the Unicode standard closer than Python. This means that is_alphanumeric (char function in Rust) is different from isalnum (Python). To fix the discrepancy, RustPython needs to mimic Python by rejecting certain characters. Some classes of combining characters count as alphanumeric in Rust but not Python. Combining characters are accent marks that are combined with other characters to create a single grapheme.

It's possible that this PR is not exhaustive. I fixed the combining character issue BUT I don't know the full range of discrepancies.

~~This doesn't actually fix #7518, but it fixes a similar issue based on that report.~~ Actually, CodeRabbit pointed me in the right direction so now it does fix #7518.

assert!('\u{006e}'.is_alphanumeric());
assert!(!'\u{0303}'.is_alphanumeric());
assert!('\u{00f1}'.is_alphanumeric());
assert!('\u{0345}'.is_alphanumeric());

for raw in 0x0363..=0x036f {
    let c = char::from_u32(raw).unwrap();
    assert!(c.is_alphanumeric());
}

assert '\u006e'.isalnum()
assert not '\u0303'.isalnum()
assert '\u00f1'.isalnum()
assert not '\u0345'.isalnum()
assert not '\u0363'.isalnum()

for raw in range(0x0363, 0x036f):
    assert not chr(raw).isalnum()

^Note the differences which are accounted for by the new tests.

Summary by CodeRabbit

Bug Fixes
- Corrected str.isalnum() so standalone Unicode combining marks are not treated as alphanumeric, while base letters remain valid.
Tests
- Expanded tests for str.isalnum() covering combining-character cases.
- Added regex test ensuring standalone combining marks are not matched as word characters.

coderabbitai · 2026-04-15T00:17:24Z

📝 Walkthrough

Walkthrough

Updated alphanumeric checks to consult ICU canonical combining class data so combining marks are excluded from isalnum(); corresponding tests and a dependency on icu_properties were added.

Changes

Cohort / File(s)	Summary
VM builtin (isalnum) `crates/vm/src/builtins/str.rs`	Rewrote `PyStr::isalnum()` to require both `char::is_alphanumeric()` and that the code point's `CanonicalCombiningClass` is `NotReordered` via ICU lookup.
sre_engine: string predicate `crates/sre_engine/src/string.rs`	`is_uni_alnum` now consults `CodePointMapData<CanonicalCombiningClass>` and excludes combining marks from alphanumeric classification.
sre_engine: dependency `crates/sre_engine/Cargo.toml`	Added `icu_properties` workspace dependency.
Tests: builtin & stdlib `extra_tests/snippets/builtin_str.py`, `extra_tests/snippets/stdlib_re.py`	Added assertions ensuring combining marks (e.g., U+0345 and ranges U+0363..U+036E) are not treated as alphanumeric and that `\w` does not match such marks.

Sequence Diagram(s)

sequenceDiagram
    participant Test as Python test / Caller
    participant VM as VM (`PyStr::isalnum`)
    participant SRE as sre_engine (`is_uni_alnum`)
    participant ICU as ICU properties (CodePointMapData<CanonicalCombiningClass>)

    Test->>VM: call isalnum() on string
    VM->>SRE: check code point alnum predicate
    SRE->>ICU: lookup CanonicalCombiningClass.for_char(code_point)
    ICU-->>SRE: return CanonicalCombiningClass
    SRE-->>VM: return (is_alnum && CCC == NotReordered)
    VM-->>Test: final boolean result

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Replace unmaintained unic crates #7555 — Similar migration to icu_properties and ICU-based Unicode property lookups affecting VM builtins.

Suggested reviewers

ShaharNaveh
youknowone

Poem

🐰 I hopped through code with whiskers bright,
Marked the marks that hid from sight.
ICU maps helped set things right,
Alnum now skips the trailing bite—
A tiny patch, a rabbit's delight.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: fixing the discrepancy between Python and Rust implementations regarding combining character handling in isalnum().
Linked Issues check	✅ Passed	The PR successfully addresses all coding requirements from issue `#7518` by excluding combining marks (Mn category) from isalnum() via CanonicalCombiningClass checks in both Python and regex implementations.
Out of Scope Changes check	✅ Passed	All changes are directly related to fixing combining character handling in isalnum() as specified in the linked issue; no unrelated modifications detected.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@extra_tests/snippets/stdlib_re.py`:
- Around line 82-84: The commented-out test shows that is_uni_word() currently
inherits is_uni_alnum() which uses Rust's char::is_alphanumeric() and therefore
wrongly treats combining marks (e.g. U+0345, category Mn) as word characters;
update is_uni_alnum() (and thus is_uni_word()) to exclude characters whose
Unicode General Category is a Mark (Mn, Mc, Me) rather than relying solely on
char::is_alphanumeric(), for example by querying the character's general
category and returning false for Mark categories so the test re.match(r"\w",
"\u0345") will not match; alternatively, if you intend to keep this limitation,
uncomment the test and mark it as an expected failure with a comment referencing
issue `#7518` and the functions is_uni_alnum() / is_uni_word().

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 04cb8ffa-9709-4846-a2ad-187862235904

📥 Commits

Reviewing files that changed from the base of the PR and between 898da7f and cd8b11d.

📒 Files selected for processing (3)

crates/vm/src/builtins/str.rs
extra_tests/snippets/builtin_str.py
extra_tests/snippets/stdlib_re.py

ShaharNaveh · 2026-04-15T06:07:24Z

+    let map = CodePointMapData::<CanonicalCombiningClass>::new();
    char::try_from(ch)
-        .map(|x| x.is_alphanumeric())
+        .map(|x| x.is_alphanumeric() && map.get(x) == CanonicalCombiningClass::NotReordered)


Remove line 446 and:

Suggested change

.map(|x| x.is_alphanumeric() && map.get(x) == CanonicalCombiningClass::NotReordered)

.map(|x| x.is_alphanumeric() && CanonicalCombiningClass::for_char(x) == CanonicalCombiningClass::NotReordered)

ShaharNaveh · 2026-04-15T06:08:12Z

+        let map = CodePointMapData::<CanonicalCombiningClass>::new();
+        !self.data.is_empty()
+            && self.char_all(|c| {
+                c.is_alphanumeric() && map.get(c) == CanonicalCombiningClass::NotReordered


remove line 953 and

Suggested change

c.is_alphanumeric() && map.get(c) == CanonicalCombiningClass::NotReordered

c.is_alphanumeric() && CanonicalCombiningClass::for_char(c) == CanonicalCombiningClass::NotReordered

ShaharNaveh · 2026-04-15T06:08:34Z

+assert not "\u006e\u0303".isalnum()
+assert "\u00f1".isalnum()
+assert not "\u0345".isalnum()
+for raw in range(0x0363, 0x036f):


Suggested change

for raw in range(0x0363, 0x036f):

for raw in range(0x0363, 0x036F):

Related to: RustPython#7518 Rust and Python differ on alphanumeric characters. Rust follows the Unicode standard closer than Python. This means that is_alphanumeric (char function in Rust) is different from isalnum (Python). To fix the discrepancy, RustPython needs to mimic Python by rejecting certain characters. Some classes of combining characters count as alphanumeric in Rust but not Python. Combining characters are accent marks that are combined with other characters to create a single grapheme. It's possible that this PR is not exhaustive. I fixed the combining character issue BUT I don't know the full range of discrepancies.

Closes: RustPython#7518

joshuamegnauth54 · 2026-04-16T03:29:01Z

I fixed the issues and force pushed. 😁

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/vm/src/builtins/str.rs`:
- Around line 949-954: Replace the current isalnum implementation (which uses
char::is_alphanumeric plus CanonicalCombiningClass check) with a
CPython-compatible predicate that accepts only Unicode letter categories
(Lu/Ll/Lt/Lm/Lo) or number categories (Nd/Nl/Np) — i.e., check the character's
general category starts with 'L' or 'N' — so spacing/combing marks like U+093F
(Mc) are rejected; update the isalnum method (and the char_all predicate it
uses) to use this category test, update the mirror predicate used for \w in the
sre_engine string predicate to the same logic so they stay aligned, and add a
regression test asserting str.isalnum() returns false for U+093F.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 509798df-be24-4b49-ba31-08eb4f8b25c1

📥 Commits

Reviewing files that changed from the base of the PR and between 64dd760 and f79d6df.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (5)

crates/sre_engine/Cargo.toml
crates/sre_engine/src/string.rs
crates/vm/src/builtins/str.rs
extra_tests/snippets/builtin_str.py
extra_tests/snippets/stdlib_re.py

✅ Files skipped from review due to trivial changes (3)

crates/sre_engine/Cargo.toml
extra_tests/snippets/stdlib_re.py
extra_tests/snippets/builtin_str.py

🚧 Files skipped from review as they are similar to previous changes (1)

crates/sre_engine/src/string.rs

coderabbitai · 2026-04-16T03:33:33Z

    fn isalnum(&self) -> bool {
-        !self.data.is_empty() && self.char_all(char::is_alphanumeric)
+        !self.data.is_empty()
+            && self.char_all(|c| {
+                c.is_alphanumeric()
+                    && CanonicalCombiningClass::for_char(c) == CanonicalCombiningClass::NotReordered
+            })


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, examine the actual code in str.rs around the isalnum implementation cd /tmp && find . -name "str.rs" -path "*/vm/src/builtins/*" 2>/dev/null | head -5

Repository: RustPython/RustPython

Length of output: 47

🏁 Script executed:

find . -type f -name "str.rs" | grep -E "(builtins|vm)" | head -10

Repository: RustPython/RustPython

Length of output: 96

🏁 Script executed:

# List the codebase structure git ls-files | grep -E "(str\.rs|string\.rs)" | head -20

Repository: RustPython/RustPython

Length of output: 271

🏁 Script executed:

# Check the actual implementation at the mentioned lines git show HEAD:crates/vm/src/builtins/str.rs | sed -n '940,960p'

Repository: RustPython/RustPython

Length of output: 675

🏁 Script executed:

# Check the sre_engine code mentioned in the review git show HEAD:crates/sre_engine/src/string.rs | sed -n '440,460p'

Repository: RustPython/RustPython

Length of output: 668

🏁 Script executed:

# Check the test file mentioned git show HEAD:extra_tests/snippets/builtin_str.py | sed -n '70,90p'

Repository: RustPython/RustPython

Length of output: 598

🏁 Script executed:

# Check imports in str.rs to see what's available git show HEAD:crates/vm/src/builtins/str.rs | sed -n '1,60p'

Repository: RustPython/RustPython

Length of output: 2247

🌐 Web query:

U+093F DEVANAGARI VOWEL SIGN I Unicode canonical combining class properties

💡 Result:

The Unicode character U+093F DEVANAGARI VOWEL SIGN I has Canonical Combining Class 0 (Not Reordered).

Citations:

1: https://util.unicode.org/UnicodeJsps/character.jsp?a=093F

2: https://www.compart.com/en/unicode/U+093F

3: https://www.fileformat.info/info/unicode/char/93f/index.htm

4: https://www.unicode.org/Public/latest/ucd/UnicodeData.txt

🌐 Web query:

U+0303 combining tilde canonical combining class general category

💡 Result:

The Unicode character U+0303 (Combining Tilde) has: - Canonical Combining Class: Above (230) - General Category: Nonspacing Mark (Mn)

Citations:

1: https://graphemica.com/0303

2: https://util.unicode.org/UnicodeJsps/character.jsp?a=0303

3: https://www.compart.com/en/unicode/U+0303

4: https://www.fileformat.info/info/unicode/char/0303/index.htm

5: https://codepoints.net/U+0303

🌐 Web query:

CPython str.isalnum() implementation definition Unicode categories

💡 Result:

The CPython implementation of str.isalnum determines if all characters in the string are alphanumeric (letters or numbers) and the string is non-empty. It uses Unicode general categories: a character is alphanumeric if it is alphabetic (general categories L: Letter, specifically "Lm", "Lt", "Lu", "Ll", or "Lo") OR numeric (specifically Decimal_Digit "Nd", though isalnum combines alpha or digit properties). From official documentation: Alphabetic characters are those with general category property being one of "Lm", "Lt", "Lu", "Ll", or "Lo". For isdigit, numeric characters are those with Numeric_Type=Digit (Nd), Decimal (Nd), or Numeric (No, Nl). isalnum is effectively isalpha or isdigit. The C implementation is in Objects/unicodectype.c using _PyUnicode_TypeRecord flags: ALPHA_MASK (0x01) for categories Ll/Lu/Lt/Lo/Lm, and DIGIT_MASK or DECIMAL_MASK for numeric. The function likely checks (ctype->flags & (ALPHA_MASK | DIGIT_MASK)) != 0, though exact isalnum function not directly shown, but follows from isalpha and isdigit patterns and Py_UNICODE_ISALNUM macro. For bytes.isalnum, it uses Py_ISALNUM from pyctype.h table for ASCII. Sources confirm via unicodedata lookups and generated tables from Unicode data.

Citations:

1: https://docs.python.org/3/library/stdtypes.html?highlight=str%20isalphanum

2: https://docs.python.org/3/library/stdtypes.html

3: https://docs.python.org/3/c-api/unicode.html

4: https://github.com/python/cpython/blob/main/Objects/unicodectype.c

5: https://github.com/python/cpython/blob/main/Objects/unicodeobject.c

🏁 Script executed:

# Check the actual combining classes of test characters python3 << 'EOF' import unicodedata test_chars = [ ('\u0303', 'U+0303 COMBINING TILDE'), ('\u0345', 'U+0345 COMBINING GREEK YPOGEGRAMMENI'), ('\u0363', 'U+0363 COMBINING LATIN SMALL LETTER A'), ('\u093F', 'U+093F DEVANAGARI VOWEL SIGN I'), ] for char, name in test_chars: cat = unicodedata.category(char) comb_class = unicodedata.combining(char) print(f"{name}: category={cat}, combining_class={comb_class}") EOF

Repository: RustPython/RustPython

Length of output: 327

🏁 Script executed:

# Check if Rust's is_alphanumeric matches Unicode Alphabetic property # First check what tests actually verify git show HEAD:extra_tests/snippets/builtin_str.py | sed -n '76,95p'

Repository: RustPython/RustPython

Length of output: 601

🏁 Script executed:

# Check if there are any other tests or comments about combining marks git show HEAD:crates/vm/src/builtins/str.rs | grep -A 5 -B 5 "combining\|Combining"

Repository: RustPython/RustPython

Length of output: 708

CanonicalCombiningClass::NotReordered filtering is incomplete for CPython parity.

CPython defines str.isalnum() as isalpha() || isdecimal() || isdigit() || isnumeric(), where isalpha() accepts only Unicode letter categories Lm/Lt/Lu/Ll/Lo. Rust's char::is_alphanumeric() is broader, including Alphabetic property which covers combining marks. Unicode class-zero (NotReordered) marks include spacing combining marks like U+093F (DEVANAGARI VOWEL SIGN I), which has category Mc and combining_class=0. The current filter admits such marks despite CPython rejecting them. Tests cover only non-class-zero combining marks (U+0303/U+0345/U+0363..U+036F, all with combining_class=230 or 240), missing this gap. Switch to a CPython-compatible category check (is_letter_category(L*) || is_digit_category(N*)) and add a regression test for U+093F. The mirror predicate in crates/sre_engine/src/string.rs:444-452 (which has a TODO comment) should be updated in lockstep to keep str.isalnum() and \w aligned.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/vm/src/builtins/str.rs` around lines 949 - 954, Replace the current isalnum implementation (which uses char::is_alphanumeric plus CanonicalCombiningClass check) with a CPython-compatible predicate that accepts only Unicode letter categories (Lu/Ll/Lt/Lm/Lo) or number categories (Nd/Nl/Np) — i.e., check the character's general category starts with 'L' or 'N' — so spacing/combing marks like U+093F (Mc) are rejected; update the isalnum method (and the char_all predicate it uses) to use this category test, update the mirror predicate used for \w in the sre_engine string predicate to the same logic so they stay aligned, and add a regression test asserting str.isalnum() returns false for U+093F.

ShaharNaveh

Looks great!
tysm:)

joshuamegnauth54 · 2026-04-16T16:25:34Z

CodeRabbit's comment is interesting. 🤔 I assumed there were more edge cases. I didn't look at isalpha() and the rest yet, but there are issues open for them too. I'll try to fix those in a future patch in a more robust manner.

ShaharNaveh · 2026-04-16T17:56:20Z

+assert "\u006e".isalnum()
+assert not "\u0303".isalnum()
+assert not "\u006e\u0303".isalnum()
+assert "\u00f1".isalnum()


it may be worth opening a PR for CPython to test those changes to ensure there's no regression on their end as well (it will get to us as well)

youknowone

Rust and Python differ on alphanumeric characters. Rust follows the Unicode standard closer than Python.

If it does, isn't it better to patch CPython to follow unicode standard better?

joshuamegnauth54 · 2026-04-17T17:25:38Z

Sorry, I'm not sure if I'm supposed to respond here to you two after PRs have been merged. 😅

I don't fully understand the issue, but I think that the differences are due to Unicode version and how each project chooses to define "alphanumeric." I'm trying to figure out if I can just combine Unicode properties to mimic cpython.

coderabbitai bot reviewed Apr 15, 2026

View reviewed changes

Comment thread extra_tests/snippets/stdlib_re.py Outdated

ShaharNaveh reviewed Apr 15, 2026

View reviewed changes

joshuamegnauth54 added 2 commits April 15, 2026 21:55

fix: Ignore combining characters in SRE

f79d6df

Closes: RustPython#7518

joshuamegnauth54 force-pushed the isalnum-combining-character branch from 64dd760 to f79d6df Compare April 16, 2026 03:23

coderabbitai bot reviewed Apr 16, 2026

View reviewed changes

ShaharNaveh approved these changes Apr 16, 2026

View reviewed changes

ShaharNaveh reviewed Apr 16, 2026

View reviewed changes

youknowone approved these changes Apr 17, 2026

View reviewed changes

youknowone merged commit aac2070 into RustPython:main Apr 17, 2026
20 checks passed

joshuamegnauth54 deleted the isalnum-combining-character branch April 17, 2026 17:22

	.map(\|x\| x.is_alphanumeric() && map.get(x) == CanonicalCombiningClass::NotReordered)
	.map(\|x\| x.is_alphanumeric() && CanonicalCombiningClass::for_char(x) == CanonicalCombiningClass::NotReordered)

	c.is_alphanumeric() && map.get(c) == CanonicalCombiningClass::NotReordered
	c.is_alphanumeric() && CanonicalCombiningClass::for_char(c) == CanonicalCombiningClass::NotReordered

	for raw in range(0x0363, 0x036f):
	for raw in range(0x0363, 0x036F):

Conversation

joshuamegnauth54 commented Apr 15, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ShaharNaveh Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShaharNaveh Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShaharNaveh Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

joshuamegnauth54 commented Apr 16, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

ShaharNaveh left a comment

Choose a reason for hiding this comment

Uh oh!

joshuamegnauth54 commented Apr 16, 2026

Uh oh!

ShaharNaveh Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

youknowone left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

joshuamegnauth54 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joshuamegnauth54 commented Apr 15, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 15, 2026 •

edited

Loading

ShaharNaveh Apr 15, 2026 •

edited

Loading

ShaharNaveh Apr 15, 2026 •

edited

Loading

ShaharNaveh Apr 16, 2026 •

edited

Loading