Match CPython's islower/isupper exactly#7646
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReworked string casedness detection: added a generic Changes
Sequence Diagram(s)(omitted — changes are focused on algorithmic/case-detection logic within the VM and do not introduce a multi-component sequential flow requiring visualization) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/vm/src/anystr.rs`:
- Around line 442-456: The is_cased helper (fn is_cased<VALID, INVALID>)
currently only rejects INVALID property and thus lets titlecase characters slip;
update the loop to also detect and reject titlecase characters for each char in
chunk.chars() (e.g., call CharExt::is_titlecase(ch) or check
GeneralCategory::for_char(ch) == GeneralCategory::TitlecaseLetter) so any
titlecase char causes an early return false, preserving the existing VALID
detection logic and final all_cased result.
In `@crates/vm/src/stdlib/_thread.rs`:
- Line 469: The return in function thread_to_id currently returns
handle.as_pthread_t() which is a platform-dependent RawPthread; explicitly cast
that value to u64 before returning to match the function's u64 return type and
ensure Unix portability — locate thread_to_id and change the return to cast
handle.as_pthread_t() into u64 (e.g., as u64) so the conversion is explicit and
clear.
In `@extra_tests/snippets/builtin_str_encode.py`:
- Around line 24-26: Replace the incorrect 6-digit escape and wrong expectation:
use the 8-digit escape sequence \U0001f431 for the emoji in both assertions and
change the first assertion to expect True (emoji bytes are ignored) by asserting
"\U0001f431 CAT".encode().isupper(), while keeping the second assertion as
"\U0001f431 cat".encode().islower().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 4a7cc4e6-e414-4b7d-9ca0-4066c74a3fa2
📒 Files selected for processing (5)
crates/vm/src/anystr.rscrates/vm/src/builtins/str.rscrates/vm/src/stdlib/_thread.rsextra_tests/snippets/builtin_str.pyextra_tests/snippets/builtin_str_encode.py
a3559f2 to
5c0fa80
Compare
|
Are there any relevant tests in, say, |
|
It looks like @fanninpm I skimmed them but couldn't find any tests for Unicode properties. It makes sense, though, because the properties may change across Unicode versions (e.g. \u0295 above). RustPython can probably remove its Unicode tests too once its string implementations match CPython exactly. |
5c0fa80 to
dde4ac3
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
crates/vm/src/anystr.rs (1)
444-462: Optional: foldis_casedinto a single pass over chars.Functionally correct and the titlecase rejection addresses the regression flagged on the previous revision. A minor refinement: each chunk is traversed twice (
chunk.chars().any(...)for the invalid/titlecase check, then again for the valid check). A singlefor c in ...loop is a touch more efficient on long strings and mirrors CPython'sunicode_islower_implcontrol flow more directly.♻️ Proposed single-pass refactor
fn is_cased<VALID, INVALID>(&self) -> bool where VALID: BinaryProperty, INVALID: BinaryProperty, { let mut all_cased = false; - for chunk in self.as_bytes().utf8_chunks().map(|c| c.valid()) { - if chunk.chars().any(|c| { - INVALID::for_char(c) - || GeneralCategoryGroup::TitlecaseLetter.contains(GeneralCategory::for_char(c)) - }) { - return false; - } - if !all_cased && chunk.chars().any(VALID::for_char) { - all_cased = true; - } + for c in self + .as_bytes() + .utf8_chunks() + .flat_map(|c| c.valid().chars()) + { + if INVALID::for_char(c) + || GeneralCategoryGroup::TitlecaseLetter.contains(GeneralCategory::for_char(c)) + { + return false; + } + if !all_cased && VALID::for_char(c) { + all_cased = true; + } } all_cased }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/vm/src/anystr.rs` around lines 444 - 462, The current is_cased implementation iterates each utf8 chunk twice (once to check INVALID::for_char/TitlecaseLetter via GeneralCategory::for_char and again to check VALID::for_char), so refactor is_cased to perform a single pass: for each chunk from self.as_bytes().utf8_chunks().map(|c| c.valid()) iterate each char once, immediately return false if INVALID::for_char(c) or GeneralCategoryGroup::TitlecaseLetter.contains(GeneralCategory::for_char(c)) is true, and otherwise set a flag when VALID::for_char(c) is seen; after processing all chunks return that flag. Keep the same trait bounds (VALID: BinaryProperty, INVALID: BinaryProperty) and behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@crates/vm/src/anystr.rs`:
- Around line 444-462: The current is_cased implementation iterates each utf8
chunk twice (once to check INVALID::for_char/TitlecaseLetter via
GeneralCategory::for_char and again to check VALID::for_char), so refactor
is_cased to perform a single pass: for each chunk from
self.as_bytes().utf8_chunks().map(|c| c.valid()) iterate each char once,
immediately return false if INVALID::for_char(c) or
GeneralCategoryGroup::TitlecaseLetter.contains(GeneralCategory::for_char(c)) is
true, and otherwise set a flag when VALID::for_char(c) is seen; after processing
all chunks return that flag. Keep the same trait bounds (VALID: BinaryProperty,
INVALID: BinaryProperty) and behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: d31dd5c7-e5bc-4027-bdd5-f17b032670cf
📒 Files selected for processing (5)
crates/vm/src/anystr.rscrates/vm/src/builtins/str.rscrates/vm/src/stdlib/_thread.rsextra_tests/snippets/builtin_str.pyextra_tests/snippets/builtin_str_encode.py
🚧 Files skipped from review as they are similar to previous changes (3)
- crates/vm/src/stdlib/_thread.rs
- extra_tests/snippets/builtin_str.py
- extra_tests/snippets/builtin_str_encode.py
dde4ac3 to
a771228
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
crates/vm/src/anystr.rs (1)
441-465:is_casednow faithfully mirrors CPython'sunicode_is{upper,lower}_impl(titlecase regression fixed).The previous concern about titlecase code points (e.g.
"\u01c5A".isupper()returningtrue) is resolved by theGeneralCategoryGroup::TitlecaseLetter.contains(GeneralCategory::for_char(c))guard, which is equivalent to CPython'sPy_UNICODE_ISTITLE(set onLt). The control flow — early-returnfalseonINVALIDor titlecase, setall_casedonce onVALID, returnall_cased— matches CPython line-for-line.The
as_bytes().utf8_chunks().flat_map(|c| c.valid().chars())pattern is the right choice for theWtf8impl too: isolated surrogates are dropped, but since surrogates are neverLu/Ll/Ltin CPython either, the observable behavior is identical.Optional readability nit (non-blocking): the titlecase test could be written more directly, in line with the nearby
isdecimalstyle onstr.rs:♻️ Optional tweak
- if INVALID::for_char(c) - || GeneralCategoryGroup::TitlecaseLetter.contains(GeneralCategory::for_char(c)) - { + if INVALID::for_char(c) + || matches!(GeneralCategory::for_char(c), GeneralCategory::TitlecaseLetter) + { return false; }This also lets you drop
GeneralCategoryGroupfrom the imports at the top of the file if no other user remains.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/vm/src/anystr.rs` around lines 441 - 465, Replace the Titlecase guard in is_cased to check the GeneralCategory directly (e.g. compare GeneralCategory::for_char(c) to the TitlecaseLetter variant or call a dedicated is_titlecase method) instead of using GeneralCategoryGroup::TitlecaseLetter.contains(...); update the import list to remove GeneralCategoryGroup if unused (keep GeneralCategory) so the check mirrors the nearby isdecimal style and removes the extra import.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@crates/vm/src/anystr.rs`:
- Around line 441-465: Replace the Titlecase guard in is_cased to check the
GeneralCategory directly (e.g. compare GeneralCategory::for_char(c) to the
TitlecaseLetter variant or call a dedicated is_titlecase method) instead of
using GeneralCategoryGroup::TitlecaseLetter.contains(...); update the import
list to remove GeneralCategoryGroup if unused (keep GeneralCategory) so the
check mirrors the nearby isdecimal style and removes the extra import.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 5635961c-c3cc-4e5a-bf28-03108999e18b
📒 Files selected for processing (5)
crates/vm/src/anystr.rscrates/vm/src/builtins/str.rscrates/vm/src/stdlib/_thread.rsextra_tests/snippets/builtin_str.pyextra_tests/snippets/builtin_str_encode.py
✅ Files skipped from review due to trivial changes (2)
- crates/vm/src/stdlib/_thread.rs
- extra_tests/snippets/builtin_str_encode.py
ead1069 to
0acaac3
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
crates/vm/src/anystr.rs (1)
444-465:is_casedcorrectly rejects opposite-case and titlecase — matches CPython.The titlecase rejection (previously flagged) is now in place, so strings like
"DžA".isupper()and"Dža".islower()will returnfalseas CPython does.Two minor notes:
- The titlecase test is slightly indirect — a direct equality is clearer and avoids constructing a group:
♻️ Optional simplification
- if INVALID::for_char(c) - || GeneralCategoryGroup::TitlecaseLetter.contains(GeneralCategory::for_char(c)) - { + if INVALID::for_char(c) || GeneralCategory::for_char(c) == GeneralCategory::TitlecaseLetter { return false; }
- For
Wtf8, unpaired surrogates are encoded as invalid UTF-8 bytes and get skipped byutf8_chunks().flat_map(|c| c.valid().chars()). Surrogates aren't cased, so this is semantically fine, but worth a brief comment so future readers don't mistake the chunking for a bug.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/vm/src/anystr.rs` around lines 444 - 465, The is_cased function should use a direct titlecase comparison and document surrogate behavior: replace the indirect GeneralCategoryGroup::TitlecaseLetter.contains(GeneralCategory::for_char(c)) with an equality check GeneralCategory::for_char(c) == GeneralCategory::TitlecaseLetter inside is_cased (keeping the early return where INVALID::for_char(c) triggers false), and add a short comment near the utf8_chunks().flat_map(...) explaining that Wtf8 encodes unpaired surrogates as invalid UTF‑8 bytes which are skipped by utf8_chunks so surrogates won’t be seen/cased; reference the is_cased function and the VALID/INVALID, GeneralCategory, and utf8_chunks symbols when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@crates/vm/src/anystr.rs`:
- Around line 444-465: The is_cased function should use a direct titlecase
comparison and document surrogate behavior: replace the indirect
GeneralCategoryGroup::TitlecaseLetter.contains(GeneralCategory::for_char(c))
with an equality check GeneralCategory::for_char(c) ==
GeneralCategory::TitlecaseLetter inside is_cased (keeping the early return where
INVALID::for_char(c) triggers false), and add a short comment near the
utf8_chunks().flat_map(...) explaining that Wtf8 encodes unpaired surrogates as
invalid UTF‑8 bytes which are skipped by utf8_chunks so surrogates won’t be
seen/cased; reference the is_cased function and the VALID/INVALID,
GeneralCategory, and utf8_chunks symbols when making the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 4741774e-310b-4e4d-88cf-c5c207167f99
📒 Files selected for processing (5)
crates/vm/src/anystr.rscrates/vm/src/builtins/str.rscrates/vm/src/stdlib/_thread.rsextra_tests/snippets/builtin_str.pyextra_tests/snippets/builtin_str_encode.py
✅ Files skipped from review due to trivial changes (1)
- crates/vm/src/stdlib/_thread.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- extra_tests/snippets/builtin_str_encode.py
This PR fixes a regression from my last islower/isupper patch. Python's Bytes doesn't assume an encoding, so methods like islower should only consider ASCII casing. I updated islower/isupper for UTF-8 and WTF-8 to match CPython more closely. The two functions now use the same properties as CPython and now match CPython exactly. I updated the unit tests to pass on Python 3.15. Unicode updates sometimes cause properties to shift. I previously tested everything on Python 3.14, but that lead to failures that I assumed were bugs but were actually due to Unicode differences. For example, U+0295 is a lower case letter in older Unicode versions but is NOT in newer versions. One of the new tests is disabled on Python 3.14 for now because it will fail in CI till CI is bumped to 3.15.
0acaac3 to
043dbc5
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
crates/vm/src/anystr.rs (1)
455-457: Prefer directGeneralCategoryequality over group containment for single-category checks.
GeneralCategoryGroup::TitlecaseLetteris a single-category group, so.contains(...)is unnecessary. The codebase already uses this pattern instr.rswithmatches!(GeneralCategory::for_char(c), GeneralCategory::DecimalNumber)for equivalent checks.♻️ Proposed tweak
- if INVALID::for_char(c) - || GeneralCategoryGroup::TitlecaseLetter.contains(GeneralCategory::for_char(c)) - { + if INVALID::for_char(c) + || GeneralCategory::for_char(c) == GeneralCategory::TitlecaseLetter + { return false; }After this change,
GeneralCategoryGroupcan be removed from the import list at lines 7–9, as it has no other uses in this file.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/vm/src/anystr.rs` around lines 455 - 457, The code uses GeneralCategoryGroup::TitlecaseLetter.contains(GeneralCategory::for_char(c)) for a single-category check; replace that containment call with a direct GeneralCategory comparison (e.g., match or matches! against GeneralCategory::TitlecaseLetter) alongside the existing INVALID::for_char(c) check in the same if, and then remove GeneralCategoryGroup from the imports (it's no longer used in this file); update the condition that currently reads GeneralCategoryGroup::TitlecaseLetter.contains(GeneralCategory::for_char(c)) to use direct equality/matching against GeneralCategory::TitlecaseLetter.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@crates/vm/src/anystr.rs`:
- Around line 455-457: The code uses
GeneralCategoryGroup::TitlecaseLetter.contains(GeneralCategory::for_char(c)) for
a single-category check; replace that containment call with a direct
GeneralCategory comparison (e.g., match or matches! against
GeneralCategory::TitlecaseLetter) alongside the existing INVALID::for_char(c)
check in the same if, and then remove GeneralCategoryGroup from the imports
(it's no longer used in this file); update the condition that currently reads
GeneralCategoryGroup::TitlecaseLetter.contains(GeneralCategory::for_char(c)) to
use direct equality/matching against GeneralCategory::TitlecaseLetter.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 4021c102-3e15-47ce-80f5-80b372772a1d
📒 Files selected for processing (5)
crates/vm/src/anystr.rscrates/vm/src/builtins/str.rscrates/vm/src/stdlib/_thread.rsextra_tests/snippets/builtin_str.pyextra_tests/snippets/builtin_str_encode.py
✅ Files skipped from review due to trivial changes (1)
- crates/vm/src/stdlib/_thread.rs
🚧 Files skipped from review as they are similar to previous changes (2)
- extra_tests/snippets/builtin_str.py
- extra_tests/snippets/builtin_str_encode.py
This PR fixes a regression from my last
islower/isupperpatch. Python's Bytes doesn't assume an encoding, so methods likeislowershould only consider ASCII casing.I updated
islower/isupperfor UTF-8 and WTF-8 to match CPython exactly. The two functions now use the same properties as CPython and in the exact same way that CPython.I updated the unit tests to pass on Python 3.15. Unicode updates sometimes cause properties to shift. I previously tested everything on Python 3.14, but that lead to failures that I assumed were bugs but were actually due to Unicode differences. For example, U+0295 is a lower case letter in older Unicode versions but is NOT in newer versions.
With Python 3.14:
With Python 3.15 and RustPython:
^It's imperative to test with the latest Python because of Unicode differences. I, uh, spent like three hours trying to fix a bug that didn't exist because I was crosschecking the tests with Python 3.14... Oops.
I bumped CI's Python version to 3.15 as well.Reverted because it's more complicated than I thought.Here is CPython's isupper implementation for comparison.
CPython's core logic for
islower(isupperis basically the same)RustPython's
islower:They're the same now! 😁
Summary by CodeRabbit
Bug Fixes
Tests