Skip to content

[mypyc] Add librt.strings.toupper and tolower codepoint primitives#21553

Open
VaggelisD wants to merge 1 commit into
python:masterfrom
VaggelisD:librt-strings-toupper-tolower
Open

[mypyc] Add librt.strings.toupper and tolower codepoint primitives#21553
VaggelisD wants to merge 1 commit into
python:masterfrom
VaggelisD:librt-strings-toupper-tolower

Conversation

@VaggelisD
Copy link
Copy Markdown
Contributor

@VaggelisD VaggelisD commented May 28, 2026

6th PR of #21418.

This PR introduces two i32 -> i32 case-conversion helpers, alongside the existing classifiers.

The constraint to flag: A single i32 holds one codepoint, but some Unicode case mappings expand to multiple e.g 'ß'.upper() becomes 'SS', 'fi'.upper() becomes 'FI' etc.

For those inputs the primitive returns the input unchanged; This is the same split CPython makes between Py_UNICODE_TOUPPER (codepoint) and str.upper() (string), with the former returning the first codepoint of the expansion.

Users needing full Unicode case conversion should call s.upper() / s.lower() on the string, for which we already have mypyc primitives (#20948). For ASCII benchmarks, the codepoint primitives are ~5x faster than their str counterparts, avoiding the 1-char allocation.

@github-actions

This comment has been minimized.

Two i32 -> i32 case-conversion helpers mirroring the existing codepoint
classifiers. ASCII fast paths inline (`a..z <-> A..Z` via add/sub 32);
non-ASCII delegates to `str.upper` / `str.lower` on a 1-character string
via a shared LibRTStrings_ChangeCase_slow helper.

When uppercasing or lowercasing expands to multiple codepoints (e.g.
'ß'.upper() == 'SS', 'fi'.upper() == 'FI'), the helper returns the input
unchanged so the signature stays i32 -> i32. Allocation failure aborts
via CPyError_OutOfMemory, matching how LibRTStrings_IsIdentifier handles
OOM and keeping the helpers ERR_NEVER.

Following the inline-in-header pattern landed for isidentifier (python#21522),
the bodies live as `static inline` in librt_strings.h so they compile
directly into both the librt.strings module and mypyc-emitted code with
no capsule indirection.

Stack: depends on the librt.strings.isidentifier primitive (python#21522).
@VaggelisD VaggelisD force-pushed the librt-strings-toupper-tolower branch from 84a8652 to c2e43f3 Compare May 28, 2026 09:06
@github-actions
Copy link
Copy Markdown
Contributor

According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant