rustpython-unicode should be a shared crate that provides CPython-compatible Unicode
semantics and Unicode data for Rust-based Python implementations and tools.
Its purpose is not to implement Python string objects or high-level string methods. Instead,
it should provide the Unicode foundation that Python runtimes need: character classification,
case mapping, normalization, identifier rules, regex character-class predicates, and
unicodedata-style access to Unicode database information.
This makes it a natural shared dependency between RustPython and Pyre, while keeping actual
string operations in crate::str or the host runtime.
Design Goals
- Match CPython behavior at the Unicode data and semantics level, not just in a few visible
edge cases.
- Use a single version-pinned Unicode data source per targeted CPython release.
- Expose low-level APIs that can be reused by str, re, parser, compiler, and unicodedata.
- Support non-scalar code points where Python behavior requires it, so the core API should be
u32-based rather than char-based.
- Be usable outside RustPython, especially by other Python-related Rust projects.
Scope
rustpython-unicode should include:
- Unicode character classification used by Python:
- isalpha
- isalnum
- isdecimal
- isdigit
- isnumeric
- isspace
- isprintable
- casing predicates if needed
- Identifier-related predicates:
- is_xid_start
- is_xid_continue
- Python identifier helpers
- Regex-oriented Unicode predicates:
- Unicode \w
- Unicode \d
- Unicode \s
- any other CPython regex character classes that depend on Unicode tables
- Case conversion and casing-related data:
- lowercase
- uppercase
- titlecase
- casefold
- full mappings where Python requires them
- Unicode normalization support needed by Python:
- NFC
- NFD
- NFKC
- NFKD
- is_normalized
- unicodedata-style database access:
- general category
- bidirectional class
- combining class
- east asian width
- mirrored
- decomposition
- decimal/digit/numeric values
- name lookup
- character lookup by name
- Unicode age/version checks if needed for CPython compatibility layers
- Versioned Unicode tables aligned with CPython.
Non-Goals
rustpython-unicode should not include:
- Python str object behavior
- slicing, searching, splitting, joining, formatting, padding, or other string algorithms
- Python object model concerns
- interpreter-specific wrappers
Those belong in crate::str or the embedding runtime.
Architecture
A good split would be:
- rustpython-unicode
- owns Unicode tables and Unicode semantics
- exposes u32-based predicates and mappings
- exposes unicodedata-style query APIs
- crate::str
- owns Python string methods and higher-level string algorithms
- calls into rustpython-unicode for all Unicode-sensitive behavior
- regex engine
- calls into rustpython-unicode for Unicode character classes
- unicodedata module
- becomes a thin wrapper over rustpython-unicode
This keeps one authoritative Unicode path across the runtime.
Suggested Public API Direction
pub mod classify {
pub fn is_alpha(cp: u32) -> bool;
pub fn is_alnum(cp: u32) -> bool;
pub fn is_decimal(cp: u32) -> bool;
pub fn is_digit(cp: u32) -> bool;
pub fn is_numeric(cp: u32) -> bool;
pub fn is_space(cp: u32) -> bool;
pub fn is_printable(cp: u32) -> bool;
}
pub mod identifier {
pub fn is_xid_start(cp: u32) -> bool;
pub fn is_xid_continue(cp: u32) -> bool;
pub fn is_python_identifier_start(cp: u32) -> bool;
pub fn is_python_identifier_continue(cp: u32) -> bool;
}
pub mod regex {
pub fn is_word(cp: u32) -> bool;
pub fn is_digit(cp: u32) -> bool;
pub fn is_space(cp: u32) -> bool;
}
pub mod case {
pub fn to_lowercase(cp: u32) -> CaseMapping;
pub fn to_uppercase(cp: u32) -> CaseMapping;
pub fn to_titlecase(cp: u32) -> CaseMapping;
pub fn casefold(cp: u32) -> CaseMapping;
}
pub mod normalize {
pub fn nfc<I: IntoIterator<Item = u32>>(input: I) -> Normalized;
pub fn nfd<I: IntoIterator<Item = u32>>(input: I) -> Normalized;
pub fn nfkc<I: IntoIterator<Item = u32>>(input: I) -> Normalized;
pub fn nfkd<I: IntoIterator<Item = u32>>(input: I) -> Normalized;
pub fn is_normalized_nfc<I: IntoIterator<Item = u32>>(input: I) -> bool;
}
pub mod data {
pub fn category(cp: u32) -> GeneralCategory;
pub fn bidirectional(cp: u32) -> BidiClass;
pub fn combining(cp: u32) -> u8;
pub fn east_asian_width(cp: u32) -> EastAsianWidth;
pub fn mirrored(cp: u32) -> bool;
pub fn decomposition(cp: u32) -> Option;
pub fn decimal(cp: u32) -> Option;
pub fn digit(cp: u32) -> Option;
pub fn numeric(cp: u32) -> Option;
pub fn name(cp: u32) -> Option<&'static str>;
pub fn lookup(name: &str) -> Option;
}
The exact types can change, but the important part is the boundary: low-level Unicode
semantics here, string algorithms elsewhere.
Compatibility Model
The crate should define compatibility against a specific CPython line, for example:
- CPython 3.14 Unicode semantics
- version-pinned generated tables
- explicit regeneration workflow when upgrading CPython
That matters because “Unicode-correct” is not enough here. The target is “CPython-
compatible.”
Why This Is Better Than Ad Hoc Fixes
- str, re, parser, and unicodedata stop drifting apart.
- There is one authoritative source for Unicode behavior.
- Compatibility work becomes table-driven instead of patch-driven.
- Future CPython upgrades become more mechanical and auditable.
rustpython-unicode should be a shared crate that provides CPython-compatible Unicode
semantics and Unicode data for Rust-based Python implementations and tools.
Its purpose is not to implement Python string objects or high-level string methods. Instead,
it should provide the Unicode foundation that Python runtimes need: character classification,
case mapping, normalization, identifier rules, regex character-class predicates, and
unicodedata-style access to Unicode database information.
This makes it a natural shared dependency between RustPython and Pyre, while keeping actual
string operations in crate::str or the host runtime.
Design Goals
edge cases.
u32-based rather than char-based.
Scope
rustpython-unicode should include:
Non-Goals
rustpython-unicode should not include:
Those belong in crate::str or the embedding runtime.
Architecture
A good split would be:
This keeps one authoritative Unicode path across the runtime.
Suggested Public API Direction
pub mod classify {
pub fn is_alpha(cp: u32) -> bool;
pub fn is_alnum(cp: u32) -> bool;
pub fn is_decimal(cp: u32) -> bool;
pub fn is_digit(cp: u32) -> bool;
pub fn is_numeric(cp: u32) -> bool;
pub fn is_space(cp: u32) -> bool;
pub fn is_printable(cp: u32) -> bool;
}
pub mod identifier {
pub fn is_xid_start(cp: u32) -> bool;
pub fn is_xid_continue(cp: u32) -> bool;
pub fn is_python_identifier_start(cp: u32) -> bool;
pub fn is_python_identifier_continue(cp: u32) -> bool;
}
pub mod regex {
pub fn is_word(cp: u32) -> bool;
pub fn is_digit(cp: u32) -> bool;
pub fn is_space(cp: u32) -> bool;
}
pub mod case {
pub fn to_lowercase(cp: u32) -> CaseMapping;
pub fn to_uppercase(cp: u32) -> CaseMapping;
pub fn to_titlecase(cp: u32) -> CaseMapping;
pub fn casefold(cp: u32) -> CaseMapping;
}
pub mod normalize {
pub fn nfc<I: IntoIterator<Item = u32>>(input: I) -> Normalized;
pub fn nfd<I: IntoIterator<Item = u32>>(input: I) -> Normalized;
pub fn nfkc<I: IntoIterator<Item = u32>>(input: I) -> Normalized;
pub fn nfkd<I: IntoIterator<Item = u32>>(input: I) -> Normalized;
pub fn is_normalized_nfc<I: IntoIterator<Item = u32>>(input: I) -> bool;
}
pub mod data {
pub fn category(cp: u32) -> GeneralCategory;
pub fn bidirectional(cp: u32) -> BidiClass;
pub fn combining(cp: u32) -> u8;
pub fn east_asian_width(cp: u32) -> EastAsianWidth;
pub fn mirrored(cp: u32) -> bool;
pub fn decomposition(cp: u32) -> Option;
pub fn decimal(cp: u32) -> Option;
pub fn digit(cp: u32) -> Option;
pub fn numeric(cp: u32) -> Option;
pub fn name(cp: u32) -> Option<&'static str>;
pub fn lookup(name: &str) -> Option;
}
The exact types can change, but the important part is the boundary: low-level Unicode
semantics here, string algorithms elsewhere.
Compatibility Model
The crate should define compatibility against a specific CPython line, for example:
That matters because “Unicode-correct” is not enough here. The target is “CPython-
compatible.”
Why This Is Better Than Ad Hoc Fixes