Skip to content

rustpython-unicode #7560

@youknowone

Description

@youknowone

rustpython-unicode should be a shared crate that provides CPython-compatible Unicode
semantics and Unicode data for Rust-based Python implementations and tools.

Its purpose is not to implement Python string objects or high-level string methods. Instead,
it should provide the Unicode foundation that Python runtimes need: character classification,
case mapping, normalization, identifier rules, regex character-class predicates, and
unicodedata-style access to Unicode database information.

This makes it a natural shared dependency between RustPython and Pyre, while keeping actual
string operations in crate::str or the host runtime.

Design Goals

  • Match CPython behavior at the Unicode data and semantics level, not just in a few visible
    edge cases.
  • Use a single version-pinned Unicode data source per targeted CPython release.
  • Expose low-level APIs that can be reused by str, re, parser, compiler, and unicodedata.
  • Support non-scalar code points where Python behavior requires it, so the core API should be
    u32-based rather than char-based.
  • Be usable outside RustPython, especially by other Python-related Rust projects.

Scope

rustpython-unicode should include:

  • Unicode character classification used by Python:
    • isalpha
    • isalnum
    • isdecimal
    • isdigit
    • isnumeric
    • isspace
    • isprintable
    • casing predicates if needed
  • Identifier-related predicates:
    • is_xid_start
    • is_xid_continue
    • Python identifier helpers
  • Regex-oriented Unicode predicates:
    • Unicode \w
    • Unicode \d
    • Unicode \s
    • any other CPython regex character classes that depend on Unicode tables
  • Case conversion and casing-related data:
    • lowercase
    • uppercase
    • titlecase
    • casefold
    • full mappings where Python requires them
  • Unicode normalization support needed by Python:
    • NFC
    • NFD
    • NFKC
    • NFKD
    • is_normalized
  • unicodedata-style database access:
    • general category
    • bidirectional class
    • combining class
    • east asian width
    • mirrored
    • decomposition
    • decimal/digit/numeric values
    • name lookup
    • character lookup by name
    • Unicode age/version checks if needed for CPython compatibility layers
  • Versioned Unicode tables aligned with CPython.

Non-Goals

rustpython-unicode should not include:

  • Python str object behavior
  • slicing, searching, splitting, joining, formatting, padding, or other string algorithms
  • Python object model concerns
  • interpreter-specific wrappers

Those belong in crate::str or the embedding runtime.

Architecture

A good split would be:

  • rustpython-unicode
    • owns Unicode tables and Unicode semantics
    • exposes u32-based predicates and mappings
    • exposes unicodedata-style query APIs
  • crate::str
    • owns Python string methods and higher-level string algorithms
    • calls into rustpython-unicode for all Unicode-sensitive behavior
  • regex engine
    • calls into rustpython-unicode for Unicode character classes
  • unicodedata module
    • becomes a thin wrapper over rustpython-unicode

This keeps one authoritative Unicode path across the runtime.

Suggested Public API Direction

pub mod classify {
pub fn is_alpha(cp: u32) -> bool;
pub fn is_alnum(cp: u32) -> bool;
pub fn is_decimal(cp: u32) -> bool;
pub fn is_digit(cp: u32) -> bool;
pub fn is_numeric(cp: u32) -> bool;
pub fn is_space(cp: u32) -> bool;
pub fn is_printable(cp: u32) -> bool;
}

pub mod identifier {
pub fn is_xid_start(cp: u32) -> bool;
pub fn is_xid_continue(cp: u32) -> bool;
pub fn is_python_identifier_start(cp: u32) -> bool;
pub fn is_python_identifier_continue(cp: u32) -> bool;
}

pub mod regex {
pub fn is_word(cp: u32) -> bool;
pub fn is_digit(cp: u32) -> bool;
pub fn is_space(cp: u32) -> bool;
}

pub mod case {
pub fn to_lowercase(cp: u32) -> CaseMapping;
pub fn to_uppercase(cp: u32) -> CaseMapping;
pub fn to_titlecase(cp: u32) -> CaseMapping;
pub fn casefold(cp: u32) -> CaseMapping;
}

pub mod normalize {
pub fn nfc<I: IntoIterator<Item = u32>>(input: I) -> Normalized;
pub fn nfd<I: IntoIterator<Item = u32>>(input: I) -> Normalized;
pub fn nfkc<I: IntoIterator<Item = u32>>(input: I) -> Normalized;
pub fn nfkd<I: IntoIterator<Item = u32>>(input: I) -> Normalized;
pub fn is_normalized_nfc<I: IntoIterator<Item = u32>>(input: I) -> bool;
}

pub mod data {
pub fn category(cp: u32) -> GeneralCategory;
pub fn bidirectional(cp: u32) -> BidiClass;
pub fn combining(cp: u32) -> u8;
pub fn east_asian_width(cp: u32) -> EastAsianWidth;
pub fn mirrored(cp: u32) -> bool;
pub fn decomposition(cp: u32) -> Option;
pub fn decimal(cp: u32) -> Option;
pub fn digit(cp: u32) -> Option;
pub fn numeric(cp: u32) -> Option;
pub fn name(cp: u32) -> Option<&'static str>;
pub fn lookup(name: &str) -> Option;
}

The exact types can change, but the important part is the boundary: low-level Unicode
semantics here, string algorithms elsewhere.

Compatibility Model

The crate should define compatibility against a specific CPython line, for example:

  • CPython 3.14 Unicode semantics
  • version-pinned generated tables
  • explicit regeneration workflow when upgrading CPython

That matters because “Unicode-correct” is not enough here. The target is “CPython-
compatible.”

Why This Is Better Than Ad Hoc Fixes

  • str, re, parser, and unicodedata stop drifting apart.
  • There is one authoritative source for Unicode behavior.
  • Compatibility work becomes table-driven instead of patch-driven.
  • Future CPython upgrades become more mechanical and auditable.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions