rustpython-unicode

 rustpython-unicode should be a shared crate that provides CPython-compatible Unicode
  semantics and Unicode data for Rust-based Python implementations and tools.

  Its purpose is not to implement Python string objects or high-level string methods. Instead,
  it should provide the Unicode foundation that Python runtimes need: character classification,
  case mapping, normalization, identifier rules, regex character-class predicates, and
  unicodedata-style access to Unicode database information.

  This makes it a natural shared dependency between RustPython and Pyre, while keeping actual
  string operations in crate::str or the host runtime.

  Design Goals

  - Match CPython behavior at the Unicode data and semantics level, not just in a few visible
    edge cases.
  - Use a single version-pinned Unicode data source per targeted CPython release.
  - Expose low-level APIs that can be reused by str, re, parser, compiler, and unicodedata.
  - Support non-scalar code points where Python behavior requires it, so the core API should be
    u32-based rather than char-based.
  - Be usable outside RustPython, especially by other Python-related Rust projects.

  Scope

  rustpython-unicode should include:

  - Unicode character classification used by Python:
      - isalpha
      - isalnum
      - isdecimal
      - isdigit
      - isnumeric
      - isspace
      - isprintable
      - casing predicates if needed
  - Identifier-related predicates:
      - is_xid_start
      - is_xid_continue
      - Python identifier helpers
  - Regex-oriented Unicode predicates:
      - Unicode \w
      - Unicode \d
      - Unicode \s
      - any other CPython regex character classes that depend on Unicode tables
  - Case conversion and casing-related data:
      - lowercase
      - uppercase
      - titlecase
      - casefold
      - full mappings where Python requires them
  - Unicode normalization support needed by Python:
      - NFC
      - NFD
      - NFKC
      - NFKD
      - is_normalized
  - unicodedata-style database access:
      - general category
      - bidirectional class
      - combining class
      - east asian width
      - mirrored
      - decomposition
      - decimal/digit/numeric values
      - name lookup
      - character lookup by name
      - Unicode age/version checks if needed for CPython compatibility layers
  - Versioned Unicode tables aligned with CPython.

  Non-Goals

  rustpython-unicode should not include:

  - Python str object behavior
  - slicing, searching, splitting, joining, formatting, padding, or other string algorithms
  - Python object model concerns
  - interpreter-specific wrappers

  Those belong in crate::str or the embedding runtime.

  Architecture

  A good split would be:

  - rustpython-unicode
      - owns Unicode tables and Unicode semantics
      - exposes u32-based predicates and mappings
      - exposes unicodedata-style query APIs
  - crate::str
      - owns Python string methods and higher-level string algorithms
      - calls into rustpython-unicode for all Unicode-sensitive behavior
  - regex engine
      - calls into rustpython-unicode for Unicode character classes
  - unicodedata module
      - becomes a thin wrapper over rustpython-unicode

  This keeps one authoritative Unicode path across the runtime.

  Suggested Public API Direction

  pub mod classify {
      pub fn is_alpha(cp: u32) -> bool;
      pub fn is_alnum(cp: u32) -> bool;
      pub fn is_decimal(cp: u32) -> bool;
      pub fn is_digit(cp: u32) -> bool;
      pub fn is_numeric(cp: u32) -> bool;
      pub fn is_space(cp: u32) -> bool;
      pub fn is_printable(cp: u32) -> bool;
  }

  pub mod identifier {
      pub fn is_xid_start(cp: u32) -> bool;
      pub fn is_xid_continue(cp: u32) -> bool;
      pub fn is_python_identifier_start(cp: u32) -> bool;
      pub fn is_python_identifier_continue(cp: u32) -> bool;
  }

  pub mod regex {
      pub fn is_word(cp: u32) -> bool;
      pub fn is_digit(cp: u32) -> bool;
      pub fn is_space(cp: u32) -> bool;
  }

  pub mod case {
      pub fn to_lowercase(cp: u32) -> CaseMapping;
      pub fn to_uppercase(cp: u32) -> CaseMapping;
      pub fn to_titlecase(cp: u32) -> CaseMapping;
      pub fn casefold(cp: u32) -> CaseMapping;
  }

  pub mod normalize {
      pub fn nfc<I: IntoIterator<Item = u32>>(input: I) -> Normalized;
      pub fn nfd<I: IntoIterator<Item = u32>>(input: I) -> Normalized;
      pub fn nfkc<I: IntoIterator<Item = u32>>(input: I) -> Normalized;
      pub fn nfkd<I: IntoIterator<Item = u32>>(input: I) -> Normalized;
      pub fn is_normalized_nfc<I: IntoIterator<Item = u32>>(input: I) -> bool;
  }

  pub mod data {
      pub fn category(cp: u32) -> GeneralCategory;
      pub fn bidirectional(cp: u32) -> BidiClass;
      pub fn combining(cp: u32) -> u8;
      pub fn east_asian_width(cp: u32) -> EastAsianWidth;
      pub fn mirrored(cp: u32) -> bool;
      pub fn decomposition(cp: u32) -> Option<Decomposition>;
      pub fn decimal(cp: u32) -> Option<u8>;
      pub fn digit(cp: u32) -> Option<u8>;
      pub fn numeric(cp: u32) -> Option<NumericValue>;
      pub fn name(cp: u32) -> Option<&'static str>;
      pub fn lookup(name: &str) -> Option<u32>;
  }

  The exact types can change, but the important part is the boundary: low-level Unicode
  semantics here, string algorithms elsewhere.

  Compatibility Model

  The crate should define compatibility against a specific CPython line, for example:

  - CPython 3.14 Unicode semantics
  - version-pinned generated tables
  - explicit regeneration workflow when upgrading CPython

  That matters because “Unicode-correct” is not enough here. The target is “CPython-
  compatible.”

  Why This Is Better Than Ad Hoc Fixes

  - str, re, parser, and unicodedata stop drifting apart.
  - There is one authoritative source for Unicode behavior.
  - Compatibility work becomes table-driven instead of patch-driven.
  - Future CPython upgrades become more mechanical and auditable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rustpython-unicode #7560

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

rustpython-unicode #7560

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions