-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Replace unmaintained unic crates
#7555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
youknowone
merged 12 commits into
RustPython:main
from
ShaharNaveh:replace-unmaintained-unic-char
Apr 2, 2026
Merged
Changes from 1 commit
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
6abe80a
Use maintained crates
ShaharNaveh 36c3886
Fix literal
ShaharNaveh 23335bd
Fix some of vm
ShaharNaveh de8cd0a
Fix vm
ShaharNaveh 15b4a82
Fix stdlib
ShaharNaveh 4789fc0
Migrate more at stdlib
ShaharNaveh b0583ab
Use shortform
ShaharNaveh 47dadaf
Mark failing test
ShaharNaveh 273ad65
Fix test marks
ShaharNaveh 5a776f5
cargo shear
ShaharNaveh 0cc47eb
Clippy
ShaharNaveh 3245cd1
clippy
ShaharNaveh File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Fix stdlib
- Loading branch information
commit 15b4a828e710d70c1b1fae52b224ca6d75aa4e09
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
Repository: RustPython/RustPython
Length of output: 101
🏁 Script executed:
cat -n crates/stdlib/src/unicodedata.rs | head -250Repository: RustPython/RustPython
Length of output: 10161
🏁 Script executed:
Repository: RustPython/RustPython
Length of output: 300
🏁 Script executed:
Repository: RustPython/RustPython
Length of output: 1397
🏁 Script executed:
Repository: RustPython/RustPython
Length of output: 235
🏁 Script executed:
Repository: RustPython/RustPython
Length of output: 5918
🏁 Script executed:
Repository: RustPython/RustPython
Length of output: 131
🏁 Script executed:
Repository: RustPython/RustPython
Length of output: 575
🏁 Script executed:
Repository: RustPython/RustPython
Length of output: 182
🏁 Script executed:
Repository: RustPython/RustPython
Length of output: 1168
🏁 Script executed:
Repository: RustPython/RustPython
Length of output: 1335
🏁 Script executed:
Repository: RustPython/RustPython
Length of output: 792
🏁 Script executed:
Repository: RustPython/RustPython
Length of output: 352
🏁 Script executed:
Repository: RustPython/RustPython
Length of output: 2575
🏁 Script executed:
# Check what Unicode version the ICU libraries support web_search ICU 2.1.2 unicode version supportRepository: RustPython/RustPython
Length of output: 115
🌐 Web query:
ICU 2.1.2 unicode version support which unicode standard version💡 Result:
ICU 2.1.2 supports Unicode Standard version 2.1.2.
Citations:
🌐 Web query:
ICU Unicode data version 15 16 icu_normalizer icu_properties 2.1💡 Result:
ICU Unicode data versions 15 and 16 are supported in ICU4C releases: ICU 72/74 for Unicode 15/15.1, ICU 76 for Unicode 16. ICU 78 uses Unicode 17. In ICU4X (Rust), icu_normalizer and icu_properties crates at version 2.1 (released 2025-10-28) use data generated from ICU version release-78.1rc / icu4x/2025-10-28 tag and CLDR 48, corresponding to Unicode 17 (not 15 or 16). Earlier ICU4X versions aligned with Unicode 15 (ICU 74, 2023) and 16 (ICU 76, 2024). icu_normalizer depends on icu_properties for decomposition/composition properties. No specific "icu_properties 2.1" data version beyond the shared ICU4X data pipeline.
Citations:
🏁 Script executed:
Repository: RustPython/RustPython
Length of output: 493
🌐 Web query:
unic-ucd-age 0.9.0 unicode version support UNIC💡 Result:
The unic-ucd-age 0.9.0 crate, part of the UNIC (Unicode and Internationalization Crates for Rust) project, provides access to the Unicode character Age property from the Unicode Character Database (UCD). It supports Unicode 10.0.0 data, as indicated in the project's documentation for the unic super-crate version 0.9.0, which explicitly states it includes Unicode 10.0.0 data and algorithms. The crate exposes a UNICODE_VERSION constant representing the Unicode version of its data and includes the Age struct and CharAge trait for querying when characters were first assigned in Unicode versions. Released on 2019-03-03, it depends on unic-ucd-version ^0.9.0 and is marked as unmaintained. Example usage: Age::of('A') returns UnicodeVersion { major: 1, minor: 1, micro: 0 }, matching Unicode 1.1 assignment.
Citations:
🏁 Script executed:
Repository: RustPython/RustPython
Length of output: 320
Resolve the mixed Unicode database versions.
Age::of()andUNICODE_VERSIONcome fromunic_ucd_age0.9.0 (Unicode 10.0.0), butcategory(),bidirectional(),east_asian_width(),normalize(),is_normalized(), andmirrored()all use ICU properties and normalization (Unicode 17.0). This causesunidata_versionto report 10.0.0 while queries run against Unicode 17.0 data, breaking Python's contract thatunicodedata.unidata_versiondescribes the actual Unicode database used by the module. Characters added after Unicode 10.0.0 will either be filtered out bycheck_age()or return incorrect properties, causing silent data corruption.🤖 Prompt for AI Agents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the recommended replacement crate for
unic-ucd-age?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ShaharNaveh, the recommended replacement forunic-ucd-ageis theicu_propertiescrate — which you're already pulling in as a workspace dependency in this PR.The
icu_propertiescrate provides definitions of Unicode Properties and APIs for retrieving property data in an appropriate data structure, and is published both as its own crate and as part of theicumeta-crate.The
Ageproperty is available viaicu_properties::props::Age(an enumerated property), accessible through the standardCodePointMapDataAPI:For
UNICODE_VERSION(the constant used to reportunidata_versionand drivecheck_age), you can derive it fromicu_properties_data.icu_properties_dataships data generated with a specific ICU and Unicode version, so you'd hard-code or derive the version string to match theicu_properties/icu_normalizerdata actually bundled. This is also the correct fix for the mixed-version issue flagged in my earlier review comment — onceunic_ucd_ageis removed,unidata_versionwill accurately reflect the Unicode version used by all the ICU-backed functions.The migration would look roughly like:
unic_ucd_agefromcrates/stdlib/Cargo.tomlAge::of(c)incheck_age()withCodePointMapData::<Age>::new().get(c)(returning anAgeenum value you can compare against a target version)UNICODE_VERSIONwith the corresponding ICU Unicode version constant (e.g., hard-coded or from aconstbased on the compiled data version)Would you like me to draft the full diff for this migration?
🧠 Learnings used
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
icu_properties::props::Agedoesn't existsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
@ShaharNaveh, you're absolutely right — I apologize for the incorrect suggestion. After verifying theicu_properties::propsmodule docs,Ageis indeed not exposed as a property inicu_propertiesv2. The props module includes things likeBidiClass,GeneralCategory,EastAsianWidth,CanonicalCombiningClass,XidStart/XidContinue, etc., but notAge.So the options for resolving the mixed-version concern are:
Keep
unic_ucd_ageas-is — it's the last remaininguniccrate in use, but it serves a narrow and specific purpose (age-based filtering incheck_age()and reportingUNICODE_VERSION). Since replacing it would require either a different crate or significant rework, this may be acceptable as a known limitation to address in a follow-up.Hard-code the Unicode version constant to match the ICU data version (e.g. Unicode 16 or 17 depending on which
icu_properties/icu_normalizerdata version is bundled), and remove the age-based filtering fromcheck_age()entirely. This eliminatesunic_ucd_ageat the cost of dropping the age-based character filtering thatUCD.ucd_3_2_0relies on.Use a different crate that provides the Age property, such as the
unicode-ageorucd-trie-based crates, but those may have their own maintenance concerns.Given the scope of this PR (replacing unmaintained
uniccrates), option 1 (keepingunic_ucd_agefor now) seems most pragmatic. The mixed-version issue I flagged is a real concern but may be pre-existing and better addressed separately. Would you like me to open a follow-up issue to track this?🌐 Web query:
💡 Result:
In ICU4X
icu_propertiesv2.x (incl.icu_properties 2.1.2) there is currently noprops::Agetype, so you can’t doCodePointMapData::<props::Age>(or load an “Age” property map) via the Rust API. Thepropsmodule lists the supported properties, and Age is not among them. [1]What is supported is the general pattern: enumerated properties (like
Script) are queried viaCodePointMapData<T>, e.g.CodePointMapData::<Script>::new().get('木'). [2]Sources:
[1]
icu_properties::propsmodule items (supported properties list) (docs.rs)[2]
icu::propertiesdocs showingCodePointMapDatausage for enumerated properties (docs.rs)Citations:
🌐 Web query:
💡 Result:
icu4x’s Rust Unicode-properties API is theicu_propertiescrate (also re-exported asicu::properties). It exposes properties via marker types inicu_properties::props(used withCodePointSetData/CodePointMapData) and helper modules likesets/maps. (docs.rs)However, the Unicode
Ageproperty (the Unicode version a code point was first assigned) is not currently implemented/exposed byicu_properties: it does not appear in the crate’s “all items” list of available properties. (docs.rs)If you specifically need
Age, you have to source it elsewhere, for example:Ageproperty is defined as the version of Unicode in which the code point was assigned). (unicode.org)Age/u_charAge). (unicode-org.github.io)Citations:
@ShaharNaveh, you're correct — I apologize for the bad suggestion. TheAgeproperty is not currently implemented or exposed byicu_properties; it does not appear in theicu_properties::propsmodule's list of supported properties.So
unic_ucd_agecannot be replaced byicu_propertiesat this time. The practical options are:unic_ucd_ageas-is for this PR — it's the only remaininguniccrate after your changes, serving the specific purpose of age-based filtering and version reporting. This is a reasonable stopping point for this PR.Given the scope of this PR, keeping
unic_ucd_ageand deferring full replacement seems most pragmatic. Would you like me to open a follow-up issue to track replacingunic_ucd_ageonceicu_propertiesgains Age support, or if a suitable alternative crate is found?✏️ Learnings added
🧠 Learnings used