Skip to content

unicodedata: Const, embedded version#8131

Open
joshuamegnauth54 wants to merge 1 commit into
RustPython:mainfrom
joshuamegnauth54:ucd-const-version
Open

unicodedata: Const, embedded version#8131
joshuamegnauth54 wants to merge 1 commit into
RustPython:mainfrom
joshuamegnauth54:ucd-const-version

Conversation

@joshuamegnauth54

@joshuamegnauth54 joshuamegnauth54 commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Summary

unicodedata's version can be const evaluated and doesn't need to allocate.

Summary by CodeRabbit

  • Chores
    • Updated Unicode data version handling to use a simplified “modern vs legacy” selection, ensuring consistent behavior across Unicode lookups.
  • Bug Fixes
    • Improved numeric and property lookup consistency by unifying the modern/legacy branching across related Unicode queries.
  • Documentation
    • Adjusted the exposed Unicode data version value to be a string with a stable static lifetime (unidata_version), aligning it across module and instance accessors.

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 15e4d831-08ee-4a2d-8d0f-77c9e8716c14

📥 Commits

Reviewing files that changed from the base of the PR and between 5a42103 and 6315476.

📒 Files selected for processing (2)
  • crates/stdlib/build.rs
  • crates/stdlib/src/unicodedata.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • crates/stdlib/build.rs
  • crates/stdlib/src/unicodedata.rs

📝 Walkthrough

Walkthrough

build.rs is updated to emit a RUST_UNICODE_VERSION Cargo environment variable derived from char::UNICODE_VERSION components. unicodedata.rs removes the UnicodeVersion struct and UNICODE_VERSION constant, replacing the Ucd struct's version field with a modern: bool flag. All UCD method branches and lookup_numeric_val are updated accordingly, and unidata_version is simplified to return a &'static str.

Changes

Unicode Version Flag Refactor

Layer / File(s) Summary
Build-time RUST_UNICODE_VERSION env var
crates/stdlib/build.rs
Adds a println! Cargo directive to set RUST_UNICODE_VERSION from char::UNICODE_VERSION components as a compile-time env var.
Ucd struct and lookup_numeric_val refactor
crates/stdlib/src/unicodedata.rs
Removes UnicodeVersion/UNICODE_VERSION and fmt imports; changes Ucd's stored field to modern: bool with new(modern: bool); updates lookup_numeric_val signature to (ch, modern: bool); updates module imports and module_exec to use Ucd::new(true).
UCD method branching to self.modern
crates/stdlib/src/unicodedata.rs
All UCD query methods (category, bidirectional, east_asian_width, mirrored, combining, decomposition, numeric_type_matches, digit, decimal) replace self.unic_version.major comparisons with self.modern.
Python-exposed version API
crates/stdlib/src/unicodedata.rs
numeric switches to lookup_numeric_val(ch, self.modern); unidata_version getter and module-level attribute become const fn returning &'static str (RUST_UNICODE_VERSION or "3.2.0"); ucd_3_2_0 returns Ucd::new(false).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 A version struct too big to keep,
I swapped it for a boolean leap!
modern: bool — so clean, so bright,
The old UnicodeVersion lost the fight.
Now "3.2.0" stays frozen in time,
And build.rs sets the version at compile-time! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 63.16% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: refactoring unicodedata to use const-evaluated, embedded version information instead of runtime allocation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/stdlib/build.rs`:
- Around line 659-664: The escaped quotes in the println! macro's format string
are causing literal quote characters to be embedded in the RUST_UNICODE_VERSION
environment variable value. Remove the `\"` escape sequences from the format
string so that the version string contains only the numeric value without
literal quotes. The format string should output the version directly as `15.1.0`
rather than `"15.1.0"`.

In `@crates/stdlib/src/unicodedata.rs`:
- Line 486: The digit() method at line 486 is hardcoding true as the modern flag
in the lookup_numeric_val(ch, true) call, while the decimal() and numeric()
methods both use lookup_numeric_val(ch, self.modern). This inconsistency means
digit() will always use modern tables regardless of the Unicode version. Change
the hardcoded true to self.modern in the digit() method's lookup_numeric_val()
call to match the consistent behavior used in decimal() and numeric().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: fca68cc1-521d-46c8-961b-917baeb34e36

📥 Commits

Reviewing files that changed from the base of the PR and between fe2a7db and 5a42103.

📒 Files selected for processing (2)
  • crates/stdlib/build.rs
  • crates/stdlib/src/unicodedata.rs

Comment thread crates/stdlib/build.rs
Comment thread crates/stdlib/src/unicodedata.rs
@joshuamegnauth54

Copy link
Copy Markdown
Contributor Author

I'm not sure the Windows test failed on something completely unrelated to this patch. 🤔

@ShaharNaveh

Copy link
Copy Markdown
Contributor

I'm not sure the Windows test failed on something completely unrelated to this patch. 🤔

I've restarted it

@ShaharNaveh ShaharNaveh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this cleanup!

tysm:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants