Skip to content

gh-95555: Support Unicode property escapes \p{...} in regular expressions#151969

Open
serhiy-storchaka wants to merge 1 commit into
python:mainfrom
serhiy-storchaka:re-properties
Open

gh-95555: Support Unicode property escapes \p{...} in regular expressions#151969
serhiy-storchaka wants to merge 1 commit into
python:mainfrom
serhiy-storchaka:re-properties

Conversation

@serhiy-storchaka

@serhiy-storchaka serhiy-storchaka commented Jun 23, 2026

Copy link
Copy Markdown
Member

Add support for \p{property} and \P{property} escapes in Unicode (str) regular expressions, for the properties the engine can resolve without the unicodedata database. They are matched either as CATEGORY opcodes (character predicates and combinations of them) or as fixed sets of character ranges, so neither the matcher nor the compiler gains a unicodedata dependency.

Supported in this change:

  • many General_Category values — the groups L, N, Z, C and the values Lu, Lt, Lm, Nd, Nl, No, Zs, Zl, Zp, Cc, Cf, Cs, Co and Cn;
  • the binary properties Alphabetic, Lowercase, Uppercase, Numeric, Printable, XID_Start, XID_Continue, Cased and Case_Ignorable;
  • the POSIX compatibility classes alpha, alnum, blank, cntrl, digit, graph, lower, print, space, upper, word and xdigit;
  • the code-point classes ASCII, Any, Assigned, Noncharacter_Code_Point, Join_Control and the immutable Pattern_Syntax and Pattern_White_Space.

Property and value names use loose matching (UAX #44 UAX44-LM3), and a property may be spelled \p{Lu}, \p{gc=Lu} or \p{name=yes}.

The remaining table-based properties (the General_Category values Ll/Lo and the M/P/S families, Block, and the other enumerated properties) require the unicodedata tables and are intentionally left out of this first change, to be added separately.

…xpressions

Add support for \p{property} and \P{property} in Unicode (str) regular
expressions, for the properties the engine can resolve without the
unicodedata database.  They are matched either as CATEGORY opcodes
(character predicates and combinations of them, see sre.c) or as fixed
sets of character ranges.

Supported properties:

* many General_Category values -- the groups L, N, Z, C and the values Lu,
  Lt, Lm, Nd, Nl, No, Zs, Zl, Zp, Cc, Cf, Cs, Co and Cn;
* the binary properties Alphabetic, Lowercase, Uppercase, Numeric,
  Printable, XID_Start, XID_Continue, Cased and Case_Ignorable;
* the POSIX compatibility classes alpha, alnum, blank, cntrl, digit, graph,
  lower, print, space, upper, word and xdigit;
* the code-point classes ASCII, Any, Assigned, Noncharacter_Code_Point,
  Join_Control and the immutable Pattern_Syntax and Pattern_White_Space.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@read-the-docs-community

Copy link
Copy Markdown

Documentation build overview

📚 cpython-previews | 🛠️ Build #33265162 | 📁 Comparing 8ca0ebe against main (868d9a8)

  🔍 Preview build  

3 files changed
± library/re.html
± whatsnew/3.16.html
± whatsnew/changelog.html

Comment thread Doc/library/re.rst
Matches ``[0-9]`` if the :py:const:`~re.ASCII` flag is used.

__ https://www.unicode.org/versions/Unicode15.0.0/ch04.pdf#G134153
__ https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-4/#G124142

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be added to:

# The Unicode Database
# --------------------
# When changing UCD version please update
# * Doc/library/stdtypes.rst, and
# * Doc/library/unicodedata.rst
# * Doc/reference/lexical_analysis.rst (three occurrences)
UNIDATA_VERSION = "17.0.0"

Comment thread Lib/re/_properties.py
@@ -0,0 +1,267 @@
#
# Secret Labs' Regular Expression Engine

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't written by the company, nor is it licensed to them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants