gh-95555: Support Unicode property escapes \p{...} in regular expressions#151969
Open
serhiy-storchaka wants to merge 1 commit into
Open
gh-95555: Support Unicode property escapes \p{...} in regular expressions#151969serhiy-storchaka wants to merge 1 commit into
serhiy-storchaka wants to merge 1 commit into
Conversation
…xpressions
Add support for \p{property} and \P{property} in Unicode (str) regular
expressions, for the properties the engine can resolve without the
unicodedata database. They are matched either as CATEGORY opcodes
(character predicates and combinations of them, see sre.c) or as fixed
sets of character ranges.
Supported properties:
* many General_Category values -- the groups L, N, Z, C and the values Lu,
Lt, Lm, Nd, Nl, No, Zs, Zl, Zp, Cc, Cf, Cs, Co and Cn;
* the binary properties Alphabetic, Lowercase, Uppercase, Numeric,
Printable, XID_Start, XID_Continue, Cased and Case_Ignorable;
* the POSIX compatibility classes alpha, alnum, blank, cntrl, digit, graph,
lower, print, space, upper, word and xdigit;
* the code-point classes ASCII, Any, Assigned, Noncharacter_Code_Point,
Join_Control and the immutable Pattern_Syntax and Pattern_White_Space.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Documentation build overview
|
| Matches ``[0-9]`` if the :py:const:`~re.ASCII` flag is used. | ||
|
|
||
| __ https://www.unicode.org/versions/Unicode15.0.0/ch04.pdf#G134153 | ||
| __ https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-4/#G124142 |
Member
There was a problem hiding this comment.
This should be added to:
cpython/Tools/unicode/makeunicodedata.py
Lines 42 to 48 in 868d9a8
| @@ -0,0 +1,267 @@ | |||
| # | |||
| # Secret Labs' Regular Expression Engine | |||
Member
There was a problem hiding this comment.
This wasn't written by the company, nor is it licensed to them?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add support for
\p{property}and\P{property}escapes in Unicode (str) regular expressions, for the properties the engine can resolve without theunicodedatadatabase. They are matched either asCATEGORYopcodes (character predicates and combinations of them) or as fixed sets of character ranges, so neither the matcher nor the compiler gains aunicodedatadependency.Supported in this change:
General_Categoryvalues — the groupsL,N,Z,Cand the valuesLu,Lt,Lm,Nd,Nl,No,Zs,Zl,Zp,Cc,Cf,Cs,CoandCn;Alphabetic,Lowercase,Uppercase,Numeric,Printable,XID_Start,XID_Continue,CasedandCase_Ignorable;alpha,alnum,blank,cntrl,digit,graph,lower,print,space,upper,wordandxdigit;ASCII,Any,Assigned,Noncharacter_Code_Point,Join_Controland the immutablePattern_SyntaxandPattern_White_Space.Property and value names use loose matching (UAX #44 UAX44-LM3), and a property may be spelled
\p{Lu},\p{gc=Lu}or\p{name=yes}.The remaining table-based properties (the
General_CategoryvaluesLl/Loand theM/P/Sfamilies,Block, and the other enumerated properties) require theunicodedatatables and are intentionally left out of this first change, to be added separately.reshould support\p{...}character properties #95555