gh-152033: Optimize category escapes outside character sets#152035
Open
serhiy-storchaka wants to merge 2 commits into
Open
gh-152033: Optimize category escapes outside character sets#152035serhiy-storchaka wants to merge 2 commits into
serhiy-storchaka wants to merge 2 commits into
Conversation
Character class escapes (``\d``, ``\D``, ``\s``, ``\S``, ``\w`` and ``\W``) that occur outside a character set are now compiled directly to a single CATEGORY opcode instead of being wrapped in an IN block. This removes the IN wrapper (three code words) and an indirect charset() call, and makes such an escape a simple repeatable unit so that, for example, ``\d+`` uses the REPEAT_ONE fast path; a CATEGORY case is added to SRE(count). The transformation preserves behaviour exactly. For category-heavy patterns the compiled byte code is about 20% smaller and matching is up to ~2x faster, with no effect on patterns that do not use bare category escapes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Documentation build overview
|
_get_charset_prefix() did not recognize a leading bare CATEGORY opcode, so a pattern starting with a category escape (such as ``\d``) lost its SRE_INFO_CHARSET prefix and search() could no longer skip non-matching start positions -- a regression relative to the IN-wrapped form. Handle CATEGORY there too, which restores the charset-prefix optimization and makes search() of category-prefixed patterns up to ~1.9x faster. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The character class escapes
\d,\D,\s,\S,\wand\Ware currently always compiled to anINblock containing a singleCATEGORYitem, even when they occur outside a character set.This PR compiles such an escape directly to a bare
CATEGORYopcode, removing theINwrapper (three code words) and the indirectSRE(charset)call. It also makes a category escape a "simple" repeatable unit, so\d+now uses theREPEAT_ONEfast path instead of the genericREPEAT/MAX_UNTILloop; aCATEGORYcase is added toSRE(count)accordingly.The transformation preserves behaviour exactly (the engine already matched the same category); only the compiled byte code changes.
In a release build I measure ~1.3x geometric-mean speedup across a range of category-heavy patterns — roughly 1.7–2.0x on scans like
\d+,\s+and\S+, and ~1.1–1.2x on realistic tokenizing, date and IP-address patterns — together with ~20% smaller compiled byte code for those patterns. Patterns that do not use bare category escapes are unaffected.Verified with the full
test_resuite, a broad sweep of regex-dependent stdlib modules, and ~80k differential fuzz cases compared against the unmodified engine (0 mismatches).🤖 Generated with Claude Code