Skip to content

gh-152033: Optimize category escapes outside character sets#152035

Open
serhiy-storchaka wants to merge 2 commits into
python:mainfrom
serhiy-storchaka:re-category-outside-set
Open

gh-152033: Optimize category escapes outside character sets#152035
serhiy-storchaka wants to merge 2 commits into
python:mainfrom
serhiy-storchaka:re-category-outside-set

Conversation

@serhiy-storchaka

@serhiy-storchaka serhiy-storchaka commented Jun 23, 2026

Copy link
Copy Markdown
Member

The character class escapes \d, \D, \s, \S, \w and \W are currently always compiled to an IN block containing a single CATEGORY item, even when they occur outside a character set.

This PR compiles such an escape directly to a bare CATEGORY opcode, removing the IN wrapper (three code words) and the indirect SRE(charset) call. It also makes a category escape a "simple" repeatable unit, so \d+ now uses the REPEAT_ONE fast path instead of the generic REPEAT/MAX_UNTIL loop; a CATEGORY case is added to SRE(count) accordingly.

The transformation preserves behaviour exactly (the engine already matched the same category); only the compiled byte code changes.

In a release build I measure ~1.3x geometric-mean speedup across a range of category-heavy patterns — roughly 1.7–2.0x on scans like \d+, \s+ and \S+, and ~1.1–1.2x on realistic tokenizing, date and IP-address patterns — together with ~20% smaller compiled byte code for those patterns. Patterns that do not use bare category escapes are unaffected.

Verified with the full test_re suite, a broad sweep of regex-dependent stdlib modules, and ~80k differential fuzz cases compared against the unmodified engine (0 mismatches).

🤖 Generated with Claude Code

Character class escapes (``\d``, ``\D``, ``\s``, ``\S``, ``\w`` and
``\W``) that occur outside a character set are now compiled directly to a
single CATEGORY opcode instead of being wrapped in an IN block.  This
removes the IN wrapper (three code words) and an indirect charset() call,
and makes such an escape a simple repeatable unit so that, for example,
``\d+`` uses the REPEAT_ONE fast path; a CATEGORY case is added to
SRE(count).

The transformation preserves behaviour exactly.  For category-heavy
patterns the compiled byte code is about 20% smaller and matching is up
to ~2x faster, with no effect on patterns that do not use bare category
escapes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@read-the-docs-community

read-the-docs-community Bot commented Jun 23, 2026

Copy link
Copy Markdown

Documentation build overview

📚 cpython-previews | 🛠️ Build #33278181 | 📁 Comparing 2cb30f2 against main (5e0747d)

  🔍 Preview build  

2 files changed
± whatsnew/3.16.html
± whatsnew/changelog.html

_get_charset_prefix() did not recognize a leading bare CATEGORY opcode,
so a pattern starting with a category escape (such as ``\d``) lost its
SRE_INFO_CHARSET prefix and search() could no longer skip non-matching
start positions -- a regression relative to the IN-wrapped form.  Handle
CATEGORY there too, which restores the charset-prefix optimization and
makes search() of category-prefixed patterns up to ~1.9x faster.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant