gh-152033: Optimize category escapes outside character sets by serhiy-storchaka · Pull Request #152035 · python/cpython

serhiy-storchaka · 2026-06-23T20:50:43Z

The character class escapes \d, \D, \s, \S, \w and \W are currently always compiled to an IN block containing a single CATEGORY item, even when they occur outside a character set.

This PR compiles such an escape directly to a bare CATEGORY opcode, removing the IN wrapper (three code words) and the indirect SRE(charset) call. It also makes a category escape a "simple" repeatable unit, so \d+ now uses the REPEAT_ONE fast path instead of the generic REPEAT/MAX_UNTIL loop; a CATEGORY case is added to SRE(count) accordingly.

The transformation preserves behaviour exactly (the engine already matched the same category); only the compiled byte code changes.

In a release build I measure ~1.3x geometric-mean speedup across a range of category-heavy patterns — roughly 1.7–2.0x on scans like \d+, \s+ and \S+, and ~1.1–1.2x on realistic tokenizing, date and IP-address patterns — together with ~20% smaller compiled byte code for those patterns. Patterns that do not use bare category escapes are unaffected.

Verified with the full test_re suite, a broad sweep of regex-dependent stdlib modules, and ~80k differential fuzz cases compared against the unmodified engine (0 mismatches).

🤖 Generated with Claude Code

Issue: Optimize matching of category escapes (\d, \w, ...) outside character sets #152033

Character class escapes (``\d``, ``\D``, ``\s``, ``\S``, ``\w`` and ``\W``) that occur outside a character set are now compiled directly to a single CATEGORY opcode instead of being wrapped in an IN block. This removes the IN wrapper (three code words) and an indirect charset() call, and makes such an escape a simple repeatable unit so that, for example, ``\d+`` uses the REPEAT_ONE fast path; a CATEGORY case is added to SRE(count). The transformation preserves behaviour exactly. For category-heavy patterns the compiled byte code is about 20% smaller and matching is up to ~2x faster, with no effect on patterns that do not use bare category escapes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

read-the-docs-community · 2026-06-23T20:55:52Z

Documentation build overview

📚 cpython-previews | 🛠️ Build #33278181 | 📁 Comparing 2cb30f2 against main (5e0747d)

🔍 Preview build

2 files changed

± whatsnew/3.16.html
± whatsnew/changelog.html

_get_charset_prefix() did not recognize a leading bare CATEGORY opcode, so a pattern starting with a category escape (such as ``\d``) lost its SRE_INFO_CHARSET prefix and search() could no longer skip non-matching start positions -- a regression relative to the IN-wrapped form. Handle CATEGORY there too, which restores the charset-prefix optimization and makes search() of category-prefixed patterns up to ~1.9x faster. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

serhiy-storchaka requested a review from AA-Turner as a code owner June 23, 2026 20:50

bedevere-app Bot mentioned this pull request Jun 23, 2026

Optimize matching of category escapes (\d, \w, ...) outside character sets #152033

Open

bedevere-app Bot added the awaiting core review label Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-152033: Optimize category escapes outside character sets#152035

gh-152033: Optimize category escapes outside character sets#152035
serhiy-storchaka wants to merge 2 commits into
python:mainfrom
serhiy-storchaka:re-category-outside-set

serhiy-storchaka commented Jun 23, 2026 •

edited by bedevere-app Bot

Loading

Uh oh!

read-the-docs-community Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

serhiy-storchaka commented Jun 23, 2026 • edited by bedevere-app Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

read-the-docs-community Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation build overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

serhiy-storchaka commented Jun 23, 2026 •

edited by bedevere-app Bot

Loading

read-the-docs-community Bot commented Jun 23, 2026 •

edited

Loading