Skip to content

Compile single-category character sets to a bare CATEGORY opcode #152056

Description

@eendebakpt

After gh-152033, \d outside a set compiles to a bare CATEGORY opcode with a fast REPEAT_ONE / count path. But a character set containing exactly one category — [\d], [\w], [^\s], … — is still wrapped in an IN block, even though [\d] = \d and [^\d] = \D.

Such sets can compile to the same bare CATEGORY opcode (the negated form via the complementary category, CH_NEGATE), reusing gh-152033's fast path. @serhiy-storchaka

Smaller byte code

The IN wrapper is removed, so each single-category set shrinks:

pattern before after saved
[\d]+, [\w]+, [\s]+ 16 13 −3 words
[^\d]+, [^\w]+, [^\s]+ 17 13 −4 words

Benchmark

Benchmark base after
[^\s]+ 1.58 us 559 ns 2.82x
[^\d]+ 1.65 us 911 ns 1.81x
[^\w]+ 2.65 us 1.72 us 1.55x
[\s]+ 974 ns 634 ns 1.54x
[\d]+ 1.06 us 740 ns 1.43x
findall [^\s]+ 91.5 us 69.0 us 1.33x
[\w]+ 1.49 us 1.15 us 1.30x
findall [\w]+ 116 us 90.5 us 1.29x

Real-world usage

Single-category sets — especially the negated [^\s] / [^\d] / [^\w], which
people write instead of \S / \D / \W are used, a few examples:

  • yt-dlp: re.split(r'[^\d]+', curl_cffi.__version__) (version parsing, networking/_curlcffi.py:34), re.sub(r'^[^\d]+\s', '', s) (utils/_utils.py:1850), and _VALID_URL = r'trovovod:(?P<id>[^\s]+)' (extractor/trovo.py:316).
  • Pygments psql lexer: (r"[^\s]+", String.Symbol) and r'\\[^\s]+' (lexers/sql.py:262,270).
  • NLTK Penn Treebank / destructive tokenizers: re.compile(r"([:,])([^\d])") (tokenize/treebank.py:62, tokenize/destructive.py:83).
  • Instagram extractor (yt-dlp): …likeCountClick"[^>]*>[^\d]*([\d,\.]+)
    (extractor/instagram.py:515).
  • stdlib: Lib/re/__init__.py ([^\d], [^\s]), Lib/ctypes/util.py ([^\s], x3), Lib/idlelib/outwin.py ([^\s]), and others.
  • Jinja2 — the urlize URL matcher ([\d]{1,3})(\.[\d]{1,3}){3} / (?::[\d]{1,5})? (jinja2/utils.py:216,221).
  • SymPy 1.14.0 — [\w]* in sympy/plotting/experimental_lambdify.py.
benchmark script (pyperf)
"""Benchmark: single-category character sets -> bare CATEGORY opcode."""
import re
import pyperf

N = 200
DIGITS  = ("0123456789" * N)[:N]
WORDS   = ("aB_dE9gH_kLmN0pQrStUvW_1" * N)[:N]
SPACES  = ((" \t\n\r\f\v") * N)[:N]
NODIGIT = ("aBcDeF gHiJkL!mNoPqR.sT?" * N)[:N]
NOWORD  = (" \t!.,?;: -+=/\\()[]{}<>|" * N)[:N]
NOSPACE = ("aBcDeF9gHiJkL0mNoPqR.sT?" * N)[:N]

SCANS = [
    ("scan_set_digit",    re.compile(r"[\d]+"),  DIGITS),
    ("scan_set_word",     re.compile(r"[\w]+"),  WORDS),
    ("scan_set_space",    re.compile(r"[\s]+"),  SPACES),
    ("scan_set_notdigit", re.compile(r"[^\d]+"), NODIGIT),
    ("scan_set_notword",  re.compile(r"[^\w]+"), NOWORD),
    ("scan_set_notspace", re.compile(r"[^\s]+"), NOSPACE),
]
DOC = ("The Quick Brown Fox jumps over 12 Lazy Dogs near IP 10_0_0_1 and Node7. " * 50)
FINDS = [
    ("find_set_word",     re.compile(r"[\w]+"),  DOC),
    ("find_set_notspace", re.compile(r"[^\s]+"), DOC),
]

def make_scan(p, s):
    def run():
        assert p.match(s) is not None
    return run

runner = pyperf.Runner()
for name, p, s in SCANS:
    runner.bench_func(name, make_scan(p, s))
for name, p, s in FINDS:
    runner.bench_func(name, (lambda p, s: lambda: p.findall(s))(p, s))

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance or resource usagestdlibStandard Library Python modules in the Lib/ directorytopic-regex
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions