After gh-152033, \d outside a set compiles to a bare CATEGORY opcode with a fast REPEAT_ONE / count path. But a character set containing exactly one category — [\d], [\w], [^\s], … — is still wrapped in an IN block, even though [\d] = \d and [^\d] = \D.
Such sets can compile to the same bare CATEGORY opcode (the negated form via the complementary category, CH_NEGATE), reusing gh-152033's fast path. @serhiy-storchaka
Smaller byte code
The IN wrapper is removed, so each single-category set shrinks:
| pattern |
before |
after |
saved |
[\d]+, [\w]+, [\s]+ |
16 |
13 |
−3 words |
[^\d]+, [^\w]+, [^\s]+ |
17 |
13 |
−4 words |
Benchmark
| Benchmark |
base |
after |
|
[^\s]+ |
1.58 us |
559 ns |
2.82x |
[^\d]+ |
1.65 us |
911 ns |
1.81x |
[^\w]+ |
2.65 us |
1.72 us |
1.55x |
[\s]+ |
974 ns |
634 ns |
1.54x |
[\d]+ |
1.06 us |
740 ns |
1.43x |
findall [^\s]+ |
91.5 us |
69.0 us |
1.33x |
[\w]+ |
1.49 us |
1.15 us |
1.30x |
findall [\w]+ |
116 us |
90.5 us |
1.29x |
Real-world usage
Single-category sets — especially the negated [^\s] / [^\d] / [^\w], which
people write instead of \S / \D / \W are used, a few examples:
- yt-dlp:
re.split(r'[^\d]+', curl_cffi.__version__) (version parsing, networking/_curlcffi.py:34), re.sub(r'^[^\d]+\s', '', s) (utils/_utils.py:1850), and _VALID_URL = r'trovovod:(?P<id>[^\s]+)' (extractor/trovo.py:316).
- Pygments psql lexer:
(r"[^\s]+", String.Symbol) and r'\\[^\s]+' (lexers/sql.py:262,270).
- NLTK Penn Treebank / destructive tokenizers:
re.compile(r"([:,])([^\d])") (tokenize/treebank.py:62, tokenize/destructive.py:83).
- Instagram extractor (yt-dlp):
…likeCountClick"[^>]*>[^\d]*([\d,\.]+)
(extractor/instagram.py:515).
- stdlib:
Lib/re/__init__.py ([^\d], [^\s]), Lib/ctypes/util.py ([^\s], x3), Lib/idlelib/outwin.py ([^\s]), and others.
- Jinja2 — the
urlize URL matcher ([\d]{1,3})(\.[\d]{1,3}){3} / (?::[\d]{1,5})? (jinja2/utils.py:216,221).
- SymPy 1.14.0 —
[\w]* in sympy/plotting/experimental_lambdify.py.
benchmark script (pyperf)
"""Benchmark: single-category character sets -> bare CATEGORY opcode."""
import re
import pyperf
N = 200
DIGITS = ("0123456789" * N)[:N]
WORDS = ("aB_dE9gH_kLmN0pQrStUvW_1" * N)[:N]
SPACES = ((" \t\n\r\f\v") * N)[:N]
NODIGIT = ("aBcDeF gHiJkL!mNoPqR.sT?" * N)[:N]
NOWORD = (" \t!.,?;: -+=/\\()[]{}<>|" * N)[:N]
NOSPACE = ("aBcDeF9gHiJkL0mNoPqR.sT?" * N)[:N]
SCANS = [
("scan_set_digit", re.compile(r"[\d]+"), DIGITS),
("scan_set_word", re.compile(r"[\w]+"), WORDS),
("scan_set_space", re.compile(r"[\s]+"), SPACES),
("scan_set_notdigit", re.compile(r"[^\d]+"), NODIGIT),
("scan_set_notword", re.compile(r"[^\w]+"), NOWORD),
("scan_set_notspace", re.compile(r"[^\s]+"), NOSPACE),
]
DOC = ("The Quick Brown Fox jumps over 12 Lazy Dogs near IP 10_0_0_1 and Node7. " * 50)
FINDS = [
("find_set_word", re.compile(r"[\w]+"), DOC),
("find_set_notspace", re.compile(r"[^\s]+"), DOC),
]
def make_scan(p, s):
def run():
assert p.match(s) is not None
return run
runner = pyperf.Runner()
for name, p, s in SCANS:
runner.bench_func(name, make_scan(p, s))
for name, p, s in FINDS:
runner.bench_func(name, (lambda p, s: lambda: p.findall(s))(p, s))
Linked PRs
After gh-152033,
\doutside a set compiles to a bareCATEGORYopcode with a fastREPEAT_ONE/countpath. But a character set containing exactly one category —[\d],[\w],[^\s], … — is still wrapped in anINblock, even though[\d]=\dand[^\d]=\D.Such sets can compile to the same bare
CATEGORYopcode (the negated form via the complementary category,CH_NEGATE), reusing gh-152033's fast path. @serhiy-storchakaSmaller byte code
The
INwrapper is removed, so each single-category set shrinks:[\d]+,[\w]+,[\s]+[^\d]+,[^\w]+,[^\s]+Benchmark
[^\s]+[^\d]+[^\w]+[\s]+[\d]+findall [^\s]+[\w]+findall [\w]+Real-world usage
Single-category sets — especially the negated
[^\s]/[^\d]/[^\w], whichpeople write instead of
\S/\D/\Ware used, a few examples:re.split(r'[^\d]+', curl_cffi.__version__)(version parsing,networking/_curlcffi.py:34),re.sub(r'^[^\d]+\s', '', s)(utils/_utils.py:1850), and_VALID_URL = r'trovovod:(?P<id>[^\s]+)'(extractor/trovo.py:316).(r"[^\s]+", String.Symbol)andr'\\[^\s]+'(lexers/sql.py:262,270).re.compile(r"([:,])([^\d])")(tokenize/treebank.py:62,tokenize/destructive.py:83).…likeCountClick"[^>]*>[^\d]*([\d,\.]+)(
extractor/instagram.py:515).Lib/re/__init__.py([^\d],[^\s]),Lib/ctypes/util.py([^\s], x3),Lib/idlelib/outwin.py([^\s]), and others.urlizeURL matcher([\d]{1,3})(\.[\d]{1,3}){3}/(?::[\d]{1,5})?(jinja2/utils.py:216,221).[\w]*insympy/plotting/experimental_lambdify.py.benchmark script (pyperf)
Linked PRs