Skip to content

Speed up matching of case-insensitive character sets #152054

Description

@eendebakpt

Feature or enhancement

Proposal:

A REPEAT_ONE over a case-insensitive character set — e.g. [a-z]+ with re.IGNORECASE — does not use the fast SRE(count) path. The compiled inner opcode is IN_IGNORE / IN_UNI_IGNORE / IN_LOC_IGNORE, none of which has a case in SRE(count). The case-sensitive SRE_OP_IN already has a fast case.

Adding the three IN_*_IGNORE cases to SRE(count) lets them scan inline.

Benchmark

Benchmark before after
[a-z]+ re.I|re.A (IN_IGNORE) 1.28 us 538 ns 2.38x
[a-z]+ re.I (IN_UNI_IGNORE) 1.41 us 711 ns 1.98x
[aeiou]+ re.I 1.35 us 706 ns 1.91x
[a-z0-9]+ re.I 1.31 us 696 ns 1.88x
[a-z0-9_]+ re.I 1.31 us 703 ns 1.86x
[a-z]+ re.L|re.I bytes (IN_LOC_IGNORE) 2.09 us 1.38 us 1.52x
findall [a-z]+ re.I 109 us 88.8 us 1.22x
findall [a-z_][a-z0-9_]* re.I 103 us 89.2 us 1.15x

[^0-9]+ re.I is unchanged — it has no cased members, so it stays a plain
IN (already fast).

benchmark script (pyperf)
"""Benchmark: SRE(count) fast path for case-insensitive set repeats."""
import re
import pyperf

N = 100
MIXED   = ("aBcDeFgHiJkLmNoPqRsTuVwX" * N)[:N]
ALNUM   = ("aB3dE6gH9kLmN0pQrStUvWx1" * N)[:N]
WORD    = ("aB_dE_gH_kLmN_pQrStUvW_1" * N)[:N]
NODIGIT = ("aBcDeF gHiJkL!mNoPqR.sT?" * N)[:N]
BYTES   = MIXED.encode("latin1")

SCANS = [
    ("scan_alpha_uni",  re.compile(r"[a-z]+",    re.I),        MIXED),
    ("scan_alpha_asc",  re.compile(r"[a-z]+",    re.I | re.A), MIXED),
    ("scan_alnum_uni",  re.compile(r"[a-z0-9]+", re.I),        ALNUM),
    ("scan_word_uni",   re.compile(r"[a-z0-9_]+",re.I),        WORD),
    ("scan_neg_uni",    re.compile(r"[^0-9]+",   re.I),        NODIGIT),
    ("scan_vowels_uni", re.compile(r"[aeiou]+",  re.I),        "aAeEiIoOuU" * (N // 10)),
    ("scan_alpha_loc",  re.compile(rb"[a-z]+",   re.L | re.I), BYTES),
]
DOC = ("The Quick Brown Fox jumps over 12 Lazy Dogs near IP 10_0_0_1 and Node7. " * 50)
FINDS = [
    ("find_words_ci", re.compile(r"[a-z]+",           re.I), DOC),
    ("find_ident_ci", re.compile(r"[a-z_][a-z0-9_]*", re.I), DOC),
]

def make_scan(p, s):
    def run():
        assert p.match(s) is not None
    return run

runner = pyperf.Runner()
for name, p, s in SCANS:
    runner.bench_func(name, make_scan(p, s))
for name, p, s in FINDS:
    runner.bench_func(name, (lambda p, s: lambda: p.findall(s))(p, s))

Run under the unpatched and patched builds, then
python -m pyperf compare_to before.json after.json --table.

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

No response

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance or resource usagestdlibStandard Library Python modules in the Lib/ directorytopic-regextype-featureA feature request or enhancement
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions