Compile single-category character sets to a bare `CATEGORY` opcode

After gh-152033, `\d` outside a set compiles to a bare `CATEGORY` opcode with a fast `REPEAT_ONE` / `count` path. But a character set containing exactly one category — `[\d]`, `[\w]`, `[^\s]`, … — is still wrapped in an `IN` block, even though `[\d]` = `\d` and `[^\d]` = `\D`.

Such sets can compile to the same bare `CATEGORY` opcode (the negated form via the complementary category, `CH_NEGATE`), reusing gh-152033's fast path. @serhiy-storchaka 

**Smaller byte code**

The `IN` wrapper is removed, so each single-category set shrinks:

| pattern | before | after | saved |
|---|---|---|---|
| `[\d]+`, `[\w]+`, `[\s]+` | 16 | 13 | −3 words |
| `[^\d]+`, `[^\w]+`, `[^\s]+` | 17 | 13 | −4 words |


**Benchmark**

| Benchmark | base | after | |
|---|---|---|---|
| `[^\s]+` | 1.58 us | 559 ns | 2.82x |
| `[^\d]+` | 1.65 us | 911 ns | 1.81x |
| `[^\w]+` | 2.65 us | 1.72 us | 1.55x |
| `[\s]+` | 974 ns | 634 ns | 1.54x |
| `[\d]+` | 1.06 us | 740 ns | 1.43x |
| `findall [^\s]+` | 91.5 us | 69.0 us | 1.33x |
| `[\w]+` | 1.49 us | 1.15 us | 1.30x |
| `findall [\w]+` | 116 us | 90.5 us | 1.29x |

**Real-world usage**

Single-category sets — especially the negated `[^\s]` / `[^\d]` / `[^\w]`, which
people write instead of `\S` / `\D` / `\W` are used, a few examples:

- yt-dlp: `re.split(r'[^\d]+', curl_cffi.__version__)` (version parsing,  `networking/_curlcffi.py:34`), `re.sub(r'^[^\d]+\s', '', s)`  (`utils/_utils.py:1850`), and `_VALID_URL = r'trovovod:(?P<id>[^\s]+)'`  (`extractor/trovo.py:316`).
- Pygments psql lexer: `(r"[^\s]+", String.Symbol)` and `r'\\[^\s]+'`  (`lexers/sql.py:262,270`).
- NLTK Penn Treebank / destructive tokenizers:   `re.compile(r"([:,])([^\d])")` (`tokenize/treebank.py:62`,  `tokenize/destructive.py:83`).
- Instagram extractor (yt-dlp): `…likeCountClick"[^>]*>[^\d]*([\d,\.]+)`
  (`extractor/instagram.py:515`).
- **stdlib:** `Lib/re/__init__.py` (`[^\d]`, `[^\s]`), `Lib/ctypes/util.py` (`[^\s]`, x3), `Lib/idlelib/outwin.py` (`[^\s]`), and others.
- Jinja2 — the `urlize` URL matcher `([\d]{1,3})(\.[\d]{1,3}){3}` /  `(?::[\d]{1,5})?` (`jinja2/utils.py:216,221`).
- SymPy 1.14.0 — `[\w]*` in `sympy/plotting/experimental_lambdify.py`.

<details>
<summary>benchmark script (pyperf)</summary>

```python
"""Benchmark: single-category character sets -> bare CATEGORY opcode."""
import re
import pyperf

N = 200
DIGITS  = ("0123456789" * N)[:N]
WORDS   = ("aB_dE9gH_kLmN0pQrStUvW_1" * N)[:N]
SPACES  = ((" \t\n\r\f\v") * N)[:N]
NODIGIT = ("aBcDeF gHiJkL!mNoPqR.sT?" * N)[:N]
NOWORD  = (" \t!.,?;: -+=/\\()[]{}<>|" * N)[:N]
NOSPACE = ("aBcDeF9gHiJkL0mNoPqR.sT?" * N)[:N]

SCANS = [
    ("scan_set_digit",    re.compile(r"[\d]+"),  DIGITS),
    ("scan_set_word",     re.compile(r"[\w]+"),  WORDS),
    ("scan_set_space",    re.compile(r"[\s]+"),  SPACES),
    ("scan_set_notdigit", re.compile(r"[^\d]+"), NODIGIT),
    ("scan_set_notword",  re.compile(r"[^\w]+"), NOWORD),
    ("scan_set_notspace", re.compile(r"[^\s]+"), NOSPACE),
]
DOC = ("The Quick Brown Fox jumps over 12 Lazy Dogs near IP 10_0_0_1 and Node7. " * 50)
FINDS = [
    ("find_set_word",     re.compile(r"[\w]+"),  DOC),
    ("find_set_notspace", re.compile(r"[^\s]+"), DOC),
]

def make_scan(p, s):
    def run():
        assert p.match(s) is not None
    return run

runner = pyperf.Runner()
for name, p, s in SCANS:
    runner.bench_func(name, make_scan(p, s))
for name, p, s in FINDS:
    runner.bench_func(name, (lambda p, s: lambda: p.findall(s))(p, s))
```
</details>



### Linked PRs
* gh-152057

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compile single-category character sets to a bare `CATEGORY` opcode #152056

Linked PRs

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

pattern	before	after	saved
`[\d]+`, `[\w]+`, `[\s]+`	16	13	−3 words
`[^\d]+`, `[^\w]+`, `[^\s]+`	17	13	−4 words

Benchmark	base	after
`[^\s]+`	1.58 us	559 ns	2.82x
`[^\d]+`	1.65 us	911 ns	1.81x
`[^\w]+`	2.65 us	1.72 us	1.55x
`[\s]+`	974 ns	634 ns	1.54x
`[\d]+`	1.06 us	740 ns	1.43x
`findall [^\s]+`	91.5 us	69.0 us	1.33x
`[\w]+`	1.49 us	1.15 us	1.30x
`findall [\w]+`	116 us	90.5 us	1.29x

Uh oh!

Compile single-category character sets to a bare CATEGORY opcode #152056

Description

Linked PRs

Metadata

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Issue actions

Compile single-category character sets to a bare `CATEGORY` opcode #152056