Add a C accelerator for difflib.SequenceMatcher

# Feature or enhancement

### Proposal:

Disclaimer: I guided Claude Code to generate the code, the benchmarks and the summary below. I felt it's safe to use AI for this issue because it is purely a logical module port and does not touch any of the Python core.

# Add a C accelerator for `difflib.SequenceMatcher`

## Enhancement

Add a C accelerator module `_difflib` for `difflib.SequenceMatcher`, modeled after the `decimal`/`_pydecimal`/`_decimal` pattern. Output is bit-identical to the existing pure-Python implementation (including the documented leftmost-maximal-match tie-break); the win comes from constant-factor optimizations (the algorithm is still Ratcliff-Obershelp).

Across six realistic workloads, the accelerator is **5–15× faster than pure Python** on CPython `main` (3.16-dev), and faster than both existing PyPI accelerators (`cdifflib`, `cydifflib`) on every workload tested.

## Preserving the pure-Python implementation

The pure-Python `SequenceMatcher` is preserved verbatim as a reference/fallback. The layout mirrors `decimal`/`datetime`/`pickle`:

- `Lib/_pydifflib.py` — the existing pure-Python implementation, no behavior change. Self-contained; alternative Python implementations can use it directly.
- `Lib/difflib.py` — a thin shim:
  - `from _pydifflib import *` brings in everything (`Differ`, `HtmlDiff`, `unified_diff`, `ndiff`, `get_close_matches`, …)
  - re-exports a handful of private helpers (`_calculate_ratio`, `_mdiff`, `_format_range_unified`, `_format_range_context`) that existing tests reference
  - `lazy from _pydifflib import can_colorize, get_theme` preserves the existing lazy-import contract that `LazyImportTest` enforces while keeping the names visible as runtime attributes (which `test_pyclbr.test_easy` checks)
  - `try: from _difflib import SequenceMatcher` + subclasses it to inherit the slow-path methods (`quick_ratio`, `real_quick_ratio`, `get_grouped_opcodes`) from the pure-Python class
  - rebinds `_pydifflib.SequenceMatcher` to the C-accelerated subclass after defining it, so the helper functions in `_pydifflib` (which look up `SequenceMatcher` in their own module globals) transparently pick up the speedup. Without this rebind, `difflib.unified_diff` and friends would see zero benefit from the accelerator.

When `_difflib` is unavailable, the shim is a no-op and `SequenceMatcher` stays the pure-Python class.

## Optimization phases

Each phase preserves the algorithm; each row shows what changed and why.

| phase | what changed | why |
|---|---|---|
| 1 | C port of `find_longest_match`; `j2len` replaced with paired int arrays + generation counter (no per-row clear) | Eliminates per-row dict allocation + per-`j` dict probe in the inner DP loop. |
| 2 | C port of `chain_b`; type-specialized iteration of `b` for `str`/`list`/`tuple`/`bytes` | `chain_b` became >50% of remaining cost after phase 1; ~25k `setdefault`+`append` in Python. |
| 3 | Full Ratcliff-Obershelp recursion in C; DP and extension passes operate on `int32` label arrays (`a_lbl`, `a_dp`, `b_lbl`, `junk_mask`) | Hundreds of FLM calls per `ratio()` each cross the Python/C boundary; extension passes did `PyObject_RichCompareBool` per element. |
| 4 | Codepoint-keyed `cp_full[]`/`cp_dp[]` arrays for `str` `b`; `str` `a` fast-path reads codepoint and indexes the array | For `str` inputs we paid `PyUnicode_FromOrdinal` + dict-probe per element on both `a` and `b`. |
| 5 | Bytes fast-path (cp arrays of size 256), persistent DP scratch (no per-call `alloca`/`memset`), skip max-cp scan for UCS1 strings | Bytes paid `PyLong_FromLong` per element under autojunk; pure-Python's per-int allocator costs dominated for the bytes workload. |

Phase 1 is reproduced by monkey-patching `_pydifflib.SequenceMatcher.find_longest_match` with an early C port that still talks to the Python `b2j` dict; phase 2 also swaps in a C `__chain_b`; phase 3 disables the codepoint-keyed lookup tables on top of the current code; phase 4 disables only the phase-5-specific changes (bytes fast-path and the UCS1 max-codepoint shortcut). All five measurements run on the same release-mode in-tree CPython `main` build, with pyperf (20 processes × 3 values per workload, default warmup, system tuned via `pyperf system tune`). All times are pyperf mean ± stddev for `SequenceMatcher(...).ratio() + .get_opcodes()`:

| case | pure-Py | phase 1 | phase 2 | phase 3 | phase 4 | phase 5 (final) |
|---|---|---|---|---|---|---|
| str_2k_similar | 908 ± 24 us | 391 ± 12 us | 268 ± 12 us | 228 ± 6 us | 132 ± 2 us | 133 ± 8 us |
| str_8k_similar | 3.73 ± 0.08 ms | 1.61 ± 0.05 ms | 1.12 ± 0.06 ms | 988 ± 38 us | 619 ± 23 us | 628 ± 18 us |
| lines_difflib | 5.64 ± 0.04 ms | 1.29 ± 0.02 ms | 1.25 ± 0.01 ms | 1.03 ± 0.01 ms | 1.09 ± 0.01 ms | 1.09 ± 0.01 ms |
| ints_3k | 16.1 ± 0.0 ms | 3.31 ± 0.01 ms | 3.12 ± 0.02 ms | 2.22 ± 0.01 ms | 2.18 ± 0.02 ms | 2.18 ± 0.02 ms |
| lines_big | 25.0 ± 0.1 ms | 4.77 ± 0.03 ms | 4.46 ± 0.05 ms | 2.87 ± 0.03 ms | 2.84 ± 0.01 ms | 2.84 ± 0.02 ms |
| bytes_8k | 213 ± 0 ms | 23.7 ± 0.2 ms | 23.2 ± 0.2 ms | 15.0 ± 0.1 ms | 14.8 ± 0.1 ms | 14.5 ± 0.1 ms |

Speedup over pure-Python (means only, derived from the row above):

| case | phase 1 | phase 2 | phase 3 | phase 4 | phase 5 (final) |
|---|---:|---:|---:|---:|---:|
| str_2k_similar | 2.32× | 3.39× | 3.98× | 6.88× | 6.83× |
| str_8k_similar | 2.32× | 3.33× | 3.78× | 6.03× | 5.94× |
| lines_difflib | 4.37× | 4.51× | 5.48× | 5.17× | 5.17× |
| ints_3k | 4.86× | 5.16× | 7.25× | 7.39× | 7.39× |
| lines_big | 5.24× | 5.61× | 8.71× | 8.80× | 8.80× |
| bytes_8k | 8.99× | 9.18× | 14.20× | 14.39× | 14.69× |

Reading the table: each phase adds an optimization that monotonically helps the workloads it targets. Phase 3 (label-array DP + recursion-in-C) is the single largest jump on int/line workloads; phase 4 (codepoint-keyed tables for `str`) drives the doubling on `str` workloads; phase 5 contributes a measurable but smaller bytes win, and on the other workloads it's at the noise floor (within ±0.5 σ of phase 4). The pyperf standard deviations are <3% in every cell, so the small phase-4-vs-phase-5 swings on `str_2k`, `str_8k`, and `lines_difflib` are well inside measurement noise — not a regression.

## Benchmark hardware and software

All numbers in this issue were collected on the same machine and software configuration:

- **CPU**: Intel Xeon Silver 4410Y (Sapphire Rapids), 2 sockets × 12 cores × 2 threads = 48 hardware threads, 3.9 GHz max turbo (disabled for benchmarks).
- **RAM**: 256 GB.
- **OS**: Ubuntu 24.04.2 LTS, Linux 6.14.0.
- **Compiler**: GCC 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04.1).
- **System tuning**: `sudo pyperf system tune` (Turbo Boost disabled, `performance` CPU governor, ASLR off, IRQ affinity pinned).
- **pyperf**: 2.10.0.
- **Python**: CPython `main` at commit `5f4fbc10f68`, built `--enable-shared` release mode (no `--with-pydebug`), `-O3`. Python 3.12 numbers use the system Python 3.12.3 in a dedicated venv on the same host.

## Benchmark workloads

Six workloads, all comparing two sequences that are mostly identical with a small percentage of perturbations — the realistic case difflib is designed for:

- **str_2k_similar / str_8k_similar** — two random lowercase-ASCII strings of 2,000 and 8,000 characters; about 5% of positions randomly substituted. Character-level diff; `b2j` keys are 1-char strings.
- **lines_difflib** — `_pydifflib.py` as a list of ~2,100 lines vs the same list with every 50th line replaced by a comment. Line-level diff over Python source.
- **lines_big** — the same file replicated ×3 (~6,300 lines), every 80th line perturbed. Larger line-level diff with line repetition.
- **ints_3k** — list of 3,000 random ints in [0, 500] vs the same list with about 5% of positions randomly substituted. Tests the path where `b2j` keys are Python ints.
- **bytes_8k** — `bytes` of 8,000 random bytes vs the same with about 5% mutated. Tests the bytes fast-path; pure-Python pays a `PyLong_FromLong` per element when building `b2j`.

The benchmark measures end-to-end `SequenceMatcher(None, a, b)` followed by `s.ratio()` and `s.get_opcodes()` with `autojunk` enabled (the default).

## Ours vs the original pure-Python on CPython `main`

Measured on the in-tree CPython `main` (3.16-dev, release-mode, no `--with-pydebug`) where the patch lives, using pyperf (20 processes × 3 values per workload). Pure Python is `_pydifflib.SequenceMatcher` (the same code as the existing `difflib.SequenceMatcher`); the accelerated column is the C-backed `difflib.SequenceMatcher` after the shim picks up `_difflib`:

| case | pure-Py (mean ± std) | accel (mean ± std) | speedup |
|---|---:|---:|---:|
| str_2k_similar | 908 ± 24 us | 133 ± 8 us | 6.83× |
| str_8k_similar | 3.73 ± 0.08 ms | 628 ± 18 us | 5.94× |
| lines_difflib | 5.64 ± 0.04 ms | 1.09 ± 0.01 ms | 5.17× |
| ints_3k | 16.1 ± 0.0 ms | 2.18 ± 0.02 ms | 7.39× |
| lines_big | 25.0 ± 0.1 ms | 2.84 ± 0.02 ms | 8.80× |
| bytes_8k | 213 ± 0 ms | 14.5 ± 0.1 ms | 14.69× |

<details>
<summary>Bench script (pyperf, ours vs pure-Python on the in-tree CPython main build)</summary>

The script below picks the implementation via the `DIFFLIB_IMPL` env var, so the same script measures any combination of `py` (stdlib `difflib`), `pyours` (the pure-Python `_pydifflib`), `ours` (the in-tree `_difflib` C accelerator), `cdifflib`, or `cydifflib`. For the table above I ran it twice:

```bash
# Pure-Python reference (does not need the C accelerator installed).
DIFFLIB_IMPL=pyours PYTHONPATH=/tmp ./python bench_pyperf.py \
    -o results/main_pyours.json \
    --inherit-environ=DIFFLIB_IMPL,PYTHONPATH

# C accelerator (drops _difflib.cpython-*.so into /tmp).
DIFFLIB_IMPL=ours PYTHONPATH=/tmp ./python bench_pyperf.py \
    -o results/main_ours.json \
    --inherit-environ=DIFFLIB_IMPL,PYTHONPATH

# Statistical comparison via t-test on per-process means:
./python -m pyperf compare_to results/main_pyours.json results/main_ours.json
```

```python
"""pyperf benchmark of difflib.SequenceMatcher implementations.

Selects the implementation via the DIFFLIB_IMPL environment variable
(one of: py, pyours, ours, cdifflib, cydifflib).  This avoids needing
custom argparse arguments that pyperf would have to propagate to
worker processes.
"""

import os
import random
import string

import pyperf


def make_inputs(pydifflib_path):
    cases = {}
    rng = random.Random(42)
    s1 = ''.join(rng.choices(string.ascii_lowercase, k=2000))
    s2 = list(s1)
    for _ in range(100):
        s2[rng.randrange(len(s2))] = rng.choice(string.ascii_lowercase)
    cases['str_2k_similar'] = (s1, ''.join(s2))

    s1 = ''.join(rng.choices(string.ascii_lowercase, k=8000))
    s2 = list(s1)
    for _ in range(400):
        s2[rng.randrange(len(s2))] = rng.choice(string.ascii_lowercase)
    cases['str_8k_similar'] = (s1, ''.join(s2))

    with open(pydifflib_path) as f:
        la = f.readlines()
    lb = la[:]
    for i in range(0, len(lb), 50):
        lb[i] = '# tweak\n'
    cases['lines_difflib'] = (la, lb)

    rng = random.Random(7)
    la = [rng.randint(0, 500) for _ in range(3000)]
    lb = la[:]
    for _ in range(150):
        lb[rng.randrange(len(lb))] = rng.randint(0, 500)
    cases['ints_3k'] = (la, lb)

    with open(pydifflib_path) as f:
        la = f.readlines() * 3
    lb = la[:]
    for i in range(0, len(lb), 80):
        lb[i] = '# x\n'
    cases['lines_big'] = (la, lb)

    rng = random.Random(99)
    ba = bytes(rng.choices(range(256), k=8000))
    bb = bytearray(ba)
    for _ in range(400):
        bb[rng.randrange(len(bb))] = rng.randrange(256)
    cases['bytes_8k'] = (ba, bytes(bb))
    return cases


def resolve_impl(name):
    if name == 'py':
        import difflib
        return difflib.SequenceMatcher
    if name == 'pyours':
        import _pydifflib
        return _pydifflib.SequenceMatcher
    if name == 'ours':
        import _difflib
        return _difflib.SequenceMatcher
    if name == 'cdifflib':
        import cdifflib
        return cdifflib.CSequenceMatcher
    if name == 'cydifflib':
        import cydifflib
        return cydifflib.SequenceMatcher
    raise ValueError(f"unknown impl {name!r}")


def make_workload(SM, a, b):
    def workload():
        s = SM(None, a, b)
        s.ratio()
        s.get_opcodes()
    return workload


def main():
    impl = os.environ['DIFFLIB_IMPL']
    pydifflib_path = os.environ.get(
        'PYDIFFLIB_PATH', '/path/to/cpython/Lib/_pydifflib.py')

    SM = resolve_impl(impl)
    cases = make_inputs(pydifflib_path)

    runner = pyperf.Runner()
    runner.metadata['difflib_impl'] = impl
    runner.metadata['SM_class'] = f"{SM.__module__}.{SM.__qualname__}"

    for name, (a, b) in cases.items():
        runner.bench_func(name, make_workload(SM, a, b))


if __name__ == '__main__':
    main()
```

</details>

## Apples-to-apples comparison vs PyPI alternatives (CPython 3.12)

Two existing C accelerators on PyPI: `cdifflib` 1.2.9 (hand-written C) and `cydifflib` 1.2.0 (Cython-generated). The comparison is run on CPython **3.12** (not the in-tree 3.16-dev branch where the patch lives) because `cydifflib`'s wheel build is CMake-based and currently fails to configure against 3.16-dev — it cannot be tested on that runtime. To remove "different Python version" as a confound, I rebuilt the new accelerator against the system Python 3.12 headers and installed all four implementations (pure-Python, `cdifflib`, `cydifflib`, ours) into a single Python 3.12 venv so every column below shares the same interpreter, GC, and pure-Python baseline. All measurements use pyperf (20 processes × 3 values per workload, system tuned via `pyperf system tune`).

**Wall time (pyperf mean ± stddev), all on Python 3.12:**

| case | pure-Py | cdifflib 1.2.9 | cydifflib 1.2.0 | ours |
|---|---:|---:|---:|---:|
| str_2k_similar | 989 ± 21 us | 529 ± 13 us | 901 ± 14 us | 120 ± 4 us |
| str_8k_similar | 3.86 ± 0.05 ms | 2.19 ± 0.07 ms | 33.5 ± 0.0 ms | 520 ± 25 us |
| lines_difflib | 6.15 ± 0.08 ms | 2.74 ± 0.04 ms | 1.88 ± 0.03 ms | 1.08 ± 0.01 ms |
| ints_3k | 16.5 ± 0.1 ms | 8.71 ± 0.13 ms | 4.31 ± 0.09 ms | 2.39 ± 0.01 ms |
| lines_big | 25.8 ± 0.2 ms | 11.6 ± 0.1 ms | 21.2 ± 0.0 ms | 2.99 ± 0.04 ms |
| bytes_8k | 205 ± 1 ms | 127 ± 1 ms | 60.8 ± 0.1 ms | 14.8 ± 0.1 ms |

**Speedup over pure-Python (means only), all measured on Python 3.12:**

| case | cdifflib 1.2.9 | cydifflib 1.2.0 | ours |
|---|---:|---:|---:|
| str_2k_similar | 1.87× | 1.10× | 8.24× |
| str_8k_similar | 1.76× | 0.12× | 7.42× |
| lines_difflib | 2.24× | 3.27× | 5.69× |
| ints_3k | 1.89× | 3.83× | 6.90× |
| lines_big | 2.22× | 1.22× | 8.63× |
| bytes_8k | 1.61× | 3.37× | 13.85× |

<details>
<summary>Bench script (pyperf, 4-way comparison on CPython 3.12)</summary>

Same `bench_pyperf.py` as above; in the 3.12 venv where all four implementations are importable, set `DIFFLIB_IMPL` to one of `py`, `cdifflib`, `cydifflib`, `ours` and run once per impl:

```bash
for impl in py cdifflib cydifflib ours; do
    DIFFLIB_IMPL=$impl venv/bin/python bench_pyperf.py \
        -o results/py312_$impl.json \
        --inherit-environ=DIFFLIB_IMPL
done

# Compare any two with a t-test on the per-process means:
venv/bin/python -m pyperf compare_to results/py312_py.json results/py312_ours.json
```

</details>

Observations:

- **`cydifflib` regresses on char-level diff**: `str_8k_similar` runs at 0.12× of pure Python (≈8× *slower*). Reproducible across pyperf processes (stddev = 0 ms), not a measurement artifact. Looks like per-element Cython wrapping overhead is bad when elements are single codepoints.
- **`cdifflib` is consistent at ~1.6–2.2×**: a modest constant-factor win that targets `find_longest_match` and `get_matching_blocks` only, leaving `chain_b` in Python.
- **Ours is the fastest of the three in all six workloads**, by 1.4× (lines_difflib, where cydifflib is also competitive) to 4× (str_8k_similar) to 9× (bytes_8k) over the next-fastest in each row.

## What lets ours go further

`cdifflib` and `cydifflib` both keep the existing data shape: `b2j` is a `dict[elt, list[int]]`, and the inner loop does `PyDict_GetItem` on Python-object keys per outer `i`. The biggest single change in this work is restructuring the state itself:

1. Every distinct element of `b` is assigned a small integer label at `chain_b` time.
2. Position-indexed `int32_t` arrays (`a_lbl`, `b_lbl`, `a_dp`, `junk_mask`) replace the per-position object lookups.
3. The whole `get_matching_blocks` recursion (queue + sort + collapse) runs in C, with the DP loop and extension passes operating purely on those int32 arrays — no Python C-API calls in the hot path.
4. For `str` and `bytes`, codepoint-keyed lookup tables (`cp_full[256]` for UCS1 / `cp_full[max+1]` for wider strings; `cp_full[256]` for bytes) replace the dict-probe entirely. This is what makes the bytes case 13× rather than ~3×.

Neither `cdifflib` nor `cydifflib` does the integer-label transformation; that explains most of the remaining gap.

## Comparison vs GNU `diff(1)` on line workloads

GNU `diff` is a different program (and a different algorithm — Myers diff, O((n+m)D) where D is the edit distance, not Ratcliff-Obershelp); the comparison only makes sense on line-level inputs because GNU `diff` doesn't natively handle character, integer, or byte sequences. The GNU `diff` column includes its full real-world cost: process spawn + reading both files from disk + writing the unified diff to `/dev/null`. The Python columns measure the in-process work `difflib.unified_diff(a_lines, b_lines, lineterm='')` does (the same work `Tools/build/stable_abi.py`, doctest's failure output, and `unittest.assertMultiLineEqual` do under the hood). All numbers are pyperf mean ± stddev:

| workload | GNU diff 3.10 (subprocess) | pure-Py difflib | ours (C accel) |
|---|---:|---:|---:|
| lines_difflib (~2,100 lines) | 1.48 ± 0.01 ms | 6.16 ± 0.03 ms | 1.14 ± 0.01 ms |
| lines_big (~6,300 lines) | 2.50 ± 0.09 ms | 26.7 ± 0.2 ms | 2.46 ± 0.02 ms |
| lines_huge (~32,000 lines, 5%) | 13.5 ± 0.2 ms | 280 ± 1 ms | 24.6 ± 0.2 ms |

Reading the table:

- **Small/medium line diffs (≤ ~6k lines)**: ours is *competitive with or faster than* GNU `diff` end-to-end. GNU `diff`'s subprocess + file-IO floor (~1–2 ms) is the same order of magnitude as the actual diff work at this size, so the in-process Python+C path comes out ahead.
- **Large line diffs (32k lines)**: GNU `diff` pulls ahead, ~1.8× faster. That's Myers diff's asymptotic edge over Ratcliff-Obershelp showing up at scale. Pure-Python `difflib` is **20× slower than GNU diff** here; ours closes most of that gap but not all of it.

**Why the crossover, briefly**: GNU `diff` uses Myers diff — for "mostly identical" inputs, edit distance D is small so its O((n+m)D) behaves like O(n+m) — *linear* in n. Ratcliff-Obershelp is super-linear in expected case. Holding the perturbation rate constant, scaling n by 15× (2k → 32k lines) grows ours' time by ~22× and GNU diff's algorithm time by ~9×. Combined with GNU diff's fixed ~1 ms subprocess overhead, the crossover lands around **n ≈ 10–15k lines**.

This is well above the sizes CPython callers actually hit (doctest failures, `assertMultiLineEqual`, the 2,833-line stable-ABI manifest), so for real workloads we're in the "GNU diff loses to us because of IO floor" zone. The discussion thread linked at the top covers the asymptotic angle (suffix-automaton work for `autojunk=False`) — that's the orthogonal path that would close the remaining gap on >10k-line inputs.

<details>
<summary>Bench script (pyperf, GNU diff vs difflib.unified_diff)</summary>

The script selects the implementation via `DIFFLIB_IMPL` (`gnudiff`, `pyours`, or `ours`) so all three columns above share the same input generation and pyperf invocation pattern.

```bash
rm -f results/gnu_*.json
for impl in gnudiff pyours ours; do
    DIFFLIB_IMPL=$impl PYTHONPATH=/tmp ./python bench_gnudiff.py \
        -o results/gnu_$impl.json \
        --inherit-environ=DIFFLIB_IMPL,PYTHONPATH
done

# Compare any two via t-test on the per-process means:
./python -m pyperf compare_to results/gnu_pyours.json results/gnu_ours.json
```

```python
"""Compare GNU diff(1) vs difflib.unified_diff on identical line inputs.

GNU diff uses Myers diff, not Ratcliff-Obershelp -- the outputs are
structurally different, but the wall-time comparison is informative for
the line-level workloads (the only case GNU diff handles natively).
Char-level / bytes / int sequences aren't included because GNU diff
doesn't have a meaningful native equivalent.
"""

import os
import random
import subprocess
import tempfile

import pyperf


def make_inputs(pydifflib_path):
    cases = {}
    with open(pydifflib_path) as f:
        la = f.readlines()
    lb = la[:]
    for i in range(0, len(lb), 50):
        lb[i] = '# tweak\n'
    cases['lines_difflib'] = (la, lb)

    with open(pydifflib_path) as f:
        la = f.readlines() * 3
    lb = la[:]
    for i in range(0, len(lb), 80):
        lb[i] = '# x\n'
    cases['lines_big'] = (la, lb)

    # ~32k lines, ~5% perturbation
    with open(pydifflib_path) as f:
        la = f.readlines() * 15
    lb = la[:]
    rng = random.Random(123)
    for _ in range(len(lb) // 20):
        i = rng.randrange(len(lb))
        lb[i] = f"# changed line {i}\n"
    cases['lines_huge'] = (la, lb)
    return cases


def make_temp_files(a_lines, b_lines):
    fa = tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.a')
    fa.writelines(a_lines); fa.close()
    fb = tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.b')
    fb.writelines(b_lines); fb.close()
    return fa.name, fb.name


def main():
    impl = os.environ['DIFFLIB_IMPL']  # gnudiff | pyours | ours
    pydifflib_path = os.environ.get(
        'PYDIFFLIB_PATH', '/path/to/cpython/Lib/_pydifflib.py')
    cases = make_inputs(pydifflib_path)

    runner = pyperf.Runner()
    runner.metadata['impl'] = impl

    if impl == 'gnudiff':
        # Write inputs once; bench the subprocess invocation.  This
        # deliberately includes process spawn + file IO + writing the
        # diff to /dev/null -- the real cost of using GNU diff.
        for name, (a, b) in cases.items():
            path_a, path_b = make_temp_files(a, b)
            def workload(_a=path_a, _b=path_b):
                # GNU diff exits 1 when files differ; that's success.
                subprocess.run(
                    ['/usr/bin/diff', '-u', _a, _b],
                    stdout=subprocess.DEVNULL,
                    stderr=subprocess.DEVNULL,
                    check=False,
                )
            runner.bench_func(name, workload)
    else:
        if impl == 'pyours':
            import _pydifflib as mod
        elif impl == 'ours':
            import difflib as mod  # picks up _difflib via the shim
        else:
            raise SystemExit(f"unknown impl {impl!r}")
        unified_diff = mod.unified_diff
        for name, (a, b) in cases.items():
            def workload(_a=a, _b=b, _ud=unified_diff):
                # Materialize the generator to match the work GNU diff
                # does (it writes the whole diff out).
                list(_ud(_a, _b, lineterm=''))
            runner.bench_func(name, workload)


if __name__ == '__main__':
    main()
```

</details>

## Integration details

- **Build**: `Modules/_difflibmodule.c` (heap type, per-interpreter GIL supported, `Py_mod_gil = Py_MOD_GIL_NOT_USED` for free-threading, `PyType_GetModuleByDef` for subclass-safe state lookup), wired via `PY_STDLIB_MOD_SIMPLE([_difflib])` in `configure.ac` and a one-line entry in `Modules/Setup.stdlib.in`. `configure` regenerated via the standard `Tools/build/regen-configure.sh` (autoconf 2.72 docker image); global string tables regenerated via `make regen-all`.
- **Argument Clinic**: all seven methods (`__init__`, `set_seqs`, `set_seq1`, `set_seq2`, `find_longest_match`, `get_matching_blocks`, `get_opcodes`, `ratio`) use `/*[clinic input]` blocks; argument parsing is generated into `Modules/clinic/_difflibmodule.c.h`.
- **Documentation**: an `impl-detail` block added to `Doc/library/difflib.rst` mentions the C accelerator, the `_pydifflib` reference, bit-identical output, and the typical 5–25× speedup range.
- **Tests**: when the accelerator is present, `Lib/test/test_difflib.py` programmatically generates a `*_PurePython` variant of each existing TestCase that monkey-patches `difflib.SequenceMatcher` to `_pydifflib.SequenceMatcher` for the duration of the test (similar in spirit to `test_decimal`'s `C*` / `Py*` class pairs, but non-invasive — no test method changes). With the accelerator built, `test_difflib` runs 100 tests (61 C-path + 39 Py-path); without it, the 60 original tests run as before.
- **Source annotations**: the C file uses `[phase N]` comments throughout to map sections of code to the optimization phases described above, so a reviewer can grep `[phase 4]` to see exactly which code was added for the codepoint-keyed lookup tables, etc.

All 3,611 tests pass across `test_difflib`, `test_pyclbr`, `test_doctest`, `test_unittest`, `test_argparse`, `test_traceback`.

## Cost on real CPython workloads

For the CPython test suite as a whole, the impact is small — `assertEqual` only calls `ndiff` on failure, doctest only computes diffs on failure, `argparse`/`traceback`'s `get_close_matches` filters out 99% of pairs via `quick_ratio` before reaching `ratio()`. On the difflib-using test subset (~33s):

| test | stock | accel | speedup |
|---|---:|---:|---:|
| test_difflib | 1.41s | 0.49s | 2.9× |
| test_doctest | 3.55s | 3.29s | 1.1× |
| test_genericalias | 0.40s | 0.34s | 1.2× |
| (others within noise) | | | |

Where users feel it is during **development**: a failed `unittest.assertEqual` on a 2000-item list goes from 115 ms → 30 ms; doctest's `unified_diff` on a long mismatch goes from ~50 ms → ~4 ms; `Tools/build/stable_abi.py`'s ~2,800-line manifest diff goes from 43 ms → 4 ms (11×). Useful for iteration, not for green CI.

## Previous discussion

- [Algorithmic complexity of difflib](https://discuss.python.org/t/algorithmic-complexity-of-difflib/105844) — the suffix-automaton proposal ([#144132](https://github.com/python/cpython/issues/144132)). That work and this work are orthogonal: SAM addresses pathological / very-large-input asymptotic complexity; this addresses constant-factor cost on typical workloads.
- `cdifflib` and `cydifflib` on PyPI have demonstrated demand for a C-accelerated `SequenceMatcher` for years.

### Has this already been discussed elsewhere?

No response given

### Links to previous discussion of this feature:

_No response_


### Linked PRs
* gh-150188

phase	what changed	why
1	C port of `find_longest_match`; `j2len` replaced with paired int arrays + generation counter (no per-row clear)	Eliminates per-row dict allocation + per-`j` dict probe in the inner DP loop.
2	C port of `chain_b`; type-specialized iteration of `b` for `str`/`list`/`tuple`/`bytes`	`chain_b` became >50% of remaining cost after phase 1; ~25k `setdefault`+`append` in Python.
3	Full Ratcliff-Obershelp recursion in C; DP and extension passes operate on `int32` label arrays (`a_lbl`, `a_dp`, `b_lbl`, `junk_mask`)	Hundreds of FLM calls per `ratio()` each cross the Python/C boundary; extension passes did `PyObject_RichCompareBool` per element.
4	Codepoint-keyed `cp_full[]`/`cp_dp[]` arrays for `str` `b`; `str` `a` fast-path reads codepoint and indexes the array	For `str` inputs we paid `PyUnicode_FromOrdinal` + dict-probe per element on both `a` and `b`.
5	Bytes fast-path (cp arrays of size 256), persistent DP scratch (no per-call `alloca`/`memset`), skip max-cp scan for UCS1 strings	Bytes paid `PyLong_FromLong` per element under autojunk; pure-Python's per-int allocator costs dominated for the bytes workload.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a C accelerator for difflib.SequenceMatcher #150184

Feature or enhancement

Proposal:

Add a C accelerator for `difflib.SequenceMatcher`

Enhancement

Preserving the pure-Python implementation

Optimization phases

Benchmark hardware and software

Benchmark workloads

Ours vs the original pure-Python on CPython `main`

Apples-to-apples comparison vs PyPI alternatives (CPython 3.12)

What lets ours go further

Comparison vs GNU `diff(1)` on line workloads

Integration details

Cost on real CPython workloads

Previous discussion

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Linked PRs

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

case	pure-Py	phase 1	phase 2	phase 3	phase 4	phase 5 (final)
str_2k_similar	908 ± 24 us	391 ± 12 us	268 ± 12 us	228 ± 6 us	132 ± 2 us	133 ± 8 us
str_8k_similar	3.73 ± 0.08 ms	1.61 ± 0.05 ms	1.12 ± 0.06 ms	988 ± 38 us	619 ± 23 us	628 ± 18 us
lines_difflib	5.64 ± 0.04 ms	1.29 ± 0.02 ms	1.25 ± 0.01 ms	1.03 ± 0.01 ms	1.09 ± 0.01 ms	1.09 ± 0.01 ms
ints_3k	16.1 ± 0.0 ms	3.31 ± 0.01 ms	3.12 ± 0.02 ms	2.22 ± 0.01 ms	2.18 ± 0.02 ms	2.18 ± 0.02 ms
lines_big	25.0 ± 0.1 ms	4.77 ± 0.03 ms	4.46 ± 0.05 ms	2.87 ± 0.03 ms	2.84 ± 0.01 ms	2.84 ± 0.02 ms
bytes_8k	213 ± 0 ms	23.7 ± 0.2 ms	23.2 ± 0.2 ms	15.0 ± 0.1 ms	14.8 ± 0.1 ms	14.5 ± 0.1 ms

case	phase 1	phase 2	phase 3	phase 4	phase 5 (final)
str_2k_similar	2.32×	3.39×	3.98×	6.88×	6.83×
str_8k_similar	2.32×	3.33×	3.78×	6.03×	5.94×
lines_difflib	4.37×	4.51×	5.48×	5.17×	5.17×
ints_3k	4.86×	5.16×	7.25×	7.39×	7.39×
lines_big	5.24×	5.61×	8.71×	8.80×	8.80×
bytes_8k	8.99×	9.18×	14.20×	14.39×	14.69×

case	pure-Py	cdifflib 1.2.9	cydifflib 1.2.0	ours
str_2k_similar	989 ± 21 us	529 ± 13 us	901 ± 14 us	120 ± 4 us
str_8k_similar	3.86 ± 0.05 ms	2.19 ± 0.07 ms	33.5 ± 0.0 ms	520 ± 25 us
lines_difflib	6.15 ± 0.08 ms	2.74 ± 0.04 ms	1.88 ± 0.03 ms	1.08 ± 0.01 ms
ints_3k	16.5 ± 0.1 ms	8.71 ± 0.13 ms	4.31 ± 0.09 ms	2.39 ± 0.01 ms
lines_big	25.8 ± 0.2 ms	11.6 ± 0.1 ms	21.2 ± 0.0 ms	2.99 ± 0.04 ms
bytes_8k	205 ± 1 ms	127 ± 1 ms	60.8 ± 0.1 ms	14.8 ± 0.1 ms

case	cdifflib 1.2.9	cydifflib 1.2.0	ours
str_2k_similar	1.87×	1.10×	8.24×
str_8k_similar	1.76×	0.12×	7.42×
lines_difflib	2.24×	3.27×	5.69×
ints_3k	1.89×	3.83×	6.90×
lines_big	2.22×	1.22×	8.63×
bytes_8k	1.61×	3.37×	13.85×

workload	GNU diff 3.10 (subprocess)	pure-Py difflib	ours (C accel)
lines_difflib (~2,100 lines)	1.48 ± 0.01 ms	6.16 ± 0.03 ms	1.14 ± 0.01 ms
lines_big (~6,300 lines)	2.50 ± 0.09 ms	26.7 ± 0.2 ms	2.46 ± 0.02 ms
lines_huge (~32,000 lines, 5%)	13.5 ± 0.2 ms	280 ± 1 ms	24.6 ± 0.2 ms

test	stock	accel	speedup
test_difflib	1.41s	0.49s	2.9×
test_doctest	3.55s	3.29s	1.1×
test_genericalias	0.40s	0.34s	1.2×
(others within noise)

Uh oh!

Add a C accelerator for difflib.SequenceMatcher #150184

Description

Feature or enhancement

Proposal:

Add a C accelerator for difflib.SequenceMatcher

Enhancement

Preserving the pure-Python implementation

Optimization phases

Benchmark hardware and software

Benchmark workloads

Ours vs the original pure-Python on CPython main

Apples-to-apples comparison vs PyPI alternatives (CPython 3.12)

What lets ours go further

Comparison vs GNU diff(1) on line workloads

Integration details

Cost on real CPython workloads

Previous discussion

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Linked PRs

Metadata

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Issue actions

Add a C accelerator for `difflib.SequenceMatcher`

Ours vs the original pure-Python on CPython `main`

Comparison vs GNU `diff(1)` on line workloads