Skip to content

Add a C accelerator for difflib.SequenceMatcher #150184

@blhsing

Description

@blhsing

Feature or enhancement

Proposal:

Disclaimer: I guided Claude Code to generate the code, the benchmarks and the summary below. I felt it's safe to use AI for this issue because it is purely a logical module port and does not touch any of the Python core.

Add a C accelerator for difflib.SequenceMatcher

Enhancement

Add a C accelerator module _difflib for difflib.SequenceMatcher, modeled after the decimal/_pydecimal/_decimal pattern. Output is bit-identical to the existing pure-Python implementation (including the documented leftmost-maximal-match tie-break); the win comes from constant-factor optimizations (the algorithm is still Ratcliff-Obershelp).

Across six realistic workloads, the accelerator is 5–15× faster than pure Python on CPython main (3.16-dev), and faster than both existing PyPI accelerators (cdifflib, cydifflib) on every workload tested.

Preserving the pure-Python implementation

The pure-Python SequenceMatcher is preserved verbatim as a reference/fallback. The layout mirrors decimal/datetime/pickle:

  • Lib/_pydifflib.py — the existing pure-Python implementation, no behavior change. Self-contained; alternative Python implementations can use it directly.
  • Lib/difflib.py — a thin shim:
    • from _pydifflib import * brings in everything (Differ, HtmlDiff, unified_diff, ndiff, get_close_matches, …)
    • re-exports a handful of private helpers (_calculate_ratio, _mdiff, _format_range_unified, _format_range_context) that existing tests reference
    • lazy from _pydifflib import can_colorize, get_theme preserves the existing lazy-import contract that LazyImportTest enforces while keeping the names visible as runtime attributes (which test_pyclbr.test_easy checks)
    • try: from _difflib import SequenceMatcher + subclasses it to inherit the slow-path methods (quick_ratio, real_quick_ratio, get_grouped_opcodes) from the pure-Python class
    • rebinds _pydifflib.SequenceMatcher to the C-accelerated subclass after defining it, so the helper functions in _pydifflib (which look up SequenceMatcher in their own module globals) transparently pick up the speedup. Without this rebind, difflib.unified_diff and friends would see zero benefit from the accelerator.

When _difflib is unavailable, the shim is a no-op and SequenceMatcher stays the pure-Python class.

Optimization phases

Each phase preserves the algorithm; each row shows what changed and why.

phase what changed why
1 C port of find_longest_match; j2len replaced with paired int arrays + generation counter (no per-row clear) Eliminates per-row dict allocation + per-j dict probe in the inner DP loop.
2 C port of chain_b; type-specialized iteration of b for str/list/tuple/bytes chain_b became >50% of remaining cost after phase 1; ~25k setdefault+append in Python.
3 Full Ratcliff-Obershelp recursion in C; DP and extension passes operate on int32 label arrays (a_lbl, a_dp, b_lbl, junk_mask) Hundreds of FLM calls per ratio() each cross the Python/C boundary; extension passes did PyObject_RichCompareBool per element.
4 Codepoint-keyed cp_full[]/cp_dp[] arrays for str b; str a fast-path reads codepoint and indexes the array For str inputs we paid PyUnicode_FromOrdinal + dict-probe per element on both a and b.
5 Bytes fast-path (cp arrays of size 256), persistent DP scratch (no per-call alloca/memset), skip max-cp scan for UCS1 strings Bytes paid PyLong_FromLong per element under autojunk; pure-Python's per-int allocator costs dominated for the bytes workload.

Phase 1 is reproduced by monkey-patching _pydifflib.SequenceMatcher.find_longest_match with an early C port that still talks to the Python b2j dict; phase 2 also swaps in a C __chain_b; phase 3 disables the codepoint-keyed lookup tables on top of the current code; phase 4 disables only the phase-5-specific changes (bytes fast-path and the UCS1 max-codepoint shortcut). All five measurements run on the same release-mode in-tree CPython main build, with pyperf (20 processes × 3 values per workload, default warmup, system tuned via pyperf system tune). All times are pyperf mean ± stddev for SequenceMatcher(...).ratio() + .get_opcodes():

case pure-Py phase 1 phase 2 phase 3 phase 4 phase 5 (final)
str_2k_similar 908 ± 24 us 391 ± 12 us 268 ± 12 us 228 ± 6 us 132 ± 2 us 133 ± 8 us
str_8k_similar 3.73 ± 0.08 ms 1.61 ± 0.05 ms 1.12 ± 0.06 ms 988 ± 38 us 619 ± 23 us 628 ± 18 us
lines_difflib 5.64 ± 0.04 ms 1.29 ± 0.02 ms 1.25 ± 0.01 ms 1.03 ± 0.01 ms 1.09 ± 0.01 ms 1.09 ± 0.01 ms
ints_3k 16.1 ± 0.0 ms 3.31 ± 0.01 ms 3.12 ± 0.02 ms 2.22 ± 0.01 ms 2.18 ± 0.02 ms 2.18 ± 0.02 ms
lines_big 25.0 ± 0.1 ms 4.77 ± 0.03 ms 4.46 ± 0.05 ms 2.87 ± 0.03 ms 2.84 ± 0.01 ms 2.84 ± 0.02 ms
bytes_8k 213 ± 0 ms 23.7 ± 0.2 ms 23.2 ± 0.2 ms 15.0 ± 0.1 ms 14.8 ± 0.1 ms 14.5 ± 0.1 ms

Speedup over pure-Python (means only, derived from the row above):

case phase 1 phase 2 phase 3 phase 4 phase 5 (final)
str_2k_similar 2.32× 3.39× 3.98× 6.88× 6.83×
str_8k_similar 2.32× 3.33× 3.78× 6.03× 5.94×
lines_difflib 4.37× 4.51× 5.48× 5.17× 5.17×
ints_3k 4.86× 5.16× 7.25× 7.39× 7.39×
lines_big 5.24× 5.61× 8.71× 8.80× 8.80×
bytes_8k 8.99× 9.18× 14.20× 14.39× 14.69×

Reading the table: each phase adds an optimization that monotonically helps the workloads it targets. Phase 3 (label-array DP + recursion-in-C) is the single largest jump on int/line workloads; phase 4 (codepoint-keyed tables for str) drives the doubling on str workloads; phase 5 contributes a measurable but smaller bytes win, and on the other workloads it's at the noise floor (within ±0.5 σ of phase 4). The pyperf standard deviations are <3% in every cell, so the small phase-4-vs-phase-5 swings on str_2k, str_8k, and lines_difflib are well inside measurement noise — not a regression.

Benchmark hardware and software

All numbers in this issue were collected on the same machine and software configuration:

  • CPU: Intel Xeon Silver 4410Y (Sapphire Rapids), 2 sockets × 12 cores × 2 threads = 48 hardware threads, 3.9 GHz max turbo (disabled for benchmarks).
  • RAM: 256 GB.
  • OS: Ubuntu 24.04.2 LTS, Linux 6.14.0.
  • Compiler: GCC 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04.1).
  • System tuning: sudo pyperf system tune (Turbo Boost disabled, performance CPU governor, ASLR off, IRQ affinity pinned).
  • pyperf: 2.10.0.
  • Python: CPython main at commit 5f4fbc10f68, built --enable-shared release mode (no --with-pydebug), -O3. Python 3.12 numbers use the system Python 3.12.3 in a dedicated venv on the same host.

Benchmark workloads

Six workloads, all comparing two sequences that are mostly identical with a small percentage of perturbations — the realistic case difflib is designed for:

  • str_2k_similar / str_8k_similar — two random lowercase-ASCII strings of 2,000 and 8,000 characters; about 5% of positions randomly substituted. Character-level diff; b2j keys are 1-char strings.
  • lines_difflib_pydifflib.py as a list of ~2,100 lines vs the same list with every 50th line replaced by a comment. Line-level diff over Python source.
  • lines_big — the same file replicated ×3 (~6,300 lines), every 80th line perturbed. Larger line-level diff with line repetition.
  • ints_3k — list of 3,000 random ints in [0, 500] vs the same list with about 5% of positions randomly substituted. Tests the path where b2j keys are Python ints.
  • bytes_8kbytes of 8,000 random bytes vs the same with about 5% mutated. Tests the bytes fast-path; pure-Python pays a PyLong_FromLong per element when building b2j.

The benchmark measures end-to-end SequenceMatcher(None, a, b) followed by s.ratio() and s.get_opcodes() with autojunk enabled (the default).

Ours vs the original pure-Python on CPython main

Measured on the in-tree CPython main (3.16-dev, release-mode, no --with-pydebug) where the patch lives, using pyperf (20 processes × 3 values per workload). Pure Python is _pydifflib.SequenceMatcher (the same code as the existing difflib.SequenceMatcher); the accelerated column is the C-backed difflib.SequenceMatcher after the shim picks up _difflib:

case pure-Py (mean ± std) accel (mean ± std) speedup
str_2k_similar 908 ± 24 us 133 ± 8 us 6.83×
str_8k_similar 3.73 ± 0.08 ms 628 ± 18 us 5.94×
lines_difflib 5.64 ± 0.04 ms 1.09 ± 0.01 ms 5.17×
ints_3k 16.1 ± 0.0 ms 2.18 ± 0.02 ms 7.39×
lines_big 25.0 ± 0.1 ms 2.84 ± 0.02 ms 8.80×
bytes_8k 213 ± 0 ms 14.5 ± 0.1 ms 14.69×
Bench script (pyperf, ours vs pure-Python on the in-tree CPython main build)

The script below picks the implementation via the DIFFLIB_IMPL env var, so the same script measures any combination of py (stdlib difflib), pyours (the pure-Python _pydifflib), ours (the in-tree _difflib C accelerator), cdifflib, or cydifflib. For the table above I ran it twice:

# Pure-Python reference (does not need the C accelerator installed).
DIFFLIB_IMPL=pyours PYTHONPATH=/tmp ./python bench_pyperf.py \
    -o results/main_pyours.json \
    --inherit-environ=DIFFLIB_IMPL,PYTHONPATH

# C accelerator (drops _difflib.cpython-*.so into /tmp).
DIFFLIB_IMPL=ours PYTHONPATH=/tmp ./python bench_pyperf.py \
    -o results/main_ours.json \
    --inherit-environ=DIFFLIB_IMPL,PYTHONPATH

# Statistical comparison via t-test on per-process means:
./python -m pyperf compare_to results/main_pyours.json results/main_ours.json
"""pyperf benchmark of difflib.SequenceMatcher implementations.

Selects the implementation via the DIFFLIB_IMPL environment variable
(one of: py, pyours, ours, cdifflib, cydifflib).  This avoids needing
custom argparse arguments that pyperf would have to propagate to
worker processes.
"""

import os
import random
import string

import pyperf


def make_inputs(pydifflib_path):
    cases = {}
    rng = random.Random(42)
    s1 = ''.join(rng.choices(string.ascii_lowercase, k=2000))
    s2 = list(s1)
    for _ in range(100):
        s2[rng.randrange(len(s2))] = rng.choice(string.ascii_lowercase)
    cases['str_2k_similar'] = (s1, ''.join(s2))

    s1 = ''.join(rng.choices(string.ascii_lowercase, k=8000))
    s2 = list(s1)
    for _ in range(400):
        s2[rng.randrange(len(s2))] = rng.choice(string.ascii_lowercase)
    cases['str_8k_similar'] = (s1, ''.join(s2))

    with open(pydifflib_path) as f:
        la = f.readlines()
    lb = la[:]
    for i in range(0, len(lb), 50):
        lb[i] = '# tweak\n'
    cases['lines_difflib'] = (la, lb)

    rng = random.Random(7)
    la = [rng.randint(0, 500) for _ in range(3000)]
    lb = la[:]
    for _ in range(150):
        lb[rng.randrange(len(lb))] = rng.randint(0, 500)
    cases['ints_3k'] = (la, lb)

    with open(pydifflib_path) as f:
        la = f.readlines() * 3
    lb = la[:]
    for i in range(0, len(lb), 80):
        lb[i] = '# x\n'
    cases['lines_big'] = (la, lb)

    rng = random.Random(99)
    ba = bytes(rng.choices(range(256), k=8000))
    bb = bytearray(ba)
    for _ in range(400):
        bb[rng.randrange(len(bb))] = rng.randrange(256)
    cases['bytes_8k'] = (ba, bytes(bb))
    return cases


def resolve_impl(name):
    if name == 'py':
        import difflib
        return difflib.SequenceMatcher
    if name == 'pyours':
        import _pydifflib
        return _pydifflib.SequenceMatcher
    if name == 'ours':
        import _difflib
        return _difflib.SequenceMatcher
    if name == 'cdifflib':
        import cdifflib
        return cdifflib.CSequenceMatcher
    if name == 'cydifflib':
        import cydifflib
        return cydifflib.SequenceMatcher
    raise ValueError(f"unknown impl {name!r}")


def make_workload(SM, a, b):
    def workload():
        s = SM(None, a, b)
        s.ratio()
        s.get_opcodes()
    return workload


def main():
    impl = os.environ['DIFFLIB_IMPL']
    pydifflib_path = os.environ.get(
        'PYDIFFLIB_PATH', '/path/to/cpython/Lib/_pydifflib.py')

    SM = resolve_impl(impl)
    cases = make_inputs(pydifflib_path)

    runner = pyperf.Runner()
    runner.metadata['difflib_impl'] = impl
    runner.metadata['SM_class'] = f"{SM.__module__}.{SM.__qualname__}"

    for name, (a, b) in cases.items():
        runner.bench_func(name, make_workload(SM, a, b))


if __name__ == '__main__':
    main()

Apples-to-apples comparison vs PyPI alternatives (CPython 3.12)

Two existing C accelerators on PyPI: cdifflib 1.2.9 (hand-written C) and cydifflib 1.2.0 (Cython-generated). The comparison is run on CPython 3.12 (not the in-tree 3.16-dev branch where the patch lives) because cydifflib's wheel build is CMake-based and currently fails to configure against 3.16-dev — it cannot be tested on that runtime. To remove "different Python version" as a confound, I rebuilt the new accelerator against the system Python 3.12 headers and installed all four implementations (pure-Python, cdifflib, cydifflib, ours) into a single Python 3.12 venv so every column below shares the same interpreter, GC, and pure-Python baseline. All measurements use pyperf (20 processes × 3 values per workload, system tuned via pyperf system tune).

Wall time (pyperf mean ± stddev), all on Python 3.12:

case pure-Py cdifflib 1.2.9 cydifflib 1.2.0 ours
str_2k_similar 989 ± 21 us 529 ± 13 us 901 ± 14 us 120 ± 4 us
str_8k_similar 3.86 ± 0.05 ms 2.19 ± 0.07 ms 33.5 ± 0.0 ms 520 ± 25 us
lines_difflib 6.15 ± 0.08 ms 2.74 ± 0.04 ms 1.88 ± 0.03 ms 1.08 ± 0.01 ms
ints_3k 16.5 ± 0.1 ms 8.71 ± 0.13 ms 4.31 ± 0.09 ms 2.39 ± 0.01 ms
lines_big 25.8 ± 0.2 ms 11.6 ± 0.1 ms 21.2 ± 0.0 ms 2.99 ± 0.04 ms
bytes_8k 205 ± 1 ms 127 ± 1 ms 60.8 ± 0.1 ms 14.8 ± 0.1 ms

Speedup over pure-Python (means only), all measured on Python 3.12:

case cdifflib 1.2.9 cydifflib 1.2.0 ours
str_2k_similar 1.87× 1.10× 8.24×
str_8k_similar 1.76× 0.12× 7.42×
lines_difflib 2.24× 3.27× 5.69×
ints_3k 1.89× 3.83× 6.90×
lines_big 2.22× 1.22× 8.63×
bytes_8k 1.61× 3.37× 13.85×
Bench script (pyperf, 4-way comparison on CPython 3.12)

Same bench_pyperf.py as above; in the 3.12 venv where all four implementations are importable, set DIFFLIB_IMPL to one of py, cdifflib, cydifflib, ours and run once per impl:

for impl in py cdifflib cydifflib ours; do
    DIFFLIB_IMPL=$impl venv/bin/python bench_pyperf.py \
        -o results/py312_$impl.json \
        --inherit-environ=DIFFLIB_IMPL
done

# Compare any two with a t-test on the per-process means:
venv/bin/python -m pyperf compare_to results/py312_py.json results/py312_ours.json

Observations:

  • cydifflib regresses on char-level diff: str_8k_similar runs at 0.12× of pure Python (≈8× slower). Reproducible across pyperf processes (stddev = 0 ms), not a measurement artifact. Looks like per-element Cython wrapping overhead is bad when elements are single codepoints.
  • cdifflib is consistent at ~1.6–2.2×: a modest constant-factor win that targets find_longest_match and get_matching_blocks only, leaving chain_b in Python.
  • Ours is the fastest of the three in all six workloads, by 1.4× (lines_difflib, where cydifflib is also competitive) to 4× (str_8k_similar) to 9× (bytes_8k) over the next-fastest in each row.

What lets ours go further

cdifflib and cydifflib both keep the existing data shape: b2j is a dict[elt, list[int]], and the inner loop does PyDict_GetItem on Python-object keys per outer i. The biggest single change in this work is restructuring the state itself:

  1. Every distinct element of b is assigned a small integer label at chain_b time.
  2. Position-indexed int32_t arrays (a_lbl, b_lbl, a_dp, junk_mask) replace the per-position object lookups.
  3. The whole get_matching_blocks recursion (queue + sort + collapse) runs in C, with the DP loop and extension passes operating purely on those int32 arrays — no Python C-API calls in the hot path.
  4. For str and bytes, codepoint-keyed lookup tables (cp_full[256] for UCS1 / cp_full[max+1] for wider strings; cp_full[256] for bytes) replace the dict-probe entirely. This is what makes the bytes case 13× rather than ~3×.

Neither cdifflib nor cydifflib does the integer-label transformation; that explains most of the remaining gap.

Comparison vs GNU diff(1) on line workloads

GNU diff is a different program (and a different algorithm — Myers diff, O((n+m)D) where D is the edit distance, not Ratcliff-Obershelp); the comparison only makes sense on line-level inputs because GNU diff doesn't natively handle character, integer, or byte sequences. The GNU diff column includes its full real-world cost: process spawn + reading both files from disk + writing the unified diff to /dev/null. The Python columns measure the in-process work difflib.unified_diff(a_lines, b_lines, lineterm='') does (the same work Tools/build/stable_abi.py, doctest's failure output, and unittest.assertMultiLineEqual do under the hood). All numbers are pyperf mean ± stddev:

workload GNU diff 3.10 (subprocess) pure-Py difflib ours (C accel)
lines_difflib (~2,100 lines) 1.48 ± 0.01 ms 6.16 ± 0.03 ms 1.14 ± 0.01 ms
lines_big (~6,300 lines) 2.50 ± 0.09 ms 26.7 ± 0.2 ms 2.46 ± 0.02 ms
lines_huge (~32,000 lines, 5%) 13.5 ± 0.2 ms 280 ± 1 ms 24.6 ± 0.2 ms

Reading the table:

  • Small/medium line diffs (≤ ~6k lines): ours is competitive with or faster than GNU diff end-to-end. GNU diff's subprocess + file-IO floor (~1–2 ms) is the same order of magnitude as the actual diff work at this size, so the in-process Python+C path comes out ahead.
  • Large line diffs (32k lines): GNU diff pulls ahead, ~1.8× faster. That's Myers diff's asymptotic edge over Ratcliff-Obershelp showing up at scale. Pure-Python difflib is 20× slower than GNU diff here; ours closes most of that gap but not all of it.

Why the crossover, briefly: GNU diff uses Myers diff — for "mostly identical" inputs, edit distance D is small so its O((n+m)D) behaves like O(n+m) — linear in n. Ratcliff-Obershelp is super-linear in expected case. Holding the perturbation rate constant, scaling n by 15× (2k → 32k lines) grows ours' time by ~22× and GNU diff's algorithm time by ~9×. Combined with GNU diff's fixed ~1 ms subprocess overhead, the crossover lands around n ≈ 10–15k lines.

This is well above the sizes CPython callers actually hit (doctest failures, assertMultiLineEqual, the 2,833-line stable-ABI manifest), so for real workloads we're in the "GNU diff loses to us because of IO floor" zone. The discussion thread linked at the top covers the asymptotic angle (suffix-automaton work for autojunk=False) — that's the orthogonal path that would close the remaining gap on >10k-line inputs.

Bench script (pyperf, GNU diff vs difflib.unified_diff)

The script selects the implementation via DIFFLIB_IMPL (gnudiff, pyours, or ours) so all three columns above share the same input generation and pyperf invocation pattern.

rm -f results/gnu_*.json
for impl in gnudiff pyours ours; do
    DIFFLIB_IMPL=$impl PYTHONPATH=/tmp ./python bench_gnudiff.py \
        -o results/gnu_$impl.json \
        --inherit-environ=DIFFLIB_IMPL,PYTHONPATH
done

# Compare any two via t-test on the per-process means:
./python -m pyperf compare_to results/gnu_pyours.json results/gnu_ours.json
"""Compare GNU diff(1) vs difflib.unified_diff on identical line inputs.

GNU diff uses Myers diff, not Ratcliff-Obershelp -- the outputs are
structurally different, but the wall-time comparison is informative for
the line-level workloads (the only case GNU diff handles natively).
Char-level / bytes / int sequences aren't included because GNU diff
doesn't have a meaningful native equivalent.
"""

import os
import random
import subprocess
import tempfile

import pyperf


def make_inputs(pydifflib_path):
    cases = {}
    with open(pydifflib_path) as f:
        la = f.readlines()
    lb = la[:]
    for i in range(0, len(lb), 50):
        lb[i] = '# tweak\n'
    cases['lines_difflib'] = (la, lb)

    with open(pydifflib_path) as f:
        la = f.readlines() * 3
    lb = la[:]
    for i in range(0, len(lb), 80):
        lb[i] = '# x\n'
    cases['lines_big'] = (la, lb)

    # ~32k lines, ~5% perturbation
    with open(pydifflib_path) as f:
        la = f.readlines() * 15
    lb = la[:]
    rng = random.Random(123)
    for _ in range(len(lb) // 20):
        i = rng.randrange(len(lb))
        lb[i] = f"# changed line {i}\n"
    cases['lines_huge'] = (la, lb)
    return cases


def make_temp_files(a_lines, b_lines):
    fa = tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.a')
    fa.writelines(a_lines); fa.close()
    fb = tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.b')
    fb.writelines(b_lines); fb.close()
    return fa.name, fb.name


def main():
    impl = os.environ['DIFFLIB_IMPL']  # gnudiff | pyours | ours
    pydifflib_path = os.environ.get(
        'PYDIFFLIB_PATH', '/path/to/cpython/Lib/_pydifflib.py')
    cases = make_inputs(pydifflib_path)

    runner = pyperf.Runner()
    runner.metadata['impl'] = impl

    if impl == 'gnudiff':
        # Write inputs once; bench the subprocess invocation.  This
        # deliberately includes process spawn + file IO + writing the
        # diff to /dev/null -- the real cost of using GNU diff.
        for name, (a, b) in cases.items():
            path_a, path_b = make_temp_files(a, b)
            def workload(_a=path_a, _b=path_b):
                # GNU diff exits 1 when files differ; that's success.
                subprocess.run(
                    ['/usr/bin/diff', '-u', _a, _b],
                    stdout=subprocess.DEVNULL,
                    stderr=subprocess.DEVNULL,
                    check=False,
                )
            runner.bench_func(name, workload)
    else:
        if impl == 'pyours':
            import _pydifflib as mod
        elif impl == 'ours':
            import difflib as mod  # picks up _difflib via the shim
        else:
            raise SystemExit(f"unknown impl {impl!r}")
        unified_diff = mod.unified_diff
        for name, (a, b) in cases.items():
            def workload(_a=a, _b=b, _ud=unified_diff):
                # Materialize the generator to match the work GNU diff
                # does (it writes the whole diff out).
                list(_ud(_a, _b, lineterm=''))
            runner.bench_func(name, workload)


if __name__ == '__main__':
    main()

Integration details

  • Build: Modules/_difflibmodule.c (heap type, per-interpreter GIL supported, Py_mod_gil = Py_MOD_GIL_NOT_USED for free-threading, PyType_GetModuleByDef for subclass-safe state lookup), wired via PY_STDLIB_MOD_SIMPLE([_difflib]) in configure.ac and a one-line entry in Modules/Setup.stdlib.in. configure regenerated via the standard Tools/build/regen-configure.sh (autoconf 2.72 docker image); global string tables regenerated via make regen-all.
  • Argument Clinic: all seven methods (__init__, set_seqs, set_seq1, set_seq2, find_longest_match, get_matching_blocks, get_opcodes, ratio) use /*[clinic input] blocks; argument parsing is generated into Modules/clinic/_difflibmodule.c.h.
  • Documentation: an impl-detail block added to Doc/library/difflib.rst mentions the C accelerator, the _pydifflib reference, bit-identical output, and the typical 5–25× speedup range.
  • Tests: when the accelerator is present, Lib/test/test_difflib.py programmatically generates a *_PurePython variant of each existing TestCase that monkey-patches difflib.SequenceMatcher to _pydifflib.SequenceMatcher for the duration of the test (similar in spirit to test_decimal's C* / Py* class pairs, but non-invasive — no test method changes). With the accelerator built, test_difflib runs 100 tests (61 C-path + 39 Py-path); without it, the 60 original tests run as before.
  • Source annotations: the C file uses [phase N] comments throughout to map sections of code to the optimization phases described above, so a reviewer can grep [phase 4] to see exactly which code was added for the codepoint-keyed lookup tables, etc.

All 3,611 tests pass across test_difflib, test_pyclbr, test_doctest, test_unittest, test_argparse, test_traceback.

Cost on real CPython workloads

For the CPython test suite as a whole, the impact is small — assertEqual only calls ndiff on failure, doctest only computes diffs on failure, argparse/traceback's get_close_matches filters out 99% of pairs via quick_ratio before reaching ratio(). On the difflib-using test subset (~33s):

test stock accel speedup
test_difflib 1.41s 0.49s 2.9×
test_doctest 3.55s 3.29s 1.1×
test_genericalias 0.40s 0.34s 1.2×
(others within noise)

Where users feel it is during development: a failed unittest.assertEqual on a 2000-item list goes from 115 ms → 30 ms; doctest's unified_diff on a long mismatch goes from ~50 ms → ~4 ms; Tools/build/stable_abi.py's ~2,800-line manifest diff goes from 43 ms → 4 ms (11×). Useful for iteration, not for green CI.

Previous discussion

  • Algorithmic complexity of difflib — the suffix-automaton proposal (#144132). That work and this work are orthogonal: SAM addresses pathological / very-large-input asymptotic complexity; this addresses constant-factor cost on typical workloads.
  • cdifflib and cydifflib on PyPI have demonstrated demand for a C-accelerated SequenceMatcher for years.

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

No response

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance or resource usagetype-featureA feature request or enhancement
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions