gh-151024: Add a C accelerator for html.escape() and html.unescape()#151025
Draft
gaborbernat wants to merge 4 commits into
Draft
gh-151024: Add a C accelerator for html.escape() and html.unescape()#151025gaborbernat wants to merge 4 commits into
gaborbernat wants to merge 4 commits into
Conversation
…ape() escape() scans 1-byte strings word-at-a-time (SWAR) to skip runs with no special character eight bytes at a time and returns the input unchanged when nothing needs escaping. unescape() replaces the regex plus per-match Python callback with a single C pass that bulk-copies the text between references and binary-searches generated HTML5 named-reference and numeric-charref tables. The pure-Python implementations remain as the PEP 399 fallback. The new _html module has no mutable state, declares free-threading support (Py_MOD_GIL_NOT_USED) and a per-interpreter GIL, and is exercised against both implementations plus the WHATWG entities.json and html5lib datasets. Modules/html_entities.h is generated from html.entities by Tools/build/generate_html_entities.py (make regen-html).
Documentation build overview
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
html.escapeandhtml.unescapeare pure Python.escapemakes up to five sequentialstr.replacepasses;unescaperuns a regex with a per-match Python callback over the 2231-entry HTML5 entity table. Both are on hot paths:html.parser.HTMLParsercallsunescapeon every data run and attribute value, andescapeis the standard HTML-output escaper.What changes
This adds
_html, a C accelerator, keeping the pure-Python implementations as the PEP 399 fallback.escapescans one-byte strings word-at-a-time (SWAR): it loads eight bytes into auint64_tand tests all lanes withhaszero-style integer ops, advancing eight safe bytes per step and returning the input unchanged when nothing needs escaping. These are the same0x0101…/0x8080…masks CPython already uses for ASCII scanning inunicodeobject.candfind_max_char.h. UCS-2/UCS-4 strings use a scalar pass.unescapemakes a single C pass: it bulk-copies the text between&references and binary-searches generated named-reference and numeric-charref tables, applying the numeric character reference rules from the spec. The tables live inModules/html_entities.h, generated fromhtml.entitiesbyTools/build/generate_html_entities.py(make regen-html).The module holds no mutable state (immutable
strinputs, read-only tables), so it declaresPy_MOD_GIL_NOT_USEDand per-interpreter GIL support.Benchmarks
pyperf, C versus the pure-Python fallback on the same interpreter:escapeclean paragraphescapecomment bodyescapeattribute valueescapeuser HTML blockunescapedecode entitiesunescapescraped titleunescapeno entitiesHTMLParser.feed, 70 KB pageunescapeis ~4-5x on its own; insideHTMLParserit shares the work with tokenization, so a full parse is 1.26x faster. In the top-1000 PyPI packages, 119 (15.5%) use the affected public API directly or throughHTMLParser, among them airflow, spark, ruff, aws-cli, botocore, and the Azure SDK.Correctness
Output matches the pure-Python implementation across the WHATWG
entities.json(2231 named references), the html5lib-tests tokenizer entity suite, a 24k-case differential fuzz, andtest_html, which now runs against both implementations.Benchmark
Resolves #151024.