Skip to content

gh-70398: Fix HTMLParser crash when reset()/feed() is called from a handler#152040

Open
tonghuaroot wants to merge 3 commits into
python:mainfrom
tonghuaroot:fix-gh-70398-htmlparser-reset
Open

gh-70398: Fix HTMLParser crash when reset()/feed() is called from a handler#152040
tonghuaroot wants to merge 3 commits into
python:mainfrom
tonghuaroot:fix-gh-70398-htmlparser-reset

Conversation

@tonghuaroot

Copy link
Copy Markdown
Contributor

A bare assert must never be reachable from ordinary user code. Today,
calling reset() (or feed()) from inside an HTMLParser handler trips
an internal assertion. The fix is not about supporting mid-handler
reset() as a feature; it is about leaving the parser in a clean,
reusable state after a handler swaps the buffer, instead of crashing
while slicing data that no longer exists.

Root cause

goahead() caches the buffer it is parsing in a local variable:

def goahead(self, end):
    rawdata = self.rawdata
    i = 0
    n = len(rawdata)
    while i < n:
        ...

It then loops, slicing rawdata[i:j] and calling handlers. The loop
assumes self.rawdata stays the same object for the whole call. A
handler breaks that assumption: reset() sets self.rawdata = '', and a
re-entrant feed() rebuilds self.rawdata. After such a call, goahead()
keeps walking the now-stale local rawdata with indices that no longer
match self.rawdata. Downstream this produces a regex match against the
wrong string, the result is None, and check_for_whole_start_tag()
hits assert match.

def check_for_whole_start_tag(self, i):
    rawdata = self.rawdata
    match = locatetagend.match(rawdata, i+1)
    assert match          # trips here
    j = match.end()

Why the assert is load-bearing, not a debug-only invariant

Under python -O the assert is compiled out. The None then
propagates one line further to j = match.end(), raising an unrelated,
harder-to-diagnose error from a different call site:

  File ".../html/parser.py", line ..., in parse_starttag
    endpos = self.check_for_whole_start_tag(i)
  File ".../html/parser.py", line ..., in check_for_whole_start_tag
    j = match.end()
AttributeError: 'NoneType' object has no attribute 'end'

So the assert is acting as real control flow guarding a corrupt state,
not merely documenting an invariant. Honest caveat: in both modes the
failure is a raised exception at parse time (an AssertionError, or an
AttributeError under -O); it is not silent output corruption or an
infinite loop. But a parser should not raise an internal assertion in
response to a legitimate handler action, and -O turning one exception
type into a more confusing one confirms this code path needs an explicit
guard rather than an assertion.

The fix

Detect that self.rawdata is no longer the object goahead() started
with, and stop. Two checks cover the two points at which a handler can
run during the loop: the top of the loop (covers handlers invoked by the
tag/ref parsing helpers on the previous iteration) and immediately after
handle_data().

while i < n:
    if self.rawdata is not rawdata:
        # A handler called reset() or feed() and replaced the buffer
        # we are processing; the rest is handled by a later call.
        return
    ...
        if i < j:
            ...
            self.handle_data(...)
            if self.rawdata is not rawdata:
                # handle_data() called reset() or feed() and replaced the
                # buffer; stop processing the now-stale data.
                return

After reset(), parsing of the current data stops and the parser is back
to a fresh state, ready to be reused. After a re-entrant feed(), the
newly fed data is handled by the in-progress call as usual; the parser is
left usable for later feed()/close() calls.

This answers the original 2016 report (Hibou57): "I expected reset() to
stop the parser." It now does, cleanly, instead of crashing.

Note: a handler that calls feed() mid-parse re-emits the prefix it has
already consumed. That is pre-existing behavior and is left unchanged; the
tests assert only what matters there (no crash, the newly fed data is
parsed, and the parser stays reusable), not the exact duplicated event
sequence.

Compatibility

Normal parsing is unaffected: the new checks only fire when a handler
actually replaces self.rawdata. Output for inputs that do not call
reset()/feed() from a handler is byte-identical before and after,
including the default convert_charrefs=True path and chunked feeding.

Tests

test_reset_in_handler is parameterized over both convert_charrefs
branches (the rawdata.find('<') path and the self.interesting.search
path) and over every handler entry point that can run mid-parse:
handle_data, handle_starttag, handle_endtag, handle_comment, plus
handle_charref and handle_entityref (which only fire under
convert_charrefs=False). A new test_feed_in_handler covers a handler
calling feed() instead of reset(). Each case asserts: no crash, the
expected events up to the swap, rawdata == '' afterward, and that the
parser still parses a fresh document. Every new case crashes on the
unpatched parser and passes on the fixed one.

Refs gh-70398.

…ndler

goahead() reads the buffer to parse into a local variable and assumes
self.rawdata stays unchanged across handler calls.  When a handler
calls reset(), self.rawdata is replaced with an empty string while
goahead() keeps slicing the stale buffer, eventually tripping an
assertion in check_for_whole_start_tag().

Detect that the buffer was replaced (by reset() or a re-entrant feed())
and stop processing the now-stale data, leaving the parser in a clean,
reusable state.
Cover both convert_charrefs branches and every handler entry point
(data/starttag/endtag/comment, plus charref/entityref under
convert_charrefs=False), and add a test for a handler that calls feed()
rather than reset().  Use the support.subTests decorator to match the
surrounding tests.

Trim a redundant line from the top-of-loop comment in goahead().
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant