gh-70398: Fix HTMLParser crash when reset()/feed() is called from a handler#152040
Open
tonghuaroot wants to merge 3 commits into
Open
gh-70398: Fix HTMLParser crash when reset()/feed() is called from a handler#152040tonghuaroot wants to merge 3 commits into
tonghuaroot wants to merge 3 commits into
Conversation
…ndler goahead() reads the buffer to parse into a local variable and assumes self.rawdata stays unchanged across handler calls. When a handler calls reset(), self.rawdata is replaced with an empty string while goahead() keeps slicing the stale buffer, eventually tripping an assertion in check_for_whole_start_tag(). Detect that the buffer was replaced (by reset() or a re-entrant feed()) and stop processing the now-stale data, leaving the parser in a clean, reusable state.
Cover both convert_charrefs branches and every handler entry point (data/starttag/endtag/comment, plus charref/entityref under convert_charrefs=False), and add a test for a handler that calls feed() rather than reset(). Use the support.subTests decorator to match the surrounding tests. Trim a redundant line from the top-of-loop comment in goahead().
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A bare
assertmust never be reachable from ordinary user code. Today,calling
reset()(orfeed()) from inside anHTMLParserhandler tripsan internal assertion. The fix is not about supporting mid-handler
reset()as a feature; it is about leaving the parser in a clean,reusable state after a handler swaps the buffer, instead of crashing
while slicing data that no longer exists.
Root cause
goahead()caches the buffer it is parsing in a local variable:It then loops, slicing
rawdata[i:j]and calling handlers. The loopassumes
self.rawdatastays the same object for the whole call. Ahandler breaks that assumption:
reset()setsself.rawdata = '', and are-entrant
feed()rebuildsself.rawdata. After such a call,goahead()keeps walking the now-stale local
rawdatawith indices that no longermatch
self.rawdata. Downstream this produces a regex match against thewrong string, the result is
None, andcheck_for_whole_start_tag()hits
assert match.Why the assert is load-bearing, not a debug-only invariant
Under
python -Otheassertis compiled out. TheNonethenpropagates one line further to
j = match.end(), raising an unrelated,harder-to-diagnose error from a different call site:
So the
assertis acting as real control flow guarding a corrupt state,not merely documenting an invariant. Honest caveat: in both modes the
failure is a raised exception at parse time (an
AssertionError, or anAttributeErrorunder-O); it is not silent output corruption or aninfinite loop. But a parser should not raise an internal assertion in
response to a legitimate handler action, and
-Oturning one exceptiontype into a more confusing one confirms this code path needs an explicit
guard rather than an assertion.
The fix
Detect that
self.rawdatais no longer the objectgoahead()startedwith, and stop. Two checks cover the two points at which a handler can
run during the loop: the top of the loop (covers handlers invoked by the
tag/ref parsing helpers on the previous iteration) and immediately after
handle_data().After
reset(), parsing of the current data stops and the parser is backto a fresh state, ready to be reused. After a re-entrant
feed(), thenewly fed data is handled by the in-progress call as usual; the parser is
left usable for later
feed()/close()calls.This answers the original 2016 report (Hibou57): "I expected
reset()tostop the parser." It now does, cleanly, instead of crashing.
Note: a handler that calls
feed()mid-parse re-emits the prefix it hasalready consumed. That is pre-existing behavior and is left unchanged; the
tests assert only what matters there (no crash, the newly fed data is
parsed, and the parser stays reusable), not the exact duplicated event
sequence.
Compatibility
Normal parsing is unaffected: the new checks only fire when a handler
actually replaces
self.rawdata. Output for inputs that do not callreset()/feed()from a handler is byte-identical before and after,including the default
convert_charrefs=Truepath and chunked feeding.Tests
test_reset_in_handleris parameterized over bothconvert_charrefsbranches (the
rawdata.find('<')path and theself.interesting.searchpath) and over every handler entry point that can run mid-parse:
handle_data,handle_starttag,handle_endtag,handle_comment, plushandle_charrefandhandle_entityref(which only fire underconvert_charrefs=False). A newtest_feed_in_handlercovers a handlercalling
feed()instead ofreset(). Each case asserts: no crash, theexpected events up to the swap,
rawdata == ''afterward, and that theparser still parses a fresh document. Every new case crashes on the
unpatched parser and passes on the fixed one.
Refs gh-70398.