fix: bound _seen_logs and stop retaining exc_info#1717
Conversation
Closes #1714. The module-level `_seen_logs` dict in `_protocol/incoming.py` and the class-level `QuietLogger._seen_logs` in `_logger.py` were unbounded caches keyed by `str(sys.exc_info()[1])`, with `sys.exc_info()` itself stored as the value. The exception strings emitted by `_read_name` / `_decode_labels_at_offset` embed peer-controlled fields (`self.source` carries the peer's IP and ephemeral source port) and byte offsets, so each malformed packet variant generated a fresh key. The stored traceback in turn pinned `self.data` (the raw packet bytes), giving a LAN-local attacker ~9 KB of per-key retention with no upper bound. The stored `exc_info` triple was dead state. Its only reader was removed in 4218d75 (Nov 2018) when `logger('Exception occurred:', exc_info=exc_info)` was rewritten to `exc_info=True`; since then the dict has functioned as a `set` with junk values. The integer counters in `log_warning_once` / `log_exception_once` were written but never read either. Switch both `_seen_logs` instances to a plain `set[str]` and clear them once they reach `_MAX_SEEN_LOGS = 256` entries. Logging behaviour is unchanged — first-occurrence warnings/tracebacks still fire, the suppression on subsequent occurrences still works — but the worst-case memory footprint under malformed input is now bounded to a few hundred short exception strings instead of millions of (str, exc_info-triple) pairs.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1717 +/- ##
==========================================
- Coverage 99.76% 99.76% -0.01%
==========================================
Files 33 33
Lines 3428 3418 -10
Branches 472 469 -3
==========================================
- Hits 3420 3410 -10
Misses 5 5
Partials 3 3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Merging this PR will not alter performance
Performance Changes
Comparing Footnotes |
|
FixSizeOrderedDict might be better here... need to see and benchmark. logging is hot |
|
FixSizeOrderedDict clearly lost in testing performance and this is hot |
Lift the bound-check + eviction logic out of the four QuietLogger classmethods and the parser-side `_log_exception_debug` and into a single `_mark_seen(seen, key)` in `_logger.py`. Both call sites still keep their own `_seen_logs` set (no behavior change vs the previous commit — there are still two independent dedup pools, each capped at `_MAX_SEEN_LOGS`); only the eviction policy is now defined in one place. `_MAX_SEEN_LOGS` likewise moves to `_logger.py` only.
…cap to 512 Smooth out the overflow path: instead of dropping the entire set at once when it hits the bound (which then re-emits warning-level logs for every previously-deduped key on its next occurrence), evict one arbitrary entry per insert past the bound. The cap effectively becomes a recency window of size `_MAX_SEEN_LOGS`. Also raises the bound from 256 to 512. Memory cost stays trivial (~40 KB of short exception strings), but under sustained malformed- packet pressure ops sees a steady trickle of new warnings rather than a burst of re-emissions every few hundred packets.
`set.pop()` removes an arbitrary element (hash-bucket order), not the oldest one — the "recency window" framing in ba5de3a was misleading. Switch the dedup containers to insertion-ordered `dict[str, None]` and evict via `del seen[next(iter(seen))]` so eviction is truly FIFO: the oldest entry is dropped per overflow, and `_MAX_SEEN_LOGS` becomes a genuine recency window. The eviction-path delta vs set+pop (~165 vs ~77 ns/call in a microbench) is dwarfed by the `str(sys.exc_info()[1])` interpolation on the calling path; this only runs when a packet has already failed to parse.
Tighten ``test_seen_logs_is_bounded`` to assert the cap is held exactly and that the oldest 5 keys are evicted first while the newest overflow keys remain — without this, the test would have passed for ``set.pop()`` (random eviction) too, hiding any future regression away from dict-backed FIFO. Also add ``tests/benchmarks/test_mark_seen.py`` so CodSpeed tracks the three branches (hit / fill / churn) of the helper.
|
@bluetoothbot review |
PR Review — fix: bound _seen_logs and stop retaining exc_infoSolid security fix. The bound + FIFO eviction + dropped 🟢 Suggestions1. Test leaves global `_seen_logs` populated (`tests/test_protocol.py`, L981-996)
2. Removal of `QuietLogger._seen_logs` ClassVar is a subtle attribute-surface change (`src/zeroconf/_logger.py`, L42-43)
Checklist
SummarySolid security fix. The bound + FIFO eviction + dropped Automated review by Kōan42220db |
There was a problem hiding this comment.
Pull request overview
This PR addresses a LAN-local DoS/memory-growth issue by bounding exception/log de-duplication state and avoiding retention of sys.exc_info() traceback tuples that can pin large packet buffers in memory.
Changes:
- Introduces a bounded, FIFO-evicting
_mark_seen()helper with_MAX_SEEN_LOGScap and switches de-duplication maps todict[str, None]. - Updates both the incoming packet parser path and
QuietLoggerpaths to use_mark_seen()(and no longer storeexc_infotuples). - Adds regression tests for bounded
_seen_logsplus a CodSpeed benchmark for_mark_seen()behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
src/zeroconf/_logger.py |
Adds _MAX_SEEN_LOGS, module-level _seen_logs, and _mark_seen(); refactors QuietLogger de-dup logic to use it. |
src/zeroconf/_protocol/incoming.py |
Bounds parser-side _seen_logs and uses _mark_seen() to control exception trace logging. |
tests/test_logger.py |
Updates tests to clear the new module-level _seen_logs and adds FIFO-eviction bounding regression coverage. |
tests/test_protocol.py |
Adds regression coverage ensuring parser _seen_logs cannot grow beyond the cap under varying peer ports. |
tests/benchmarks/test_mark_seen.py |
Adds benchmark coverage for hit/fill/churn scenarios of _mark_seen(). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Address review feedback on #1717. The previous eviction body — ``del seen[next(iter(seen))]`` — relied on the GIL to serialize the compound iter/next/del. Under the free-threaded build (3.14t) and under multi-instance sync use where multiple ``Zeroconf`` instances share the module-level ``_seen_logs``, callers can race: - Two threads pop the same victim — ``del`` raises ``KeyError`` on the loser - One thread mutates the dict between another's ``iter()`` and ``next()`` — the iterator's mutation-count check raises ``RuntimeError`` ("dictionary changed size during iteration") Switch to ``seen.pop(next(iter(seen), None), None)`` so the pop is idempotent (no ``KeyError``) and the iter start handles the empty- dict edge case (no ``StopIteration``), and wrap the iter/next in a ``try/except RuntimeError`` so concurrent mutation during eviction is absorbed. Worst case under contention is a transient overshoot of the cap by one entry per racing thread, which clears as soon as the contention does. Add ``test_mark_seen_absorbs_runtime_error_during_eviction`` which substitutes a dict subclass whose ``__iter__`` always raises, proving the new ``except`` branch lets the insert still complete. Also tighten ``tests/test_protocol.py::test_seen_logs_is_bounded`` per Kōan's review comment: previously asserted ``<= _MAX_SEEN_LOGS``, which would still pass if a future refactor collapsed the per-port keys to a single dedup string. Now asserts ``== _MAX_SEEN_LOGS`` and verifies port 0's exception text is gone while the highest port's is still present — pins both the bound and that the parser exception path actually enters the dict with per-source-unique keys.
Under free-threaded (3.14t) contention or sustained multi-instance sync use, two callers can both pass the ``len(seen) < _MAX_SEEN_LOGS`` check and both insert before either evicts — leaving the dict transiently above the cap. The previous ``if``-based eviction removed one entry on the next call, but the new insert added one back, so the overshoot never drained and the dict silently grew with thread count. Switch the eviction to ``while len(seen) >= _MAX_SEEN_LOGS:`` so a single non-racing caller drains the accumulated drift back to the cap. Bail on ``RuntimeError`` and ``StopIteration`` so a concurrent mutation can't make the loop spin. Add ``test_mark_seen_drains_drift_above_cap`` to pin the drain behaviour — a regression to single-eviction would leave the dict at ``cap + drift - 1`` and fail the assertion.
|
@bluetoothbot review |
Two Copilot review follow-ups on #1717: - ``_mark_seen`` returned early on cache hits, so a drift-overshoot left behind by concurrent inserts (FT or sync-multi-instance callers racing the ``len < cap`` check) only drained on the next miss. A workload that's mostly hits after a contention burst could keep the dict permanently above the cap. Add a drift-drain ``while`` inside the hit branch, gated on ``len > _MAX_SEEN_LOGS`` so steady-state-at-cap hits remain a single ``len`` + compare past the membership check. The shared pop+race-tolerant idiom factors out into a small ``_evict_oldest`` helper invoked only when the outer ``while`` predicate is true, so the hot-path cost is unchanged. Pin the behaviour with ``test_mark_seen_drains_drift_on_hit_path``. - ``tests/test_protocol.py::test_seen_logs_is_bounded`` asserted FIFO eviction via substring matches on the exception text (``"'1.2.3.4', 0)" in k``). That tied the test to the current ``IncomingDecodeError`` message format, so a future normalization of dedup-key shape (as discussed on #1714) would break the test without changing the bounded/eviction behaviour. Snapshot the actual key the parser inserted per port via ``next(reversed(_seen_logs))`` and assert FIFO by key identity. Also pin ``len(set(keys_per_port)) == _MAX_SEEN_LOGS + overflow`` so a regression that dropped ``self.source`` from the exception text (collapsing all calls to one dedup key) still fails the test rather than silently passing — preserving the per-port- unique-key check from the prior Kōan review.
|
@bluetoothbot review |
Two Copilot review follow-ups on #1717: - ``_mark_seen`` returned early on cache hits, so a drift-overshoot left behind by concurrent inserts (FT or sync-multi-instance callers racing the ``len < cap`` check) only drained on the next miss. A workload that's mostly hits after a contention burst could keep the dict permanently above the cap. Add a drift-drain ``while`` inside the hit branch, gated on ``len > _MAX_SEEN_LOGS`` so steady-state-at-cap hits remain a single ``len`` + compare past the membership check. The shared pop+race-tolerant idiom factors out into a small ``_evict_oldest`` helper invoked only when the outer ``while`` predicate is true, so the hot-path cost is unchanged. Pin the behaviour with ``test_mark_seen_drains_drift_on_hit_path``. - ``tests/test_protocol.py::test_seen_logs_is_bounded`` asserted FIFO eviction via substring matches on the exception text (``"'1.2.3.4', 0)" in k``). That tied the test to the current ``IncomingDecodeError`` message format, so a future normalization of dedup-key shape (as discussed on #1714) would break the test without changing the bounded/eviction behaviour. Snapshot the actual key the parser inserted per port via ``next(reversed(_seen_logs))`` and assert FIFO by key identity. Also pin ``len(set(keys_per_port)) == _MAX_SEEN_LOGS + overflow`` so a regression that dropped ``self.source`` from the exception text (collapsing all calls to one dedup key) still fails the test rather than silently passing — preserving the per-port- unique-key check from the prior Kōan review.
3670833 to
d6cddb9
Compare
|
@bluetoothbot review |
Summary
Closes #1714. Bounds the
_seen_logsexception-dedup state in_protocol/incoming.pyand_logger.py, and stops retainingsys.exc_info()triples in it. A LAN-local peer sending malformedmDNS packets could otherwise grow either dict without bound — each
unique exception string retained a traceback whose frame locals
pinned
self.data(the raw inbound packet, up to 8966 bytes).Details
The dedup key in
_log_exception_debugisstr(sys.exc_info()[1]),and the seven
IncomingDecodeErrormessages raised from_read_name/
_decode_labels_at_offset(RFC 6762 §18 name-decoding error paths)all embed
self.source— which carries the peer's ephemeral sourceport, varying per packet — plus byte
offsetand pointerlink.Each attacker-influenced combination produced a fresh key, so the
unbounded growth was driven by
port × offset × link, not by thesmall set of error classes.
The stored value (the
exc_infotriple) has been dead state since4218d75 (Nov 2018), where
logger('Exception occurred:', exc_info=exc_info)was rewritten toexc_info=Trueduring a mypycleanup. From that point on the dict has functioned as a
set-with-junk-values: the value was never read. Behaviour isunchanged —
exc_info=Truetells the stdlib logger to callsys.exc_info()itself at the call site; the stored copy nevercontributed.
QuietLogger.log_warning_once/log_exception_onceused the samedict to store an integer counter that was likewise never read.
Collapsed all four QuietLogger methods plus the parser-side
_log_exception_debugonto a single_mark_seen(seen, key) -> boolhelper that lives in
_logger.py. Two separate dedup pools arepreserved —
_logger._seen_logs(QuietLogger paths) and_protocol.incoming._seen_logs(parser path) — so a parser floodcannot evict listener/core-side dedup state.
Storage is
dict[str, None](insertion-ordered on Python 3.7+) andthe helper drains oldest-first in a single loop whose target depends
on whether we're about to insert:
Hits drain only when
len > cap(drift correction); misses drainto
cap - 1so the insert lands at exactlycap._MAX_SEEN_LOGS(= 512) is therefore a recency window — under a sustained malformed-
packet flood the oldest deduped key drops one at a time, no burst
re-emission of warning-level logs. Worst-case footprint per pool
≈ 40 KB of short exception strings, vs. previously unbounded with
~9 KB per retained traceback.
Free-threaded (3.14t) safety. The dedup dicts are module-level
globals shared across all
Zeroconfinstances in the process; onthe FT build callers can race without the GIL serialising opcodes.
Individual dict ops (
in,__setitem__,pop,len) are atomicin CPython 3.13+ FT and don't crash, but the compound
iter→nextused to pick the FIFO victim can raiseRuntimeErrorif another thread mutates the dict between the twoops.
_evict_oldesttherefore wraps the pop intry: seen.pop(next(iter(seen)), None) except (RuntimeError, StopIteration): return Falseand the outerwhileshort-circuitson False, so concurrent mutation can't make the loop spin.
The drain runs on every call, including cache hits, so a hit-
heavy workload following a contention burst still corrects the
overshoot — without this, an
if-style check on the miss pathalone would leave the dict permanently above the cap whenever
subsequent calls happened to be all hits. Under sustained inserts
multiple threads can both pass the
len < capcheck and bothinsert, leaving the dict transiently above the cap; the next call
drains the drift back to the cap, so the bound holds at steady
state regardless of thread count.
CWE-400 (Uncontrolled Resource Consumption). LAN-local attack
surface only.
Test plan
poetry run pytest tests/test_logger.py tests/test_protocol.py --timeout=60 -v— passes, including five regressions:test_seen_logs_is_bounded(logger-side) — pins exact-cap +FIFO eviction so a swap to a non-ordered container fails it
test_seen_logs_is_bounded(protocol-side) — snapshots theactual dedup key the parser inserted per port via
next(reversed(_seen_logs))and asserts FIFO eviction by keyidentity (not by substring matching on the exception text), so
a future normalization of the
IncomingDecodeErrormessageformat doesn't false-positive a regression. Also pins
len(set(keys_per_port)) == _MAX_SEEN_LOGS + overflow, so aregression that dropped
self.sourcefrom the exception text(collapsing all 517 calls to one dedup key) still fails the
test rather than silently passing
test_mark_seen_absorbs_runtime_error_during_eviction—RacyDictwhose__iter__raisesRuntimeErrorproves thehelper still inserts the new key when concurrent mutation
breaks the iterator
test_mark_seen_drains_drift_above_cap— seeds the dict tocap + 10(simulating accumulated FT drift) and asserts thenext miss drains all the way back to the cap, so a
regression to single-eviction (one
popper call) would failtest_mark_seen_drains_drift_on_hit_path— same seeded drift,but the call is a hit; pins that the drift drains anyway, so
a regression that early-returned on hits without correcting
overshoot would leave the dict at
cap + driftand failpoetry run pytest --timeout=60— 346 passed, 3 skipped(full suite,
REQUIRE_CYTHON=1)poetry run pre-commit run --files src/zeroconf/_protocol/incoming.py src/zeroconf/_logger.py tests/test_logger.py tests/test_protocol.py tests/benchmarks/test_mark_seen.py— all hooks passREQUIRE_CYTHON=1 poetry install --only=main,dev— Cythonrebuild picks up the new dict-backed
_seen_logsshape inincoming.pytests/benchmarks/test_mark_seen.pycovershit / fill / churn so future regressions show up on the PR
CodSpeed delta
shared dict end at exactly the cap on the GIL build; the FT
drain behaviour is exercised by
test_mark_seen_drains_drift_above_cap/
test_mark_seen_drains_drift_on_hit_pathand by the CImatrix's 3.14t entry
REQUIRE_CYTHON=1): cache-hit at exactly cap = 52 ns/call.The downstream
log.debug/log.warninginvocation dwarfsthat, so the dedup wrapper is not the bottleneck.
assigned alongside the patched release