fix: bound DNSCache record count to prevent unbounded LAN-driven growth#1718
Conversation
Closes #1715. A LAN-local peer multicasting unique-name mDNS responses could grow ``DNSCache.cache`` / ``_expirations`` / ``service_cache`` / ``_expire_heap`` without bound. ``_DNS_PTR_MIN_TTL = 1125`` (RFC 6762 recommendation for ServiceBrowser refresh-floor) actually extends attacker-injected PTR records to ~19 min before they expire naturally, and ``async_expire`` only runs every ``_CACHE_CLEANUP_INTERVAL = 10`` s — too slow to keep up with a sustained flood. On memory-constrained deployments (Home Assistant on a Raspberry Pi is the canonical victim) this trivially OOMs the process. Add a hard cap (``_MAX_CACHE_RECORDS = 10_000`` in ``const.py``) on the total number of records the cache will hold. ``DNSCache`` now tracks ``_total_records`` (incremented on genuinely-new inserts in ``_async_add``, decremented in ``_async_remove``). When the cap is hit, ``_async_evict_oldest`` heappops the closest-to-expiration record and removes it before inserting the new one. Reuses the existing ``_expire_heap`` for victim selection — O(log n) — and skips stale heap entries via the same expiration-equality check ``async_expire`` already uses. All four touched lines compile to direct C int ops in the Cython build (score-0 in ``cython -a`` annotation); the eviction call is a C-level vtable dispatch fired only on overflow. Wall-clock check: below-cap add stays at ~160 ns/record; over-cap add with constant eviction is ~195 ns/record (+36 ns for the heappop + cache delete). Per-source-IP record quotas — the reporter's "Better" suggestion — are deferred to a follow-up PR. This commit closes the unbounded- growth bug class on its own; the quota work changes the fairness properties but not the memory bound. CWE-400 (Uncontrolled Resource Consumption). LAN-local attack surface only.
|
@bluetoothbot review |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1718 +/- ##
=======================================
Coverage 99.76% 99.76%
=======================================
Files 33 33
Lines 3426 3440 +14
Branches 471 473 +2
=======================================
+ Hits 3418 3432 +14
Misses 5 5
Partials 3 3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Merging this PR will not alter performance
Performance Changes
Comparing |
There was a problem hiding this comment.
Pull request overview
This PR bounds DNSCache growth to mitigate LAN-local memory exhaustion from floods of unique mDNS records.
Changes:
- Adds
_MAX_CACHE_RECORDSas a hard DNS cache record cap. - Tracks
_total_recordsand evicts the closest-to-expiration record on overflow. - Adds regression tests and CodSpeed benchmarks for bounded-cache behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
src/zeroconf/const.py |
Defines the cache record limit. |
src/zeroconf/_cache.py |
Adds record counting and overflow eviction logic. |
src/zeroconf/_cache.pxd |
Updates Cython declarations for the new cache state and helper. |
tests/test_cache.py |
Adds tests for cap enforcement and counter synchronization. |
tests/benchmarks/test_cache_bound.py |
Adds benchmarks for below-cap and at-cap insertion paths. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Adds ``test_cache_eviction_skips_stale_heap_entries`` which fills the cache to ``_MAX_CACHE_RECORDS``, re-adds the closest-to-expiration record with a longer TTL (which leaves a stale ``(when, record)`` tuple at the head of ``_expire_heap`` and updates ``_expirations`` with the new ``when``), then forces a single eviction and asserts the refreshed record survives while the next-oldest legitimate record is the one that gets evicted. Without this test the ``continue`` branch inside ``_async_evict_oldest`` would be untouched by the suite.
PR Review — fix: bound DNSCache record count to prevent unbounded LAN-driven growthSolid, well-targeted security fix for the unbounded cache growth in #1715. The two real attack variants (unique-name flood saturating 🟢 Suggestions1. `_total_records` can underflow on misuse (`src/zeroconf/_cache.py`, L173)
In practice the in-tree callers (
Not blocking — the invariant test ( 2. Worst-case O(N log N) eviction when heap is stale-dominated (`src/zeroconf/_cache.py`, L129-138)
This is acceptable given the alternative (rebuild inside the eviction hot path) would slow the common case. Worth a one-line comment in 3. Test comment phrasing (`tests/test_cache.py`, L571-573)The comment 'Clear the Purely cosmetic — the test is correct. Checklist
SummarySolid, well-targeted security fix for the unbounded cache growth in #1715. The two real attack variants (unique-name flood saturating Automated review by Kōanc24fa39 |
Address review feedback on #1718. **Orphaned-store bug (blocking).** ``_async_add`` was capturing ``store = self.cache.get(record.key)`` before calling ``_async_evict_oldest``. If the closest-to-expiration victim was the last record under ``record.key`` (e.g. the incoming record shares its key with the victim — A+AAAA pairs, PTR fan-outs, TXT/SRV pairs all hit this in real workloads), ``_remove_key`` would ``del self.cache[record.key]`` during eviction, leaving the captured ``store`` pointing at an orphaned dict. The subsequent ``store[record] = record`` then wrote into a dict no longer reachable through the cache; the new record was effectively lost, and a later ``async_expire`` of that record would ``KeyError`` from ``_remove_key``. Defer the bucket creation: look up ``store`` first (without creating), do the eviction-if-needed pass, then ``cache.get`` again to pick up any deletion eviction caused. The bucket is only created with ``self.cache[record.key] = {}`` after we know eviction has settled. **Stale-prefix scan (suggestion #1).** Under sustained TTL re-adds — e.g. ``async_mark_unique_records_older_than_1s_to_expire`` rewriting TTLs to 1 — ``_expire_heap`` accumulates stale ``(when, record)`` tuples that no longer match ``_expirations[record]``. The previous eviction body would pop those one at a time, doing O(stale_prefix * log n) work per ``_async_add``. Reuse the same ``len(heap) > _MIN_SCHEDULED_RECORD_EXPIRATION and len(heap) > len(expirations) * 2`` heuristic that ``async_expire`` already runs: when the heap is mostly stale, rebuild it once up front, then run the pop loop against only live entries. **Accounting invariant test.** ``_total_records`` is ``cdef public unsigned int``; an off-by-one in ``_async_add`` / ``_async_remove`` would silently underflow in the Cython wheel and pin the cap-check True forever (eviction storm on every add). Rather than defensively zero-clamp the decrement — which would hide the bug — the new ``test_cache_total_records_invariant_under_mixed_ops`` walks the counter through every code path that touches it (fresh inserts, re-adds, DNSService, DNSNsec, shared-key inserts that empty their bucket on removal, full-cap eviction loop, ``async_expire``, ``async_remove_records``) and asserts ``cache._total_records == sum(len(s) for s in cache.cache.values())`` after every step. Any future change that misses an increment or doubles a decrement fails loudly. Regression tests: - ``test_cache_eviction_victim_shares_key_with_new_record`` — reproduces the orphan; would have failed against c24fa39. - ``test_cache_eviction_rebuilds_heap_when_mostly_stale`` — fills to cap then re-adds with two distinct TTLs to push the heap to ~3x MAX, then triggers eviction and asserts the rebuild ran (heap and expirations match after the pass). - ``test_cache_total_records_invariant_under_mixed_ops`` — counter invariant across all touched code paths. The ``.pxd`` adds ``expire_heap_len`` as ``unsigned int`` to ``_async_evict_oldest``'s locals so the new threshold comparison stays a direct C int op in the Cython build.
|
@bluetoothbot review |
Address remaining review feedback on #1718. **Re-add heap growth (Copilot inline at L100, real bug separate from the orphan).** ``_async_add``'s cap check is gated on ``is_new``, but a peer that just replays cached records with a different ``created``/ TTL hits the ``is_new == False`` path: no cap check, no ``_total_records`` change — yet the changed ``when`` makes ``_expirations.get(record) != when`` true, so a fresh ``(when, record)`` tuple is heappushed while the prior tuple stays behind as stale. ``_expire_heap`` grows unbounded between cleanups even though ``cache`` and ``_total_records`` stay flat. At ~10k pps for 10 s that's ~1 M stale heap entries × ~100 B ≈ 100 MB transient. Factor the rebuild check (``len(heap) > _MIN_SCHEDULED_RECORD_EXPIRATION and len(heap) > 2 * len(expirations)`` → filter-and-heapify) into a new ``_maybe_rebuild_heap`` cdef helper. Call it from three sites: - ``_async_add`` immediately after the conditional heappush, so a re-add flood is bounded at every step (amortized O(1) per push, O(N) rebuild fires at most once per N stale pushes) - ``async_expire`` after the pop loop, replacing the existing inline rebuild — same behavior, slightly more aggressive (uses post-pop heap length instead of pre-pop), no test regression - (removed from ``_async_evict_oldest`` — the eager rebuild in ``_async_add`` keeps the heap bounded before eviction is ever triggered, so the eviction-side call was dead code) Cython codegen: the rebuild call from ``_async_add`` only fires when the conditional heappush already fired, so the hot path on duplicate re-adds (same ``when``) is unchanged. The check inside ``_maybe_rebuild_heap`` is two ``len()`` calls + an integer compare — score-0 in ``cython -a``. **Test DRY-up.** Introduce a module-level ``_addr(name, idx, ...)`` helper that builds the boilerplate ``DNSAddress(name, _TYPE_A, _CLASS_IN, ttl, bytes((idx & 0xFF, (idx >> 8) & 0xFF, 0, 1)), created =...)`` so each new test reads top-down instead of being dominated by constructor noise. Test docstrings collapsed to one sentence per ``CLAUDE.md``. **New regression test:** ``test_cache_re_add_flood_does_not_grow_heap_unbounded`` adds 200 records below the cap, then replays them through 10 cycles of varying TTLs (2000 stale pushes total) and asserts the heap stays near the rebuild threshold. The previous code would have grown it to ~11x the record count; the helper keeps it within ``2 * expirations``. **Removed test:** ``test_cache_eviction_rebuilds_heap_when_mostly_stale`` — its premise (build up stale entries, then trigger eviction-side rebuild) is no longer reachable now that ``_async_add`` rebuilds proactively. The same code path is covered by the re-add flood test. **Existing tests updated:** ``test_cache_heap_multi_name_cleanup`` and ``test_cache_heap_pops_order`` previously asserted pre-cleanup heap size (``min_records_to_cleanup + 5``). With proactive rebuild firing inside ``_async_add``, the heap is cleaned up earlier and the post- cleanup invariant (``1 + 5`` after ``async_expire``) still holds, but the intermediate snapshot is no longer accurate. Replaced the stale intermediate assertion with a comment.
``_async_evict_oldest`` falls through to a silent return if ``_expire_heap`` is empty. By the cache invariant every record counted in ``_total_records`` has a heap entry, so this branch is unreachable in normal operation — but it's a defense-in-depth path worth pinning. Force the broken state directly (``_total_records = MAX, _expire_heap = []``) and verify eviction returns gracefully while the subsequent insert still completes.
|
@bluetoothbot review |
Three review follow-ups on #1718: **Bench actually measures eviction (Copilot inline at L43).** ``test_cache_add_at_cap_evicts`` previously created an empty cache and added ``_MAX_CACHE_RECORDS + 1000`` records, so ~91% of the measured time was the below-cap fill path and only the final 1000 inserts evicted — the CodSpeed number was not representative of the eviction performance the name implies. Restructure to pre-fill the cache to the cap *outside* the timed body and consume one fresh unique record per benchmark iteration. Each iteration measures exactly one ``_async_add`` + ``_async_evict_oldest`` cycle, with the cache permanently at the cap. ``_make_records`` gains a ``prefix=`` parameter so the fill records and the overflow pool don't collide on ``name``. **Defensive test pins counter state (Kōan suggestion #3).** ``test_cache_eviction_empty_heap_returns_without_evicting`` already asserts the new record lands; add the corresponding ``_total_records == _MAX_CACHE_RECORDS + 1`` assertion so a future change to the broken-state recovery (refuse the add, clamp the counter) fails this test explicitly rather than silently shifting behaviour. **One-line docstring (Kōan suggestion #2).** ``_maybe_rebuild_heap`` had an 11-line docstring restating the body and including amortized-cost analysis. Per CLAUDE.md the why lives at the call site comment in ``_async_add`` (already does); the function's contract is a single line. **No log on the empty-heap fall-through (Kōan suggestion #1).** Kōan suggested an assertion or debug log; the assertion would crash production on invariant drift, and even a guarded ``log.debug(...)`` adds Python attribute-lookup overhead in the function the cap overflow path runs through. Drift is already caught loudly by ``test_cache_total_records_invariant_under_mixed_ops``. Comment explains the deliberate silence.
|
@bluetoothbot review |
|
bad AI suggestion to add logging in the hot path |
|
need to hand check nsec path.. should add a test to be sure |
The ``new = is_new and not isinstance(record, DNSNsec)`` line in
``_async_add`` only gates the listener-notify return value — the
counter increment, the cache store, and the cap check are all
type-agnostic. A peer can't flood NSec records past the cap today.
But the existing ``new`` flag is exactly the shape a future refactor
might extend to ``_total_records`` ("DNSNsec isn't real data, don't
count it"), which would silently open a bypass. Pin
``_total_records == _MAX_CACHE_RECORDS`` after pushing
``_MAX_CACHE_RECORDS + 100`` unique-name DNSNsec records so any such
regression fails this test loudly.
|
ok happy with it. but should have the bot check it again as I'm tired and might miss something |
The NSEC accounting path is already pinned by |
**Cross-type NSec spot-test (Kōan).** Add ``test_cache_dnsnsec_at_cap_evicts_prior_record``: fill the cache to ``_MAX_CACHE_RECORDS`` with ``DNSAddress`` records, add a single ``DNSNsec``, assert the counter stays at cap (NSEC routed through eviction), the NSEC is reachable via ``cache.cache[nsec.key]`` (orphan fix covers NSEC's key-collision path), and the earliest fill record was the eviction victim. Complements the existing flood test — together they pin NSEC against both ``new``-flag drift and cross-type ``_async_add`` regressions. Uses direct cache lookup instead of ``async_get_unique`` because the latter's type stub excludes ``DNSNsec``. **Unbounded benchmark generator (Copilot inline at test_cache_bound.py:53).** The previous bench used ``iter(_make_records(100_000, ...))`` which raises ``StopIteration`` once exhausted — fast operations under pytest-codspeed can exceed that count. Replace with ``_unbounded_records`` (``itertools.count`` + lazy ``DNSAddress`` construction) so the iterator survives any iteration count CodSpeed chooses. The construction work moves into the timed body but is a constant per-iteration overhead and doesn't distort relative measurements across runs.
|
@bluetoothbot review |
The comment said drift would surface via
``test_cache_total_records_invariant_under_mixed_ops``, but that
test only exercises well-formed paths and never produces a
heap-empty-with-counter-nonzero state. The test that actually pins
the broken-invariant recovery is
``test_cache_eviction_empty_heap_returns_without_evicting``. Per
``CLAUDE.md`` ("default to no comments"; cross-references rot when
referents are renamed) the comment shouldn't exist at all — the
function name and docstring already describe the contract, and the
while-loop fall-through is self-explanatory.
|
ok bot suggestions are worse than alternatives. This is good as is |
Summary
Closes #1715. A LAN-local peer multicasting unique-name mDNS responses
can grow
DNSCache.cache/_expirations/service_cache/_expire_heapwithout bound. With_DNS_PTR_MIN_TTL = 1125extending attacker-injected PTR records to ~19 min and
async_expireonly running every
_CACHE_CLEANUP_INTERVAL = 10s, a low-Mbpsmulticast flood OOMs a Raspberry-Pi-class deployment in minutes.
Adds a hard cap on the total record count with closest-to-expiration
eviction on overflow, plus a shared
_maybe_rebuild_heaphelper thatkeeps
_expire_heapbounded against a second variant of the sameattack (record replay with shifting TTLs, which leaves stale heap
entries without changing the cache size).
Details
Cache cap
_MAX_CACHE_RECORDS = 10_000added toconst.py. Picked as aceiling well above any realistic legitimate workload — Home
Assistant deployments typically sit in the low hundreds — while
staying low enough to keep memory bounded at a few MB worst-case.
DNSCachegains_total_records: unsigned int(declared in_cache.pxd). Incremented in_async_addon genuinely-newinserts; decremented in
_async_remove. The counter isauthoritative — the cap check uses it directly rather than walking
self.cache._async_evict_oldest: when the cap is hit,heappops theclosest-to-expiration
(when, record)from_expire_heapandremoves it before inserting the new arrival. Skips stale heap
entries via the same expiration-equality check
async_expirealready uses.
Eviction-victim collision (orphaned-store bug)
The first iteration of this PR captured
store = self.cache.get(record.key)before eviction. If the closest-to-expiration victim was the last
record under
record.key(which happens whenever the incomingrecord shares its key with the victim — A+AAAA pairs, PTR fan-outs,
TXT/SRV pairs all hit this),
_remove_keywoulddel self.cache[record.key]during eviction, leaving the capturedstorepointing at an orphaned dict. The subsequentstore[record] = recordthen wrote into a dict no longer reachable through thecache; the new record was effectively lost, and a later
async_expireof that record wouldKeyErrorfrom_remove_key.Fix: defer the bucket creation. Look up
storefirst (withoutcreating), do the eviction-if-needed pass, then
cache.getagainto pick up any deletion eviction caused. The bucket is only created
with
self.cache[record.key] = {}after eviction has settled.Replay-flood heap growth
The cap above is gated on
is_new, but a peer that just replayscached records with different
created/TTLs hits theis_new == Falsepath: no cap check, no_total_recordschange — yet thechanged
whenmakes_expirations.get(record) != whentrue, so afresh
(when, record)tuple is heappushed while the prior tuplestays behind as stale.
_expire_heapgrows unbounded betweencleanups even though
cacheand_total_recordsstay flat. At~10k pps for 10 s that's ~1 M stale heap entries × ~100 B ≈ 100 MB
transient.
Factor the heap-rebuild check (
len(heap) > _MIN_SCHEDULED_RECORD_EXPIRATION and len(heap) > 2 * len(expirations)→ filter-and-heapify) into anew
_maybe_rebuild_heapcdef helper. Called from two sites:_async_addimmediately after the conditional heappush, so are-add flood is bounded at every step. Amortized O(1) per push:
the O(N) rebuild fires at most once per N stale pushes.
async_expireafter the pop loop, replacing the existing inlinerebuild (same behavior, slightly more aggressive — uses post-pop
heap length instead of pre-pop, no test regression).
_async_evict_oldestno longer needs its own rebuild call:proactive rebuild in
_async_addkeeps the heap bounded beforeeviction is ever triggered.
Why not per-source-IP quotas
The reporter's "Better" suggestion is intentionally not pursued.
mDNS rides on UDP with no source authentication; an attacker on the
local segment can multicast with any source address, so a per-IP
quota is bypassed by source-rotation. The LAN itself is the trust
boundary in mDNS; once an attacker is past it, source IPs are not a
meaningful identity. Per-source bookkeeping would add hot-path
overhead and lifecycle complexity (peer mappings need GC as records
expire) in exchange for a fairness property that doesn't survive
contact with the protocol's threat model.
CWE-400 (Uncontrolled Resource Consumption). LAN-local attack
surface only.
Cython perf check
cython -aon the touched_cache.py:Generated C compiles the cap check and counter ops to direct
unsigned intarithmetic, helper invocations to C-level vtabledispatch. The yellow inside
_maybe_rebuild_heapis onePyObject_RichCompareper list-comp iteration (type-genericequality against
_expirations.get(...)) — unavoidable, and onlyruns during the rare rebuild path. The hot path on duplicate
re-adds (same
when) never enters the helper at all because theif self._expirations.get(record) != when:gate trips first.Wall-clock smoke test: below-cap adds run at ~159 ns/record;
sustained-eviction adds run at ~195 ns/record (+36 ns/record for the
heappop + cache delete).
Test plan
poetry run pytest tests/test_cache.py --timeout=60 -v— 33passed including five regressions:
test_cache_size_is_bounded— cap holds exactly under unique-nameflood, FIFO eviction order
test_cache_eviction_skips_stale_heap_entries— heap stale-skipselects next valid victim
test_cache_eviction_victim_shares_key_with_new_record—reproduces the orphan bug against pre-
931358acodetest_cache_re_add_flood_does_not_grow_heap_unbounded— replayattack bounded by the
_maybe_rebuild_heaphelpertest_cache_total_records_invariant_under_mixed_ops— pins_total_records == sum(len(store) for store in cache.cache.values())after every code path that touches it (fresh inserts, re-adds,
DNSService, DNSNsec, shared-key emptying its bucket, full-cap
eviction loop,
async_expire,async_remove_records); anyfuture off-by-one fails loudly rather than silently underflowing
the unsigned counter
poetry run pytest --timeout=60— 345 passed, 3 skipped (fullsuite)
poetry run pre-commit run --files <changed files>— all hookspass
REQUIRE_CYTHON=1 poetry install --only=main,dev— extensionrebuilds against the new
.pxd(added_total_recordsfield,_async_evict_oldestand_maybe_rebuild_heapcdefs)tests/benchmarks/test_cache_bound.pycovering below-cap and at-cap add paths
assigned alongside the patched release