fix: bound DNSCache record count to prevent unbounded LAN-driven growth by bdraco · Pull Request #1718 · python-zeroconf/python-zeroconf

bdraco · 2026-05-18T02:36:57Z

Summary

Closes #1715. A LAN-local peer multicasting unique-name mDNS responses
can grow DNSCache.cache / _expirations / service_cache /
_expire_heap without bound. With _DNS_PTR_MIN_TTL = 1125
extending attacker-injected PTR records to ~19 min and async_expire
only running every _CACHE_CLEANUP_INTERVAL = 10 s, a low-Mbps
multicast flood OOMs a Raspberry-Pi-class deployment in minutes.

Adds a hard cap on the total record count with closest-to-expiration
eviction on overflow, plus a shared _maybe_rebuild_heap helper that
keeps _expire_heap bounded against a second variant of the same
attack (record replay with shifting TTLs, which leaves stale heap
entries without changing the cache size).

Details

Cache cap

_MAX_CACHE_RECORDS = 10_000 added to const.py. Picked as a
ceiling well above any realistic legitimate workload — Home
Assistant deployments typically sit in the low hundreds — while
staying low enough to keep memory bounded at a few MB worst-case.
DNSCache gains _total_records: unsigned int (declared in
_cache.pxd). Incremented in _async_add on genuinely-new
inserts; decremented in _async_remove. The counter is
authoritative — the cap check uses it directly rather than walking
self.cache.
New _async_evict_oldest: when the cap is hit, heappops the
closest-to-expiration (when, record) from _expire_heap and
removes it before inserting the new arrival. Skips stale heap
entries via the same expiration-equality check async_expire
already uses.

Eviction-victim collision (orphaned-store bug)

The first iteration of this PR captured store = self.cache.get(record.key)
before eviction. If the closest-to-expiration victim was the last
record under record.key (which happens whenever the incoming
record shares its key with the victim — A+AAAA pairs, PTR fan-outs,
TXT/SRV pairs all hit this), _remove_key would del self.cache[record.key] during eviction, leaving the captured
store pointing at an orphaned dict. The subsequent store[record] = record then wrote into a dict no longer reachable through the
cache; the new record was effectively lost, and a later
async_expire of that record would KeyError from _remove_key.

Fix: defer the bucket creation. Look up store first (without
creating), do the eviction-if-needed pass, then cache.get again
to pick up any deletion eviction caused. The bucket is only created
with self.cache[record.key] = {} after eviction has settled.

Replay-flood heap growth

The cap above is gated on is_new, but a peer that just replays
cached records with different created/TTLs hits the is_new == False path: no cap check, no _total_records change — yet the
changed when makes _expirations.get(record) != when true, so a
fresh (when, record) tuple is heappushed while the prior tuple
stays behind as stale. _expire_heap grows unbounded between
cleanups even though cache and _total_records stay flat. At
~10k pps for 10 s that's ~1 M stale heap entries × ~100 B ≈ 100 MB
transient.

Factor the heap-rebuild check (len(heap) > _MIN_SCHEDULED_RECORD_EXPIRATION and len(heap) > 2 * len(expirations) → filter-and-heapify) into a
new _maybe_rebuild_heap cdef helper. Called from two sites:

_async_add immediately after the conditional heappush, so a
re-add flood is bounded at every step. Amortized O(1) per push:
the O(N) rebuild fires at most once per N stale pushes.
async_expire after the pop loop, replacing the existing inline
rebuild (same behavior, slightly more aggressive — uses post-pop
heap length instead of pre-pop, no test regression).

_async_evict_oldest no longer needs its own rebuild call:
proactive rebuild in _async_add keeps the heap bounded before
eviction is ever triggered.

Why not per-source-IP quotas

The reporter's "Better" suggestion is intentionally not pursued.
mDNS rides on UDP with no source authentication; an attacker on the
local segment can multicast with any source address, so a per-IP
quota is bypassed by source-rotation. The LAN itself is the trust
boundary in mDNS; once an attacker is past it, source IPs are not a
meaningful identity. Per-source bookkeeping would add hot-path
overhead and lifecycle complexity (peer mappings need GC as records
expire) in exchange for a fairness property that doesn't survive
contact with the protocol's threat model.

CWE-400 (Uncontrolled Resource Consumption). LAN-local attack
surface only.

Cython perf check

cython -a on the touched _cache.py:

_async_add additions (hot path):
  line  75:  self._total_records: int = 0                          score-0
  line 100:  if is_new and self._total_records >= _MAX_CACHE_RECORDS score-0
  line 101:      self._async_evict_oldest()                         score-0
  line 109:      self._total_records += 1                           score-0
  line 120:      self._maybe_rebuild_heap()  (after stale push)     score-0

_async_remove (decrement):
  line 173:  self._total_records -= 1                               score-0

_maybe_rebuild_heap helper body:
  line 153:  expire_heap_len = len(self._expire_heap)               score-9
  line 155:  expire_heap_len > _MIN_SCHEDULED_RECORD_EXPIRATION     score-0
  line 156:  and expire_heap_len > len(self._expirations) * 2       score-12

async_expire (uses helper):
  line 219:  self._maybe_rebuild_heap()                             score-0

Generated C compiles the cap check and counter ops to direct
unsigned int arithmetic, helper invocations to C-level vtable
dispatch. The yellow inside _maybe_rebuild_heap is one
PyObject_RichCompare per list-comp iteration (type-generic
equality against _expirations.get(...)) — unavoidable, and only
runs during the rare rebuild path. The hot path on duplicate
re-adds (same when) never enters the helper at all because the
if self._expirations.get(record) != when: gate trips first.

Wall-clock smoke test: below-cap adds run at ~159 ns/record;
sustained-eviction adds run at ~195 ns/record (+36 ns/record for the
heappop + cache delete).

Test plan

poetry run pytest tests/test_cache.py --timeout=60 -v — 33
passed including five regressions:
- test_cache_size_is_bounded — cap holds exactly under unique-name
  flood, FIFO eviction order
- test_cache_eviction_skips_stale_heap_entries — heap stale-skip
  selects next valid victim
- test_cache_eviction_victim_shares_key_with_new_record —
  reproduces the orphan bug against pre-931358a code
- test_cache_re_add_flood_does_not_grow_heap_unbounded — replay
  attack bounded by the _maybe_rebuild_heap helper
- test_cache_total_records_invariant_under_mixed_ops — pins
  _total_records == sum(len(store) for store in cache.cache.values())
  after every code path that touches it (fresh inserts, re-adds,
  DNSService, DNSNsec, shared-key emptying its bucket, full-cap
  eviction loop, async_expire, async_remove_records); any
  future off-by-one fails loudly rather than silently underflowing
  the unsigned counter
poetry run pytest --timeout=60 — 345 passed, 3 skipped (full
suite)
poetry run pre-commit run --files <changed files> — all hooks
pass
REQUIRE_CYTHON=1 poetry install --only=main,dev — extension
rebuilds against the new .pxd (added _total_records field,
_async_evict_oldest and _maybe_rebuild_heap cdefs)
CodSpeed bench at tests/benchmarks/test_cache_bound.py
covering below-cap and at-cap add paths
(optional) draft a GitHub Security Advisory so a CVE can be
assigned alongside the patched release

Closes #1715. A LAN-local peer multicasting unique-name mDNS responses could grow ``DNSCache.cache`` / ``_expirations`` / ``service_cache`` / ``_expire_heap`` without bound. ``_DNS_PTR_MIN_TTL = 1125`` (RFC 6762 recommendation for ServiceBrowser refresh-floor) actually extends attacker-injected PTR records to ~19 min before they expire naturally, and ``async_expire`` only runs every ``_CACHE_CLEANUP_INTERVAL = 10`` s — too slow to keep up with a sustained flood. On memory-constrained deployments (Home Assistant on a Raspberry Pi is the canonical victim) this trivially OOMs the process. Add a hard cap (``_MAX_CACHE_RECORDS = 10_000`` in ``const.py``) on the total number of records the cache will hold. ``DNSCache`` now tracks ``_total_records`` (incremented on genuinely-new inserts in ``_async_add``, decremented in ``_async_remove``). When the cap is hit, ``_async_evict_oldest`` heappops the closest-to-expiration record and removes it before inserting the new one. Reuses the existing ``_expire_heap`` for victim selection — O(log n) — and skips stale heap entries via the same expiration-equality check ``async_expire`` already uses. All four touched lines compile to direct C int ops in the Cython build (score-0 in ``cython -a`` annotation); the eviction call is a C-level vtable dispatch fired only on overflow. Wall-clock check: below-cap add stays at ~160 ns/record; over-cap add with constant eviction is ~195 ns/record (+36 ns for the heappop + cache delete). Per-source-IP record quotas — the reporter's "Better" suggestion — are deferred to a follow-up PR. This commit closes the unbounded- growth bug class on its own; the quota work changes the fairness properties but not the memory bound. CWE-400 (Uncontrolled Resource Consumption). LAN-local attack surface only.

bdraco · 2026-05-18T02:37:29Z

@bluetoothbot review

codecov · 2026-05-18T02:38:39Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.76%. Comparing base (64d143d) to head (39ab755).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1718   +/-   ##
=======================================
  Coverage   99.76%   99.76%           
=======================================
  Files          33       33           
  Lines        3426     3440   +14     
  Branches      471      473    +2     
=======================================
+ Hits         3418     3432   +14     
  Misses          5        5           
  Partials        3        3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codspeed-hq · 2026-05-18T02:38:48Z

Merging this PR will not alter performance

✅ 9 untouched benchmarks
🆕 2 new benchmarks

Performance Changes

	Benchmark	`BASE`	`HEAD`	Efficiency
🆕	`test_cache_add_at_cap_evicts`	N/A	57.3 µs	N/A
🆕	`test_cache_add_below_cap`	N/A	2.3 ms	N/A

_{Comparing fix/cache-unbounded-records (39ab755) with master (0ff3c6b)}

Copilot

Pull request overview

This PR bounds DNSCache growth to mitigate LAN-local memory exhaustion from floods of unique mDNS records.

Changes:

Adds _MAX_CACHE_RECORDS as a hard DNS cache record cap.
Tracks _total_records and evicts the closest-to-expiration record on overflow.
Adds regression tests and CodSpeed benchmarks for bounded-cache behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`src/zeroconf/const.py`	Defines the cache record limit.
`src/zeroconf/_cache.py`	Adds record counting and overflow eviction logic.
`src/zeroconf/_cache.pxd`	Updates Cython declarations for the new cache state and helper.
`tests/test_cache.py`	Adds tests for cap enforcement and counter synchronization.
`tests/benchmarks/test_cache_bound.py`	Adds benchmarks for below-cap and at-cap insertion paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Adds ``test_cache_eviction_skips_stale_heap_entries`` which fills the cache to ``_MAX_CACHE_RECORDS``, re-adds the closest-to-expiration record with a longer TTL (which leaves a stale ``(when, record)`` tuple at the head of ``_expire_heap`` and updates ``_expirations`` with the new ``when``), then forces a single eviction and asserts the refreshed record survives while the next-oldest legitimate record is the one that gets evicted. Without this test the ``continue`` branch inside ``_async_evict_oldest`` would be untouched by the suite.

bluetoothbot · 2026-05-18T02:43:41Z

PR Review — fix: bound DNSCache record count to prevent unbounded LAN-driven growth

Solid, well-targeted security fix for the unbounded cache growth in #1715. The two real attack variants (unique-name flood saturating cache, and replay-flood saturating _expire_heap) are both addressed, with the orphan-store bug correctly handled by the deferred bucket creation. All prior Copilot review comments (orphan store, replay-flood, benchmark prefilling, finite-pool exhaustion) are addressed in the current diff. Test coverage is comprehensive — the test_cache_total_records_invariant_under_mixed_ops test in particular is a strong invariant lock that will catch any future off-by-one. _MAX_CACHE_RECORDS = 10000 is a defensible ceiling for a defense-in-depth bound (RFC 6762 §10 advisory caching). Cython .pxd updates are consistent with the existing unsigned int convention for cache counters. Performance numbers (~159 ns/record below cap, ~195 ns/record at cap) are reasonable. Suggestions in this review are minor and non-blocking — the PR is merge-ready.

🟢 Suggestions

1. `_total_records` can underflow on misuse (`src/zeroconf/_cache.py`, L173)

_async_remove decrements _total_records unconditionally, but the .pxd declares the field as unsigned int. If async_remove_records is ever called with a record that was already removed (double-remove from a caller, or a stray call after async_expire already popped it), the counter wraps to UINT_MAX and the cap effectively disables itself.

In practice the in-tree callers (record_manager, async_expire) guarantee removed records exist in _expirations, so this is defensive rather than a present-day bug. Worth considering either:

A _total_records > 0 guard before decrementing, or
Decrement only when _remove_key actually freed a record (would require threading a bool back from _remove_key).

Not blocking — the invariant test (test_cache_total_records_invariant_under_mixed_ops) keeps the in-tree paths honest, and an underflow would surface as a test failure on any new caller that double-removes.

_remove_key(self.cache, record.key, record)
self._expirations.pop(record, None)
self._total_records -= 1

2. Worst-case O(N log N) eviction when heap is stale-dominated (`src/zeroconf/_cache.py`, L129-138)

_async_evict_oldest may heappop up to len(_expire_heap) stale entries before finding a live one. In normal operation _maybe_rebuild_heap (called after every push in _async_add) keeps stale ratio under 2× live, so this is bounded — but in a corner case where the cap fires repeatedly without any intermediate rebuild trigger (heap just-under-2× ratio when eviction starts), eviction can do significant work.

This is acceptable given the alternative (rebuild inside the eviction hot path) would slow the common case. Worth a one-line comment in _async_evict_oldest noting that the loop relies on _async_add's rebuild to keep the stale prefix bounded, so a future reader doesn't 'fix' it by removing the rebuild call.

def _async_evict_oldest(self) -> None:
    """Drop the closest-to-expiration record to make room for a new one."""
    while self._expire_heap:
        when_record = heappop(self._expire_heap)
        record = when_record[1]
        if self._expirations.get(record) != when_record[0]:
            continue
        self._async_remove(record)
        return

3. Test comment phrasing (`tests/test_cache.py`, L571-573)

The comment 'Clear the _MIN_SCHEDULED_RECORD_EXPIRATION floor so the rebuild engages' reads as 'reset/zero the constant', but the test doesn't actually clear or patch it — it just produces enough re-adds across cycles to exceed it. Consider 'Exceed the _MIN_SCHEDULED_RECORD_EXPIRATION floor across cycles so the rebuild can engage' to avoid the ambiguity.

Purely cosmetic — the test is correct.

# Stay below the cache cap so eviction never fires; the attack here is
# heap growth via re-add, not cap saturation. Clear the
# ``_MIN_SCHEDULED_RECORD_EXPIRATION`` floor so the rebuild engages.

Checklist

Summary

Solid, well-targeted security fix for the unbounded cache growth in #1715. The two real attack variants (unique-name flood saturating cache, and replay-flood saturating _expire_heap) are both addressed, with the orphan-store bug correctly handled by the deferred bucket creation. All prior Copilot review comments (orphan store, replay-flood, benchmark prefilling, finite-pool exhaustion) are addressed in the current diff. Test coverage is comprehensive — the test_cache_total_records_invariant_under_mixed_ops test in particular is a strong invariant lock that will catch any future off-by-one. _MAX_CACHE_RECORDS = 10000 is a defensible ceiling for a defense-in-depth bound (RFC 6762 §10 advisory caching). Cython .pxd updates are consistent with the existing unsigned int convention for cache counters. Performance numbers (~159 ns/record below cap, ~195 ns/record at cap) are reasonable. Suggestions in this review are minor and non-blocking — the PR is merge-ready.

Automated review by Kōanc24fa39
51fbc27
931358a
7c6be63
052b1e2
a2a2d34
38b1517
e5b509a
39ab755

Address review feedback on #1718. **Orphaned-store bug (blocking).** ``_async_add`` was capturing ``store = self.cache.get(record.key)`` before calling ``_async_evict_oldest``. If the closest-to-expiration victim was the last record under ``record.key`` (e.g. the incoming record shares its key with the victim — A+AAAA pairs, PTR fan-outs, TXT/SRV pairs all hit this in real workloads), ``_remove_key`` would ``del self.cache[record.key]`` during eviction, leaving the captured ``store`` pointing at an orphaned dict. The subsequent ``store[record] = record`` then wrote into a dict no longer reachable through the cache; the new record was effectively lost, and a later ``async_expire`` of that record would ``KeyError`` from ``_remove_key``. Defer the bucket creation: look up ``store`` first (without creating), do the eviction-if-needed pass, then ``cache.get`` again to pick up any deletion eviction caused. The bucket is only created with ``self.cache[record.key] = {}`` after we know eviction has settled. **Stale-prefix scan (suggestion ＃1).** Under sustained TTL re-adds — e.g. ``async_mark_unique_records_older_than_1s_to_expire`` rewriting TTLs to 1 — ``_expire_heap`` accumulates stale ``(when, record)`` tuples that no longer match ``_expirations[record]``. The previous eviction body would pop those one at a time, doing O(stale_prefix * log n) work per ``_async_add``. Reuse the same ``len(heap) > _MIN_SCHEDULED_RECORD_EXPIRATION and len(heap) > len(expirations) * 2`` heuristic that ``async_expire`` already runs: when the heap is mostly stale, rebuild it once up front, then run the pop loop against only live entries. **Accounting invariant test.** ``_total_records`` is ``cdef public unsigned int``; an off-by-one in ``_async_add`` / ``_async_remove`` would silently underflow in the Cython wheel and pin the cap-check True forever (eviction storm on every add). Rather than defensively zero-clamp the decrement — which would hide the bug — the new ``test_cache_total_records_invariant_under_mixed_ops`` walks the counter through every code path that touches it (fresh inserts, re-adds, DNSService, DNSNsec, shared-key inserts that empty their bucket on removal, full-cap eviction loop, ``async_expire``, ``async_remove_records``) and asserts ``cache._total_records == sum(len(s) for s in cache.cache.values())`` after every step. Any future change that misses an increment or doubles a decrement fails loudly. Regression tests: - ``test_cache_eviction_victim_shares_key_with_new_record`` — reproduces the orphan; would have failed against c24fa39. - ``test_cache_eviction_rebuilds_heap_when_mostly_stale`` — fills to cap then re-adds with two distinct TTLs to push the heap to ~3x MAX, then triggers eviction and asserts the rebuild ran (heap and expirations match after the pass). - ``test_cache_total_records_invariant_under_mixed_ops`` — counter invariant across all touched code paths. The ``.pxd`` adds ``expire_heap_len`` as ``unsigned int`` to ``_async_evict_oldest``'s locals so the new threshold comparison stays a direct C int op in the Cython build.

bdraco · 2026-05-18T03:07:27Z

@bluetoothbot review

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Address remaining review feedback on #1718. **Re-add heap growth (Copilot inline at L100, real bug separate from the orphan).** ``_async_add``'s cap check is gated on ``is_new``, but a peer that just replays cached records with a different ``created``/ TTL hits the ``is_new == False`` path: no cap check, no ``_total_records`` change — yet the changed ``when`` makes ``_expirations.get(record) != when`` true, so a fresh ``(when, record)`` tuple is heappushed while the prior tuple stays behind as stale. ``_expire_heap`` grows unbounded between cleanups even though ``cache`` and ``_total_records`` stay flat. At ~10k pps for 10 s that's ~1 M stale heap entries × ~100 B ≈ 100 MB transient. Factor the rebuild check (``len(heap) > _MIN_SCHEDULED_RECORD_EXPIRATION and len(heap) > 2 * len(expirations)`` → filter-and-heapify) into a new ``_maybe_rebuild_heap`` cdef helper. Call it from three sites: - ``_async_add`` immediately after the conditional heappush, so a re-add flood is bounded at every step (amortized O(1) per push, O(N) rebuild fires at most once per N stale pushes) - ``async_expire`` after the pop loop, replacing the existing inline rebuild — same behavior, slightly more aggressive (uses post-pop heap length instead of pre-pop), no test regression - (removed from ``_async_evict_oldest`` — the eager rebuild in ``_async_add`` keeps the heap bounded before eviction is ever triggered, so the eviction-side call was dead code) Cython codegen: the rebuild call from ``_async_add`` only fires when the conditional heappush already fired, so the hot path on duplicate re-adds (same ``when``) is unchanged. The check inside ``_maybe_rebuild_heap`` is two ``len()`` calls + an integer compare — score-0 in ``cython -a``. **Test DRY-up.** Introduce a module-level ``_addr(name, idx, ...)`` helper that builds the boilerplate ``DNSAddress(name, _TYPE_A, _CLASS_IN, ttl, bytes((idx & 0xFF, (idx >> 8) & 0xFF, 0, 1)), created =...)`` so each new test reads top-down instead of being dominated by constructor noise. Test docstrings collapsed to one sentence per ``CLAUDE.md``. **New regression test:** ``test_cache_re_add_flood_does_not_grow_heap_unbounded`` adds 200 records below the cap, then replays them through 10 cycles of varying TTLs (2000 stale pushes total) and asserts the heap stays near the rebuild threshold. The previous code would have grown it to ~11x the record count; the helper keeps it within ``2 * expirations``. **Removed test:** ``test_cache_eviction_rebuilds_heap_when_mostly_stale`` — its premise (build up stale entries, then trigger eviction-side rebuild) is no longer reachable now that ``_async_add`` rebuilds proactively. The same code path is covered by the re-add flood test. **Existing tests updated:** ``test_cache_heap_multi_name_cleanup`` and ``test_cache_heap_pops_order`` previously asserted pre-cleanup heap size (``min_records_to_cleanup + 5``). With proactive rebuild firing inside ``_async_add``, the heap is cleaned up earlier and the post- cleanup invariant (``1 + 5`` after ``async_expire``) still holds, but the intermediate snapshot is no longer accurate. Replaced the stale intermediate assertion with a comment.

``_async_evict_oldest`` falls through to a silent return if ``_expire_heap`` is empty. By the cache invariant every record counted in ``_total_records`` has a heap entry, so this branch is unreachable in normal operation — but it's a defense-in-depth path worth pinning. Force the broken state directly (``_total_records = MAX, _expire_heap = []``) and verify eviction returns gracefully while the subsequent insert still completes.

bdraco · 2026-05-18T04:00:14Z

@bluetoothbot review

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Three review follow-ups on #1718: **Bench actually measures eviction (Copilot inline at L43).** ``test_cache_add_at_cap_evicts`` previously created an empty cache and added ``_MAX_CACHE_RECORDS + 1000`` records, so ~91% of the measured time was the below-cap fill path and only the final 1000 inserts evicted — the CodSpeed number was not representative of the eviction performance the name implies. Restructure to pre-fill the cache to the cap *outside* the timed body and consume one fresh unique record per benchmark iteration. Each iteration measures exactly one ``_async_add`` + ``_async_evict_oldest`` cycle, with the cache permanently at the cap. ``_make_records`` gains a ``prefix=`` parameter so the fill records and the overflow pool don't collide on ``name``. **Defensive test pins counter state (Kōan suggestion ＃3).** ``test_cache_eviction_empty_heap_returns_without_evicting`` already asserts the new record lands; add the corresponding ``_total_records == _MAX_CACHE_RECORDS + 1`` assertion so a future change to the broken-state recovery (refuse the add, clamp the counter) fails this test explicitly rather than silently shifting behaviour. **One-line docstring (Kōan suggestion ＃2).** ``_maybe_rebuild_heap`` had an 11-line docstring restating the body and including amortized-cost analysis. Per CLAUDE.md the why lives at the call site comment in ``_async_add`` (already does); the function's contract is a single line. **No log on the empty-heap fall-through (Kōan suggestion ＃1).** Kōan suggested an assertion or debug log; the assertion would crash production on invariant drift, and even a guarded ``log.debug(...)`` adds Python attribute-lookup overhead in the function the cap overflow path runs through. Drift is already caught loudly by ``test_cache_total_records_invariant_under_mixed_ops``. Comment explains the deliberate silence.

bdraco · 2026-05-18T04:10:29Z

@bluetoothbot review

bdraco · 2026-05-18T04:10:57Z

bad AI suggestion to add logging in the hot path

bdraco · 2026-05-18T04:12:43Z

need to hand check nsec path.. should add a test to be sure

The ``new = is_new and not isinstance(record, DNSNsec)`` line in ``_async_add`` only gates the listener-notify return value — the counter increment, the cache store, and the cap check are all type-agnostic. A peer can't flood NSec records past the cap today. But the existing ``new`` flag is exactly the shape a future refactor might extend to ``_total_records`` ("DNSNsec isn't real data, don't count it"), which would silently open a bypass. Pin ``_total_records == _MAX_CACHE_RECORDS`` after pushing ``_MAX_CACHE_RECORDS + 100`` unique-name DNSNsec records so any such regression fails this test loudly.

bdraco · 2026-05-18T04:14:07Z

ok happy with it. but should have the bot check it again as I'm tired and might miss something

bluetoothbot · 2026-05-18T04:16:38Z

@bdraco: need to hand check nsec path.. should add a test to be sure

The NSEC accounting path is already pinned by test_cache_total_records_invariant_under_mixed_ops: it adds a DNSNsec (_total_records == actual() == 21) and then removes it (_total_records == actual() == 20), so the increment/decrement symmetry and the cache-bucket-size invariant both hold across NSEC. What that test doesn't pin is NSEC arriving at the cap — the current _async_add treats NSEC identically to other records for the cap check (is_new and self._total_records >= _MAX_CACHE_RECORDS fires regardless of isinstance(record, DNSNsec)), so an NSEC flood drives eviction the same way an A-record flood does. If you want a hard pin for that specifically, a small focused test would do it: fill to _MAX_CACHE_RECORDS with A records, add one DNSNsec, assert _total_records == _MAX_CACHE_RECORDS, the NSEC is reachable via async_get_unique, and exactly one prior record is gone. Cheap to add and forecloses any future regression where NSEC accidentally gets excluded from the cap path (the is_new / new split already differentiates them once for the return value, which is where a future refactor could easily slip).

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

**Cross-type NSec spot-test (Kōan).** Add ``test_cache_dnsnsec_at_cap_evicts_prior_record``: fill the cache to ``_MAX_CACHE_RECORDS`` with ``DNSAddress`` records, add a single ``DNSNsec``, assert the counter stays at cap (NSEC routed through eviction), the NSEC is reachable via ``cache.cache[nsec.key]`` (orphan fix covers NSEC's key-collision path), and the earliest fill record was the eviction victim. Complements the existing flood test — together they pin NSEC against both ``new``-flag drift and cross-type ``_async_add`` regressions. Uses direct cache lookup instead of ``async_get_unique`` because the latter's type stub excludes ``DNSNsec``. **Unbounded benchmark generator (Copilot inline at test_cache_bound.py:53).** The previous bench used ``iter(_make_records(100_000, ...))`` which raises ``StopIteration`` once exhausted — fast operations under pytest-codspeed can exceed that count. Replace with ``_unbounded_records`` (``itertools.count`` + lazy ``DNSAddress`` construction) so the iterator survives any iteration count CodSpeed chooses. The construction work moves into the timed body but is a constant per-iteration overhead and doesn't distort relative measurements across runs.

bdraco · 2026-05-18T04:22:14Z

@bluetoothbot review

The comment said drift would surface via ``test_cache_total_records_invariant_under_mixed_ops``, but that test only exercises well-formed paths and never produces a heap-empty-with-counter-nonzero state. The test that actually pins the broken-invariant recovery is ``test_cache_eviction_empty_heap_returns_without_evicting``. Per ``CLAUDE.md`` ("default to no comments"; cross-references rot when referents are renamed) the comment shouldn't exist at all — the function name and docstring already describe the contract, and the while-loop fall-through is self-explanatory.

bdraco · 2026-05-18T04:48:37Z

ok bot suggestions are worse than alternatives. This is good as is

bdraco requested a review from Copilot May 18, 2026 02:37

Copilot started reviewing on behalf of bdraco May 18, 2026 02:39 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

Comment thread src/zeroconf/_cache.py

bdraco had a problem deploying to release May 18, 2026 02:42 — with GitHub Actions Error

bdraco temporarily deployed to release May 18, 2026 02:45 — with GitHub Actions Inactive

bdraco requested a review from Copilot May 18, 2026 03:07

Copilot started reviewing on behalf of bdraco May 18, 2026 03:08 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

Comment thread src/zeroconf/_cache.py

bdraco temporarily deployed to release May 18, 2026 03:19 — with GitHub Actions Inactive

bdraco temporarily deployed to release May 18, 2026 03:53 — with GitHub Actions Inactive

bdraco temporarily deployed to release May 18, 2026 03:59 — with GitHub Actions Inactive

bdraco requested a review from Copilot May 18, 2026 04:00

Copilot started reviewing on behalf of bdraco May 18, 2026 04:00 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

Comment thread tests/benchmarks/test_cache_bound.py Outdated

bdraco requested a review from Copilot May 18, 2026 04:10

Copilot started reviewing on behalf of bdraco May 18, 2026 04:15 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

Comment thread tests/benchmarks/test_cache_bound.py Outdated

bdraco merged commit 0ad3f37 into master May 18, 2026
36 of 37 checks passed

bdraco deleted the fix/cache-unbounded-records branch May 18, 2026 04:48

bluetoothbot mentioned this pull request May 18, 2026

fix: bound QuestionHistory size to prevent LAN-driven OOM #1733

Draft

Conversation

bdraco commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Cache cap

Eviction-victim collision (orphaned-store bug)

Replay-flood heap growth

Why not per-source-IP quotas

Cython perf check

Test plan

Uh oh!

bdraco commented May 18, 2026

Uh oh!

codecov Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

codspeed-hq Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Performance Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

bluetoothbot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review — fix: bound DNSCache record count to prevent unbounded LAN-driven growth

🟢 Suggestions

Checklist

Summary

Uh oh!

bdraco commented May 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

bdraco commented May 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

bdraco commented May 18, 2026

Uh oh!

bdraco commented May 18, 2026

Uh oh!

bdraco commented May 18, 2026

Uh oh!

bdraco commented May 18, 2026

Uh oh!

bluetoothbot commented May 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

bdraco commented May 18, 2026

Uh oh!

bdraco commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bdraco commented May 18, 2026 •

edited

Loading

codecov Bot commented May 18, 2026 •

edited

Loading

codspeed-hq Bot commented May 18, 2026 •

edited

Loading

bluetoothbot commented May 18, 2026 •

edited

Loading