Skip to content

fix: bound DNSCache record count to prevent unbounded LAN-driven growth#1718

Merged
bdraco merged 9 commits into
masterfrom
fix/cache-unbounded-records
May 18, 2026
Merged

fix: bound DNSCache record count to prevent unbounded LAN-driven growth#1718
bdraco merged 9 commits into
masterfrom
fix/cache-unbounded-records

Conversation

@bdraco
Copy link
Copy Markdown
Member

@bdraco bdraco commented May 18, 2026

Summary

Closes #1715. A LAN-local peer multicasting unique-name mDNS responses
can grow DNSCache.cache / _expirations / service_cache /
_expire_heap without bound. With _DNS_PTR_MIN_TTL = 1125
extending attacker-injected PTR records to ~19 min and async_expire
only running every _CACHE_CLEANUP_INTERVAL = 10 s, a low-Mbps
multicast flood OOMs a Raspberry-Pi-class deployment in minutes.

Adds a hard cap on the total record count with closest-to-expiration
eviction on overflow, plus a shared _maybe_rebuild_heap helper that
keeps _expire_heap bounded against a second variant of the same
attack (record replay with shifting TTLs, which leaves stale heap
entries without changing the cache size).

Details

Cache cap

  • _MAX_CACHE_RECORDS = 10_000 added to const.py. Picked as a
    ceiling well above any realistic legitimate workload — Home
    Assistant deployments typically sit in the low hundreds — while
    staying low enough to keep memory bounded at a few MB worst-case.
  • DNSCache gains _total_records: unsigned int (declared in
    _cache.pxd). Incremented in _async_add on genuinely-new
    inserts; decremented in _async_remove. The counter is
    authoritative — the cap check uses it directly rather than walking
    self.cache.
  • New _async_evict_oldest: when the cap is hit, heappops the
    closest-to-expiration (when, record) from _expire_heap and
    removes it before inserting the new arrival. Skips stale heap
    entries via the same expiration-equality check async_expire
    already uses.

Eviction-victim collision (orphaned-store bug)

The first iteration of this PR captured store = self.cache.get(record.key)
before eviction. If the closest-to-expiration victim was the last
record under record.key (which happens whenever the incoming
record shares its key with the victim — A+AAAA pairs, PTR fan-outs,
TXT/SRV pairs all hit this), _remove_key would del self.cache[record.key] during eviction, leaving the captured
store pointing at an orphaned dict. The subsequent store[record] = record then wrote into a dict no longer reachable through the
cache; the new record was effectively lost, and a later
async_expire of that record would KeyError from _remove_key.

Fix: defer the bucket creation. Look up store first (without
creating), do the eviction-if-needed pass, then cache.get again
to pick up any deletion eviction caused. The bucket is only created
with self.cache[record.key] = {} after eviction has settled.

Replay-flood heap growth

The cap above is gated on is_new, but a peer that just replays
cached records with different created/TTLs hits the is_new == False path: no cap check, no _total_records change — yet the
changed when makes _expirations.get(record) != when true, so a
fresh (when, record) tuple is heappushed while the prior tuple
stays behind as stale. _expire_heap grows unbounded between
cleanups even though cache and _total_records stay flat. At
~10k pps for 10 s that's ~1 M stale heap entries × ~100 B ≈ 100 MB
transient.

Factor the heap-rebuild check (len(heap) > _MIN_SCHEDULED_RECORD_EXPIRATION and len(heap) > 2 * len(expirations) → filter-and-heapify) into a
new _maybe_rebuild_heap cdef helper. Called from two sites:

  • _async_add immediately after the conditional heappush, so a
    re-add flood is bounded at every step. Amortized O(1) per push:
    the O(N) rebuild fires at most once per N stale pushes.
  • async_expire after the pop loop, replacing the existing inline
    rebuild (same behavior, slightly more aggressive — uses post-pop
    heap length instead of pre-pop, no test regression).

_async_evict_oldest no longer needs its own rebuild call:
proactive rebuild in _async_add keeps the heap bounded before
eviction is ever triggered.

Why not per-source-IP quotas

The reporter's "Better" suggestion is intentionally not pursued.
mDNS rides on UDP with no source authentication; an attacker on the
local segment can multicast with any source address, so a per-IP
quota is bypassed by source-rotation. The LAN itself is the trust
boundary in mDNS; once an attacker is past it, source IPs are not a
meaningful identity. Per-source bookkeeping would add hot-path
overhead and lifecycle complexity (peer mappings need GC as records
expire) in exchange for a fairness property that doesn't survive
contact with the protocol's threat model.

CWE-400 (Uncontrolled Resource Consumption). LAN-local attack
surface only.

Cython perf check

cython -a on the touched _cache.py:

_async_add additions (hot path):
  line  75:  self._total_records: int = 0                          score-0
  line 100:  if is_new and self._total_records >= _MAX_CACHE_RECORDS score-0
  line 101:      self._async_evict_oldest()                         score-0
  line 109:      self._total_records += 1                           score-0
  line 120:      self._maybe_rebuild_heap()  (after stale push)     score-0

_async_remove (decrement):
  line 173:  self._total_records -= 1                               score-0

_maybe_rebuild_heap helper body:
  line 153:  expire_heap_len = len(self._expire_heap)               score-9
  line 155:  expire_heap_len > _MIN_SCHEDULED_RECORD_EXPIRATION     score-0
  line 156:  and expire_heap_len > len(self._expirations) * 2       score-12

async_expire (uses helper):
  line 219:  self._maybe_rebuild_heap()                             score-0

Generated C compiles the cap check and counter ops to direct
unsigned int arithmetic, helper invocations to C-level vtable
dispatch. The yellow inside _maybe_rebuild_heap is one
PyObject_RichCompare per list-comp iteration (type-generic
equality against _expirations.get(...)) — unavoidable, and only
runs during the rare rebuild path. The hot path on duplicate
re-adds (same when) never enters the helper at all because the
if self._expirations.get(record) != when: gate trips first.

Wall-clock smoke test: below-cap adds run at ~159 ns/record;
sustained-eviction adds run at ~195 ns/record (+36 ns/record for the
heappop + cache delete).

Test plan

  • poetry run pytest tests/test_cache.py --timeout=60 -v — 33
    passed including five regressions:
    • test_cache_size_is_bounded — cap holds exactly under unique-name
      flood, FIFO eviction order
    • test_cache_eviction_skips_stale_heap_entries — heap stale-skip
      selects next valid victim
    • test_cache_eviction_victim_shares_key_with_new_record
      reproduces the orphan bug against pre-931358a code
    • test_cache_re_add_flood_does_not_grow_heap_unbounded — replay
      attack bounded by the _maybe_rebuild_heap helper
    • test_cache_total_records_invariant_under_mixed_ops — pins
      _total_records == sum(len(store) for store in cache.cache.values())
      after every code path that touches it (fresh inserts, re-adds,
      DNSService, DNSNsec, shared-key emptying its bucket, full-cap
      eviction loop, async_expire, async_remove_records); any
      future off-by-one fails loudly rather than silently underflowing
      the unsigned counter
  • poetry run pytest --timeout=60 — 345 passed, 3 skipped (full
    suite)
  • poetry run pre-commit run --files <changed files> — all hooks
    pass
  • REQUIRE_CYTHON=1 poetry install --only=main,dev — extension
    rebuilds against the new .pxd (added _total_records field,
    _async_evict_oldest and _maybe_rebuild_heap cdefs)
  • CodSpeed bench at tests/benchmarks/test_cache_bound.py
    covering below-cap and at-cap add paths
  • (optional) draft a GitHub Security Advisory so a CVE can be
    assigned alongside the patched release

Closes #1715.

A LAN-local peer multicasting unique-name mDNS responses could grow
``DNSCache.cache`` / ``_expirations`` / ``service_cache`` /
``_expire_heap`` without bound. ``_DNS_PTR_MIN_TTL = 1125`` (RFC 6762
recommendation for ServiceBrowser refresh-floor) actually extends
attacker-injected PTR records to ~19 min before they expire naturally,
and ``async_expire`` only runs every ``_CACHE_CLEANUP_INTERVAL = 10`` s
— too slow to keep up with a sustained flood. On memory-constrained
deployments (Home Assistant on a Raspberry Pi is the canonical victim)
this trivially OOMs the process.

Add a hard cap (``_MAX_CACHE_RECORDS = 10_000`` in ``const.py``) on the
total number of records the cache will hold. ``DNSCache`` now tracks
``_total_records`` (incremented on genuinely-new inserts in
``_async_add``, decremented in ``_async_remove``). When the cap is
hit, ``_async_evict_oldest`` heappops the closest-to-expiration record
and removes it before inserting the new one. Reuses the existing
``_expire_heap`` for victim selection — O(log n) — and skips stale
heap entries via the same expiration-equality check
``async_expire`` already uses.

All four touched lines compile to direct C int ops in the Cython
build (score-0 in ``cython -a`` annotation); the eviction call is a
C-level vtable dispatch fired only on overflow. Wall-clock check:
below-cap add stays at ~160 ns/record; over-cap add with constant
eviction is ~195 ns/record (+36 ns for the heappop + cache delete).

Per-source-IP record quotas — the reporter's "Better" suggestion —
are deferred to a follow-up PR. This commit closes the unbounded-
growth bug class on its own; the quota work changes the fairness
properties but not the memory bound.

CWE-400 (Uncontrolled Resource Consumption). LAN-local attack
surface only.
@bdraco bdraco requested a review from Copilot May 18, 2026 02:37
@bdraco
Copy link
Copy Markdown
Member Author

bdraco commented May 18, 2026

@bluetoothbot review

@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.76%. Comparing base (64d143d) to head (39ab755).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1718   +/-   ##
=======================================
  Coverage   99.76%   99.76%           
=======================================
  Files          33       33           
  Lines        3426     3440   +14     
  Branches      471      473    +2     
=======================================
+ Hits         3418     3432   +14     
  Misses          5        5           
  Partials        3        3           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 18, 2026

Merging this PR will not alter performance

✅ 9 untouched benchmarks
🆕 2 new benchmarks

Performance Changes

Benchmark BASE HEAD Efficiency
🆕 test_cache_add_at_cap_evicts N/A 57.3 µs N/A
🆕 test_cache_add_below_cap N/A 2.3 ms N/A

Comparing fix/cache-unbounded-records (39ab755) with master (0ff3c6b)

Open in CodSpeed

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR bounds DNSCache growth to mitigate LAN-local memory exhaustion from floods of unique mDNS records.

Changes:

  • Adds _MAX_CACHE_RECORDS as a hard DNS cache record cap.
  • Tracks _total_records and evicts the closest-to-expiration record on overflow.
  • Adds regression tests and CodSpeed benchmarks for bounded-cache behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/zeroconf/const.py Defines the cache record limit.
src/zeroconf/_cache.py Adds record counting and overflow eviction logic.
src/zeroconf/_cache.pxd Updates Cython declarations for the new cache state and helper.
tests/test_cache.py Adds tests for cap enforcement and counter synchronization.
tests/benchmarks/test_cache_bound.py Adds benchmarks for below-cap and at-cap insertion paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/zeroconf/_cache.py
Adds ``test_cache_eviction_skips_stale_heap_entries`` which fills the
cache to ``_MAX_CACHE_RECORDS``, re-adds the closest-to-expiration
record with a longer TTL (which leaves a stale ``(when, record)``
tuple at the head of ``_expire_heap`` and updates ``_expirations``
with the new ``when``), then forces a single eviction and asserts the
refreshed record survives while the next-oldest legitimate record is
the one that gets evicted. Without this test the ``continue`` branch
inside ``_async_evict_oldest`` would be untouched by the suite.
@bluetoothbot
Copy link
Copy Markdown
Contributor

bluetoothbot commented May 18, 2026

PR Review — fix: bound DNSCache record count to prevent unbounded LAN-driven growth

Solid, well-targeted security fix for the unbounded cache growth in #1715. The two real attack variants (unique-name flood saturating cache, and replay-flood saturating _expire_heap) are both addressed, with the orphan-store bug correctly handled by the deferred bucket creation. All prior Copilot review comments (orphan store, replay-flood, benchmark prefilling, finite-pool exhaustion) are addressed in the current diff. Test coverage is comprehensive — the test_cache_total_records_invariant_under_mixed_ops test in particular is a strong invariant lock that will catch any future off-by-one. _MAX_CACHE_RECORDS = 10000 is a defensible ceiling for a defense-in-depth bound (RFC 6762 §10 advisory caching). Cython .pxd updates are consistent with the existing unsigned int convention for cache counters. Performance numbers (~159 ns/record below cap, ~195 ns/record at cap) are reasonable. Suggestions in this review are minor and non-blocking — the PR is merge-ready.


🟢 Suggestions

1. `_total_records` can underflow on misuse (`src/zeroconf/_cache.py`, L173)

_async_remove decrements _total_records unconditionally, but the .pxd declares the field as unsigned int. If async_remove_records is ever called with a record that was already removed (double-remove from a caller, or a stray call after async_expire already popped it), the counter wraps to UINT_MAX and the cap effectively disables itself.

In practice the in-tree callers (record_manager, async_expire) guarantee removed records exist in _expirations, so this is defensive rather than a present-day bug. Worth considering either:

  • A _total_records > 0 guard before decrementing, or
  • Decrement only when _remove_key actually freed a record (would require threading a bool back from _remove_key).

Not blocking — the invariant test (test_cache_total_records_invariant_under_mixed_ops) keeps the in-tree paths honest, and an underflow would surface as a test failure on any new caller that double-removes.

_remove_key(self.cache, record.key, record)
self._expirations.pop(record, None)
self._total_records -= 1
2. Worst-case O(N log N) eviction when heap is stale-dominated (`src/zeroconf/_cache.py`, L129-138)

_async_evict_oldest may heappop up to len(_expire_heap) stale entries before finding a live one. In normal operation _maybe_rebuild_heap (called after every push in _async_add) keeps stale ratio under 2× live, so this is bounded — but in a corner case where the cap fires repeatedly without any intermediate rebuild trigger (heap just-under-2× ratio when eviction starts), eviction can do significant work.

This is acceptable given the alternative (rebuild inside the eviction hot path) would slow the common case. Worth a one-line comment in _async_evict_oldest noting that the loop relies on _async_add's rebuild to keep the stale prefix bounded, so a future reader doesn't 'fix' it by removing the rebuild call.

def _async_evict_oldest(self) -> None:
    """Drop the closest-to-expiration record to make room for a new one."""
    while self._expire_heap:
        when_record = heappop(self._expire_heap)
        record = when_record[1]
        if self._expirations.get(record) != when_record[0]:
            continue
        self._async_remove(record)
        return
3. Test comment phrasing (`tests/test_cache.py`, L571-573)

The comment 'Clear the _MIN_SCHEDULED_RECORD_EXPIRATION floor so the rebuild engages' reads as 'reset/zero the constant', but the test doesn't actually clear or patch it — it just produces enough re-adds across cycles to exceed it. Consider 'Exceed the _MIN_SCHEDULED_RECORD_EXPIRATION floor across cycles so the rebuild can engage' to avoid the ambiguity.

Purely cosmetic — the test is correct.

# Stay below the cache cap so eviction never fires; the attack here is
# heap growth via re-add, not cap saturation. Clear the
# ``_MIN_SCHEDULED_RECORD_EXPIRATION`` floor so the rebuild engages.

Checklist

  • No hardcoded secrets
  • Input validation at trust boundaries (LAN-mDNS threat model)
  • No unbounded collection growth
  • Cython .pxd updated in same commit as .py
  • Counter invariants pinned by tests
  • Hot path performance preserved (CodSpeed)
  • Underflow safety on counter decrement — suggestion #1
  • Stale-prefix bound documented near eviction — suggestion #2
  • No commits with Co-Authored-By trailers
  • Conventional Commits subject (fix: prefix)

Summary

Solid, well-targeted security fix for the unbounded cache growth in #1715. The two real attack variants (unique-name flood saturating cache, and replay-flood saturating _expire_heap) are both addressed, with the orphan-store bug correctly handled by the deferred bucket creation. All prior Copilot review comments (orphan store, replay-flood, benchmark prefilling, finite-pool exhaustion) are addressed in the current diff. Test coverage is comprehensive — the test_cache_total_records_invariant_under_mixed_ops test in particular is a strong invariant lock that will catch any future off-by-one. _MAX_CACHE_RECORDS = 10000 is a defensible ceiling for a defense-in-depth bound (RFC 6762 §10 advisory caching). Cython .pxd updates are consistent with the existing unsigned int convention for cache counters. Performance numbers (~159 ns/record below cap, ~195 ns/record at cap) are reasonable. Suggestions in this review are minor and non-blocking — the PR is merge-ready.


Automated review by Kōanc24fa39
51fbc27
931358a
7c6be63
052b1e2
a2a2d34
38b1517
e5b509a
39ab755

Address review feedback on #1718.

**Orphaned-store bug (blocking).** ``_async_add`` was capturing
``store = self.cache.get(record.key)`` before calling
``_async_evict_oldest``. If the closest-to-expiration victim was the
last record under ``record.key`` (e.g. the incoming record shares its
key with the victim — A+AAAA pairs, PTR fan-outs, TXT/SRV pairs all
hit this in real workloads), ``_remove_key`` would ``del
self.cache[record.key]`` during eviction, leaving the captured
``store`` pointing at an orphaned dict. The subsequent ``store[record]
= record`` then wrote into a dict no longer reachable through the
cache; the new record was effectively lost, and a later
``async_expire`` of that record would ``KeyError`` from
``_remove_key``.

Defer the bucket creation: look up ``store`` first (without creating),
do the eviction-if-needed pass, then ``cache.get`` again to pick up
any deletion eviction caused. The bucket is only created with
``self.cache[record.key] = {}`` after we know eviction has settled.

**Stale-prefix scan (suggestion #1).** Under sustained TTL re-adds —
e.g. ``async_mark_unique_records_older_than_1s_to_expire`` rewriting
TTLs to 1 — ``_expire_heap`` accumulates stale ``(when, record)``
tuples that no longer match ``_expirations[record]``. The previous
eviction body would pop those one at a time, doing
O(stale_prefix * log n) work per ``_async_add``. Reuse the same
``len(heap) > _MIN_SCHEDULED_RECORD_EXPIRATION and len(heap) >
len(expirations) * 2`` heuristic that ``async_expire`` already runs:
when the heap is mostly stale, rebuild it once up front, then run
the pop loop against only live entries.

**Accounting invariant test.** ``_total_records`` is ``cdef public
unsigned int``; an off-by-one in ``_async_add`` / ``_async_remove``
would silently underflow in the Cython wheel and pin the cap-check
True forever (eviction storm on every add). Rather than defensively
zero-clamp the decrement — which would hide the bug — the new
``test_cache_total_records_invariant_under_mixed_ops`` walks the
counter through every code path that touches it (fresh inserts,
re-adds, DNSService, DNSNsec, shared-key inserts that empty their
bucket on removal, full-cap eviction loop, ``async_expire``,
``async_remove_records``) and asserts ``cache._total_records ==
sum(len(s) for s in cache.cache.values())`` after every step. Any
future change that misses an increment or doubles a decrement fails
loudly.

Regression tests:
- ``test_cache_eviction_victim_shares_key_with_new_record`` —
  reproduces the orphan; would have failed against c24fa39.
- ``test_cache_eviction_rebuilds_heap_when_mostly_stale`` — fills to
  cap then re-adds with two distinct TTLs to push the heap to ~3x
  MAX, then triggers eviction and asserts the rebuild ran (heap and
  expirations match after the pass).
- ``test_cache_total_records_invariant_under_mixed_ops`` — counter
  invariant across all touched code paths.

The ``.pxd`` adds ``expire_heap_len`` as ``unsigned int`` to
``_async_evict_oldest``'s locals so the new threshold comparison
stays a direct C int op in the Cython build.
@bdraco bdraco requested a review from Copilot May 18, 2026 03:07
@bdraco
Copy link
Copy Markdown
Member Author

bdraco commented May 18, 2026

@bluetoothbot review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment thread src/zeroconf/_cache.py
Address remaining review feedback on #1718.

**Re-add heap growth (Copilot inline at L100, real bug separate from
the orphan).** ``_async_add``'s cap check is gated on ``is_new``, but
a peer that just replays cached records with a different ``created``/
TTL hits the ``is_new == False`` path: no cap check, no
``_total_records`` change — yet the changed ``when`` makes
``_expirations.get(record) != when`` true, so a fresh ``(when,
record)`` tuple is heappushed while the prior tuple stays behind as
stale. ``_expire_heap`` grows unbounded between cleanups even though
``cache`` and ``_total_records`` stay flat. At ~10k pps for 10 s
that's ~1 M stale heap entries × ~100 B ≈ 100 MB transient.

Factor the rebuild check (``len(heap) > _MIN_SCHEDULED_RECORD_EXPIRATION
and len(heap) > 2 * len(expirations)`` → filter-and-heapify) into a
new ``_maybe_rebuild_heap`` cdef helper. Call it from three sites:

- ``_async_add`` immediately after the conditional heappush, so a
  re-add flood is bounded at every step (amortized O(1) per push,
  O(N) rebuild fires at most once per N stale pushes)
- ``async_expire`` after the pop loop, replacing the existing inline
  rebuild — same behavior, slightly more aggressive (uses post-pop
  heap length instead of pre-pop), no test regression
- (removed from ``_async_evict_oldest`` — the eager rebuild in
  ``_async_add`` keeps the heap bounded before eviction is ever
  triggered, so the eviction-side call was dead code)

Cython codegen: the rebuild call from ``_async_add`` only fires when
the conditional heappush already fired, so the hot path on duplicate
re-adds (same ``when``) is unchanged. The check inside
``_maybe_rebuild_heap`` is two ``len()`` calls + an integer compare —
score-0 in ``cython -a``.

**Test DRY-up.** Introduce a module-level ``_addr(name, idx, ...)``
helper that builds the boilerplate ``DNSAddress(name, _TYPE_A,
_CLASS_IN, ttl, bytes((idx & 0xFF, (idx >> 8) & 0xFF, 0, 1)), created
=...)`` so each new test reads top-down instead of being dominated by
constructor noise. Test docstrings collapsed to one sentence per
``CLAUDE.md``.

**New regression test:** ``test_cache_re_add_flood_does_not_grow_heap_unbounded``
adds 200 records below the cap, then replays them through 10 cycles
of varying TTLs (2000 stale pushes total) and asserts the heap stays
near the rebuild threshold. The previous code would have grown it to
~11x the record count; the helper keeps it within ``2 * expirations``.

**Removed test:** ``test_cache_eviction_rebuilds_heap_when_mostly_stale``
— its premise (build up stale entries, then trigger eviction-side
rebuild) is no longer reachable now that ``_async_add`` rebuilds
proactively. The same code path is covered by the re-add flood test.

**Existing tests updated:** ``test_cache_heap_multi_name_cleanup`` and
``test_cache_heap_pops_order`` previously asserted pre-cleanup heap
size (``min_records_to_cleanup + 5``). With proactive rebuild firing
inside ``_async_add``, the heap is cleaned up earlier and the post-
cleanup invariant (``1 + 5`` after ``async_expire``) still holds, but
the intermediate snapshot is no longer accurate. Replaced the stale
intermediate assertion with a comment.
``_async_evict_oldest`` falls through to a silent return if
``_expire_heap`` is empty. By the cache invariant every record
counted in ``_total_records`` has a heap entry, so this branch is
unreachable in normal operation — but it's a defense-in-depth path
worth pinning. Force the broken state directly
(``_total_records = MAX, _expire_heap = []``) and verify eviction
returns gracefully while the subsequent insert still completes.
@bdraco bdraco requested a review from Copilot May 18, 2026 04:00
@bdraco
Copy link
Copy Markdown
Member Author

bdraco commented May 18, 2026

@bluetoothbot review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment thread tests/benchmarks/test_cache_bound.py Outdated
Three review follow-ups on #1718:

**Bench actually measures eviction (Copilot inline at L43).**
``test_cache_add_at_cap_evicts`` previously created an empty cache
and added ``_MAX_CACHE_RECORDS + 1000`` records, so ~91% of the
measured time was the below-cap fill path and only the final 1000
inserts evicted — the CodSpeed number was not representative of the
eviction performance the name implies. Restructure to pre-fill the
cache to the cap *outside* the timed body and consume one fresh
unique record per benchmark iteration. Each iteration measures
exactly one ``_async_add`` + ``_async_evict_oldest`` cycle, with the
cache permanently at the cap. ``_make_records`` gains a ``prefix=``
parameter so the fill records and the overflow pool don't collide on
``name``.

**Defensive test pins counter state (Kōan suggestion #3).**
``test_cache_eviction_empty_heap_returns_without_evicting`` already
asserts the new record lands; add the corresponding
``_total_records == _MAX_CACHE_RECORDS + 1`` assertion so a future
change to the broken-state recovery (refuse the add, clamp the
counter) fails this test explicitly rather than silently shifting
behaviour.

**One-line docstring (Kōan suggestion #2).**
``_maybe_rebuild_heap`` had an 11-line docstring restating the body
and including amortized-cost analysis. Per CLAUDE.md the why lives
at the call site comment in ``_async_add`` (already does); the
function's contract is a single line.

**No log on the empty-heap fall-through (Kōan suggestion #1).**
Kōan suggested an assertion or debug log; the assertion would crash
production on invariant drift, and even a guarded ``log.debug(...)``
adds Python attribute-lookup overhead in the function the cap
overflow path runs through. Drift is already caught loudly by
``test_cache_total_records_invariant_under_mixed_ops``. Comment
explains the deliberate silence.
@bdraco
Copy link
Copy Markdown
Member Author

bdraco commented May 18, 2026

@bluetoothbot review

@bdraco bdraco requested a review from Copilot May 18, 2026 04:10
@bdraco
Copy link
Copy Markdown
Member Author

bdraco commented May 18, 2026

bad AI suggestion to add logging in the hot path

@bdraco
Copy link
Copy Markdown
Member Author

bdraco commented May 18, 2026

need to hand check nsec path.. should add a test to be sure

The ``new = is_new and not isinstance(record, DNSNsec)`` line in
``_async_add`` only gates the listener-notify return value — the
counter increment, the cache store, and the cap check are all
type-agnostic. A peer can't flood NSec records past the cap today.
But the existing ``new`` flag is exactly the shape a future refactor
might extend to ``_total_records`` ("DNSNsec isn't real data, don't
count it"), which would silently open a bypass. Pin
``_total_records == _MAX_CACHE_RECORDS`` after pushing
``_MAX_CACHE_RECORDS + 100`` unique-name DNSNsec records so any such
regression fails this test loudly.
@bdraco
Copy link
Copy Markdown
Member Author

bdraco commented May 18, 2026

ok happy with it. but should have the bot check it again as I'm tired and might miss something

@bluetoothbot
Copy link
Copy Markdown
Contributor

@bdraco: need to hand check nsec path.. should add a test to be sure

The NSEC accounting path is already pinned by test_cache_total_records_invariant_under_mixed_ops: it adds a DNSNsec (_total_records == actual() == 21) and then removes it (_total_records == actual() == 20), so the increment/decrement symmetry and the cache-bucket-size invariant both hold across NSEC. What that test doesn't pin is NSEC arriving at the cap — the current _async_add treats NSEC identically to other records for the cap check (is_new and self._total_records >= _MAX_CACHE_RECORDS fires regardless of isinstance(record, DNSNsec)), so an NSEC flood drives eviction the same way an A-record flood does. If you want a hard pin for that specifically, a small focused test would do it: fill to _MAX_CACHE_RECORDS with A records, add one DNSNsec, assert _total_records == _MAX_CACHE_RECORDS, the NSEC is reachable via async_get_unique, and exactly one prior record is gone. Cheap to add and forecloses any future regression where NSEC accidentally gets excluded from the cap path (the is_new / new split already differentiates them once for the return value, which is where a future refactor could easily slip).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment thread tests/benchmarks/test_cache_bound.py Outdated
**Cross-type NSec spot-test (Kōan).** Add
``test_cache_dnsnsec_at_cap_evicts_prior_record``: fill the cache to
``_MAX_CACHE_RECORDS`` with ``DNSAddress`` records, add a single
``DNSNsec``, assert the counter stays at cap (NSEC routed through
eviction), the NSEC is reachable via ``cache.cache[nsec.key]``
(orphan fix covers NSEC's key-collision path), and the earliest
fill record was the eviction victim. Complements the existing flood
test — together they pin NSEC against both ``new``-flag drift and
cross-type ``_async_add`` regressions. Uses direct cache lookup
instead of ``async_get_unique`` because the latter's type stub
excludes ``DNSNsec``.

**Unbounded benchmark generator (Copilot inline at
test_cache_bound.py:53).** The previous bench used
``iter(_make_records(100_000, ...))`` which raises ``StopIteration``
once exhausted — fast operations under pytest-codspeed can exceed
that count. Replace with ``_unbounded_records`` (``itertools.count``
+ lazy ``DNSAddress`` construction) so the iterator survives any
iteration count CodSpeed chooses. The construction work moves into
the timed body but is a constant per-iteration overhead and doesn't
distort relative measurements across runs.
@bdraco
Copy link
Copy Markdown
Member Author

bdraco commented May 18, 2026

@bluetoothbot review

The comment said drift would surface via
``test_cache_total_records_invariant_under_mixed_ops``, but that
test only exercises well-formed paths and never produces a
heap-empty-with-counter-nonzero state. The test that actually pins
the broken-invariant recovery is
``test_cache_eviction_empty_heap_returns_without_evicting``. Per
``CLAUDE.md`` ("default to no comments"; cross-references rot when
referents are renamed) the comment shouldn't exist at all — the
function name and docstring already describe the contract, and the
while-loop fall-through is self-explanatory.
@bdraco
Copy link
Copy Markdown
Member Author

bdraco commented May 18, 2026

ok bot suggestions are worse than alternatives. This is good as is

@bdraco bdraco merged commit 0ad3f37 into master May 18, 2026
36 of 37 checks passed
@bdraco bdraco deleted the fix/cache-unbounded-records branch May 18, 2026 04:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Security: Unbounded DNS record cache can be filled by any LAN peer (memory exhaustion)

3 participants