Skip to content

test: drop pending multicast responses before TOCTOU assertion#1701

Merged
bdraco merged 1 commit into
masterfrom
fix-flaky-toctou-race-test-timeout
May 17, 2026
Merged

test: drop pending multicast responses before TOCTOU assertion#1701
bdraco merged 1 commit into
masterfrom
fix-flaky-toctou-race-test-timeout

Conversation

@bdraco
Copy link
Copy Markdown
Member

@bdraco bdraco commented May 17, 2026

Summary

test_service_info_async_request flakes on multiple CI runners (e.g. macOS 3.12 skip_cython, ubuntu 3.13 use_cython) with:

AssertionError: assert [b'\n\x00\x01\x02'] == [b'\n\x00\x01\x03']

Root cause

After tracing the captured packet log, the failure isn't a timeout / convergence problem — a stale queued multicast response ends up overwriting the correct one:

  1. The earlier async_get_service_info / gather calls trigger queries that the local zeroconf answers via out_queue (or out_delay_queue). The queue snapshots the records (SRV → ash-1.local., A → 10.0.1.2) and aggregates the send by up to ~620ms / ~1320ms.
  2. async_update_service(new_info) swaps the registry entry to ash-2.local. / 10.0.1.3 and broadcasts the new records.
  3. With quick_timing, the broadcast finishes within ~30ms, so the stale queued response is still pending when the test reaches the assertion block.
  4. The test calls _clear_cache(...) and async_request(zc, 500) with _is_complete patched to False.
  5. The queued response flushes after the cache clear, re-caching ash-1.local. → 10.0.1.2 and pointing aiosinfo.server at ash-1.local..
  6. The patched loop then keeps asking for an A record for ash-1.local., which nothing in the registry answers, so the final state stays on 10.0.1.2 and the assertion fails.

Evidence from the failing ubuntu run (MulticastOutgoingQueue flushing stale records right after the ash-2 broadcasts):

... answers=[(record[srv,…,xxxyyy._test1-srvc-type._tcp.local.]=120/119,ash-1.local.:80, 0.0),
            (record[txt,…]…, 0.0)],
   additionals=[record[a,…,ash-1.local.]=120/119,10.0.1.2, …]

…followed by repeated questions=[a[…,ash-1.local.], quada[…,ash-1.local.]] queries from the AsyncServiceInfo loop.

Fix

Drop the pending entries from both out_queue and out_delay_queue before clearing the cache. The already-scheduled loop.call_at for each queue fires on the empty deque and is a no-op (the while len(self.queue)… loop in async_ready doesn't iterate, and if answers: is False), so this is deterministic and adds no wall-clock delay.

A code-side fix for the underlying behaviour (invalidating queued responses when the registry changes) is left as a separate concern.

Test plan

  • SKIP_CYTHON=1 poetry run pytest --timeout=60 -q tests/test_asyncio.py::test_service_info_async_request — 5 consecutive runs pass (~4.15s each).
  • poetry run pytest --timeout=60 -q tests/test_asyncio.py::test_service_info_async_request (with Cython extensions) — 3 consecutive runs pass.
  • poetry run pre-commit run --files tests/test_asyncio.py — clean.

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 17, 2026

Merging this PR will not alter performance

✅ 6 untouched benchmarks


Comparing fix-flaky-toctou-race-test-timeout (aaf120e) with master (dd341a3)

Open in CodSpeed

@codecov
Copy link
Copy Markdown

codecov Bot commented May 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.76%. Comparing base (dd341a3) to head (aaf120e).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1701   +/-   ##
=======================================
  Coverage   99.76%   99.76%           
=======================================
  Files          33       33           
  Lines        3410     3410           
  Branches      464      464           
=======================================
  Hits         3402     3402           
  Misses          5        5           
  Partials        3        3           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@bdraco bdraco force-pushed the fix-flaky-toctou-race-test-timeout branch from 9b372ed to f9167cf Compare May 17, 2026 16:58
@bdraco bdraco changed the title test: restore TOCTOU race assertion timeout to 3000ms test: drain multicast queues before TOCTOU race assertion May 17, 2026
`test_service_info_async_request` patches `_is_complete` to False
to keep `async_request` looping. With `quick_timing`, the earlier
`async_get_service_info` queries leave multicast responses pending
in `out_queue` / `out_delay_queue` that snapshot the records as
they were *before* `async_update_service` swapped the registry
to `new_info`. When those stale answers flush after
`_clear_cache` they re-cache
`xxxyyy._test1-srvc-type._tcp.local. -> ash-1.local. -> 10.0.1.2`
and point `aiosinfo.server` at `ash-1.local.`. The patched loop
then keeps asking for an A record nobody answers, so the final
state stays on `10.0.1.2`:

    AssertionError: assert [b'\\n\\x00\\x01\\x02'] == [b'\\n\\x00\\x01\\x03']

Drop the pending entries from both queues before clearing the
cache. The already-scheduled `loop.call_at` for each queue fires
on the empty deque and is a no-op, so this is deterministic and
adds no wall-clock delay.
@bdraco bdraco force-pushed the fix-flaky-toctou-race-test-timeout branch from f9167cf to aaf120e Compare May 17, 2026 17:03
@bdraco bdraco changed the title test: drain multicast queues before TOCTOU race assertion test: drop pending multicast responses before TOCTOU assertion May 17, 2026
@bdraco bdraco merged commit d2058d9 into master May 17, 2026
65 of 66 checks passed
@bdraco bdraco deleted the fix-flaky-toctou-race-test-timeout branch May 17, 2026 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant