Skip to content

test: shave register/shutdown/TC waits on loopback#1711

Draft
bluetoothbot wants to merge 3 commits into
python-zeroconf:masterfrom
bluetoothbot:koan/speed-up-register-random-wait
Draft

test: shave register/shutdown/TC waits on loopback#1711
bluetoothbot wants to merge 3 commits into
python-zeroconf:masterfrom
bluetoothbot:koan/speed-up-register-random-wait

Conversation

@bluetoothbot
Copy link
Copy Markdown
Contributor

@bluetoothbot bluetoothbot commented May 18, 2026

What

Speed up four slow tests by extracting one production constant (_REGISTER_RANDOM_INTERVAL, the RFC 6762 §8.1 random pre-probe wait in async_check_service) and patching it through quick_timing on loopback, plus three localised test-only patches.

Part of #1707.

Why

async_check_service waits a random 150-250ms before probing — RFC 6762 §8.1 thundering-herd avoidance for real networks, pure overhead on 127.0.0.1. Tests that use quick_timing already get probe/announce/goodbye intervals shrunk, but the inline random.randint on this path is opaque to patching. Three of the slowest tests in #1707 without an in-flight PR (test_async_service_registration_name_conflict, test_name_conflicts, and everything that registers a service) pay this for nothing.

test_shutdown_loop and test_tc_bit_defers_last_response_missing carry similar dead waits — a 1s outer timeout in shutdown_loop against a never-completing pending task, and an 800ms cleanup poll against a 400-500ms RFC 6762 §7.2 TC-bit defer timer.

How

  1. Extract _REGISTER_RANDOM_INTERVAL = (150, 250) as a module-level constant in _core.py and patch it in quick_timing alongside the other loopback shaves. No production change — default still matches the RFC value.
  2. test_shutdown_loop: patch _TASK_AWAIT_TIMEOUT to 0.1s and drop .result(1).result(0.1). Both were just upper bounds on a never-completing pending task.
  3. test_tc_bit_defers_last_response_missing: patch _TC_DELAY_RANDOM_INTERVAL to (10, 20)ms and tighten the cleanup poll loop.
  4. test_integration: wait for the browser's first startup query to fire (cache empty) before registering. The original test relied on the 150-250ms register-random-wait being longer than the browser's 20-120ms first-query delay; with the new fast register that race flipped and the first query started picking up the known PTR via §7.1 known-answer suppression.

Speedups (CPython 3.12, loopback)

test before after
test_shutdown_loop 1.16s 0.26s
test_tc_bit_defers_last_response_missing 0.79s 0.32s
test_async_service_registration_name_conflict 0.72s 0.10s
test_name_conflicts 0.50s 0.10s

Plus a smaller transparent shave on every other quick_timing-using test that registers a service.

Test plan

  • Targeted tests pass three times in a row
  • Full tests/ suite passes (338 passed, 2 skipped, ~47s wall)
  • Pre-commit hooks pass (ruff, mypy, flake8, codespell, pyupgrade)

Quality Report

Changes: 6 files changed, 31 insertions(+), 12 deletions(-)

Code scan: clean

Tests: passed (4 PASSED)

Branch hygiene: clean

Generated by Kōan post-mission quality pipeline

@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.76%. Comparing base (4bae30a) to head (98c3785).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1711   +/-   ##
=======================================
  Coverage   99.76%   99.76%           
=======================================
  Files          33       33           
  Lines        3426     3427    +1     
  Branches      471      471           
=======================================
+ Hits         3418     3419    +1     
  Misses          5        5           
  Partials        3        3           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 18, 2026

Merging this PR will not alter performance

✅ 6 untouched benchmarks


Comparing bluetoothbot:koan/speed-up-register-random-wait (98c3785) with master (4bae30a)

Open in CodSpeed

@bdraco
Copy link
Copy Markdown
Member

bdraco commented May 18, 2026

all the windows TC tests fail with this pr

    zeroconf:outgoing.py:471 now offsets = questions=1, answers=302, authorities=0, additionals=0
================================== FAILURES ===================================
__________________ test_tc_bit_defers_last_response_missing ___________________

    @patch.object(_listener, "_TC_DELAY_RANDOM_INTERVAL", (10, 20))
    def test_tc_bit_defers_last_response_missing():
        zc = Zeroconf(interfaces=["127.0.0.1"])
        _wait_for_start(zc)
        type_ = "_knowndefer._tcp.local."
        name = "knownname"
        name2 = "knownname2"
        name3 = "knownname3"
    
        registration_name = f"{name}.{type_}"
        registration2_name = f"{name2}.{type_}"
        registration3_name = f"{name3}.{type_}"
    
        desc = {"path": "/~paulsm/"}
        server_name = "ash-2.local."
        server_name2 = "ash-3.local."
        server_name3 = "ash-4.local."
    
        info = r.ServiceInfo(
            type_,
            registration_name,
            80,
            0,
            0,
            desc,
            server_name,
            addresses=[socket.inet_aton("10.0.1.2")],
        )
        info2 = r.ServiceInfo(
            type_,
            registration2_name,
            80,
            0,
            0,
            desc,
            server_name2,
            addresses=[socket.inet_aton("10.0.1.2")],
        )
        info3 = r.ServiceInfo(
            type_,
            registration3_name,
            80,
            0,
            0,
            desc,
            server_name3,
            addresses=[socket.inet_aton("10.0.1.2")],
        )
        zc.registry.async_add(info)
        zc.registry.async_add(info2)
        zc.registry.async_add(info3)
    
        protocol = zc.engine.protocols[0]
        now = r.current_time_millis()
        _clear_cache(zc)
        source_ip = "203.0.113.12"
    
        generated = r.DNSOutgoing(const._FLAGS_QR_QUERY)
        question = r.DNSQuestion(type_, const._TYPE_PTR, const._CLASS_IN)
        generated.add_question(question)
        for _ in range(300):
            # Add so many answers we end up with another packet
            generated.add_answer_at_time(info.dns_pointer(), now)
        generated.add_answer_at_time(info2.dns_pointer(), now)
        generated.add_answer_at_time(info3.dns_pointer(), now)
        packets = generated.packets()
        assert len(packets) == 4
        expected_deferred = []
    
        next_packet = r.DNSIncoming(packets.pop(0))
        expected_deferred.append(next_packet)
        threadsafe_query(zc, protocol, next_packet, source_ip, const._MDNS_PORT, Mock(), ())
        assert protocol._deferred[source_ip] == expected_deferred
        timer1 = protocol._timers[source_ip]
    
        next_packet = r.DNSIncoming(packets.pop(0))
        expected_deferred.append(next_packet)
        threadsafe_query(zc, protocol, next_packet, source_ip, const._MDNS_PORT, Mock(), ())
>       assert protocol._deferred[source_ip] == expected_deferred
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E       KeyError: '203.0.113.12'

try harder

@bdraco
Copy link
Copy Markdown
Member

bdraco commented May 18, 2026

@bluetoothbot rr

@bluetoothbot
Copy link
Copy Markdown
Contributor Author

PR Review — test: shave register/shutdown/TC waits on loopback

The _REGISTER_RANDOM_INTERVAL extraction is clean and the quick_timing fixture change is the right shape. The test_integration race fix (waiting for the first browser query before registering) is well-reasoned and the explanatory comment correctly cites §7.1 known-answer suppression. The test_shutdown_loop patch is a straightforward test-only tightening.

The blocker is test_tc_bit_defers_last_response_missing: it's failing on Windows per @bdraco. The combination of a 10-20ms patched defer interval and a halved cleanup poll budget (800ms → 400ms) doesn't survive Windows' ~15.6ms scheduler tick. Either restore the original cleanup budget or widen the patched random interval (e.g., (20, 50)ms) so the test's timing assumptions hold across the full CI matrix. Once that's resolved this is good to merge.


🔴 Blocking

1. Cleanup poll timeout halved — too tight for Windows (`tests/test_core.py`, L734-736)

The cleanup poll budget was reduced from 8×100ms=800ms to 20×20ms=400ms — a 50% cut. Combined with the patched _TC_DELAY_RANDOM_INTERVAL=(10, 20), this is failing on Windows (per @bdraco's report).

Two Windows-specific problems compound here:

  1. Coarse timer resolution. Windows' default scheduler tick is ~15.6ms. time.sleep(0.02) typically sleeps ~31ms (next tick), so 20 iterations actually take ~620ms — but the asyncio scheduler driving the TC defer timer has the same coarseness, and a 10-20ms call_later may not fire until 15-30ms+. Under load this can easily push the cleanup past the new 400ms ceiling.

  2. The defer timer is asyncio-driven, not time.sleep-driven. The poll loop runs on the test thread while _TC_DELAY_RANDOM_INTERVAL is consumed by the asyncio event loop's call_later. On Windows the ProactorEventLoop (or SelectorEventLoop) timer slot is bound to the system tick, so squeezing the random window to (10, 20)ms doesn't proportionally reduce when the timer fires — it just floors at the tick boundary.

Suggested fix: either keep the original 800ms cleanup budget (use range(40) × sleep(0.02) if you want finer-grained early-exit), or widen the patch to something like (20, 50)ms which is still 10x faster than the production (400, 500)ms but well clear of the Windows tick.

    for _ in range(20):
        time.sleep(0.02)
        if source_ip not in protocol._timers and source_ip not in protocol._deferred:
            break

🟡 Important

1. Patched interval is below practical Windows asyncio timer resolution (`tests/test_core.py`, L634)

_TC_DELAY_RANDOM_INTERVAL=(10, 20) is shorter than the Windows asyncio scheduler tick (~15.6ms). On Linux this is fine; on Windows the actual fire time will be quantised to the next tick boundary, so the variance is much larger than the (10, 20) range suggests — making the cleanup poll's worst-case timing harder to bound.

A value like (20, 50) or (30, 60) would still cut the production wait by an order of magnitude while staying clear of the Windows tick floor, and would make the test's timing assumptions hold consistently across CI matrices.

@patch.object(_listener, "_TC_DELAY_RANDOM_INTERVAL", (10, 20))

Checklist

  • Cross-platform timing safety — critical #1
  • Production change matches RFC default
  • Race fix in test_integration correctly addresses §7.1 KAS
  • Test docstrings/comments don't pad with rationale
  • No hand-edits to generated .c files
  • No Co-Authored-By trailers

Summary

The _REGISTER_RANDOM_INTERVAL extraction is clean and the quick_timing fixture change is the right shape. The test_integration race fix (waiting for the first browser query before registering) is well-reasoned and the explanatory comment correctly cites §7.1 known-answer suppression. The test_shutdown_loop patch is a straightforward test-only tightening.

The blocker is test_tc_bit_defers_last_response_missing: it's failing on Windows per @bdraco. The combination of a 10-20ms patched defer interval and a halved cleanup poll budget (800ms → 400ms) doesn't survive Windows' ~15.6ms scheduler tick. Either restore the original cleanup budget or widen the patched random interval (e.g., (20, 50)ms) so the test's timing assumptions hold across the full CI matrix. Once that's resolved this is good to merge.


Automated review by Kōanca16a07

@bluetoothbot
Copy link
Copy Markdown
Contributor Author

@bdraco: all the windows TC tests fail with this pr

The Windows failure is almost certainly the combination of two compounding factors in test_tc_bit_defers_last_response_missing:

  1. _TC_DELAY_RANDOM_INTERVAL=(10, 20) is below the Windows scheduler tick (~15.6ms). The asyncio call_later driving the TC defer timer floors at the next tick, so the effective fire time is 15-30ms+ rather than the 10-20ms the patch suggests.

  2. The cleanup poll budget was halvedrange(8) × sleep(0.1) = 800ms became range(20) × sleep(0.02) = 400ms. Under Windows' coarser time.sleep granularity the wall clock is closer to 620ms, but the asyncio side is racing the same tick and can easily overshoot.

Two options to fix:

  • Keep the 800ms cleanup budget (range(40) × sleep(0.02) if you want the finer-grained early exit), or
  • Widen the patched random interval to something like (20, 50) or (30, 60) ms — still a 10x speedup over the production (400, 500) ms, but well clear of the Windows tick floor so the timing is deterministic across the CI matrix.

I'd lean toward the second — it preserves the speedup intent of the PR without making the cleanup poll do double duty as a slack budget for scheduler jitter.

`async_check_service` paid an inline `random.randint(150, 250)` ms wait
on every `register_service` — RFC 6762 §8.1 thundering-herd avoidance
for real networks, pure overhead on 127.0.0.1. Extracted as a module
constant `_REGISTER_RANDOM_INTERVAL` and patched in the `quick_timing`
fixture to (1, 5)ms alongside the existing _CHECK_TIME / _REGISTER_TIME
/ _UNREGISTER_TIME shaves.

Also:
- `test_shutdown_loop`: drop `.result(1)` → `.result(0.1)` and patch
  `_TASK_AWAIT_TIMEOUT` to 0.1s. The 1s waits were `shutdown_loop`'s
  outer timeout on a never-completing `asyncio.sleep(5)` — the test only
  needs the loop to be stoppable while a task is pending.
- `test_tc_bit_defers_last_response_missing`: patch
  `_TC_DELAY_RANDOM_INTERVAL` (400-500ms, RFC 6762 §7.2 TC-bit deferral)
  to (10, 20)ms and tighten the cleanup poll loop.
- `test_integration`: wait for the browser's first startup query to
  fire (with empty cache) before registering. Previously the test
  relied on the register's 150-250ms random wait being longer than the
  browser's 20-120ms first-query delay; with the new fast register that
  race flipped and the first query started seeing the known PTR via
  §7.1 known-answer suppression.

Speedups on loopback (CPython 3.12):

| test                                                  | before | after |
| ----------------------------------------------------- | ------ | ----- |
| test_shutdown_loop                                    | 1.16s  | 0.26s |
| test_tc_bit_defers_last_response_missing              | 0.79s  | 0.32s |
| test_async_service_registration_name_conflict        | 0.72s  | 0.10s |
| test_name_conflicts                                   | 0.50s  | 0.10s |

Plus a smaller transparent shave on every other `quick_timing`
register-based test. Full suite: 338 passed, 2 skipped. No production
behavior change — `_REGISTER_RANDOM_INTERVAL` still defaults to the
RFC 6762 §8.1 value on real networks.
@bluetoothbot
Copy link
Copy Markdown
Contributor Author

Rebase with requested adjustments

Branch koan/speed-up-register-random-wait was rebased onto master and review feedback was applied.

Changes applied

  • CI fix: Applied pre-push CI fix
  • CI fix: Pre-push CI fix applied

Stats

5 files changed, 23 insertions(+), 9 deletions(-)
Actions performed
  • Already-solved check: negative (confidence=high, reasoning=No commit on master extracts _REGISTER_RANDOM_INTERVAL or patches it via quick_timing; recent sp)
  • Rebased koan/speed-up-register-random-wait onto upstream/master
  • Pre-push CI check: previous run #26006589202 failed
  • Applied pre-push CI fix
  • Pre-push CI fix applied
  • Force-pushed koan/speed-up-register-random-wait to origin
  • CI check enqueued in ## CI (async)

CI status

CI will be checked asynchronously.


Automated by Kōan

@bluetoothbot bluetoothbot force-pushed the koan/speed-up-register-random-wait branch from ca16a07 to 6b84b72 Compare May 18, 2026 00:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants