Skip to content

Don't call callbacks whend holding _handlers_lock#258

Merged
jstasiak merged 1 commit into
python-zeroconf:masterfrom
emontnemery:fix_starve
May 26, 2020
Merged

Don't call callbacks whend holding _handlers_lock#258
jstasiak merged 1 commit into
python-zeroconf:masterfrom
emontnemery:fix_starve

Conversation

@emontnemery

@emontnemery emontnemery commented May 25, 2020

Copy link
Copy Markdown
Collaborator

Don't hold _handlers_lock when calling the service callbacks.

Closes #255

Background:
#239 adds the lock _handlers_lock:
https://github.com/jstasiak/python-zeroconf/blob/552a030eb592a0c07feaa7a01ece1464da4b1d0b/zeroconf/__init__.py#L2209

Which is used in the engine thread:
https://github.com/jstasiak/python-zeroconf/blob/552a030eb592a0c07feaa7a01ece1464da4b1d0b/zeroconf/__init__.py#L2484-L2489

And also by the service browser when issuing the state change callbacks:
https://github.com/jstasiak/python-zeroconf/blob/552a030eb592a0c07feaa7a01ece1464da4b1d0b/zeroconf/__init__.py#L1541-L1546

Both pychromecast and Home Assistant calls Zeroconf.get_service_info from the service callbacks which means the lock may be held for several seconds which will starve the engine thread.

@coveralls

coveralls commented May 25, 2020

Copy link
Copy Markdown

Coverage Status

Coverage remained the same at 92.818% when pulling 0b9152b on emontnemery:fix_starve into 445d7f5 on jstasiak:master.

@coveralls

Copy link
Copy Markdown

Coverage Status

Coverage decreased (-0.1%) to 92.69% when pulling 0b9152b on emontnemery:fix_starve into 445d7f5 on jstasiak:master.

@jstasiak

Copy link
Copy Markdown
Collaborator

I need to wrap my head around why exactly does this help (for my own understanding), I'll have a longer look tomorrow unless someone beats me to it (@mattsaxon you're probably the best person to review this).

@jstasiak

Copy link
Copy Markdown
Collaborator

One note to remember when merging this: it'll be nice to have the whole content of the original pull request comment in the commit message in master, it contains crucial information to understand this change.

@emontnemery

emontnemery commented May 25, 2020

Copy link
Copy Markdown
Collaborator Author

I think it's clear it's not good idea to hold the lock when calling self._service_state_changed.fire():

  • Both Home Assistant and pychromecast will call Zeroconf.get_service_info from the state change callbacks
  • If get_service_info() needs to send questions because the answer is not cached, there is a deadlock until get_service_info() times out because the engine will get stuck in handle_response waiting for the lock .

A reflection is that calling get_service_info() when getting the callback seems a bit fragile, any suggestion for how it can be used in a slightly more fault tolerant way?
(This is a question for possibly improving HA / pychromecast)

@mattsaxon mattsaxon left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks good to me. Can I check that this is the only change that was needed to fix the whole issue being fixed as a number of patches were applied for testing purposes, have we backed those out and demonstrated this patch alone fixes the issue?

@emontnemery

emontnemery commented May 26, 2020

Copy link
Copy Markdown
Collaborator Author

@mattsaxon: the other patches were:

  • improve logging, merged in home assistant and zeroconf
  • make pychromecast listen to update callbacks, this has been merged and released
  • delay before retrying the call to get_service_info() in pychromecast, there is already a default 3s timeout in zeroconf so this won't make a difference

Edit: thanks for reviewing and helping out with the issue!

@mattsaxon

Copy link
Copy Markdown
Collaborator

@emontnemery , did you determine if listening for the updates was required and therefore if there is another underlying bug in zeroconf that is needing to be investigated?

@emontnemery

emontnemery commented May 26, 2020

Copy link
Copy Markdown
Collaborator Author

There was always an add callback first in the logs.
The reason it worked better when handling the update callback in pychromecast, was that the data needed by get_service_info was in cache so the call didn't deadlock waiting for the question.

@jstasiak

Copy link
Copy Markdown
Collaborator

Yeah, I've just been looking at logs supplied by @hmmbob in #255 (comment) and there was always an add_service before update_service:

% wget https://www.hmmbob.nl/110-26.1-patched-2.log
(...)
% rg '(add|update)_service' 110-26.1-patched-2.log | sed 's/DEBUG.*discovery\]/(...)/'                                                                                      
2020-05-25 22:40:45 (...) add_service _googlecast._tcp.local., Google-Home-Mini-71d8c580bbd0166b72f10071a427b6cb._googlecast._tcp.local.
2020-05-25 22:40:46 (...) add_service _googlecast._tcp.local., Google-Home-d4063db0253e246c9d66d8f7fda77a6a._googlecast._tcp.local.
2020-05-25 22:40:46 (...) add_service _googlecast._tcp.local., Music-Flow-SH8-e2a24bfc4e9e344424f834b969484002._googlecast._tcp.local.
2020-05-25 22:40:48 (...) add_service _googlecast._tcp.local., Google-Nest-Hub-4660f5b8aabe10031a93ebd362367256._googlecast._tcp.local.
2020-05-25 22:40:49 (...) add_service _googlecast._tcp.local., Chromecast-32e39183bc97abac51d77fce98a766a4._googlecast._tcp.local.
2020-05-25 22:40:51 (...) add_service _googlecast._tcp.local., Google-Home-Mini-0256c1dd13b44c80849b281970d02d11._googlecast._tcp.local.
2020-05-25 22:40:55 (...) update_service _googlecast._tcp.local., Google-Home-Mini-71d8c580bbd0166b72f10071a427b6cb._googlecast._tcp.local.
2020-05-25 22:40:55 (...) add_service _googlecast._tcp.local., Google-Home-Mini-71d8c580bbd0166b72f10071a427b6cb._googlecast._tcp.local.
2020-05-25 22:40:55 (...) update_service done for _googlecast._tcp.local., Google-Home-Mini-71d8c580bbd0166b72f10071a427b6cb._googlecast._tcp.local.
2020-05-25 22:40:55 (...) update_service _googlecast._tcp.local., Google-Home-d4063db0253e246c9d66d8f7fda77a6a._googlecast._tcp.local.
2020-05-25 22:40:55 (...) add_service _googlecast._tcp.local., Google-Home-d4063db0253e246c9d66d8f7fda77a6a._googlecast._tcp.local.
2020-05-25 22:40:55 (...) update_service done for _googlecast._tcp.local., Google-Home-d4063db0253e246c9d66d8f7fda77a6a._googlecast._tcp.local.
2020-05-25 22:40:55 (...) update_service _googlecast._tcp.local., Google-Nest-Hub-4660f5b8aabe10031a93ebd362367256._googlecast._tcp.local.
2020-05-25 22:40:55 (...) add_service _googlecast._tcp.local., Google-Nest-Hub-4660f5b8aabe10031a93ebd362367256._googlecast._tcp.local.
2020-05-25 22:40:55 (...) update_service done for _googlecast._tcp.local., Google-Nest-Hub-4660f5b8aabe10031a93ebd362367256._googlecast._tcp.local.
2020-05-25 22:40:55 (...) update_service _googlecast._tcp.local., Chromecast-32e39183bc97abac51d77fce98a766a4._googlecast._tcp.local.
2020-05-25 22:40:55 (...) add_service _googlecast._tcp.local., Chromecast-32e39183bc97abac51d77fce98a766a4._googlecast._tcp.local.
2020-05-25 22:40:55 (...) update_service done for _googlecast._tcp.local., Chromecast-32e39183bc97abac51d77fce98a766a4._googlecast._tcp.local.
2020-05-25 22:40:55 (...) update_service _googlecast._tcp.local., Google-Home-Mini-0256c1dd13b44c80849b281970d02d11._googlecast._tcp.local.
2020-05-25 22:40:55 (...) add_service _googlecast._tcp.local., Google-Home-Mini-0256c1dd13b44c80849b281970d02d11._googlecast._tcp.local.
2020-05-25 22:40:55 (...) update_service done for _googlecast._tcp.local., Google-Home-Mini-0256c1dd13b44c80849b281970d02d11._googlecast._tcp.local.

@jstasiak jstasiak merged commit fe86566 into python-zeroconf:master May 26, 2020
@jstasiak

Copy link
Copy Markdown
Collaborator

This has just been released in 0.26.3, nice work everyone.

@jstasiak

Copy link
Copy Markdown
Collaborator

A reflection is that calling get_service_info() when getting the callback seems a bit fragile, any suggestion for how it can be used in a slightly more fault tolerant way?
(This is a question for possibly improving HA / pychromecast)

One way to make this better could be to pass ServiceInfo instance to the callbacks so that calling get_service_info() would become unnecessary (at least inside a callback).

@mattsaxon

Copy link
Copy Markdown
Collaborator

@emontnemery , regarding your reflection on fragility, can you elaborate?
All the examples use it this way, however I do agree that it would be more elegant to pass it into the callback directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Zeroconf.get_service_info seems broken in 0.26.0

4 participants