race condition of socket usage#30
Conversation
…his prevent the threads from using closed sockets.
|
Thank you for the contribution @veawor. Could you provide some code or a stack trace so I'd understand what the issue is? |
|
Stack trace is as below: |
When performing an info query via request(), a listener is started, and a packet is formed. As the packet is formed, known answers are taken from the cache and placed into the packet. Then the packet is sent. The packet is self received (via multicast loopback, I assume). At that point the listener is fired and the answers in the packet are propagated back to the object that started the request. This is a really long way around the barn. The PR queries the cache directly in request() and then calls update_record(). If all of the information is in the cache, then no packet is formed or sent or received. This approach was taken because, for whatever reason, the reception of the packets on windows via the loopback was proving to be unreliable. The method has the side benefit of being a whole lot faster. This PR also incorporates the joins() from PR python-zeroconf#30. In addition it moves the two joins() in close() to their own thread because they can take quite a while to execute.
When performing an info query via request(), a listener is started, and a packet is formed. As the packet is formed, known answers are taken from the cache and placed into the packet. Then the packet is sent. The packet is self received (via multicast loopback, I assume). At that point the listener is fired and the answers in the packet are propagated back to the object that started the request. This is a really long way around the barn. The PR queries the cache directly in request() and then calls update_record(). If all of the information is in the cache, then no packet is formed or sent or received. This approach was taken because, for whatever reason, the reception of the packets on windows via the loopback was proving to be unreliable. The method has the side benefit of being a whole lot faster. This PR also incorporates the joins() from PR python-zeroconf#30. In addition it moves the two joins() in close() to their own thread because they can take quite a while to execute.
* Fix ability for a cache lookup to match properly When querying for a service type, the response is processed. During the processing, an info lookup is performed. If the info is not found in the cache, then a query is sent. Trouble is that the info requested is present in the same packet that triggered the lookup, and a query is not necessary. But two problems caused the cache lookup to fail. 1) The info was not yet in the cache. The call back was fired before all answers in the packet were cached. 2) The test for a cache hit did not work, because the cache hit test uses a DNSEntry as the comparison object. But some of the objects in the cache are descendents of DNSEntry and have their own __eq__() defined which accesses fields only present on the descendent. Thus the test can NEVER work since the descendent's __eq__() will be used. Also continuing the theme of some other recent pull requests, add three _GLOBAL_DONE tests to avoid doing work after the attempted stop, and thus avoid generating (harmless, but annoying) exceptions during shutdown * Remove unnecessary packet send in ServiceInfo.request() When performing an info query via request(), a listener is started, and a packet is formed. As the packet is formed, known answers are taken from the cache and placed into the packet. Then the packet is sent. The packet is self received (via multicast loopback, I assume). At that point the listener is fired and the answers in the packet are propagated back to the object that started the request. This is a really long way around the barn. The PR queries the cache directly in request() and then calls update_record(). If all of the information is in the cache, then no packet is formed or sent or received. This approach was taken because, for whatever reason, the reception of the packets on windows via the loopback was proving to be unreliable. The method has the side benefit of being a whole lot faster. This PR also incorporates the joins() from PR #30. In addition it moves the two joins() in close() to their own thread because they can take quite a while to execute. * Fix locking race condition in Engine.run() This fixes a race condition in which the receive engine was waiting against its condition variable under a different lock than the one it used to determine if it needed to wait. This was causing the code to sometimes take 5 seconds to do anything useful. When fixing the race condition, decided to also fix the other correctness issues in the loop which was likely causing the errors that led to the inclusion of the 'except Exception' catch all. This in turn allowed the use of EBADF error due to closing the socket during exit to be used to get out of the select in a timely manner. Finally, this allowed reorganizing the shutdown code to shutdown from the front to the back. That is to say, shutdown the recv socket first, which then allows a clean join with the engine thread. After the engine thread exits most everything else is inert as all callbacks have been unwound. * Remove a now invalid test case With the restructure of shutdown, Listener() now needs to throw EBADF on a closed socket to allow a timely and graceful shutdown. * Shutdown the service listeners in an organized fashion Also adds names to the various threads to make debugging easier. * Improve test coverage Add more needed shutdown cleanup found via additional test coverage. Force timeout calculation from milli to seconds to use floating point. * init ServiceInfo._properties * Add query support and test case for _services._dns-sd._udp.local. * pep8 cleanup * Add testcase and fixes for HInfo Record Generation The DNSHInfo packet generation code was broken. There was no test case for that functionality, and adding a test case showed four issues. Two of which were relative to PY3 string, one of which was a typoed reference to an attribute, and finally the two fields present in the HInfo record were using the wrong encoding, which is what necessitated the change from write_string() to write_character_string().
|
Thanks so much for the pull request. I made several improvements to ZC() shutdown, including incorporating these changes, before I had access to this repo. As of now I believe the functionality from this PR has been incorporated, and I unfortunately have no good way to merge the PR to give you credit for the change, so I am going to close it. If you think this an error, or an otherwise bad idea, please comment back. Thanks again for your help. |
handler(self.zc) of ServiceBrowser.run() and select.select.() of Engine.run() sometimes throws socket.error when invoke ServiceBrowser.cancel() and Zeroconf.close() in another thread. And I've added join() to wait for the termination of the threads, then close socket. Hope this commit makes sense.