Conversation
|
After fixing pool drain I noticed that a particular fast crawler (courtesy of aiosonic :) of mine was raising OSError related to too many open files. Strange I thought, since I ulimit -n 16000 on that terminal and have a custom security.conf. It turns out some bad SSL servers leave the connection half-open and the only solution to close the zombie socket is to abort the transport. I tested and it solves the problem, so I'll push the relevant commit with this fix in an instant. Note that for Python 3.11 it requires this python/cpython#104485 Also there's a memory leak in Python 3.11+ regarding, again SSL, see python/cpython#109534 The memory leak itself is not relevant to the commit, just though I'd share since it particularly limits long-winded crawls. |
sonic182
left a comment
There was a problem hiding this comment.
Awesome work! your crawler must be awesome!
I have commented some possible edge cases, mostly with the while True that may block the execution
| if self.connection: | ||
| loop = get_loop() | ||
| loop.create_task(self.connection.release()) | ||
| pass |
There was a problem hiding this comment.
I think you can just remove the whole method
There was a problem hiding this comment.
Will do so. Thank you for your kindness.
| self.blocked = False | ||
| self.temp_key: Optional[str] = None | ||
| self.requests_count = 0 | ||
| self.cycled = False |
There was a problem hiding this comment.
Can you explain a bit this variable?
I think I need to better document this class which is a bit complex for the nature of the problem 😄
There was a problem hiding this comment.
It's a safe-guard against putting the same Connection object into the pool twice. I can test without it and see if it makes a difference.
| self.h2handler.cleanup() | ||
|
|
||
| else: | ||
| while self.blocked: |
There was a problem hiding this comment.
This __aexit__ step is executed in a separated asyncio task? this doesn't block?
The CI (gh actions) with linux is blocked I guess it is for this 😮
Maybe it worth to create a separate task with asyncio.create_task that does this active wait (or to have a "cleaner" task for it)
There was a problem hiding this comment.
It does not block because of the nature of asyncio.sleep, which yields to other coroutines.
Sorry to hear about the CI issues.
We really can't have a task easily there I think because we need to catch CancelledError which is raise by wait_for in _do_request. I'll see if I can do it in a more elegant manner. Thanks.
|
@sonic182 I think we can create a Future object that can be awaited in aexit if self.blocked without the ugly Let's call this Future object the Connection waiter. A Future is just a promise of a resolution of a problem, that can be awaited and it will yield to other tasks if not yet done, without the need for the also ugly The beauty of this Future object is that its set_result() method can be called from synchronous code paths. So we await on this waiter if Connection is self.blocked and we get the chance to catch CancelledError either way when RequestTimeout() comes. I have a test crawling running right now that I would like not to interrupt but I think I can hack up some code for tomorrow. What do you think? |
|
I'm working in this small PR #454 to include socket read/write in the same connection class (this will be needed for future usage when fully implement the Proxy) Maybe the try/finally inside the HTTPResponse class fixes your crawler? by ensuring the connection get's released after reading response data or having an error with it |
|
I'll definitely give it a try, thanks! EDIT: I see it's merged already so I'll be definitely be giving it a try today still. |
|
Unfortunately I see ConnectionPoolAcquireTimeout very fast even with latest |
You may need to use Semaphores to limit the concurrency in your crawler, I think you're starting a lot of calls very quick and the pool size is not big enough for it, you may need to extend timeouts too |
|
Please re-read the first message in this pull request and you'll realize the problem is not as you mentioned. EDIT: By the way my pool size is 16000 clients, which I allow by custom limits.conf and ulimit -n 16000. How could my Pool be exhausting? And how does it not exhaust with the changes introduced in this PR? |
As I stated previously I have serious problems with the pool draining very quickly. I only tested with CyclicQueuePool,
I erroneously and naively tried to fix this with #452
It turns out the problem wasn't just an one-liner. The problem starts with create_task in del inside HttpResponse class. This task has no guarantee of executing and as soon del exits the task is most likely instantaneously garbage collected.
So I discovered I could wait inside Connection class aexit function. This alleviated the problem but there was still fast pool drain.
I tried a little more and it became obvious the wait_for call to _do_request helper function was cancelling aexit of Connection class. But I was able to catch CancelledError there so there's guarantee that we can close the socket by using close() on the StreamWriter and finally putting the Connection back into the CyclicQueuePool class Queue.
There was still pool drain so I fixed two more situations inside Connector class where Exceptions happen and the Connection is not put back into the Queue.
Now my crawler is running stable without pool drain! Thank you for the work!