Fix pool drain by geraldog · Pull Request #453 · sonic182/aiosonic

geraldog · 2024-01-19T05:07:56Z

As I stated previously I have serious problems with the pool draining very quickly. I only tested with CyclicQueuePool,

I erroneously and naively tried to fix this with #452

It turns out the problem wasn't just an one-liner. The problem starts with create_task in del inside HttpResponse class. This task has no guarantee of executing and as soon del exits the task is most likely instantaneously garbage collected.

So I discovered I could wait inside Connection class aexit function. This alleviated the problem but there was still fast pool drain.

I tried a little more and it became obvious the wait_for call to _do_request helper function was cancelling aexit of Connection class. But I was able to catch CancelledError there so there's guarantee that we can close the socket by using close() on the StreamWriter and finally putting the Connection back into the CyclicQueuePool class Queue.

There was still pool drain so I fixed two more situations inside Connector class where Exceptions happen and the Connection is not put back into the Queue.

Now my crawler is running stable without pool drain! Thank you for the work!

geraldog · 2024-01-19T07:06:51Z

After fixing pool drain I noticed that a particular fast crawler (courtesy of aiosonic :) of mine was raising OSError related to too many open files. Strange I thought, since I ulimit -n 16000 on that terminal and have a custom security.conf.

It turns out some bad SSL servers leave the connection half-open and the only solution to close the zombie socket is to abort the transport. I tested and it solves the problem, so I'll push the relevant commit with this fix in an instant.

Note that for Python 3.11 it requires this python/cpython#104485

Also there's a memory leak in Python 3.11+ regarding, again SSL, see python/cpython#109534

The memory leak itself is not relevant to the commit, just though I'd share since it particularly limits long-winded crawls.

sonic182

Awesome work! your crawler must be awesome!

I have commented some possible edge cases, mostly with the while True that may block the execution

sonic182 · 2024-01-19T09:08:45Z

-            if self.connection:
-                loop = get_loop()
-                loop.create_task(self.connection.release())
+        pass


I think you can just remove the whole method

Will do so. Thank you for your kindness.

sonic182 · 2024-01-19T09:12:41Z

        self.blocked = False
        self.temp_key: Optional[str] = None
        self.requests_count = 0
+        self.cycled = False


Can you explain a bit this variable?

I think I need to better document this class which is a bit complex for the nature of the problem 😄

It's a safe-guard against putting the same Connection object into the pool twice. I can test without it and see if it makes a difference.

sonic182 · 2024-01-19T09:17:56Z

+                    self.h2handler.cleanup()
+
+            else:
+                while self.blocked:


This __aexit__ step is executed in a separated asyncio task? this doesn't block?

The CI (gh actions) with linux is blocked I guess it is for this 😮

Maybe it worth to create a separate task with asyncio.create_task that does this active wait (or to have a "cleaner" task for it)

It does not block because of the nature of asyncio.sleep, which yields to other coroutines.

Sorry to hear about the CI issues.

We really can't have a task easily there I think because we need to catch CancelledError which is raise by wait_for in _do_request. I'll see if I can do it in a more elegant manner. Thanks.

geraldog · 2024-01-21T05:19:05Z

@sonic182 I think we can create a Future object that can be awaited in aexit if self.blocked without the ugly while True:

Let's call this Future object the Connection waiter. A Future is just a promise of a resolution of a problem, that can be awaited and it will yield to other tasks if not yet done, without the need for the also ugly asyncio.sleep(hardcoded_timeout)

The beauty of this Future object is that its set_result() method can be called from synchronous code paths.

So we await on this waiter if Connection is self.blocked and we get the chance to catch CancelledError either way when RequestTimeout() comes.

I have a test crawling running right now that I would like not to interrupt but I think I can hack up some code for tomorrow. What do you think?

sonic182 · 2024-01-21T10:26:10Z

I'm working in this small PR #454 to include socket read/write in the same connection class (this will be needed for future usage when fully implement the Proxy)

Maybe the try/finally inside the HTTPResponse class fixes your crawler? by ensuring the connection get's released after reading response data or having an error with it

geraldog · 2024-01-22T10:55:23Z

I'll definitely give it a try, thanks!

EDIT: I see it's merged already so I'll be definitely be giving it a try today still.

geraldog · 2024-01-22T11:22:18Z

Unfortunately I see ConnectionPoolAcquireTimeout very fast even with latest master that include #454 :(

sonic182 · 2024-01-22T11:54:16Z

ConnectionPoolAcquireTimeout

You may need to use Semaphores to limit the concurrency in your crawler, I think you're starting a lot of calls very quick and the pool size is not big enough for it, you may need to extend timeouts too

geraldog · 2024-01-22T12:08:57Z

Please re-read the first message in this pull request and you'll realize the problem is not as you mentioned.

EDIT: By the way my pool size is 16000 clients, which I allow by custom limits.conf and ulimit -n 16000.
The crawler does not create more than 1000 tasks, limited by len(asyncio.all_tasks())

How could my Pool be exhausting? And how does it not exhaust with the changes introduced in this PR?

geraldog · 2024-01-31T20:14:56Z

@sonic182 taking in account #454 going in effect and my promise to rework the logic around a future to wait instead of using asyncio.sleep, I'm closing this PR and opening another one. Hope you don't mind.

geraldog added 5 commits January 19, 2024 01:07

connection.py: ensure conn is always put back into queue

4bcced2

connectors.py: ensure conn is put back into queue after Exception

6969ac1

__init__.py: create_task in __del__ has no guarantee of executing

3313978

connection.py: make sure to await asyncio.sleep

527b5f8

connections.py: abort the connection transport on RequestTimeout

e306640

sonic182 requested changes Jan 19, 2024

View reviewed changes

geraldog closed this Jan 31, 2024

Conversation

geraldog commented Jan 19, 2024

Uh oh!

geraldog commented Jan 19, 2024

Uh oh!

sonic182 left a comment

Choose a reason for hiding this comment

Uh oh!

sonic182 Jan 19, 2024

Choose a reason for hiding this comment

Uh oh!

geraldog Jan 19, 2024

Choose a reason for hiding this comment

Uh oh!

sonic182 Jan 19, 2024

Choose a reason for hiding this comment

Uh oh!

geraldog Jan 19, 2024

Choose a reason for hiding this comment

Uh oh!

sonic182 Jan 19, 2024

Choose a reason for hiding this comment

Uh oh!

geraldog Jan 19, 2024

Choose a reason for hiding this comment

Uh oh!

geraldog commented Jan 21, 2024

Uh oh!

sonic182 commented Jan 21, 2024

Uh oh!

geraldog commented Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geraldog commented Jan 22, 2024

Uh oh!

sonic182 commented Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geraldog commented Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geraldog commented Jan 31, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

geraldog commented Jan 22, 2024 •

edited

Loading

sonic182 commented Jan 22, 2024 •

edited

Loading

geraldog commented Jan 22, 2024 •

edited

Loading