Implement dynamic sticky queue polling based on the reported backlog#1438
Conversation
b7386bc to
08205ce
Compare
2d5e875 to
6bcc386
Compare
| // polls observing a stickyBacklogSize == 1 for example (which actually can be 0 already at | ||
| // that | ||
| // moment) and get stuck causing dip in worker load. | ||
| if (stickyBacklogSize > pollersCount || stickyPollers.get() <= normalPollers.get()) { |
There was a problem hiding this comment.
I assume it is ok that this conditional is not atomic as a whole? Meaning, it's ok if technically in a racy situation this results in an unexpected sticky or normal poll when finishPoll is run by another thread at the same time? Otherwise, you'd be better off switching the atomic integers to regular ones and marking these methods synchronized (calls shouldn't be so frequent and time sensitive here that synchronized would cause a real problem IMO).
There was a problem hiding this comment.
Yeah, I think it's ok here. And the main argument is that the value of stickyBacklogSize is anyway non-precise. We don't control/verify which order the responses will be coming in and that the value that we write into stickyBacklogSize is an actual last value. In this condition, it's not worth fighting for precise decisions here.
| // polls observing a stickyBacklogSize == 1 for example (which actually can be 0 already at | ||
| // that | ||
| // moment) and get stuck causing dip in worker load. | ||
| if (stickyBacklogSize > pollersCount || stickyPollers.get() <= normalPollers.get()) { |
There was a problem hiding this comment.
In Go, this is stickyBacklogSize > 0 instead of stickyBacklogSize > pollersCount. Is it by intention to check against poller count here?
There was a problem hiding this comment.
The comment above this statement is actually addressing exactly this.
// If pollersCount >= stickyBacklogSize > 0 we want to go back to a normal ratio to avoid a
// situation that too many pollers (all of them in the worst case) will open only sticky queue
// polls observing a stickyBacklogSize == 1 for example (which actually can be 0 already at
// that moment) and get stuck causing dip in worker load.
So yeah, I decided to change Go condition a little bit to reduce a probability of a worker listening on sticky only just because of a tiny backlog of 1 or 2.
There was a problem hiding this comment.
Ah, makes sense. So this still guarantees at least as many sticky pollers as normal pollers. In my head I'd change pollerCount to stickyBacklogThreshold or something to clarify that it's not related to poll threads, but this works too.
6bcc386 to
ad5bd24
Compare
What was changed
This PR brings behavior similar to GoSDK that balances the number of sticky pollers based on the knowledge of the sticky queue backlog received from the server
Why?
After looking at the behavior of workers in the high throughput scenario, it's clear that the static sticky:normal ratio is prone to the following scenario:
Under a high volume of Workflow Tasks workers continue to pull tasks from the normal queue when the sticky queue is already backlogged. This leads to the growth of sticky queues even more which in turn leads to the expiration of sticky workflow tasks and workflow cache churn. This effectively causes a snowballing effect causing more replays and decreasing of Workers throughput. The worker should pull less from a normal task queue if it can't manage its sticky queue.
Closes #998