Skip to content

Alternative solution for PYTHON-836#891

Merged
beltran merged 1 commit into
masterfrom
python-836-2
Jan 10, 2018
Merged

Alternative solution for PYTHON-836#891
beltran merged 1 commit into
masterfrom
python-836-2

Conversation

@beltran
Copy link
Copy Markdown
Contributor

@beltran beltran commented Dec 21, 2017

This is a different solution #873 and one that hopefully works better. The problem with the previous one was that for example if we are connected to one host cluster and the scheduled speculative execution happens before the regular attempt from the main thread (this is likely to happen since it's scheduled in 0 seconds). The _query_plan from ResponseFuture will have already been exhausted by the speculative executions and the attempt from the main thread will fail with NoHostAvailable.
In this PR _on_speculative_execute schedules itself again if the attempt from the main thread hasn't yet happened

Comment thread cassandra/cluster.py Outdated
# We reschedule this call until the main thread has succeeded
# making a query
if not self.attempted_hosts:
self.session.cluster.connection_class.create_timer(0.01, self._on_speculative_execute)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what delay to set here, we have some time.sleep(0.1) in other parts so I figured it wouldn't be that bad.

@beltran
Copy link
Copy Markdown
Contributor Author

beltran commented Dec 21, 2017

I've run this branch here a few times without failure, which gives me some confidence that the test won't fail.

@mambocab
Copy link
Copy Markdown
Contributor

mambocab commented Jan 2, 2018

So it looks like this build failed with the delay set to 0.001, which implies to me that this basically depends on the .01 sleep providing a wide enough window to work. However, the code itself doesn't look like it could work that way, so maybe I'm wrong. Do you understand what's going on there?

I don't know what to think about the delay-based solution as a whole. This essentially probabilistically adds a .01 delay to speculative executions with delay < .01, yeah?

Some ideas that are floating through my head on this:

  • Could we try a lock or event-based solution?
  • Would it break request behavior to break off part of the query plan as "hosts we won't call from spec exec"? I'm guessing it's a bad idea, but I wanna get some ideas going.
  • Could we just allow spec exec to consume the query plan and have a fallback query plan for the main execution?
  • This line of thinking makes me wonder how valuable the distinction between speculative executions and non-speculative executions. I know we have to implement it as it exists now, but it's something to think about.

@beltran
Copy link
Copy Markdown
Contributor Author

beltran commented Jan 9, 2018

That build failed because the test wasn't correct, the patch was being done for:

ResponseFuture._on_speculative_execute = patch_and_count(ResponseFuture._on_speculative_execute)

but with this patch _on_speculative_execute could be called a different number of times on each time ( since it's rescheduled by itself), therefore the test now patches send_request,

ResponseFuture.send_request = patch_and_count(ResponseFuture.send_request)

The reason I changed 0.001 to 0.1 is because I realized the function was rescheduled a lot of times sometimes until the main thread finally sent the main request and the speculative query was sent after that. This is a something bad of this patch, figuring out how much to wait.

I don't know what to think about the delay-based solution as a whole. This essentially probabilistically adds a .01 delay to speculative executions with delay < .01, yeah?

Yes

Could we try a lock or event-based solution?

What I don't like about this is that we would lock the event loop thread since it would have to wait for the main thread. Also I don't like creating another Lock object in such a hot path just for this particular case. I think something good of this implementation is that performance stays the same for the majority of the cases.

Would it break request behavior to break off part of the query plan as "hosts we won't call from spec exec"? I'm guessing it's a bad idea, but I wanna get some ideas going.

I'm not sure about how this would work, if the query_plan was a list we could save the first element for the main thread and use the rest of the speculative execution, but it's an iterator so I don't see an easy way to do this. Also there may be execution plans with only one host.

Could we just allow spec exec to consume the query plan and have a fallback query plan for the main execution?

We could, this could however change the behavior depending on the order.
For example if the execution plan is only one node and a query was sent for the speculative execution to that node (exhausting that query plan), then we would send a new query from the main thread using the fallback query plan, sending two queries in total.
But if the original execution plan was exhausted by the main thread we wouldn't send any speculative execution since there would be no more hosts to query.

This line of thinking makes me wonder how valuable the distinction between speculative executions and non-speculative executions. I know we have to implement it as it exists now, but it's something to think about.

👍

I understand why this patch doesn't look too good. But a good thing about it is that it doesn't change the behavior for non speculative execution and adds just one comparison (if not self.attempted_hosts:) for almost the speculative ones. I don't like as well to reschedule something to be called later instead of waiting with a lock but I think for this particular case it's better. WDYT?

@mambocab
Copy link
Copy Markdown
Contributor

mambocab commented Jan 9, 2018

That build failed because the test wasn't correct

Got it, thanks for clearing that up.

... I think for this particular case it's better.

Yeah, you've convinced me -- thanks for having a look and explaining the reasoning. It's easy to say "depending on a hardcoded rescheduling time is bad", but yeah, I agree -- given the way normal and speculative executions are setup, this is better than theoretical blocking or query-plan-altering fixes.

So, +1 to this fix 👍

@beltran beltran merged commit 8ac3fbb into master Jan 10, 2018
@beltran
Copy link
Copy Markdown
Contributor Author

beltran commented Jan 10, 2018

Awesome, thanks for the review!

@beltran beltran deleted the python-836-2 branch January 10, 2018 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants