PYTHON-853: delay timeout by mambocab · Pull Request #899 · apache/cassandra-python-driver

mambocab · 2018-01-04T21:32:07Z

Tries to solve the same issue as #857.

I'm a little wary of adding a new place where ._timer is set, but I think we're OK:

    def _on_speculative_execute(self):
        self._timer = None
        if not self._event.is_set():
            if self._time_remaining is not None:
                if self._time_remaining <= 0:
                    self._on_timeout()
                    return
            self.send_request(error_no_hosts=False)
            self._start_timer()

Here, if _on_timeout overwrites ._timer, that's fine -- we know ._timer is None anyway.

I was worried about deferring to the event loop and returning from _on_timeout without the work actually being done. However, _on_speculative_execute is only ever called from the event loop (see below) so I think it's ok to replace the timer and put the timeoout processing on the event loop, since nothing explicitly waits on the completion of _on_speculative_execute.

    def _start_timer(self):
        if self._timer is None:
            spec_delay = self._spec_execution_plan.next_execution(self._current_host)
            if spec_delay >= 0:
                if self._time_remaining is None or self._time_remaining > spec_delay:
                    self._timer = self.session.cluster.connection_class.create_timer(spec_delay, self._on_speculative_execute)
                    return
            if self._time_remaining is not None:
                self._timer = self.session.cluster.connection_class.create_timer(self._time_remaining, self._on_timeout)

Here, we're definitely ok. If we ever replace ._timer, it's at the end of the execution of the deferred _on_timeout or _on_speculative_execute call.

    def send_request(self, error_no_hosts=True):
        """ Internal """
        # query_plan is an iterator, so this will resume where we last left
        # off if send_request() is called multiple times
        for host in self.query_plan:
            req_id = self._query(host)
            if req_id is not None:
                self._req_id = req_id
                return True
            if self.timeout is not None and time.time() - self._start_time > self.timeout:
                self._on_timeout()
                return True

        if error_no_hosts:
            self._set_final_exception(NoHostAvailable(
                "Unable to complete the operation against any hosts", self._errors))
        return False

I believe that we're ok here as well. Since the call ends after _on_timeout is called, we can't lose any timers -- ._timer will only be set multiple times if _on_timeout reschedules itself multiple times, and that's fine.

I think this covers all the ways that _on_timeout is called -- it's always called by _send_request, sometimes via _on_speculative_execute. It also seems to cover all the ways that ._timer can be set to a new value.

beltran · 2018-01-17T20:18:33Z

+        # PYTHON-853: for short timeouts, we sometimes race with our __init__
+        if self._connection is None:
+            self._timer = self.session.cluster.connection_class.create_timer(0.01, self._on_timeout)
+


Should there be a return here?

Whoops, yes

I like this patch since it's really a solution. But I think it can get into a loop sometimes, for example, against only one C* node:

import time import logging log = logging.getLogger() log.setLevel('DEBUG') handler = logging.StreamHandler() handler.setFormatter(logging.Formatter("%(asctime)s [%(levelname)s] %(name)s: %(message)s")) log.addHandler(handler) from cassandra.cluster import Cluster cluster = Cluster() session = cluster.connect() print(cluster.metadata.all_hosts()) assert len(session._pools.values()) == 1 tuple(session._pools.values())[0].shutdown() # stuck here session.execute("SELECT * from system.local", timeout=0)

Shutting down the pool tuple(session._pools.values())[0].shutdown() like that is not very legit but it could happen before some execute. In this case self._connection == None and is never set, therefore self._on_timeout() is in a loop calling itself.

mambocab · 2018-01-17T21:04:07Z

(Responding to this comment)

Re: tuple(session._pools.values())[0].shutdown() -- Is there some other way that this can happen? I don't think we have to be worried about supporting in particular -- it uses the private _pools attribute, and our shutdown APIs explicitly say not to reuse shutdown stuff. However, if some kind of defuncting or something can cause a similar hang, we definitely want to fix that.

aboudreault · 2018-01-17T21:07:37Z

Not sure to understand the example playing with the pool manually.... but I do see the point of looping.

just brainstorming, if it doesn't make sense _on_timeout() is called before things are ready, shouldn't ResponseQuery._query set a particular events when mechanisms can start? Note that I'm not totally aware of all the context so only both of you know well this issue.

beltran · 2018-01-17T21:23:58Z

Yeah, the user shouldn't/won't call shutdown manually, the script calls it that way so it's deterministic and trying to simulate what would happen if it was called internally by the driver.
shutdown is called for example when the host goes down so it might happen before calling execute therefore resulting in the same hang as in the script.
Following on what Alan said maybe we could call ResponseFuture._start_timer after calling ResponseFuture.send_request, this way connection would be set and if it hasn't it means it won't ever be. What I don't like of this approach is that it would be quite a refactor for such a small and unlikely use case.

aboudreault · 2018-01-17T21:28:02Z

I agree with a such change for a use case like that yeah.

aboudreault · 2018-01-22T19:02:39Z

+            self._timer = self.session.cluster.connection_class.create_timer(
+                0.01,
+                partial(self._on_timeout, attempts=attempts + 1)
+            )


Since the looping issue is only valid for a particular case, I would tend to leave the main API untouched. Especially because the attempts parameter is only taken into account when connection is None... which is not clear with the function signature. What about something hidden like:

if self._connection is None: timer_threshold = getattr(self, '_timer_threshold', 0) self._timer_threshold = timer_threshold + 1 if self._timer_threshold <= 3: self._timer = self.session.cluster.connection_class.create_timer( 0.01, self._on_timeout ) return

That'd work just as well, though I prefer using the kwarg. I think the parameter makes it clear that the counter is only ever passed in. With the attribute soluion, it's not obvious that it's never set in some other method.

I'm not too worried about adding to the _on_timeout API since it's internal and since we can provide a default.

If I add a docstring/comment explaining attempts, would that address your concerns? We also could rename it to _attempts to make it clear that you should only use it in rare cases.

Can't say I agree with this, but that's fine. I just feel that attempts doesn't make sense to be an API parameter, even if it's internal. e.g. From an API perspective, what does it mean? I'm OK with merging anyway... at least we definitively avoid the rare potential looping issue with this. 👍

My thought was that this is a psuedo-recursive function now, so using a parameter would be a simple way to track depth. Thanks for the feedback, sorry to override it here -- I'll add a comment and merge when the tests look good.

beltran reviewed Jan 17, 2018

View reviewed changes

mambocab force-pushed the python-853_retry-implementation branch from 2070368 to 5082c6a Compare January 17, 2018 20:29

aboudreault reviewed Jan 22, 2018

View reviewed changes

mambocab force-pushed the python-853_retry-implementation branch from e93ce0b to 8611fd1 Compare January 22, 2018 20:33

PYTHON-853: delay timeout

3778312

mambocab force-pushed the python-853_retry-implementation branch from 8611fd1 to 3778312 Compare January 22, 2018 20:42

mambocab merged commit b7d2994 into master Jan 23, 2018

mambocab deleted the python-853_retry-implementation branch January 23, 2018 21:07

beltran mentioned this pull request May 8, 2018

Catch exception if conneciton is None #857

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PYTHON-853: delay timeout#899

PYTHON-853: delay timeout#899
mambocab merged 1 commit into
masterfrom
python-853_retry-implementation

mambocab commented Jan 4, 2018 •

edited

Loading

Uh oh!

beltran Jan 17, 2018

Uh oh!

mambocab Jan 17, 2018

Uh oh!

beltran Jan 17, 2018

Uh oh!

mambocab commented Jan 17, 2018

Uh oh!

aboudreault commented Jan 17, 2018

Uh oh!

beltran commented Jan 17, 2018

Uh oh!

aboudreault commented Jan 17, 2018

Uh oh!

aboudreault Jan 22, 2018 •

edited

Loading

Uh oh!

mambocab Jan 22, 2018

Uh oh!

aboudreault Jan 22, 2018

Uh oh!

mambocab Jan 22, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mambocab commented Jan 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

beltran Jan 17, 2018

Choose a reason for hiding this comment

Uh oh!

mambocab Jan 17, 2018

Choose a reason for hiding this comment

Uh oh!

beltran Jan 17, 2018

Choose a reason for hiding this comment

Uh oh!

mambocab commented Jan 17, 2018

Uh oh!

aboudreault commented Jan 17, 2018

Uh oh!

beltran commented Jan 17, 2018

Uh oh!

aboudreault commented Jan 17, 2018

Uh oh!

aboudreault Jan 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mambocab Jan 22, 2018

Choose a reason for hiding this comment

Uh oh!

aboudreault Jan 22, 2018

Choose a reason for hiding this comment

Uh oh!

mambocab Jan 22, 2018

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mambocab commented Jan 4, 2018 •

edited

Loading

aboudreault Jan 22, 2018 •

edited

Loading