Skip to content

PYTHON-853: delay timeout#899

Merged
mambocab merged 1 commit into
masterfrom
python-853_retry-implementation
Jan 23, 2018
Merged

PYTHON-853: delay timeout#899
mambocab merged 1 commit into
masterfrom
python-853_retry-implementation

Conversation

@mambocab

@mambocab mambocab commented Jan 4, 2018

Copy link
Copy Markdown
Contributor

Tries to solve the same issue as #857.

I'm a little wary of adding a new place where ._timer is set, but I think we're OK:

    def _on_speculative_execute(self):
        self._timer = None
        if not self._event.is_set():
            if self._time_remaining is not None:
                if self._time_remaining <= 0:
                    self._on_timeout()
                    return
            self.send_request(error_no_hosts=False)
            self._start_timer()

Here, if _on_timeout overwrites ._timer, that's fine -- we know ._timer is None anyway.

I was worried about deferring to the event loop and returning from _on_timeout without the work actually being done. However, _on_speculative_execute is only ever called from the event loop (see below) so I think it's ok to replace the timer and put the timeoout processing on the event loop, since nothing explicitly waits on the completion of _on_speculative_execute.

    def _start_timer(self):
        if self._timer is None:
            spec_delay = self._spec_execution_plan.next_execution(self._current_host)
            if spec_delay >= 0:
                if self._time_remaining is None or self._time_remaining > spec_delay:
                    self._timer = self.session.cluster.connection_class.create_timer(spec_delay, self._on_speculative_execute)
                    return
            if self._time_remaining is not None:
                self._timer = self.session.cluster.connection_class.create_timer(self._time_remaining, self._on_timeout)

Here, we're definitely ok. If we ever replace ._timer, it's at the end of the execution of the deferred _on_timeout or _on_speculative_execute call.

    def send_request(self, error_no_hosts=True):
        """ Internal """
        # query_plan is an iterator, so this will resume where we last left
        # off if send_request() is called multiple times
        for host in self.query_plan:
            req_id = self._query(host)
            if req_id is not None:
                self._req_id = req_id
                return True
            if self.timeout is not None and time.time() - self._start_time > self.timeout:
                self._on_timeout()
                return True

        if error_no_hosts:
            self._set_final_exception(NoHostAvailable(
                "Unable to complete the operation against any hosts", self._errors))
        return False

I believe that we're ok here as well. Since the call ends after _on_timeout is called, we can't lose any timers -- ._timer will only be set multiple times if _on_timeout reschedules itself multiple times, and that's fine.

I think this covers all the ways that _on_timeout is called -- it's always called by _send_request, sometimes via _on_speculative_execute. It also seems to cover all the ways that ._timer can be set to a new value.

Comment thread cassandra/cluster.py Outdated
# PYTHON-853: for short timeouts, we sometimes race with our __init__
if self._connection is None:
self._timer = self.session.cluster.connection_class.create_timer(0.01, self._on_timeout)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a return here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, yes

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this patch since it's really a solution. But I think it can get into a loop sometimes, for example, against only one C* node:

import time
import logging

log = logging.getLogger()
log.setLevel('DEBUG')
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter("%(asctime)s [%(levelname)s] %(name)s: %(message)s"))
log.addHandler(handler)

from cassandra.cluster import Cluster

cluster = Cluster()
session = cluster.connect()
print(cluster.metadata.all_hosts())

assert len(session._pools.values()) == 1
tuple(session._pools.values())[0].shutdown()

# stuck here
session.execute("SELECT * from system.local", timeout=0)

Shutting down the pool tuple(session._pools.values())[0].shutdown() like that is not very legit but it could happen before some execute. In this case self._connection == None and is never set, therefore self._on_timeout() is in a loop calling itself.

@mambocab mambocab force-pushed the python-853_retry-implementation branch from 2070368 to 5082c6a Compare January 17, 2018 20:29
@mambocab

Copy link
Copy Markdown
Contributor Author

(Responding to this comment)

Re: tuple(session._pools.values())[0].shutdown() -- Is there some other way that this can happen? I don't think we have to be worried about supporting in particular -- it uses the private _pools attribute, and our shutdown APIs explicitly say not to reuse shutdown stuff. However, if some kind of defuncting or something can cause a similar hang, we definitely want to fix that.

@aboudreault

Copy link
Copy Markdown
Contributor

Not sure to understand the example playing with the pool manually.... but I do see the point of looping.

just brainstorming, if it doesn't make sense _on_timeout() is called before things are ready, shouldn't ResponseQuery._query set a particular events when mechanisms can start? Note that I'm not totally aware of all the context so only both of you know well this issue.

@beltran

beltran commented Jan 17, 2018

Copy link
Copy Markdown
Contributor

Yeah, the user shouldn't/won't call shutdown manually, the script calls it that way so it's deterministic and trying to simulate what would happen if it was called internally by the driver.
shutdown is called for example when the host goes down so it might happen before calling execute therefore resulting in the same hang as in the script.
Following on what Alan said maybe we could call ResponseFuture._start_timer after calling ResponseFuture.send_request, this way connection would be set and if it hasn't it means it won't ever be. What I don't like of this approach is that it would be quite a refactor for such a small and unlikely use case.

@aboudreault

Copy link
Copy Markdown
Contributor

I agree with a such change for a use case like that yeah.

Comment thread cassandra/cluster.py
self._timer = self.session.cluster.connection_class.create_timer(
0.01,
partial(self._on_timeout, attempts=attempts + 1)
)

@aboudreault aboudreault Jan 22, 2018

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the looping issue is only valid for a particular case, I would tend to leave the main API untouched. Especially because the attempts parameter is only taken into account when connection is None... which is not clear with the function signature. What about something hidden like:

if self._connection is None:
   timer_threshold = getattr(self, '_timer_threshold', 0)
   self._timer_threshold = timer_threshold + 1
   if self._timer_threshold <= 3:
       self._timer = self.session.cluster.connection_class.create_timer(
            0.01,
            self._on_timeout
       )
   return

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'd work just as well, though I prefer using the kwarg. I think the parameter makes it clear that the counter is only ever passed in. With the attribute soluion, it's not obvious that it's never set in some other method.

I'm not too worried about adding to the _on_timeout API since it's internal and since we can provide a default.

If I add a docstring/comment explaining attempts, would that address your concerns? We also could rename it to _attempts to make it clear that you should only use it in rare cases.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't say I agree with this, but that's fine. I just feel that attempts doesn't make sense to be an API parameter, even if it's internal. e.g. From an API perspective, what does it mean? I'm OK with merging anyway... at least we definitively avoid the rare potential looping issue with this. 👍

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought was that this is a psuedo-recursive function now, so using a parameter would be a simple way to track depth. Thanks for the feedback, sorry to override it here -- I'll add a comment and merge when the tests look good.

@mambocab mambocab force-pushed the python-853_retry-implementation branch from e93ce0b to 8611fd1 Compare January 22, 2018 20:33
@mambocab mambocab force-pushed the python-853_retry-implementation branch from 8611fd1 to 3778312 Compare January 22, 2018 20:42
@mambocab mambocab merged commit b7d2994 into master Jan 23, 2018
@mambocab mambocab deleted the python-853_retry-implementation branch January 23, 2018 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants