Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making timeout_after/ignore_after truly composable #82

Closed
njsmith opened this issue Oct 14, 2016 · 26 comments
Closed

Making timeout_after/ignore_after truly composable #82

njsmith opened this issue Oct 14, 2016 · 26 comments

Comments

@njsmith
Copy link
Contributor

njsmith commented Oct 14, 2016

As you know I'm a big fan on timeout_after. The awesome thing about it is that it's composable, in the sense that I can treat any random coroutine as a black box and slap a timeout_after around it to impose a timeout on it, so long as the called coroutine follows some simple rules. Basically, it should be written so that my timeout fires, then it promptly cleans up and exits. Not an onerous constraint.

Except... it's hard to follow this rule in coroutines that themselves want to use timeout_after. The docs allude to some of this; here's an example where everything goes wrong because the internal coroutine can't tell whether it's the internal timeout that fired, or the outer one:

  async def get_first_responding(urls):
      for url in urls:
          try:
              # per-URL timeout
              async with timeout_after(5):
                  return (await curio_http.get(url)).text
          except TaskTimeout:
              print("Too slow! Trying the next")
      raise RuntimeError("everyone's too slow!")

  async def main():
      # Global timeout:
      async with timeout_after(10):
          print(await get_first_responding([...]))

I suggest that the key missing feature is that when we see a timeout exception, we need to be able to figure out whether it's one that means that we have timed out and should clean up and let our caller get on with things, versus one that means a thing we called has timed out and we should recover however makes sense (e.g. by falling back to trying another server). Are we inside the timeout_after that fired, or outside?

 outer code       <--- Are we here?
   timeout_after
     inner code       <-- or here?

Proposal: arrange for the inner code to see a different exception type than the outer code.

Implementation: give each timeout_after a magic cookie, and when a timeout fires, inject an instance of the "inner" exception type with the magic cookie attached. When an exception passes through a timeout_after, check if the cookies match, and if so, convert to the "outer" exception type. (And ignore_after should check the cookie, and swallow the exception iff it matches.)

When there are stacked timeouts that fire simultaneously, e.g.

  async with timeout_after(5):
      # -- mark --
      async with timeout_after(10):
          ...

then it should always be the outermost timeout whose cookie is used (so the line at -- mark -- should see an "inner" exception, not an "outer" one).

Unsolved problem: I am drawing a blank on what to call these two exception types to distinguish them :-(. OperationTimeOut versus CancelledTimeOut? [Edit: or maybe ExternalTimeout / InternalTimeout? CallerTimeout / CalleeTimeout? OutsideTimeout / InsideTimeout? LocalTimeout?]

(Speaking of timeout exception names, why is it called TaskTimeout? It doesn't really have anything to do with Tasks in particular, does it? I mean, any more so than literally everything else in curio?)

@dabeaz
Copy link
Owner

dabeaz commented Oct 14, 2016

Hmmm. I'm going to have to wrap my brain around this whole scenario, but I think I see where you're going here... maybe.

Regarding the name "TaskTimeout".... Python already defines a built-in exception TimeoutError. I didn't want to use that because having a Curio task timeout is a pretty specific thing. Also, there are parts of curio that run external functions in threads/processes. If one of those were to take a builtin TimeoutError, I'd want to be able to distinguish that case from curio timeouts. So, having separate exceptions seems like the more sane approach. I didn't want to define curio.TimeoutError to avoid name confusion with the builtin. So, TaskTimeout was it (for lack of an obvious alternative).

@njsmith
Copy link
Contributor Author

njsmith commented Oct 14, 2016

Having slept on it, I think I have a slightly better idea for how to organize the exceptions...

For the "inner" code, the timeout is a signal imposed from outside saying "you have to stop now", and the only correct way to handle it is to clean up and return ASAP.

This is exactly the same rule that code should use when it gets cancelled. Basically from the point of the code that has the timeout imposed on it, a timeout is a cancellation, just a particular kind.

So probably it should be possible to distinguish between "you got timed out" and "you got manually canceled", but the default should be to handle them the same. So let's enhance CancelledError with a reason field ("timeout" vs "task-canceled"), or have a TimeoutCancelledError subclass, or whatever, and use that for the "inner" exception, and most code will just look for CancelledError.

And then the exception that gets raised from timeout_after to indicate that the block timed out can just be TimeoutError or whatever. (I see what you mean about the name confusion. See also.... Not sure that randomly throwing in the word Task helps exactly but not sure I have a better idea either. CurioTimeout?)

@dabeaz
Copy link
Owner

dabeaz commented Oct 16, 2016

One general concern on nesting is just what happens when you go beyond 2 levels. The two-exception idea definitely gets crazy then.

The idea of making the timeout error subclass CancelledError seems reasonable though. That could probably work. Let me sleep on it.

@njsmith
Copy link
Contributor Author

njsmith commented Oct 16, 2016

beyond 2 levels

No, it's totally fine I think? [btw probably no need for ableist language there?] This is why you need the "cookies" I was talking about so that for a given exception you can tell which timeout_after triggered it, because that's the timeout_after that splits the difference between the "inside" and the "outside" for that exception. But given some stack like this:

  code1
  async with timeout_after(...):  # timeout A
      code2
          async with timeout_after(...): # timeout B
              code3
              async with timeout_after(...): # timeout C
                  code4

If timeout A fires, then code4->code3->code2 see CancelledError and code1 sees TimeoutError-or-whatever-its-called. If timeout B fires, then code4->code3 see CancelledError and code2->code1 see TimeoutError-OWIC. If timeout C fires, then code4 sees CancelledError and code3->code2->code1 see TimeoutError-OWIC.

Hmm, I guess one possible source of confusion here is that in the current code, I think the code jumps through some hoops so that if A's deadline is before B's, then B's deadline gets adjusted earlier so that B "has a chance" to fire? I think it's simpler to think of it as, each timeout_after has its own independent timeout, and if A's happens to come before B's, then oh well, A's fires first and B gets unwound out of before it gets a chance to fire. Of course internally this requires some sort of data structure to keep track of which deadline is soonest, but conceptually we can think of it as independent timers. Does that help?

@njsmith
Copy link
Contributor Author

njsmith commented Oct 16, 2016

...Though on further thought actually, that does leave a tricky question about what to do with code like:

async with timeout_after(5):
        async with timeout_after(10):
            try:
                await curio.sleep(100)         # after 5 seconds, outer timeout expires
            finally:
                # while unwinding the outer timeout, we yield and the *inner* timeout fires
                await curio.sleep(100)

This could get nasty, because in the naive implementation where all timeouts fire when ready, then the inner block's exception could fire second and end up overwriting the outer block's exception. Which would be bad -- the whole point of timeout_after is that it's supposed to be able to impose a timeout on a whole block, so once the outer timeout_after has fired then that should take priority over whatever internal timeouts the block might contain.

I think this means that the right semantics are: whenever an "outer" timeout_after fires, we disarm all the timeout_after blocks inside it. So in a case like the above, the inner timeout_after is effectively a no-op, and this is obvious from the start. But this also can arise in cases where multiple timeouts fire simultaneously:

async with timeout_after(10):
    async with timeout_after(5):
        # note: *not* curio.sleep, simulating a coroutine that doesn't enter the kernel for a while
        time.sleep(20)
        # now when we enter the kernel and trigger a timeout check, both timeouts have expired
        await curio.sleep(1)

Here it was possible in principle for the inner timeout to fire first, but as things turned out they fired simultaneously, so the outer one should override the inner one.

Summing up, I think something like this gives the desired semantics. One nice thing is that it turns out that because we have ruled out inner timeouts firing after outer timeouts, we don't need to tag exceptions with unique cookies after all -- a simple nesting depth suffices:

# logic somewhere inside the kernel used for firing timeouts
if task._deadline_stack[-1] is not None and now > task._deadline_stack[-1]:
    # some timeout is firing, but which?
    nesting_depth = 0
    while now > task._deadline_stack[-1]:
        nesting_depth += 1
        task._deadline_stack.pop()
exc = CancelledDueToTimeout()
exc._nesting_depth = nesting_depth
coro.throw(exc)

class timeout_after:
    async def __aenter__(self):
        task = await curio.current_task()
        deadline_stack = task._deadline_stack
        # hack: if our deadline is too far in the future, it can never fire, but it's simplest if we always
        # push something onto the stack. So we duplicate the topmost deadline, and then if/when
        # we fire then the timeout below us will also fire at the same time, and override us.
        deadline = self.deadline
        if deadline is not None and deadline_stack[-1] < deadline:
            deadline = deadline_stack[-1]
        deadline_stack.append(deadline)

    async def __aexit__(self, type, exc, traceback):
        if exc is None:
            (await curio.current_task())._deadline_stack.pop()
            return
        elif isinstance(exc, CancelledDueToTimeout):
            exc._nesting_depth -= 1
            if exc._nesting_depth == 0:
                # we're the timeout_after that triggered this cancellation, and the code we
                # were wrapping has all been canceled. So convert to a timeout exception.
                raise CurioTimeoutError from exc
        # otherwise, let the exception continue propagating (possibly with reduced _nesting_depth)

[2016-10-16T02:34: made some edits to the code at the bottom to properly pop the deadline stack on clean exits and to support using the None deadline as a way to temporarily disable timeouts]

@dabeaz
Copy link
Owner

dabeaz commented Oct 16, 2016

An idea: What if the TaskTimeout exception included a boolean flag to indicate whether or not the specified time period has actually expired? For example:

try:
      async with timeout_after(5):
                await whatever()
except TaskTimeout as e:
      if e.expired:
            # 5-second time period expired
            ...
      else:
            # Some other kind of timeout occurred (an outer timeout perhaps)
            raise

Making something like this work would be a whole lot easier than managing two different exception types, an internal stack of timeouts, or anything too complicated.

Code could obviously ignore the expired flag and bail out for any kind of timeout. However, if you needed to know if the inner timeout expired and not any other, this would tell you (if the inner timeout was not expired, it means the timeout is due to an outer timeout of some kind).

@njsmith
Copy link
Contributor Author

njsmith commented Oct 16, 2016

So this is a flag on the exception that would have a different value depending on whether you catch it inside or outside the triggering timeout_after block? How are you planning to arrange that? Or am I misunderstanding the proposal?

@dabeaz
Copy link
Owner

dabeaz commented Oct 16, 2016

Every timeout in Curio is initiated by the timeout_after() function. I'm proposing that any TaskTimeout exception that propagates out of it have an expired flag to indicate whether or not the timeout period given to timeout_after() has expired or not.

One of my general concerns is that timeout handling is already quite tricky. I'm somewhat hesitant to make it even more tricky. However, putting a simple boolean flag on the exception to indicate whether or not the specified time interval has expired seems like it would be rather straightforward. It would also solve the initial problem of knowing if an inner timeout has expired or not.

@dabeaz
Copy link
Owner

dabeaz commented Oct 17, 2016

I just looked at the code and adding this expired flag turned out to be far easier than I thought (involving just a couple of lines of code). I've added it and pushed the changes. Maybe worth experimenting with.

@njsmith
Copy link
Contributor Author

njsmith commented Oct 17, 2016

Oh, I see, so the flag tells you whether you are outside at least one expired timeout_after? Right, okay, that is easy to compute.

But unfortunately, the semantics I want are that I want to know whether I am inside at least one expired timeout_after :-). Because that's what I need to know in order to decide whether I should be retrying the timed out operation inside me, or just cleaning up and giving up.

code1
async with timeout_after(5):
    code2
    async with timeout_after(10):
        code3

Your flag lets me distinguish the regions {code1+code2} versus {code3}, right? But what we have here is that {code2+code3} have timed out and need to exit, but code1 has not.

It might be helpful to go back to my example at the very top of the thread and think about how you would make it work? I don't see how your flag makes it possible to fix that code, but I could be wrong.

@dabeaz
Copy link
Owner

dabeaz commented Oct 17, 2016

Wouldn't the first example turn into this?

async def get_first_responding(urls):
      for url in urls:
          try:
              # per-URL timeout
              async with timeout_after(5):
                  return (await curio_http.get(url)).text
          except TaskTimeout as e:
              if e.expired:
                  print("Too slow! Trying the next")
              else:
                  # Outer timeout must have fired. Everyone too slow
                  raise

async def main():
      # Global timeout:
      async with timeout_after(10):
          print(await get_first_responding([...]))

Key thing: If you're getting a timeout and the timeout period has not expired, the timeout must have originated elsewhere. The only elsewhere (in this case) is from the outer timeout.

@dabeaz
Copy link
Owner

dabeaz commented Oct 17, 2016

One Note: Each timeout_after() is guaranteed to generate one TaskTimeout--even if nested and even if the timeouts occur at the same time. It's possible that both the inner and outer timeout would expire at the same time in this code. However, it would generate two exceptions. The first would cause a cycle back to the next URL. The second timeout would occur almost instantly after that and cause the except block to bail out to the outer function (since the expired flag would now be False).

@njsmith
Copy link
Contributor Author

njsmith commented Oct 17, 2016

Ah, right, yeah -- when exactly one timeout has fired, then "outside exactly one timeout_after" and "inside exactly one timeout_after" are just not of each other. A few messages back I actually started to take advantage of this to simplify things a lot too :-). But then realized that the two-timeouts-fired-together case can happen:

Here it was possible in principle for the inner timeout to fire first, but as things turned out they fired simultaneously, so the outer one should override the inner one.

...ah, github just live-updated with your followup and I see you saying what I was just working out myself about the cycling back and re-firing :-). Okay, I think I understand.

On first impression I'm not a big fan, for two reasons:

  • it seems like it's asking users to understand and reason about some highly complicated and non-obvious control flow because we want to keep curio's own control flow simpler. But this is the reverse of the general rule of thumb for what libraries should do. (At least when the complexity is unavoidable and can be hidden behind a non-leaky API, which I think are both plausibly true here?)
  • it sets up a footgun where if user code ever catches-and-handles (i.e. doesn't re-raise) a TaskTimeout without remembering to check for e.expired, then this code has a latent bug that won't show up until someone tries to wrap a timeout around it. I.e. this is one of the best kinds of bug: the buggy code is invisible, it only breaks later due to some non-local effect, and since these are timeouts then as a bonus the breakage is non-deterministic.

But I'll ponder and see if I can reconcile myself to it...

@njsmith
Copy link
Contributor Author

njsmith commented Oct 17, 2016

Hmm, actually, I guess my second objection could be fixed by combining your check-for-expiration-en-passant approach with my two-exception-types proposal. So you'd have a CancelledError that kills the HTTP GET, it hits the inner timeout_after, sees that it's expired, and (instead of setting a flag) gets converted to TaskTimeout, the TaskTimeout gets handled, we retry the GET, and then the CancelledError refires, and this time it manages to escape out to the outer timeout_after.

I'm still uncomfortable with the idea that adding a timeout_after inside a function can-and-must delay when it gets cancelled though... trying to reason through and check flow control paths already accounts for at least 50%+ of the time I spend staring at async/await code :-).

@dabeaz
Copy link
Owner

dabeaz commented Oct 17, 2016

Just a note that various "footgun" problems with timeouts are addressed in the docs. https://curio.readthedocs.io/en/latest/devel.html#timeouts

I'll need to ponder this more though....

@dabeaz
Copy link
Owner

dabeaz commented Oct 17, 2016

After sleeping on it, I think I'm coming around to the two-exception idea. Will look at it later today.

@dabeaz
Copy link
Owner

dabeaz commented Oct 17, 2016

I'm probably going to regret ever suggesting this, but I'm now thinking that three exceptions might be needed here. Consider this scenario:

 try:
    async with timeout_after(100):
        async with timeout_after(5):
               await curio.sleep(10)
 except TaskTimeout:
    print('It took too long')

The inner timeout will fire and create a TaskTimeout exception. However, that exception isn't being caught properly. Instead, it propagates out to the enclosing 100s timeout. This creates an interesting dilemma. Is the outer context supposed to receive a TaskTimeout even though its time interval hasn't expired?

I'm somewhat inclined to say no. Instead, maybe it should receive some kind of operational error UncaughtTimeoutError that indicates the presence of a timeout, but no code to actually catch the error at the point of failure.

Short version: You will only get a TaskTimeout if your time interval has actually expired. If you get a timeout and the interval has not expired, that's a problem.

@dabeaz
Copy link
Owner

dabeaz commented Oct 17, 2016

Pushed a change that addresses this with exceptions. Some details in the CHANGES file. Have also updated the docs. I think this is cool--solves a lot of tricky corner cases.

@njsmith
Copy link
Contributor Author

njsmith commented Oct 28, 2016

Potentially useful reference: a discussion in twisted about how to handle timeouts in the context of Deferred objects; they also ended up with a system that injects a cancellation exception, and then later catches it and converts it to a timeout exception:

@dabeaz
Copy link
Owner

dabeaz commented Oct 28, 2016

Interesting. I wonder if Curio's TaskTimeout should subclass CancellationError. Maybe.

@njsmith
Copy link
Contributor Author

njsmith commented Oct 29, 2016

I expect there will be more fine-tuning of curio's timeout apis as we gain more experience with them :-)

Some lower-priority things that will probably be important to add eventually; might as well write them down here so that they don't (or at least, might not :-)) get lost:

  • Absolute timeouts (timeout_at instead of timeout_after)
  • Ability to query the kernel for the current deadline. This is something I've seen in Google-derived code -- they have a lot of code where some function is implemented by doing a bunch of RPCs, and they want to allow logic like "hmm, I see that my deadline is in 50 ms, but I know it takes 100 ms to get a response from that service; I'm going to give up now instead of wasting bandwidth and remote CPU sending out an request where I know I won't be around to see the answer". Or, if they do decide to issue the RPC, then they pass the deadline along as metadata -- so if my deadline is 50 ms, I might tell the remote side to do as much work as it can in 40 ms.
  • (maybe?) Ability to extend/modify a deadline. I'm thinking about cases like "I want to time out if the peer doesn't download at least 1 KB/s", where each time it downloads 1 KB you extend the deadline a bit. (OTOH maybe this would be better handled in other ways, I'm not sure.)
  • If sendall gets timed out / cancelled, right now the socket is left in an indeterminate state -- there's no way to know how much data was actually sent, so pretty much all you can do is close it. Maybe some operations should tag the exception with extra data (like "bytes actually sent" in the sendall case)? Not sure how useful this would actually be, but if you don't have it then you definitely can't use it.
  • Since this is such a fundamental piece of functionality, possibly it would make sense to have a lowest-level set of primitives that are a bit more flexible, e.g. allowing the injection of an arbitrary exception, or overriding the "translation" logic that converts the exception once it hits the timeout boundary. (Glyph touches on this some in that thread.) Not sure what this would look like exactly, just leaving a breadcrumb for later...
  • Not exactly timeout related, but I think it'd make sense to pull the clock used for timeouts etc. "inside" curio's API. Of course it will be based on time.monotonic normally, but that shouldn't be guaranteed, because if we control the clock then it's possible to do things like have a testing mode where we can manually step the clock to trigger timeouts, or run it at 10x speed, etc.

@dabeaz
Copy link
Owner

dabeaz commented Oct 29, 2016

I like these ideas. I recently changed curio to use absolute clocks internally. So, having an timeout_at() function would be trivial to add. Good catch on sendall(). I think it should set a bytes_sent attribute on cancellation exceptions. That should be possible.

@njsmith
Copy link
Contributor Author

njsmith commented Nov 7, 2016

Ugh, and here's another nasty case that we should be aware of. With async generator support coming in, consider the effect of:

async def ugh():
    async with curio.timeout_after(10):
        yield "have fun"

async def blargh():
    await ugh().__anext__()
    # ugh's timeout fires here, entirely outside of the with block
    await curio.sleep(20)

Probably there is no good solution right now except maybe to detect the worst cases and give some better-than-nothing error message. For the python 3.7 timeframe something like PEP 521 might be the solution...

@dabeaz
Copy link
Owner

dabeaz commented Nov 8, 2016

Oh, that's kind of evil. I'm not exactly sure what a programmer writing that would expect the behavior to be since there's no way to actually know when control will return from a yield (if at all). I'm not even sure how I'd detect something like that.

@njsmith
Copy link
Contributor Author

njsmith commented Nov 8, 2016

I guess we could at least detect when a timeout exception manages to
propagate out the top of the call stack without hitting the context manager
that supposedly triggered it, and print a useful warning. But yeah, the
point of PEP 521 is exactly to make it possible to detect this kind of
nonsense. (Right now if you look at the draft then it just talks about
synchronous context managers, but this example convinces me that it should
talk about async context managers too.)

On Nov 8, 2016 2:11 AM, "David Beazley" notifications@github.com wrote:

Oh, that's kind of evil. I'm not exactly sure what a programmer writing
that would expect the behavior to be since there's no way to actually know
when control will return from a yield (if at all). I'm not even sure how
I'd detect something like that.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#82 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAlOaFV0Uc9KclljINse6PvVMxZtHS8Bks5q8ErfgaJpZM4KWlWM
.

@dabeaz
Copy link
Owner

dabeaz commented Jan 4, 2017

Closing this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants