Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rfc] formalizing a split between "synchronous" traps and "blocking" traps #101

Closed
njsmith opened this issue Nov 10, 2016 · 13 comments
Closed

Comments

@njsmith
Copy link
Contributor

njsmith commented Nov 10, 2016

I'd like to modify the trap handling to formalize a distinction between:

  • synchronous traps: traps which always return immediately, which never trigger a reschedule (so you are guaranteed that other tasks haven't touched any shared state), and are not cancellation points.

  • (potentially) blocking traps: traps which might block, or might trigger a reschedule (curio.sleep(0)), and are cancellation points.

The two motivations are: (a) With my user trying to write cancellation-safe code hat on, it's very helpful if some kinds of calls are documented to be synchronous, and to know which ones those are and to have a uniform set of guarantees around them. This has even bitten curio itself recently -- curio.Event was assuming that _queue_reschedule_function was synchronous, but it wasn't; the fix was to make it synchronous. This strongly suggests that synchronicity is and should be a public API guarantee for some operations. (b) With my poking around in curio's internals hat on, I think this would let us simplify some of the code. In particular, synchronous traps could be implemented by calling a function to do the trap-specific work and then having some standard code to take that function's return value/exception, inject that into the invoking task, and reschedule it; this would be an improvement over the current situation where every trap implements that logic in an ad hoc way with ready_appendleft calls scattered around everywhere. It might also make things more efficient in that if we know that we're going to reinvoke the same task, we can skip enqueueing/looping/dequeueing entirely, and it potentially simplifies things like alternative task scheduling policies if the queues only have to handle real rescheduling, not trivial reinvoke-the-same-task "rescheduling".

Posting this here first because I want to get @dabeaz's opinion before I go refactoring a big chunk of curio's internals :-)

@dabeaz
Copy link
Owner

dabeaz commented Nov 11, 2016

Can you say a bit more about what you're thinking about?

One thing that's been on my mind generally is just the interaction between running tasks and the Curio kernel itself. At the moment, I have taken a very strict approach by which all Task->Kernel interaction occurs via traps (and hence the await/async mechanism). There is a certain consistency to doing it that way. It makes it more like an operating system. It also solves the problem of having to carry a reference to the kernel around (like needing the loop in asyncio). Are you thinking that synchronous operations would occur in some other way?

@njsmith
Copy link
Contributor Author

njsmith commented Nov 11, 2016

Oh, sorry, no, I'm not suggesting changing the mechanics of the
task<->kernel API. I'm talking about standardizing the semantics of a
subset of the traps, and then taking advantage of that to let them share
more code internally.

On Nov 11, 2016 8:19 AM, "David Beazley" notifications@github.com wrote:

Can you say a bit more about what you're thinking about?

One thing that's been on my mind generally is just the interaction between
running tasks and the Curio kernel itself. At the moment, I have taken a
very strict approach by which all Task->Kernel interaction occurs via traps
(and hence the await/async mechanism). There is a certain consistency to
doing it that way. It makes it more like an operating system. It also
solves the problem of having to carry a reference to the kernel around
(like needing the loop in asyncio). Are you thinking that synchronous
operations would occur in some other way?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#101 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAlOaH0nuKchxIWusOCG0_Hm7vQCq8adks5q9JVpgaJpZM4KvPhZ
.

@dabeaz
Copy link
Owner

dabeaz commented Nov 11, 2016

I say "go for it!"

@aktiur
Copy link

aktiur commented Nov 12, 2016

I was just thinking about the problem of knowing which Curio calls are indeed synchronous and which are not, with the first hat @njsmith mentioned on.

I think formalizing the traps the way it was done helps, but it does not help with curio's end user facing coroutines for which traps used are an implementation detail. For instance, nowhere does it say in the docs that current_task is synchronous and just gives back information without releasing control of execution.

The fact that there's no differences in calling conventions between potentially blocking and synchronous coroutines bothers me a bit, as I would say it negates in part the advantages in using async/await (as compared to threads) as underlined in Unyielding. I feel that, ideally, it should be obvious which end-user facing coroutines only use synchronous traps and are guaranteed in the future to do so.

At the minimum, I guess those for which it is guaranteed should be documented as such, which they are not right now.

I also have a crazier idea, that seems unlikely to be good, but still sharing.
Right now, most coroutines are blocking, and most async context managers don't block in __aenter__. Maybe curio could make the guarantee that its own async context managers won't block in aenter, and all coroutines that should not block (and that do not already have an equivalent async context manager) should be converted to context managers?

Here is a list of coroutines that should be guaranteed to be synchronous, and that would need to be converted, as far as I can tell:

  • current_task
  • timeout_after and timeout_at (already working as a context manager)
  • ignore_after and ignore_at (already working as a context manager)
  • clock

But a solution should be found for the async context managers that do block in aenter:

  • AsyncFile in the case where they are used with fileobj = None (if I understand correctly, this is a private interface used only when using aopen, so this could be factored out)
  • Lock, Semaphore, RLock,
  • wait
  • Everything that uses _contextadapt (but that should be user provided code, if I'm not mistaken... it may be a problem in code that uses abide)

This seems to be especially problematic for Locks and Semaphores, where context managers are the preferred interface, and there does not seem to be any good solution there.

@dabeaz
Copy link
Owner

dabeaz commented Nov 12, 2016

I think this is being overthought. I don't want to make any guarantee one way or the other about any operation involving async/await other than any such operation might block (or context switch). The operative word here is "might." The fact that certain low-level operations don't block is purely an implementation detail--and one that could change at any moment. It could be different using different kernel implementations or different scheduling priorities.

Most of these "synchronous traps" are low-level operations that are used to assist in the implementation of some higher-level feature. They are not something that an end-user would typically be using directly in their own code. Honestly, I could probably just make them methods on the curio Kernel class, but then all of the synchronous traps would have to change to things like (await get_kernel()).sync_method(). I don't really want to do that. The kernel is meant to be fairly opaque (think OS kernel) and I'd still have to trap into the kernel anyways just to get a reference to the kernel in order to invoke the method. Making the operation its own trap just feels cleaner and more straightforward to me.

If some particular feature is formally specified to be non-blocking, then it can be documented as such. I'm fine with that.

@njsmith
Copy link
Contributor Author

njsmith commented Nov 13, 2016

...I made a long post about this on the curio forum last night, and now it seems to have disappeared. Or maybe it's trapped in some sort of moderation hades?

@aktiur:

I feel that, ideally, it should be obvious which end-user facing coroutines only use synchronous traps and are guaranteed in the future to do so.

Agreed -- my end goal is to figure out what guarantees end-user APIs are or aren't guaranteed to provide, but figuring out the right system for traps seems like a good first step :-).

Maybe curio could make the guarantee that its own async context managers won't block in aenter, and all coroutines that should not block [...] should be converted to context managers?

This is... really weird :-). Context managers already have a clear and well-motivated use case that has nothing to do with synchronous/non-synchronous, and like you note, some of the most important use cases for async context managers are ones that involve blocking in __aenter__.

What might make sense to me is to have a rule that whenever we can make a function non-async, we should, because that is a clear signal to the user about what it can and can't do. So for example, maybe Event.set() should be a regular method, not an async method. (Curio already provides this functionality twice -- SyncEvent.set() is a regular method, and Event.set() does the weird frame hack thing to let itself be called as a sync method from sync code. AFAICT there's no reason for the await event.set() style call to exist.) OTOH, there will still be coroutines like curio.timeout_after and curio.current_task that are synchronous but are all about talking to the event loop, and I guess the best option for these is just to have a convention that we should document clearly whatever guarantees we settle on.

@dabeaz:

I think this is being overthought. I don't want to make any guarantee one way or the other about any operation involving async/await other than any such operation might block (or context switch). The operative word here is "might." The fact that certain low-level operations don't block is purely an implementation detail--and one that could change at any moment. It could be different using different kernel implementations or different scheduling priorities.

As a general principle, I strongly disagree. I started writing an erlang-style task supervisor, and it turns out that this is really annoying if you aren't allowed to even know whether task.cancel is cancellable. That's why I started poking at this in the first place, because it was making it hard to write code on top of curio. curio.Event was recently broken because you accidentally assumed a trap was synchronous and it wasn't. If you're making mistakes like these, it seems only fair to assume that users will as well :-).

Obviously there's a whole substantive conversation to have about exactly which operations should make which guarantees, though. I just think the rule should be more nuanced than "any operation could do anything". That's nice for devs, but not so nice for users. Good APIs have to balance those competing interests, plus things like usability, teachability, etc.

Another general rule for API design is that if the code makes a guarantee but the docs don't, then this is often the worst of all worlds, because user can't count on it, but they end up depending on it anyway, so that the devs lose the freedom to change it. So for example, I like this note from Linus mentioning how some quirks of how Linux schedules processes after fork ended up being stuck in stone for a while because of a bash bug -- and there they're talking about a fully-preemptive scheduler running on multi-core hardware, where the original "guarantee" was pretty weak to start with. The analogous question for curio would be whether curio.spawn runs the child task before returning to the parent task. My preference would be that we document that spawn is synchronous (so the parent always runs first), but if we don't do that then I think we'd be better off making it random (like, I mean literally doing import random and using that inside the spawn implementation) than providing a consistent behavior but refusing to document it.

Also also, even if we, say, go ahead and document that spawn is synchronous now, as long as curio remains unstable we can always change our mind later and un-document it :-). But IMO the point of curio being unstable is give us room to figure out what the eventual stable API should look like, so that includes exploring options for what kinds of stability guarantees make sense.

@dabeaz
Copy link
Owner

dabeaz commented Nov 13, 2016

Trying to decide where to respond... here or the forum. I guess I'll do it here for now.

In the big picture, I agree that it will be useful to more precisely specify which operations are synchronous and which aren't. So, any patch that addresses that will be welcome. Certain details such as the behavior of spawn() are rather arbitrary. Right now, the child task runs first, but that's not based on any underlying principle in particular. One of the tasks has to run next. It just so happened that the child runs first right now.

With regard to Event.set(), that is an extremely special case where I am performing some experiments on the API. I think that method should be async and require an await like all of the other methods. This is because setting an event ultimately might result in the scheduling of a task. If one decides to implement task priorities, it's possible that the newly scheduled task would have a higher-priority than the one that just set the event. Thus, you'd potentially see a context switch right at that point. It needs to be asynchronous. I think that all methods that interact with the kernel in any way should use async/await.

Just on this note: One of the things I'm exploring (experimentally) is how one bridges the divide between async and sync code. I have found that it is very easy for asynchronous code to escape that world and start calling lots of a synchronous functions where you no longer have the ability to use await. Yet, from that space, it still might be useful to communicate back to the async world. To that end, there are just two methods that can be called from synchronous code. One is Event.set(). The other method is Queue.put(). Again, this is highly experimental. I might change this aspect of curio at any time or remove it entirely.

Some further things to think about: Right now task.cancel() waits for the task to actually terminate. I don't know if this is a good idea or not. The decision to do this wasn't entirely arbitrary, but one of my concerns was how you would handle a situation where a child task simply ignored the cancellation request (i.e., caught the exception and continued to run anyways). Making the task.cancel() block forces the matter---if the child task ignores the cancellation request, the task performing the cancel blocks forever. You'll then be forced to figure out what's happening. If cancel() doesn't block, I fear that you could end up with lots of zombie child tasks silently sitting around in the background without really realizing it.

One other note on cancel: It is an error to cancel any task that happens to be sitting on the curio ready queue. If a task is sitting on the ready queue, it means that something happened to make it go there. It needs to run again to fully finish whatever that might have been. A particularly nasty situation concerns locks and synchronization primitives. One reason a task might be on the ready queue is that it has just been granted access to a lock. When the task runs again, it's assumed that it now has the lock. You can't just brute-force cancel a task like that. If you do, it won't get a chance the release the lock and everything will deadlock. Yes, this means that a cancellation request might be deferred until the task makes a future request to the kernel. Those are the breaks--I just don't know any other way around this.

@njsmith
Copy link
Contributor Author

njsmith commented Nov 14, 2016

@dabeaz: when you have a chance could you check if there's some way to rescue my forum post? Looking at the forum it still doesn't appear to be visible, and while I can probably reconstruct it I'd rather not retype all that again :-)

@dabeaz
Copy link
Owner

dabeaz commented Nov 14, 2016

For some reason that message got flagged as spam. I've restored it (I think). Will need to fiddle with forum settings perhaps. Sorry about that.

@imrn
Copy link

imrn commented Nov 14, 2016

@njsmith : Can you provide a 'compact' use case where currently establisted semantics are not enough? And/or any similar semantics that currently exists in alternative async libs?

@njsmith
Copy link
Contributor Author

njsmith commented Nov 18, 2016

My main message on the forum still seems to be hidden, and I have a message from the forum software claiming that "Your post was flagged as spam: the community feels it is an advertisement [...] Multiple community members flagged this post before it was hidden". Which seems... unlikely? But it does look like I can at least retrieve the content, so I'll repost here (with some small edits):


Hi all,

This is more of a high-level design discussion, so I guess I'll try putting it here instead of github issues and see what happens!

I've recently been trying to understand the exact semantics of curio's primitive operations around rescheduling/blocking/cancellation. Like, when I do await curio.whatever(), then:

  • Is it possible for the current task to block?
  • Is it possible for the current task to get scheduled away, and allow some other task to run in the mean time? (This is not quite the same as blocking.)
  • It is possible for the operation to raise a CancelledError?
  • If it does raise CancelledError, then what can I conclude about the success/failure of the original operation?
  • Is there some systematic rule for answering these questions? Can we document it?

In general these questions apply to all of curio's public API, but as a starting point I've just been looking at the low-level traps. It turns out that currently, traps fall roughly into three categories:

  • "Synchronous traps": always return immediately, never schedule away, are never a cancellation point (example: _trap_current_task). I just submitted a patch to handle these in a more uniform manner.
  • "Blocking traps": always block; allow cancellation; cancellation means the trap didn't happen (example: _trap_io)
  • "Rescheduling traps": never block, but do trigger a "reschedule", so other tasks can/will run before the trap returns (example: _trap_spawn). It turns out that these are not cancellation points, though the logic is somewhat confusing.

There are also some weirdos, like the _trap_sleep trap usually blocks, but might reschedule instead (if you just did sleep(0)).

_trap_cancel is also weird. If we want to go cancel a task, then how we handle that depends on what it was doing when it stopped running. (We know that it isn't currently running, because the task that's calling _trap_cancel is running.) Synchronous traps aren't an issue here, because we know that by definition the victim didn't stop running inside a synchronous trap. And blocking traps are OK -- they all have the machinery to cancel the blocking and raise the CancelledError. But there are two other cases too: (1) it's in the middle of a blocking trap, but the operation already succeeded, and it just hasn't been scheduled again yet. In this case we missed our chance to cancel this operation, so we have to wait for it to block again. (2) It's in a rescheduling trap, which looks identical to situation (1). That's why rescheduling traps aren't cancellation points. Either way, we can't just inject a CancelledError immediately. So what happens right now is that if the victim isn't cancellable, then the task that's trying to do the cancellation keeps rescheduling itself and retrying the cancellation, until eventually the victim task does block on something, at which point we kill it.

I'd like to propose the following API design principle: every trap should be either "synchronous" or "blocking", with the semantics described above. And, traps should have the "synchronous" semantics whenever possible.

(Rationale: synchronous traps are much easier to reason about; yield points and cancellation points are both brain-melting, especially when you have too many of them. So better to leave them out by default when possible, and then let users add them back in if necessary.)

If this is accepted, then here are the traps that would need to change (they don't currently fit into either category), and some discussion of the trade-offs:

  • _trap_sleep, b/c of the sleep(0) issue: This is actually a nasty bug, because inserting a task that just does sleep(0) in a loop should allow other tasks to run, but it doesn't quite: since the looping task immediately gets scheduled onto the ready queue, we can never finish this iteration through the run loop; so other tasks that are already runnable do get to run, but we never poll for I/O to let tasks become runnable. So we really should just make sleep(0) "blocking" in the above sense. Fixing this is easy and I have a patch for it. [edit: this is now fixed in master]
  • _trap_join: This is similar to _trap_sleep -- it's usually blocking, but if the task has already terminated then it's rescheduling. The easy fix would be, if the task has already terminated, then it's equivalent to sleep(0).
  • _trap_reschedule_tasks: This is what operations like Event.set or Queue.put use to resume any tasks that are blocked in Event.wait or Queue.get. Right now, these operations have slightly quirky (but deterministic) semantics: the task that does the waking always reschedules itself to run after the tasks that were awoken. There are some tests that fail if this is changed. Still, I think that the benefits of synchronicity are worth changing this. I can't really think of any particular advantages to the current system.
  • _trap_spawn: The question of whether child-runs-first or parent-runs-first is an interesting one in pre-emptive kernels (see e.g.), but I can't see how any of the arguments really carry over (e.g. an argument for child-runs-first is that it will soon call exec, which is irrelevant to curio), and parent-runs-first (i.e., making _trap_spawn return synchronously) is easier to reason about.
  • _trap_cancel: For the reasons described above. I think if we standardize how blocking traps work, then that should make it relatively easy to rework the cancellation logic so that, instead of the canceling task having to hang around and repeatedly retry cancels, we could make it so that we can enqueue a cancel to a task, and then blocking traps check for that and immediately as part of the shared "blocking trap logic" and raise an exception if a cancellation is pending.

Making _trap_reschedule_tasks and _trap_spawn synchronous in particular would also make it easier to experiment with alternative, non-FIFO scheduling policies, because currently those two traps directly expose the FIFO-ness of the scheduler as part of their public API.

Thoughts?

@njsmith
Copy link
Contributor Author

njsmith commented Nov 19, 2016

See #116 for how the sync-vs-blocking approach could work.

@dabeaz
Copy link
Owner

dabeaz commented Jan 4, 2017

Closing this. However, might modify behavior of spawn() back to a child-first arrangement. Ran into something recently where that would have been advantageous. Also, it's documented as doing this.

@dabeaz dabeaz closed this as completed Jan 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants