-
Notifications
You must be signed in to change notification settings - Fork 732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Unify Sockets, Timers, and Channels #360
Comments
Thanks for putting this together, @carllerche! |
I'm pretty excited about this, definitely thanks for putting it together @carllerche! Allowing essentially arbitrary types implement Some thoughts of mine:
Overall this all looks really great to me, though, I'm quite excited to see how it pans out! |
Oh yeah, that are great changes. And I agree with one of the points by @alexcrichton , which is to use |
Some APIs explained here seem internal to mio, so excuse if I missed or misunderstood anything. Mioco already has
I think Chanels should not implement actual queues. They should only carry an atomic notification, while the queue itself is up to the user. That would allow making different versions of them: fixed circular buffer, growable vec, mpsc/mpmc, etc. Of course mio could provide some standard implementations. That would fix #322. So 👍 from me. |
Replying to @alexcrichton How much of the lock-free manipulation is necessary? The primary reason why I am pretty focused on this is that the current So, whether mutexes are acceptable for 1.0 is a tough call. Benchmarking something like this is a tricky without a "real world" work load. It'd be nice to remove the wheel aspect from It would be, but I am not sure how. Something has to wake up the The only other alternate strategy that I can think of would be to spawn a timer thread that has a pipe and writes a byte to the pipe when a timeout expires. This would require having a communication channel to the thread as well to register timeouts. Are you thinking of exporting a separate Selector type. It will stay crate private as it is now. Do you think it'd be appropriate to have Poll use &self instead of &mut self I have been thinking the same, but I didn't bring it up explicitly. I think that it should be possible |
mio::Handler abstraction could be completely removed, and users could call their own tick This is TBD, but I believe that this may be the best option. There isn't too much utility to having I think Chanels should not implement actual queues. That is the plan. you would do something like |
Cool, sounds good to me. I'll browse the PR as well, but I suspect that you're right in that mutexes may just be too heavy for this use case.
I agree, and the wheel aspect is really just an implementation detail. From an API perspective you just request an event on a delay, which seems super reasonable! I also agree that spawning a thread to send notifications is basically just a non-starter.
The only major API implication I could think of is that you would pass in |
The new unified API seems good, but I think the queuing adds a lot of complexity behind the scenes. The proposed alternative seems highly preferable: use timerfd on any platform that supports it, with a compatible implementation on other platforms. You may not even need a thread and pipe; poll supports a timeout, so just calculate the next timeout and pass that. Similar mechanisms exist on Windows. |
For Channel, on Linux, you can use eventfd. You still need syscalls, but eventfd has far less overhead than a pipe. |
Just anecdotal support for this, we (Dropbox) have rust daemons that push 20+Gbps between individual machines routinely on a few mio event loops. Every operation must have a timeout, of course, to prevent resource pileup and queuing/congestion collapse, so timers are everywhere. Based on profiling we've done, the performance of timers and channels is really critical. So I would be concerned about changes that would slow these down in a substantial way. |
General reaction from me, I'm concerned about the implementation complexity, but definitely intrigued by being able to control the behavior of the channel more directly. I also agree having a single "readiness" API is a cleaner abstraction rather than special-casing timers and channels. |
Great write-up. I really like the idea. Generalizing the different concepts makes everything a lot easier. I agree that the implementation can be quite complex, but the overall gain seems worth it. Benchmarks would be nice, since this could potentially have a performance impact. |
I'm very sensitive to the performance requirements of Mio. I should probably add a more explicit performance analysis section. Specifically for Linux, I do not believe that there will be any change in performance assuming one channel and one timer per event loop (which is the 99% case). To port an application from Mio 0.5 to Mio w/ this proposal landed, after creating a At runtime, given usual behavior, the timer under heavy load will only queue readiness once per timer tick (default of 100ms). Under heavy load, the channel will queue readiness at most once per call to So, on Linux, the usage of the additional logic is negligible compared to a heavy socket load. I do think that benchmarks are needed, but I think that they need to be done in a "real world" environment. Synthetic benchmarks in this case will not be very helpful as, by design, the completion queue on Linux is not in the hot path. (on a side note, back to @alexcrichton's point about using a Mutex initially, it may be OK as long as all fast pass guards are using an atomic read). Regarding Regarding code complexity, I want to emphasize strongly that there is no way to remove this code entirely from Mio. It will always have to exist at the very least for Windows. The question is whether or not it should be provided to all platforms. |
Also, to all, I am very interested in your opinions about what to do with |
I very much like the idea of removing |
I agree with @dwrensha about removing |
I also agree with @dwrensha that using |
Awesome.
I'm not entirely sure about that. I think you can handle all of those cases on Windows using WaitForMultipleObjects in place of poll/epoll. With a socket, you can call WSAEventSelect to get an event you can pass to WaitForMultipleObjects. For a timeout, you can either use the built-in timeout parameter in WaitForMultipleObjects or create a waitable timer object. And for channels, a semaphore should work. Is there some case those can't cover? |
Looks good @carllerche. I think removing |
I have few thoughts:
Overall, I think using Sorry, I'm not familiar with windows code at all. But current |
I don't know if "one timer per event loop" is really 99% of use cases. In mioco each coroutine can use one or more timers. Mioco has a test running million coroutines (each sleeping for a while on it's timer) and it works just fine, so I hope in new design such scenario won't suffer. I guess it's the same as @tailhook 's point 3. It's possible to implement another layer of time wheel in mioco, but I'd be happy to avoid it. Also, I'd like to make sure there are no fixed size only designs and growable options can still be supported. Eg. in the above tests the wheel size needs to be adjusted, and I was going to fill issue to ask for "just grow the time wheel if needed" setting. I understand the point of "no allocations for performance", but IMO growing instead of crashing is much better proposition for anything except embedded applications. I'm not sure if proposed changes do change anything in that matter, so I'm just making sure. |
@dpc To be clear, right now Another advantage of this proposal is that it exposes the primitives so that if the default mio timer does nto suit the requirements, an alternate timer implementation could be built. @dpc Also, I disagree strongly about "growing vs crashing" A well designed network application will think about limits. A system that "grows" has a hard limit, but it is undefined and will be hit at an indefinite time. We can agree to disagree, but in my experience it is always better to be explicit about limits when building any robust system. |
@carllerche You're probably right about network applications. But networking is not the only area where mio/mioco will be used. In fact I started mioco because I wanted more convenient way to handle unix pipes in colerr. Mio will be a standard to build any Rust io-driven code so it needs to accommodate any requirements. Lets say someone builds a logging part of backup software with mio/mioco. What should the fixed values be? I don't want my backups to fail in 10 years, because my system has now 128x more data to backup, on machine with 8x more threads and 32x more ram and some fixed values somewhere in the code are too small now. So if possible I'd like to pass the choice of behavior to the end user. And wherever mio uses fixed resources and panics when it runs out of them, I'd be happy to see a flag, that makes it allocate more instead. |
Maybe this discussion is a little bit off-topic, but it's an important one. On the one hand, I do totally agree with you that any system has a limit. On the other hand, it's unclear how to determine the limit. Let's pretend that it's easy to determine the limit for the number of connections (the limit that is not configured in mio, but currently fixed in rotor, not sure about mioco). This limit is just Fortunately, with the newest design of rotor you have exactly one timer per state machine. So unless you create a state machine per request (or something like this), it's fine to derive the limit. But when there is no such limit (like I in mioco AFAIU), it just impossible to define. What if you activate another configuration option, and there are more timers per state machine? What if timer wheel is large enough for 99.9% of the time but sometimes it crashes? How to distinguish of whether it is some kind of resource leak (like never removed timer), or just not enough [number-of-fsm * number-of-timers-per-fsm] value specified? So I second growing timer queue solution. Usually, you'll end up with EOM because of running out of memory for buffers not for timers, I think. |
If we're going down to primitives here and leaving higher level abstractions to other applications & libraries, I am inclined to agree that hard limits should also be managed at a higher level, given that there are the means to stat the used resources and act accordingly. |
Even for general purpose networking applications fixed values are not good enough. Let's say you build a network daemon (say p2p-dropbox syncthing) and distribute it as a binary to the users. How are the fixed values are to be picked? Some people will want to use it to sync couple of files between two systems smaller than raspberry pi's), and some will be syncing terabytes of data every day in huge farm of servers. |
Like others, I am apprehensive about the level of unsafety in the implementation (extensive use of raw atomic primitives and raw pointers, atomics mixed with thread-unsafe tools like Cell), but I think this can be reduced by internal implementation organization without affecting the public API. I conceptually like the proposed API but it's hard to tell in detail; I think the proposal would be clearer if it included some type signatures for the major public methods of the API, for instance All my other thoughts are basically covered by existing discussion; no need to repeat ourselves. |
Well, I think there is a good alternative to this, for windows support. If we will look closer at a stack "mio > rotor > rotor-stream > rotor-http > basic-http-server" (as an example) we could observe that:
So if we are willing to augment all the three mio, rotor, and rotor-stream, we could have the following benefits:
Another good reason to try this approach is to be able to use another two improvements for applications:
As far as I know both approaches are somewhat completion-based. And while we may accept a mediocre performance on windows, being able to utilize those two at full speed will be super-great. And the full performance of neither of them is possible by replacing just "mio". On the other hand, I believe (though, can't prove) that it's possible to reach top performance in userspace network stack and RDMA by enhancing all three (mio, rotor, rotor-stream) with some compile-time options, leaving every protocol on top of them completely unaware of the optimization. All of this is to say that it may be better to move windows support to a higher level of abstraction. |
@tailhook Sorry, I'm not really following what you are saying. |
I've been implementing this proposal on #371 and have been hitting some conceptual issues. Currently, according to the proposal, Second, I am having a hard time unifying setting immediate readiness vs. setting delayed readiness (for the timer). Setting immediate readiness is pretty trivial to implement in a way that is This means that while setting immediate readiness should be concurrent (to implement a channel), setting delayed readiness should not. The first solution to this that I can think of is to split up When getting a registration, there would be two constructors:
I believe that this would satisfy the requirements for implementing Thoughts? |
Seems very reasonable to split the two types given that the consumer and producer code are likely to be quite different and in different places, and each has different concerns. I would probably have the Registration constructors return (producer, consumer) not (consumer, producer) though. |
Also seems like reasonable tradeoffs to me. Out of curiosity, is |
@alexcrichton interesting thought, I think you are probably correct. |
I might be missing something important here, but I strongly believe that there should be no synchronization of any kind in Poll (and mio in general). It should be single threaded (!Sync) abstraction over epoll_wait/kevent. Everything else can be implemented on top of Poll.poll(). For multithreaded apps you can either dispatch actual work to worker threads or create per-thread event loop. For "worker threads" case you can even use multiple spsc queues (one per thread), that given fast (not std) spsc queue impl will given huge performance boost. For multiple eventloops synchronization is just redundant, coz there're no things to synchronize. |
You can (and should) dispatch work to a worker thread, but the worker thread needs a method to send the work back to the IO thread. |
Any update on this proposal? Is it officially the way forward? If so, is there an ETA? |
#371 is passing CI now. It includes the custom readiness queue and a channel implementation. I'm hoping to do the timer next but get this merged in as is. Does anyone wish to review the code? I know it is a somewhat large PR. |
Unify Sockets, Timers, and Channels
Currently, there are two runtime APIs:
Poll
andEventLoop
.Poll
isthe abstraction handling readiness for sockets and waiting for events on
these sockets.
EventLoop
is a wrapper aroundPoll
providing atimeout API and a cross thread notification API as well as the loop /
reactive (via
Handler
) abstraction.I am proposing to extract the timer and the notification channel
features into standalone types that implement
Evented
and thus can beused directly with
Poll
. For example:The insight is that a lot of the code that is currently windows specific
is useful to all platforms. Stabilizing the API and providing it to all
platforms allows implementing
Evented
for arbitrary types.Advantages
Disadvantages
The primary disadvantage that I can think of is that the code path
around timers & the notification channel become slightly more
complicated. I don't believe that the change would have a meaningful
performance impact.
There is also additional code complexity for all platforms. However,
this code complexity already exists for Windows.
Behavior
An
Evented
would mirror the behavior of a socket registered withepoll. Specifically, in a single threaded environment:
poll
.Poll
is permitted to fire of spurious readiness events except if the value has been dropped.In the presence of concurrency, specifically readiness being modified on
a different thread than
Poll
, a best effort is made to preserve thesesemantics.
Implementation
This section will describe how to implement a custom
Evented
type aswell as Mio's internals to handle it. For simplicity and performance,
custom
Evented
types will only be able to be registered with a singlePoll
.It is important to note that the implementation is not intended to
replace FD polling on epoll & kqueue. It is meant to work in conjunction
with the OS's event queue to support types that cannot be implemented
using a socket or other system type that is compatible with the system's
event queue.
Readiness Queue
Poll
will maintain an internal readiness queue, represented as alinked list. The linked list head pointer is an
AtomicPtr
. All of thenodes in the linked list are owned by the
Poll
instance.The type declarations are for illustration only. The actual
implementations will have some additional memory safety requirements.
Registration
When a
MyEvented
value is registered with the event loop, a newRegistration
value is obtained:Registration
will include the internalEventSet::dropped()
event tothe interest.
Re-registration
A
Registration
'sinterest
&PollOpt
can be changed by callingRegistration::update
:The
Poll
reference will not be used but will ensure thatupdate
isonly called from a single thread (the thread that owns the
Poll
reference). This allows safe mutation of
interest
andopts
withoutsynchronization primitives.
Registration
will include the internalEventSet::dropped()
event tothe interest.
Triggering readiness notifications
Readiness can be updated using
Registration::set_readiness
andRegistration::unset_readiness
. These can be called concurrently.set_readiness
adds the given events with the existingRegistration
readiness.
unset_readiness
subtracts the given events from theexisting
Registration
.Registration::set_readiness
ensures that the registration node is queued for processing.Delaying readiness
In order to support timeouts,
Registration
has the ability to schedulereadiness notifications using
Registration::delay_readiness(events, timeout)
.There is a big caveat. There is precise timing guarantee. A delayed
readiness event could be triggered much earlier than requested. Also,
the readiness timer is coarse grained, so by default will be rounded to
100ms or so. The one guarantee is that the event will be triggered no
later than the requested timeout + the duration of a timer tick (100ms
by default).
Queuing
Registration
for processingFirst, atomically update
Registration.queued
. Attempt to set the MSB.Check the current delay value. If the requested delay is less than the
current, update the delayed portion of
queued
.If the MSB was successfully set, then the current thread is responsible
for queuing the registration node (pseudocode):
Dropping
Registration
Processing a drop is handled by setting readiness to an internal
Dropped
event:The
Registration
value itself does not own any data, so there isnothing else to do.
Polling
On
Poll::poll()
the following happens:Reset the events on self
Atomically take ownership of the readiness queue:
The dequeued nodes are processed.
The next step is to process all delayed readiness nodes that have
reached their timeout. The code for this is similar to the current timer
code.
Integrating with
Selector
The readiness queue described above is not to replace socket
notifications on epoll / kqueue / etc... It is to be used in conjuction.
To handle this,
PollReadinessQueue
will be able to wakup the selector.This will be implemented in a similar fashion as the current channel
implementation. A pipe will be used to force the selector to wakeup.
The full logic of
poll
will look something like:Implementing
mio::Channel
Channel
is a mpsc queue such that when messages are pushed onto thechannel,
Poll
is woken up and returns a readable readiness event forthe
Channel
. The specific queue will be supplied on creation ofChannel
, allowing the user to choose the behavior around allocationand capacity.
Channel
will look something like:When a new message is sent over the channel:
When readiness is set,
Poll
will wake up with a readinessnotification. The user can now "poll" off of the channel. The
implementation of poll is something like:
Implemented
Timer
Timer
is a delay queue. Messages are pushed onto it with a delay afterwhich the message can be "popped" from the queue. It is implemented
using a hashed wheel timer strategy which is ideal in situations where
large number of timeouts are required and the timer can use coarse
precision (by default, 100ms ticks).
The implementation is fairly straight forward. When a timeout is
requested, the message is stored in the
Timer
implementation andRegistration::delay_readiness
is called with the timeout. There aresome potential optimizations, but those are out of scope for this
proposal.
Windows
The readiness queue described in this proposal would replace the current
windows specific
implementation.
The proposal implementation would be more efficient as it avoids locking
as well as uses lighter weight data structures (mostly, linked lists vs.
vecs).
Outstanding questions
The biggest outstanding question would be what to do about
EventLoop
.If this proposal lands, then
EventLoop
becomes entirely a very lightshim around
Poll
that dispatches events to the appropriate handlerfunction.
The entire implementation would look something like:
It will also not be possible to maintain API compatibility.
Handler::notify
andHandler::timeout
will no longer exist asEventLoop
does not know the difference between those two types andother
Evented
types that have notifications called throughready
.The options are:
EventLoop
to follow the new API and keep the minimalimpelmentation.
EventLoop
and makePoll
the primary APIEventLoop
that accepts allocations(though this would be post 1.0).
Alternatives
It is possible to implement
Timer
andChannel
as standalone typeswithout having to implement the readiness queue. For
Timer
, it wouldrequire using timerfd on linux and a timer thread on other platforms.
The disadvanage here is minor for linux as syscalls can be reduced
significantly by only using
timerfd
to track the next timeout in theTimer
vs. every timeout inTimer
.However, on platforms that don't have
timerfd
available, a polyfillwill be needed. This can be done by creating a pipe and spawning a
thread. When a timeout is needed, send a request to the thread. The
thread writes a byte to the pipe after the timeout has expired. This has
overhead, but again it can be amortized by only using the thread/pipe
combo for the next timeout in
Timer
vs. every timeout. Though, theremay be some complication with this amoritization when registering the
Timer
using level triggered notifications.On the other hand. For
Channel
, a syscall would be needed for eachmessage enqueued and dequeued. The implementation would be to have a
pipe associated with the
Chanenl
. Each time a message is enqueued,write a byte to the pipe. Whenever a message is dequeued, read a byte.
The text was updated successfully, but these errors were encountered: