-
-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should we merge the Stream and Channel interfaces? #959
Comments
I quite agree about restricting I'd argue that the distinguishing element of You never want to read a single byte from a stream (even though external reality sometimes forces you to), and you never want to read multiple objects at a time from a channel (except maybe for super-high performance, like in the old Yes, |
Interesting idea! My initial thought is that people (beginners especially) seem to often have trouble understanding that, no really, TCP/TLS/etc doesn't preserve message boundaries, not even a little bit. Having that distinction be obvious in the type system and in the names of the functions you call (especially I think it probably is true that if you want to iterate through "messages" on a stream, correctness-focus says you should have to specify how those messages are delimited. If we have an interface/mechanism/library for that, it becomes trivial to have an A compromise approach might be to make Stream mostly duck-type-compatible with Channel (rename |
Another factor to consider: how would all of this interact with passing credentials or FDs over a UNIX domain socket? These show up at a specific byte offset in the stream (though it gets fuzzy if they're sent alongside more than one byte of normal data -- from experimentation on Linux, if you send the control message alongside multiple bytes of data, the receiver will get it in the recvmsg() call that consumes the first byte of that data, and a single call to recvmsg() won't bridge the gap between the last byte of that data and the first byte after it). I guess there's a similar consideration with TCP urgent data, though I don't know if anyone actually uses that. |
On further thought, there is another wrinkle to setting the
Right, that's how we think of it right now. If we switched to using
I think there are outside the scope of the abstract stream/channel/whatever interface? Certainly the way Trio works right now, you can do those things, but not using I guess the case where this might be tricky is if you want to use So I think we can divide the stuff in this thread into two major ideas. First major ideaMaybe our bytestream interface would be more friendly if we made If we do this, then there's an open question about whether we should make Second major ideaMaybe we should somehow connect streams and channels more closely in terms of names/types/concepts. Along these lines, here's another possibility to think about: rename
If we go this way, then should EOF be indicated by |
On 04.03.19 04:08, Nathaniel J. Smith wrote:
Maybe we should somehow connect streams and channels more closely in
terms of names/types/concepts
Well … I'm still convinced that a clean separation of "one thing at a
time" and "multiple things at a time, without a boundary between them"
makes a lot of sense, conceptually as well as for type safety and
whatnot. I can't offhand think of any situation where you'd want to use
one in place of the other.
Also: How would you distinguish between a `Stream[bytes]` and a
`Channel[bytes]`, if `Stream` and `Channel` end up being the same class?
If we go this way, then should EOF be indicated by |receive_some|
returning |b""|, or by raising |EndOfChannel|?
Frankly I consider that a Unix wart. I mean, no data available on a
stream raises an error (`EAGAIN`), but a closed stream returns an empty
string?? Also, on a packetized bytestream (which Unix doesn't have,
historically), how do you distinguish between EOF and an empty packet?
The other way around seems way more logical to me.
…--
-- mit freundlichen Grüßen
--
-- Matthias Urlichs
|
What do you think of
EAGAIN is a weird quirk from retrofitting non-blocking operation onto an originally blocking model... also you can't exactly use But anyway, for blocking operations, which is what Trio uses as a model, Unix and C and Python are all consistent about using I guess the other question is, which approach leads to more convenient code in common cases. Supporting
Heh, I was just looking at this. Linux actually has two not-quite-standard packetized byte streams: Unix domain sockets with |
Thinking about this more, I realized that there's also a very straightforward conceptual justification for Think of OK, so then what happens if we take this logic, and apply it to a case where the first call to Another way to think of it: |
Hiya, I'm unable to make a LoopbackStream that connects to my ConsoleStream. The desired result is a simple echo server in memory without TCP. Type in a line, hit enter, line is redisplayed, repeat. I can wire up Console to do this by itself but my goal is to later replace the 'server' logic with something non-trivial. Seems desirable to have server logic for a stream be independent of what pumps the stream. Is there an easy solution I missed? I'm about to force Memory*Channel into a StapledStream and that doesn't feel right.
Totally missed it. First thing I did with trio was make a Stream do the frame problem, receive many type As and send many type Bs, kinda like Twisted's bytes->string NetstringReceiver. I'd like trio to have one base class so I can easily stack these transforms (Protocols). I can see how one (Streams?) was dominated by getting multiple bytes to/from the OS and the other (Channel) dealt with single chunks (objects). So yes, it seems that Sockets, Process, Files and MemoryQueues can all share top-level calls. Not sure if making all options possible reaps the benefits of sharing. What fits my head is an app facing baseclass with receive_some(N) that only returns with 1 to N things, possibly bytes or unichars or dicts or .... The lower layers seem a better place to deal with Anyhoo, I'm rambling. I'm way excited about what trio has and can accomplish. Once I get out of the kiddie pool I'll attempt to contribute more than freshman opinions. |
@JefffHofffman Hey, welcome to Trio! 👋 There's a basic philosophical difference between Twisted and Trio that I suspect might be tripping you up. In Twisted, usually it's Twisted that takes charge of making things actually happen. Your job is to build the car, and then Twisted drives it. In Trio, it's much more like "regular" Python, where if you want something to happen, you write a loop, or call a function, or something – you have to drive your car yourself :-). This doesn't mean you can't separate your protocol parsing logic from your stream pumping logic – see #796, and sans-io.readthedocs.io. But you will generally have some loop somewhere. In Trio's approach, "connecting" two streams doesn't make a lot of sense... you could write a loop to proxy between them, but for an echo server it'd be a lot simpler to just proxy the original stream's input to its output directly :-). The main way we do abstraction is composition: if you want to add TLS encryption to a stream, you wrap it in an BTW, if you're interested in console I/O, you might want to check out #174, which is our tracking issue for console I/O support in Trio. (The last comment in particular has a summary of what needs to happen.)
Hmm, adding batched |
Excellent links. Thx! I think I hit the framing problem early on and attempted to make an internal Stream do send_all(b"line\n") and receive_some(1) -> str("line"). I thought it could be used not only as a two pronged plug into the OS, but also as an internal wire to do conversion (Queue with different in/out types). Probably not the intended usage. I had no difficulties with pump-it-yourself approach. I think I was aiming for composition with a more functional style Is Perhaps since Channel can do nesting, it's not a big deal. I'll look at the framing topic on how to convert. |
Another thought (which I meant to post a week ago but forgot to): Separating Yes it's very convenient to be able to use the same(-sounding) methods for both |
Occasionally I work with a Subscriber that receives many small messages. Like the rationale for not wanting to go through await for each byte, there's a use case to process these messages in a batch from one await. Having *_many() convenience methods that I'll try out |
Hey all, Raw newbie to trio here. (Hi Nathaniel, long time since monotone :) ) FWIW, and since I didn't see anyone else directly respond to the first idea / second idea suggestion here, I really like the first and am lukewarm at best about the second. For the first, it feels like the perfect little extra affordance; most of the time I don't have any idea what max_nbytes should be and am extremely unlikely to do the testing to figure it out. Having the async for loop just work and take that decision off my hands is perfect. As for major idea two; I think there is too much painful history and collective knowledge around all the arcana of unix socket behavior that the existing streams interface is heir to. Turning everything into channels makes it that much harder for people to reason about how it matches up under the covers. Trio looks super nice; thanks to all the contributors!
|
This came out of discussion in python-triogh-959
- Remove max_refill_bytes from SSL (python-trio/trio#959) - Add synchronous close and context managers to the memory channels (python-trio/trio#1797)
This is a weird/radical idea, but @oremanj's comment here (in response to @Badg) raised some red flags for me:
At the conceptual level, the output from a process is exactly a
Stream
(orReceiveStream
or whatever), and if people are trying to jump through hoops to avoid using ourStream
ABC to represent the Stream concept, then that seems like a bad sign!So, let's at least go through the thought experiment: what if we got rid of
Stream
and usedChannel[bytes]
instead?Basic usability
Remembering which is a "stream" and which is a "channel" is super annoying. Merging them would eliminate this problem. Also annoying: constantly going through the pointless ritual of inventing a made-up buffer size (@oremanj's
ARBITRARILY_CHOSEN_POWER_OF_TWO
). And writing thatwhile True
loop over and over is also annoying. Making it a plainawait channel.receive()
orasync for chunk in channel:
would eliminate these annoyances.Conceptual level
For me, the major conceptual difference is that I've thought of
Channel
as inherently preserving object boundaries, i.e., whatever I pass tosend
is what comes out ofreceive
. In this way of thinking, aStream
is equivalent to aChannel[single_byte]
, but since handling single bytes individual would be inefficient, it uses batched-send and batched-receive operations. If we do decide to mergeStream
andChannel
, then we'd have to change this, and start saying that someChannel
s don't preserve the 1-to-1 mapping betweensend
andreceive
.I'm not sure how I feel about this. It's certainly doable on a technical level. But conceptually – it feels weird to say that a websocket and a TCP socket are both
Channel[bytes]
, given that one is framed and the other isn't – that's a fundamental difference in their usage. (Right now one isChannel[bytes]
and the other isStream
.) It would meanUint32Framing
adaptor doesn't convert aStream
into aChannel[bytes]
, it converts aChannel[bytes]
into anotherChannel[bytes]
. And that a TCP socket and a UDP socket have the same type. Intuitively this feels weird. It seems like this is a distinction you want to expose, and emphasize, on the type level.An interesting partial counter-example would be
h11
: anh11.Connection
object is essentially aChannel[h11.Event]
: you send a sequence of objects likeh11.Request
,h11.Data
,h11.EndOfMessage
, and then receive a sequence of similar objects. Sometimes, the objects on the sender and receiver sides match 1-to-1, likeRequest
andEndOfMessage
. But sometimes they don't, likeData
, which might be arbitrarily rechunked! So if you want to treath11.Connection
as aChannel[h11.Event]
, it's sort of simultaneously a 1-to-1Channel
and also a re-chunkingChannel
.One possibility is to distinguish them somehow at the type level, but make them more consistent, or identical, in terms of the operations they happen to implement.
In Liskov terms, a 1-to-1/framedActually, they are not Liskov-compatible – see below!Channel[bytes]
IS-A rechunking/unframedChannel[bytes]
– all it does is add stronger guarantees. So merging them would at least have that going for it. But I don't put a huge amount of weight on that – in practice they're used very differently.Technical level
Currently we have:
Stream
:send_all
,wait_send_all_might_not_block
,send_eof
,receive_some
Channel
:send
,send_nowait
,receive
,receive_nowait
,clone
, iterationProblematic bits:
wait_send_all_might_not_block
: we actually have an issue open about possibly getting rid of this: Should send_all automatically do wait_send_all_might_not_block after sending the data? #371. If we do that would remove this problem :-)send_eof
: we could add this to a bidirectionalChannel
, it wouldn't be too weird*_nowait
: we've been considering moving these to memory channels specifically, instead of the genericChannel
interfaceclone
: we've been considering dropping this (same link as for*_nowait
)So all those might get sorted out? And
send_all
andsend
are already basically the same. So that just leavesasync def receive()
versusasync def receive_some(max_nbytes)
. Andmax_nbytes
is also the obstacle to having iteration. ...basically this is THE core distinction between the two APIs. So, what do we think aboutmax_nbytes
.Specifying
max_nbytes
manually all the time is tiresome and annoying, as noted above.Also, I note that Twisted/Asyncio/libuv always handle
max_nbytes
internally, and the user just deals with whatever they get.Most
Stream
users basically want to read everything, and the only thingmax_nbytes
effects is efficiency, not correctness. In practice it's almost always set arbitrarily. I've never even seen anyone even benchmarking different values, except in extreme cases like trying to transmit multiple gigabytes/second through python. ForSocketStream
, there's some penalty for setting it too big – Python has to first allocate amax_nbytes
-sized buffer, thenrealloc
it down to size (see). And of course if you set it too small then you pay some overhead from doing lots of smallrecv
s instead of one big one. So you want some kind of "not too big, not too little" setting.For other
Stream
implementations, this doesn't apply – for example,SSLStream.receive_some
forces you to passmax_nbytes
, and that controls how many bytes it reads out of its internal decrypted data buffer at any one time, but this has no effect at all on how much data it reads at a time from the underlying socket, when it needs to refill its buffer. That's controlled by the constructor argumentSSLStream(max_refill_bytes=...)
.There are also cases where there is a "natural" size to return from
receive_some
. For example:in most applications,
SSLStream
might as well return whatever data has already been decrypted and is sitting in memory, instead of spending instructions messing around with buffers.A
GunzipStream
might as well return whatever data it got from decompressing the last chunk it read. (This can avoid some non-trivial complications: urllib3.response.GzipDecoder is accidentally quadratic, which allows a malicious server to DoS urllib3 clients urllib3/urllib3#1467)When reading from the Windows console, the underlying representation is natively unicode. This means that when we want to read bytes, we have to transcode into utf8, which in turn means that it may be impossible to read less than 4 bytes at a time (at least without nasty buffering tricks)
Given that most people don't tune it at all, I bet if we did a bit of benchmarking then we could pick a default
SocketStream
recv
size that would work better that 99% of what people currently do. And I guess we'd make this an argument to theSocketStream
/SocketChannel
constructor, exactly like howSSLStream
currently works, so people could override it if they want. This could complicate code where the stream is constructed implicitly though, likep = Process(..., stdout=PIPE)
– if you don't wantp.stdout
to use the defaultmax_nbytes
setting, then how do you specify something different? Some options:We could simply set the default and tell everyone to live with it.
We could add some way to pass this through, like
Process(..., stdout=NewPipe(max_nbytes=...))
.We could provide some API to mutate it, like
process.stdout.max_nbytes = new_value
.We could tell people with this unusual requirement that they should create their own pipe with whatever settings they want (this functionality is somewhat needed anyway, see Add support for talking to our stdin/stdout/stderr as streams #174, support for windows named pipes #824), then pass in one end by hand.
What about cases where correctness does depend on setting
max_nbytes
? It can never be the case that settingmax_nbytes
too small affects correctness, becauseStream
is already free to truncatemax_nbytes
to some smaller value if it wants to – no guarantees. But, we do make a guarantee that we won't return more thanmax_nbytes
.That... actually is important for correctness in some cases. For example, from this comment:
This is why we can't quite think of our current
Channel[bytes]
as being a sub-interface ofStream
– in this one very specific case,Stream
genuinely has slightly more functionality.This is probably a rare occurrence in practice. Most protocols need an internal buffer anyway, so any over-reads just go into the buffer for next time. And sometimes you want to hand-off between protocols, e.g.
SSLStream.unwrap
, or something like switching from HTTP/1.1 to Websocket... but in those cases we generally don't try to avoid over-reading from the underlying stream. Instead, we just accept that some over-read may have happened, and give it to the user to deal with (example 1, example 2). And in many cases, it's actually impossible to avoid this in any efficient way – e.g. if you have a newline-delimited protocol, then you have no idea where the next line boundary will be, so the only way to avoid over-read is to read one-byte-at-a-time, which is way too inefficient. In theory we could avoid it for TLS (which is length-prefixed), or for other length-prefixed protocols (likeUint32FramedChannel
), but it doesn't seem worth it in most cases.The special thing about
trio.input
is that it's sharing the process'sstdin
with who-knows-what-else, so we can't coordinate our buffer usage with other users, and are reduced to this kind of stone-agereceive_some(1)
technique.Some options here:
Treat this as a special case for
trio.input
, and implement it using some specialized tools. E.g., make sure thatopen_stdin
can safely be called repeatedly within a single program and returns different handles that don't interfere with each other, and then havetrio.input
doasync with trio.open_stdin(max_nbytes=1): ...
. Or provide some low-levelreceive_some_from_stdin
function, or something.Have some
Channel[bytes]
implementations wherereceive
takes an optionalmax_nbytes
argument, as a matter of convention.Same as previous point, but also formalize this convention as a named sub-interface – though I'm having trouble thinking of a good name! This might help with our problem up above, about wanting some more informative way to describe the type of
Uint32Framing
? But of course proliferating names always has its own cost, especially if the names are awkward.Also, naming the interface creates an interesting challenge: how do you type
StapledChannel
? You wantStapledChannel.receive
to have the same signature asStapledChannel.receive_channel.receive
, and at runtime this is easy – just use*args, **kwargs
. But if we name this sub-interface, then the proper static type forStapledChannel
depends on the static type of itsReceiveChannel
. I'm not sure whether givingStapledChannel
the right static type matters or not.The text was updated successfully, but these errors were encountered: