-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Text/Unicode oriented streams #57
Conversation
fn write_str(&mut self, buf: &str) -> IoResult<()>; | ||
|
||
// These are similar to Writer, but based on `write_str` instead of `write`. | ||
fn write_char(&mut self, c: char) -> IoResult<()> { ... } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would (perhaps naively) expect that this would be the fundamental method. Why would write_str
be it instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It really could be either. See the paragraph below about rust-lang/rust#7771
It would be the case with a multi-byte encoding text reader that a tiny buffer was essential (three bytes for UTF-8), in case the underlying byte-oriented stream only gave you as far as part of a character in a read, would it not? Blending it in with (I speak with no experience of trying to actually implement such a thing, and I haven’t thought it through thoroughly by any means.) |
If this would ever get into the standard library, we may also want to have a separate |
@chris-morgan Yes, multi-byte decoders in rust-encoding do have some buffering for this reason, but they do so with custom code (e.g. a couple of struct fields) rather than something like |
@lifthrasiir Yes, that sounds like the right thing to do for decoding or encoding stream wrappers. |
Addition of text streams working over byte streams is something I wanted for a long time. I'm currently working on XML processing library for Rust, and I'm missing Java's separation between byte and character streams. BTW, Java concept of byte/characters streams is really worth noting. Apart from somewhat bloated API, the basic idea is really great. It really resembles our current composition of iterators, when top layers give additional features, wrapping around lower layers. |
I'm cautiously in favor of this proposal. I would like text-oriented streams, but I'm worried about how it would interact with our existing Just today in IRC it was asked how to append a formatted string onto a |
I think that makes sense.
So any byte writer is also implicitly a Unicode writer that encodes as UTF-8? I’m conflicted about this. On one hand, I would love to be in a world where we can assume everything is UTF-8 and nobody uses any of legacy encodings anymore. On the other hand this feels a bit sloppy: blurring the distinction between Unicode text and bytes could lead to Mojibake and related bugs. Python says "Explicit is better than implicit." So I think we should either take your suggestion, or go the opposite direction: remove the
Yeah, that’s part of rust-encoding. I suppose that at some point we’ll want to distribute rust-encoding with the compiler, although maybe before that we’ll only import parts of it such as the adaptor you describe here. |
|
||
```rust | ||
pub trait TextReader { | ||
fn read(&mut self, buf: &mut StrBuf, max_bytes: uint) -> IoResult<uint>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if read
did not necessarily have to write to a StrBuf
(and incur the related heap allocations at some point). It could be parameterized over some trait that can append &str
s, (or trait objects if TextReader
will be largely used as a trait object. It could even be a TextWriter
, although that could be more broad a trait than is necessary.
Also, is max_bytes
the maximum number of utf8 bytes that can be read, or the maximum number of bytes from the underlying encoded data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max_bytes
is meant is the number of UTF-8 bytes added to buf
, so that you could pre-allocate with StrBuf::reserve_additional
before calling read
.
I’m not very happy with the design of TextReader::read
here, but I don’t know what else would be better. Suggestions welcome.
After considering the options, I think this is the best. From pub struct UTF8Writer<W>(pub W);
impl<W: Writer> Writer for UTF8Writer<W> { ... }
impl<W: Writer> TextWriter for UTF8Writer<W> { ... } And do the equivalent for Addendum: It occurs to me that this still doesn't handle |
@kballard That’s what I had in mind, except without |
@SimonSapin There's no reason to not implement |
Regarding This is a bit complicated. I suspect "read N bytes and interpret as a string" is more useful than "read and produce N utf-8 bytes", so the meaning of It's also complicated because any sort of reading like this (for anything besides single-byte encodings i.e. ASCII) is going to require at least a small internal buffer, in case the read bytes don't constitute an entire character. The alternative is requiring the read methods to take a precise byte count and error out if this doesn't end up hitting a character boundary, and that kind of sucks. Using character counts instead of byte counts doesn't really help either. I think we have to take it as a given that implementations of this may require at least a small internal buffer (e.g. a UTF-8 reader requires at least a Given all that, I think the fundamental reading method would be fn read_to_utf8(&mut self, v: &mut [u8], max_bytes_read: uint) -> IoResult<(uint, uint)>; This reads to a fixed-size The biggest issue I see with this method is that, because it's reading to a This could possibly be mitigated by taking some sort of trait that a) produces stronger guarantees about UTF-8 data (e.g. by using pub trait TextReaderDestination {
/// Returns the receiver as a `&mut [u8]`.
///
/// Data written to this slice must be valid UTF-8.
unsafe fn as_utf8_dest_slice<'a>(&'a mut self) -> &'a mut [u8];
}
impl<'l> TextReaderDestination for &'l mut [u8] {
unsafe fn as_utf8_dest_slice<'a>(&'a mut self) -> &'a mut [u8] {
self.as_mut_slice()
}
} and then we can declare our function /// Returns (bytes read, bytes written to `v`)
fn read_to_utf8<T: TextReaderDestination>(&mut self, v: T, max_bytes_read: uint) -> IoResult<(uint, uint)>; This is functionally equivalent, but requires using With this function, we can then provide some convenience methods: /// Returns bytes read
fn read_to_strbuf(&mut self, buf: &mut StrBuf, max_bytes_read: uint) -> IoResult<uint> {
unsafe {
let v = buf.as_mut_vec();
v.reserve_additional(max_bytes_read);
let l = v.len();
v.set_len(l + max_bytes_read);
let s = v.mut_slice_from(l);
v.set_len(l);
let (read, count) = try!(self.read_to_utf8(s, max_bytes_read));
debug_assert!(count <= max_bytes_read);
v.set_len(l + count);
Ok(read)
}
}
/// Reads exactly `bytes_to_read` (this may involve several calls to
/// `read_to_utf8()`).
///
/// Returns an error if the result does not lie on a character
/// boundary.
///
/// The reason for this is this is intended to be used when you know precisely
/// how many bytes should comprise the string. If you want to allow smaller
/// reads or partial character reads, use `read_to_strbuf()`.
fn read_strbuf(&mut self, bytes_to_read: uint) -> IoResult<StrBuf> {
let mut buf = StrBuf::new();
let mut read = 0;
while read < bytes_to_read {
read += try!(self.read_to_strbuf(&mut buf, remaining));
}
Ok(buf)
} If you want to optimize slightly, we can also have /// Returns the estimated number of UTF-8 bytes that may be emitted
/// as a consequence of reading `byte_count` bytes.
///
/// This is a guess, the actual value may be higher or lower.
fn estimated_utf8_count_for_byte_count(&self, byte_count: uint) -> uint; Then we can use this for both the capacity of the vector in |
There is: IMO cleanly separating the different type of streams to avoid accidentally treating one as the other is worth the inconvenience in rare case where you do need to mix them and know what you’re doing. And wrapping the result of |
A separate idea is to have a function like fn read_to_writer<W: TextWriter>(&mut self, w: W, max_bytes_read: uint) -> IoResult<uint>; as the only mechanism for reading. It would require using a The downside, of course, is that it can't read directly into the destination but must use a separate allocation first, which it then extracts a |
This doesn't work. It borrows the original stream, so you have to throw away your |
|
||
// These are similar to Writer | ||
fn flush(&mut self) -> IoResult<()> { ... } | ||
fn by_ref<'a>(&'a mut self) -> RefTextWriter<'a, Self> { ... } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should just return RefWriter
instead of defining a new type. RefWriter
can be made to implement TextWriter
when its parameter does.
Below in TextReader
you already do this, by returning RefReader
. That may have been a mistake, of course, but I think it's the right action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RefWriter
can be made to implementTextWriter
when its parameter does.
Oh, I didn’t think of that and thought I needed a separate type. I’ll change this as you suggest.
Uhm, indeed. Then an |
RefWriter<W> can implement TextWriter when W does, no need for a separate type.
Yes, it should have |
Note that using Plus, there may be internal state that is buffered between calls to the underlying |
@zkamsler I'm not sure what you mean. Also, |
@kballard I had missed the implementing |
That only makes sense when wrapping a byte reader and decoding with a given character encoding. There could be
Like
Again, only when decoding from bytes. In rust-encoding, this is specific to the implementation and not necessarily stored as bytes. I don’t think "bytes to read" makes sense in the
Actually you could skip the check with the unsafe
See my Stream size hints RFC. If both get accepted, text streams would naturally have size hints too. |
In these cases, "read N bytes and interpret as a string" and "read and produce N utf-8 bytes" are identical, so they don't care which way the count is interpreted. But for the readers that do perform some sort of conversion, "read N bytes and interpret as a string" is much more useful. Besides, my proposed
That method is
Which every single
I think Trivial example: I have a packed format that embeds strings as . I need to do something like fn read_packed_string<R: Reader>(r: &mut R) -> IoResult<StrBuf> {
let count = try!(r.read_le_u16());
UTF8Reader(r).read_strbuf(count) // count is exact for read_strbuf()
} This relies on the count being the "bytes to read". Obviously in this case it works either way, but if this is Similarly, if I'm trying to process a file as fast as I can, the recommendation (put forth by Basically, the only situation in which "bytes to emit" is useful is when I'm trying to read into a fixed-length area, e.g. what
There's nothing more fundamental than
Well yes, but that misses the point. I should not be able to write code without
Except users won't be implementing
Interesting suggestion. It still requires
The size hint proposed there is the wrong hint. For |
@kballard I feel that "bytes to read" blows through the abstraction too much. I’d be OK with "an integer that is somehow related to the amount of data to be read", with the
Yeah, that’s what I meant. Users of the trait having to write
In the trait implemations.
We’d have to trust trait implementations to get this right (just like the coercion to
Oh, I see, sorry for the confusion. Yeah, that would probably be useful too. |
How is that any better? Certainly no adaptor built on top of generic I can't come up with a single scenario where "number of bytes to emit" is useful, except in the case of writing to a fixed-size buffer, where If you want to allow If there is in fact some use-case for "read 5 characters from this stream, regardless of how many bytes it takes", then I would say that calls for a
Good news, users don't have to write The proposed
Trust them to get something right that doesn't require |
That would leave same burden on the users of the trait instead of just the implementors. If you want to write to |
Yes, if there is a |
@SimonSapin Ok, you've convinced me that fn read_to_utf8<'a, T: TextReaderDestination<'a>>(&mut self, dest: T, max_bytes_read: uint) -> IoResult<(&'a str, uint)>; Experimentally, it's a little more awkward to have the lifetime on the I'm also still not entirely comfortable with the fact that the implementer could return a Theoretically, we could keep the existing
So what are you arguing against, then, when you seem to be arguing against the idea of the count as "number of bytes to read"? In the convenience methods |
FWIW I’m not a fan of
I’m arguing to rename it to something less specific, with implementation-defined semantics. |
|
It's ugly, but it's the only way to require That said, it's a bit less important if
Which, as I have said above, completely breaks If you want an abstraction that doesn't come down to bytes in the end, then you should be working with
Because various users that read text need to know how many bytes they just read. Look at Incidentally, when talking about methods that read repeatedly (like |
Discussed at https://github.com/mozilla/rust/wiki/Meeting-weekly-2014-06-10. Although higher-level text handling is important, we don't want to merge this at this time. Primarily, we are concerned that we don't understand this domain well enough to avoid making irreversible mistakes (we're already uncertain about the existing I/O). As this sort of API is a pure addition, we would love to see this used successfully outside of the main repo (maybe in Servo). I'm marking this as postponed. |
Alright. I agree that this is still too uncertain. I’ll see with @lifthrasiir about doing this in rust-encoding and get some practical use, possibly in Servo. |
If we see an empty slot then we're not immediately ready, but rather ready when the underlying stream is ready, so propagate the readiness notification through to that. Closes rust-lang#57
No description provided.