You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While on vacation in Japan I dug into streaming parsing, for fun. It's something that piqued my interest ever since I saw Mafintosh's csv-parser package back in 2014.
This led to the creation of the omnom library , and as part of it I came up with an ergonomic way to parse and write numbers with a specified endianness to and from streams. An example:
use std::fs::File; use omnom::prelude::*; let mut file = File::open("foo.txt")?; file.write_le(56_i32)?; // Write an i32 as little-endian to a byte stream. let n: u16 = file.read_be()?; // Read a u16 as big-endian from a byte stream. let n: u32 = file.read_be()?; // Read a u32 as big-endian from a byte stream.
Bytes and Endianness
"Endianness" is a word that you might hear occasionally, but if you've never written network drivers or serialization formats you might not have had to learn about before. Basically what it means is "the order in which bytes form numbers".
A byte is the smallest representation of data in a computer, and in Rust it's referred to as a u8. A u16 is two bytes long, which means it's two u8s.
let b1 = [0u8]; // length 1 byte, size equal to u8 let b2 = [0u8, 0u8]; // length 2 bytes, size equal to u16
However in the second case, do we interpret the bytes from left-to-right, or right-to-left? Endianness refers to the order we read bytes in. Little-endian means we interpret bytes right-to-left. And big-endian means we interpret them left-to-right. Different protocols will interpret bytes in different orders, but it will usually say so in the protocol. The most important thing to be aware of is that there's a difference between the two.
In Rust's stdlib there exists "little-endian", "big-endian", and "native-endian". The latter refers to the system's default representation.
Reading Bytes from a stream
Say we wanted to read a u16 encoded as little-endian from a stream. We would write something like this:
use std::io::prelude::*; use std::fs::File; use std::mem; let mut f = File::open("foo.txt")?; let mut buf = [0; mem::size_of::<u16>()]; f.read_exact(&mut buf)?; let num = u16::from_le_bytes(buf);
This uses mem::size_of to ensure we get the right size for the type we're trying to read. Then uses Read::read_exact to read bytes from the stream, and u16::from_le_bytes to convert it to a u16 2.
To write a u16 as little-endian to a stream, we can do:
use std::io::prelude::*; use std::fs::File; let mut file = File::create("foo.txt")?; file.write_all(&99_u16.to_le_bytes())?; // Write a u16 to a stream.
As you can see writing bytes to a stream is somewhat straight forward; combining two API calls to create the buffer and write it. But expressively reading bytes is much more involved: a combination of 3 APIs that must be used in the right order to read from a stream.
Having to write this for each number read from a stream would feel repetitive, making it harder to read and write. But it's also not very discoverable for people who want to use these APIs. In order to figure it out you have to learn about several different APIs, and the interaction between them 3.
ReadBytes / WriteBytes
The Iterator::collect function is one of the functions that best embodies Rust's values of expressiveness, performance, and correctness. There's entire articles written about just that one function. And rightfully so; it's because of the type system and the borrow checker that we can have a single API that can just as easily convert an Iterator to a HashMap, as to a Result<Vec> -- all without any performance penalty. That's quite impressive for a language that has such a strong focus on correctness as Rust.
It turns out we can write a similar API for number encoding and decoding number types as well. In omnom we've done this by implementing two traits for each number: ReadBytes and WriteBytes4. The APIs look like this:
We implement this for each individual number type so that it knows how to encode and decode itself from streams. We then add various endian-based read/write methods to the Read and Write traits that can take any T: ReadBytes or T: WriteBytes and write it to themselves:
When put together, this allow reading and writing bytes from any number type since they all implement ReadBytes and WriteBytes. This removes all boilerplate we saw earlier, and replaces it with a few straight forward APIs.
use std::fs::File; use omnom::prelude::*; let mut file = File::open("foo.txt")?; file.write_le(56_i32)?; // write an i32 as little-endian to a stream. let n: u16 = file.read_be()?; // read a u16 as big-endian from a stream.
Future possibilities
RFC
Making people use the omnom library for such a small API addition doesn't make much sense. I think this would be a fantastic addition to the standard library, making something commonplace both easier to author, review, and learn about. Perhaps this would make for a good RFC?
Floating point numbers
Since Rust 1.41 endianness conversions are now available for floats as well. This isn't part of omnom yet, but would be a 2-line change to add.
Multi-value reads and writes
Something else that would be a lot of fun to add would be multi-value reads and writes:
// Proposed let head: [u32; 4] = file.read_be()?; // Current let b1: u32 = file.read_be()?; let b2: u32 = file.read_be()?; let b3: u32 = file.read_be()?; let b4: u32 = file.read_be()?; let head = [b1, b2, b3, b4];
Support for async streams
Another interesting avenue to explore would be to implement these methods for AsyncRead and AsyncWrite. Reading bytes asynchronously from streams is very common, which would create benefits comparable to their synchronous counterparts.
use async_std::fs::File; use omnom::prelude::*; let mut file = File::open("foo.txt").await?; file.write_le(56_i32).await?; // write an i32 as little-endian to a byte stream. let n: u16 = file.read_be().await?; // read a u16 as big-endian from a byte stream.
In async-std we've implemented Stream::collect using FromStream and IntoStream traits. I recently wrote about the challenges of these traits, and given how similar they'd be to the traits we're proposing in this post, they'd likely face the same challenges. In short: we can probably implement them today, but not as efficient as theoretically possible until we support async fn in traits.
Alternatives
Back in November I wrote a rather long survey of byte-ordered reads and writes, and discussed a few alternatives. The design implemented in omnom ended up being the third option we discussed. But it's probably worthwhile discussing the other two options we considered.
Alternative 1: Inherent Methods on Read/Write
Instead of implementing a new trait for each number type, we could add number-specific methods on the Read and Write traits. If we followed std's naming conventions of prefixing with be_, le_, and ne_ we would end up implementing 96 new methods on Read and Write (2 traits * 3 endianness kinds * 16 number types).
use std::io::prelude::*; let mut reader = Cursor::new(vec![2, 5, 3, 0]); assert_eq!(517, reader.read_u16_be()?); assert_eq!(768, reader.read_u16_be()?);
That's a lot of methods that would be shown in the documentation of Read and Write, making it generally harder to browse and explore. But it would also prevent us from implementing multi-value reads / writes later on (see Future Possibilities).
Alternative 2: ByteOrder inspired
Another option would be to follow byteorder's approach of adding 32 methods on the Read and Write traits, and using Endianness types as type parameters to choose the endianness.
use std::io::prelude::*; use std::io::BigEndian; let mut reader = Cursor::new(vec![2, 5, 3, 0]); assert_eq!(517, reader.read_u16::<BigEndian>()?); assert_eq!(768, reader.read_u16::<BigEndian>()?);
Compared to the first option that's still a lot of methods (16 per trait), and would also prevent us from implementing multi-value read/writes later on. Additionally we would need to introduce at least 3 additional Endianness types that are used to switch between endianness, which means additional imports in order to use the traits 5.
pub mod io { pub mod endian { pub struct BigEndian; pub struct LittleEndian; pub struct NativeEndian; } }
Prior Art
byteorder
The byteorder crate uses a variation of the second alternative. byteorder was one of the first crates to provide convenience APIs for byte order aware operations, so the naming predates the additions to the stdlib.
As of Rust 1.32 numeric types have started to provide built-in methods such as to_le and from_le that partially cover the same functionality as byteorder. The purpose of this post is to share additional designs that could replace the remaining uses of byteorder.
tokio-byteorder
The tokio-byteorder crate is a port of byteorder for tokio's own AsyncRead and AsyncWrite traits. The design mostly matches byteorder, with the exception that it operates on asynchronous byte streams instead of synchronous ones.
use std::io::Cursor; use tokio_byteorder::{BigEndian, AsyncReadBytesExt}; let mut rdr = Cursor::new(vec![2, 5, 3, 0]); assert_eq!(517, rdr.read_u16::<BigEndian>().await?); assert_eq!(768, rdr.read_u16::<BigEndian>().await?);
tokio
Two months ago tokio added support for reading and writing numbers as big-endian only:
use tokio::io::{self, AsyncReadExt}; use std::io::Cursor; let mut reader = Cursor::new(vec![2, 5, 3, 0]); assert_eq!(517, reader.read_u16().await?); assert_eq!(768, reader.read_u16().await?);
This currently includes 20 new methods total for AsyncRead and AsyncWrite. A design to add support for little-endian writes has been proposed, which closely follows the first alternative we outlined (e.g. write_u16_le). If accepted it would introduce an additional 40 top-level methods for AsyncRead and AsyncWrite. At the time of writing tokio equivalents of stdlib features such as support for floating point numbers or platform-native endianness do not seem to have been proposed.
nom
nom is a parser-combinator library that operates on byte buffers instead of streams. As such it isn't quite the same as the others in this list. However it does have support for streaming operations, and provides several methods to read bytes with a given endianness:
use nom::number::streaming::be_u16; let parser = |s| { be_u16::<(_, ErrorKind)>(s) }; assert_eq!(parser(b"\x00\x01abcd"), Ok((&b"abcd"[..], 0x0001))); assert_eq!(parser(b"\x01"), Err(Err::Incomplete(Needed::Size(2))));
This approach is different from any API we've discussed so far, though the naming scheme somewhat matches the first alternative.
Conclusion
In this post we've covered what endianness is, why streaming parsing is useful, and showed how byte-ordered aware parsing can be performed today. We've then gone on to introduce a novel approach for streaming byte-ordered aware parsing, inspired by the stdlib's collect function, and shown how to use a reference implementation. We then covered alternative API designs we and prior art.
I originally wanted to publish this post in early December after I came back from Japan. But the holidays happened, so I figured I'd make the effort now that we're approaching the end of the first month of the new year.
Overall I'm pretty excited about this API. It's very close to: "What if collect worked for numbers?"; being able to read and write numbers using 3 new methods on each trait. I think that's something that would be really nice.
let head: [u32; 4] = file.read_be()?; file.write_le(56_i32)?;
Thanks to Joshua Gould (Technetos) and Florian Gilcher (Skade) for proof-reading this post.
via Yoshua Wuyts — Blog
April 2, 2020 at 10:07PM
The text was updated successfully, but these errors were encountered:
Yoshua Wuyts — Blog
https://ift.tt/38tStBc
Byte Ordered Streams
— 2020-01-22
While on vacation in Japan I dug into streaming parsing, for fun. It's something that piqued my interest ever since I saw Mafintosh's
csv-parser
package back in 2014.This led to the creation of the
omnom
library , and as part of it I came up with an ergonomic way to parse and write numbers with a specified endianness to and from streams. An example:Bytes and Endianness
"Endianness" is a word that you might hear occasionally, but if you've never written network drivers or serialization formats you might not have had to learn about before. Basically what it means is "the order in which bytes form numbers".
A byte is the smallest representation of data in a computer, and in Rust it's referred to as a
u8
. Au16
is two bytes long, which means it's twou8
s.However in the second case, do we interpret the bytes from left-to-right, or right-to-left? Endianness refers to the order we read bytes in. Little-endian means we interpret bytes right-to-left. And big-endian means we interpret them left-to-right. Different protocols will interpret bytes in different orders, but it will usually say so in the protocol. The most important thing to be aware of is that there's a difference between the two.
In Rust's stdlib there exists "little-endian", "big-endian", and "native-endian". The latter refers to the system's default representation.
Reading Bytes from a stream
Say we wanted to read a
u16
encoded as little-endian from a stream. We would write something like this:This uses
mem::size_of
to ensure we get the right size for the type we're trying to read. Then usesRead::read_exact
to read bytes from the stream, andu16::from_le_bytes
to convert it to a u16 2.To write a u16 as little-endian to a stream, we can do:
This approach uses
u16::to_le_bytes
andWrite::write_all
to write bytes to a stream.As you can see writing bytes to a stream is somewhat straight forward; combining two API calls to create the buffer and write it. But expressively reading bytes is much more involved: a combination of 3 APIs that must be used in the right order to read from a stream.
Having to write this for each number read from a stream would feel repetitive, making it harder to read and write. But it's also not very discoverable for people who want to use these APIs. In order to figure it out you have to learn about several different APIs, and the interaction between them 3.
ReadBytes / WriteBytes
The
Iterator::collect
function is one of the functions that best embodies Rust's values of expressiveness, performance, and correctness. There's entire articles written about just that one function. And rightfully so; it's because of the type system and the borrow checker that we can have a single API that can just as easily convert anIterator
to aHashMap
, as to aResult<Vec>
-- all without any performance penalty. That's quite impressive for a language that has such a strong focus on correctness as Rust.It turns out we can write a similar API for number encoding and decoding number types as well. In
omnom
we've done this by implementing two traits for each number:ReadBytes
andWriteBytes
4. The APIs look like this:We implement this for each individual number type so that it knows how to encode and decode itself from streams. We then add various endian-based read/write methods to the
Read
andWrite
traits that can take anyT: ReadBytes
orT: WriteBytes
and write it to themselves:When put together, this allow reading and writing bytes from any number type since they all implement
ReadBytes
andWriteBytes
. This removes all boilerplate we saw earlier, and replaces it with a few straight forward APIs.Future possibilities
RFC
Making people use the
omnom
library for such a small API addition doesn't make much sense. I think this would be a fantastic addition to the standard library, making something commonplace both easier to author, review, and learn about. Perhaps this would make for a good RFC?Floating point numbers
Since Rust 1.41 endianness conversions are now available for floats as well. This isn't part of
omnom
yet, but would be a 2-line change to add.Multi-value reads and writes
Something else that would be a lot of fun to add would be multi-value reads and writes:
Support for async streams
Another interesting avenue to explore would be to implement these methods for
AsyncRead
andAsyncWrite
. Reading bytes asynchronously from streams is very common, which would create benefits comparable to their synchronous counterparts.In
async-std
we've implementedStream::collect
usingFromStream
andIntoStream
traits. I recently wrote about the challenges of these traits, and given how similar they'd be to the traits we're proposing in this post, they'd likely face the same challenges. In short: we can probably implement them today, but not as efficient as theoretically possible until we supportasync fn
in traits.Alternatives
Back in November I wrote a rather long survey of byte-ordered reads and writes, and discussed a few alternatives. The design implemented in
omnom
ended up being the third option we discussed. But it's probably worthwhile discussing the other two options we considered.Alternative 1: Inherent Methods on Read/Write
Instead of implementing a new trait for each number type, we could add number-specific methods on the
Read
andWrite
traits. If we followed std's naming conventions of prefixing withbe_
,le_
, andne_
we would end up implementing 96 new methods onRead
andWrite
(2 traits * 3 endianness kinds * 16 number types).That's a lot of methods that would be shown in the documentation of
Read
andWrite
, making it generally harder to browse and explore. But it would also prevent us from implementing multi-value reads / writes later on (see Future Possibilities).Alternative 2: ByteOrder inspired
Another option would be to follow byteorder's approach of adding 32 methods on the Read and Write traits, and using
Endianness
types as type parameters to choose the endianness.Compared to the first option that's still a lot of methods (16 per trait), and would also prevent us from implementing multi-value read/writes later on. Additionally we would need to introduce at least 3 additional
Endianness
types that are used to switch between endianness, which means additional imports in order to use the traits 5.Prior Art
byteorder
The
byteorder
crate uses a variation of the second alternative.byteorder
was one of the first crates to provide convenience APIs for byte order aware operations, so the naming predates the additions to the stdlib.As of Rust 1.32 numeric types have started to provide built-in methods such as
to_le
andfrom_le
that partially cover the same functionality asbyteorder
. The purpose of this post is to share additional designs that could replace the remaining uses ofbyteorder
.tokio-byteorder
The
tokio-byteorder
crate is a port ofbyteorder
fortokio
's ownAsyncRead
andAsyncWrite
traits. The design mostly matchesbyteorder
, with the exception that it operates on asynchronous byte streams instead of synchronous ones.tokio
Two months ago
tokio
added support for reading and writing numbers as big-endian only:This currently includes 20 new methods total for
AsyncRead
andAsyncWrite
. A design to add support for little-endian writes has been proposed, which closely follows the first alternative we outlined (e.g.write_u16_le
). If accepted it would introduce an additional 40 top-level methods forAsyncRead
andAsyncWrite
. At the time of writingtokio
equivalents of stdlib features such as support for floating point numbers or platform-native endianness do not seem to have been proposed.nom
nom
is a parser-combinator library that operates on byte buffers instead of streams. As such it isn't quite the same as the others in this list. However it does have support for streaming operations, and provides several methods to read bytes with a given endianness:This approach is different from any API we've discussed so far, though the naming scheme somewhat matches the first alternative.
Conclusion
In this post we've covered what endianness is, why streaming parsing is useful, and showed how byte-ordered aware parsing can be performed today. We've then gone on to introduce a novel approach for streaming byte-ordered aware parsing, inspired by the stdlib's
collect
function, and shown how to use a reference implementation. We then covered alternative API designs we and prior art.I originally wanted to publish this post in early December after I came back from Japan. But the holidays happened, so I figured I'd make the effort now that we're approaching the end of the first month of the new year.
Overall I'm pretty excited about this API. It's very close to: "What if collect worked for numbers?"; being able to read and write numbers using 3 new methods on each trait. I think that's something that would be really nice.
Thanks to Joshua Gould (Technetos) and Florian Gilcher (Skade) for proof-reading this post.
via Yoshua Wuyts — Blog
April 2, 2020 at 10:07PM
The text was updated successfully, but these errors were encountered: