Adds a non-blocking raw binary reader #394

zslayton · 2022-06-25T16:44:42Z

The existing RawBinaryReader wraps an io::BufRead implementation
and pulls additional bytes from it as needed. This allows it to read
large files in a streaming fashion; however, it also means that the
file must be complete. If the data source is still growing (for
example: a file that is being appended to), the streaming reader will
treat the end of the stream as an unexpected EOF and error out.
Likewise, in async contexts where bytes can arrive in separate events,
trying to read chunk of Ion bytes can fail if the slice doesn't contain
a complete Ion stream.

This PR adds two new types: a BinaryBuffer and a RawBinaryBufferReader.

The BinaryBuffer is a stack-allocated wrapper around an implementation of
AsRef<[u8]>. It provides methods to read binary Ion encoding primitives like
VarUInt, VarInt, and NOP from the data source. It is very cheap to
create and does not modify the data source.

The RawBinaryBufferReader provides a full Reader interface over a
BinaryBuffer. If it encounters an unexpected EOF while reading, it returns
an IonError::Incomplete. Unlike the original RawBinaryReader, however,
it is guaranteed to still be in a good state afterwards. If more data
becomes available later on, it can be appended to the buffer and the
RawBinaryBufferReader can try parsing it again.

Finally, a new configuration for the ion-tests integration suite has been
added that exercises the Reader interface in the RawBinaryBufferReader,
demonstrating that it is spec-compliant.

This PR duplicates a fair amount of functionality that is available
elsewhere in the crate. Follow-on PRs will consolidate these implementations;
it is expected that the blocking raw binary reader will be re-implemented
as a wrapper around the non-blocking reader, making calls to read whenever
an Incomplete is encountered.

Fixes #4, related to #351.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

The existing `RawBinaryReader` wraps an `io::BufRead` implementation and pulls additional bytes from it as needed. This allows it to read large files in a streaming fashion; however, it also means that the file must be complete. If the data source is still growing (for example: a file that is being appended to), the streaming reader will treat the end of the stream as an unexpected EOF and error out. Likewise, in async contexts where bytes can arrive in separate events, trying to read chunk of Ion bytes can fail if the slice doesn't contain a complete Ion stream. This PR adds two new types: a `BinaryBuffer` and a `RawBinaryBufferReader`. The `BinaryBuffer` is a stack-allocated wrapper around an implementation of `AsRef<u8>`. It provides methods to read binary Ion encoding primitives like `VarUInt`, `VarInt`, and `NOP` from the data source. It is very cheap to create and does not modify the data source. The `RawBinaryBufferReader` provides a full `Reader` interface over a `BinaryBuffer`. If it encounters an unexpected EOF while reading, it returns an `IonError::Incomplete`. Unlike the original `RawBinaryReader`, however, it is guaranteed to still be in a good state afterwards. If more data becomes available later on, it can be appended to the buffer and the `RawBinaryBufferReader` can try parsing it again. Finally, a new configuration for the `ion-tests` integration suite has been added that exercises the `Reader` interface in the `RawBinaryBufferReader`, demonstrating that it is spec-compliant. This PR duplicates a fair amount of functionality that is available elsewhere in the crate. Follow-on PRs will consolidate these implementations; it is expected that the blocking raw binary reader will be re-implemented as a wrapper around the non-blocking reader, making calls to `read` whenever an `Incomplete` is encountered.

codecov · 2022-06-25T16:46:48Z

Codecov Report

Merging #394 (22f0af5) into main (ee29939) will decrease coverage by 0.32%.
The diff coverage is 84.86%.

@@            Coverage Diff             @@
##             main     #394      +/-   ##
==========================================
- Coverage   88.76%   88.44%   -0.33%     
==========================================
  Files          80       82       +2     
  Lines       12622    14027    +1405     
==========================================
+ Hits        11204    12406    +1202     
- Misses       1418     1621     +203

Impacted Files	Coverage Δ
src/text/raw_text_reader.rs	`90.75% <ø> (+0.33%)`	⬆️
src/text/text_buffer.rs	`99.29% <ø> (ø)`
src/result.rs	`68.71% <22.22%> (-6.83%)`	⬇️
src/binary/non_blocking/type_descriptor.rs	`27.50% <27.50%> (ø)`
src/binary/raw_binary_reader.rs	`92.80% <50.00%> (ø)`
src/binary/type_code.rs	`94.91% <50.00%> (ø)`
src/data_source.rs	`91.91% <61.90%> (-5.51%)`	⬇️
src/binary/var_uint.rs	`96.84% <80.00%> (+0.10%)`	⬆️
src/reader.rs	`84.61% <83.33%> (-0.04%)`	⬇️
src/binary/non_blocking/raw_binary_reader.rs	`86.51% <86.51%> (ø)`
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ee29939...22f0af5. Read the comment docs.

zslayton

🗺️ PR tour

zslayton · 2022-06-25T23:23:11Z

Cargo.toml

@@ -53,6 +53,7 @@ rstest = "0.9"
 walkdir = "2.3"
 test-generator = "0.3"
 pretty-hex = "0.2"
+memmap = "0.7.0"


🗺️ This has been added as a dev dependency so it can be used in examples/read_all_values.rs.

zslayton · 2022-06-25T23:26:17Z

examples/read_all_values.rs

-use ion_rs::{IonDataSource, IonType, RawBinaryReader};
+use ion_rs::RawReader;
+use ion_rs::{IonType, RawBinaryReader};
+use memmap::MmapOptions;


🗺️ I updated this example program to be able to use either a blocking or non-blocking raw reader depending on the command line arguments passed. I used it as a quick-and-dirty benchmark during development. In its current state, the non-blocking reader is a modest improvement over the blocking one, taking about 5% less time to complete on a ~1.3GB file.

zslayton · 2022-06-25T23:29:01Z

examples/read_all_values.rs

    use IonType::*;
    use RawStreamItem::{Nothing, Null as NullValue, Value, VersionMarker};
    let mut count: usize = 0;
    loop {
-        match cursor.next()? {


This name was an artifact of when the RawBinaryReader used to be called a BinaryCursor.

zslayton · 2022-06-25T23:30:01Z

src/binary/int.rs

@@ -30,6 +30,14 @@ pub struct DecodedInt {
 }

 impl DecodedInt {
+    pub(crate) fn new(value: Integer, is_negative: bool, size_in_bytes: usize) -> Self {


🗺️ The encoding primitive constructors (VarUInt::new, DecodedInt::new, etc) are now pub (crate) to allow them to be constructed by the BinaryBuffer during parsing.

zslayton · 2022-06-25T23:33:55Z

src/binary/nibbles.rs

@@ -5,7 +5,7 @@ const MAX_NIBBLE_VALUE: u8 = 15;
 const NIBBLE_SIZE_IN_BITS: u8 = 4;

 /// Given a byte, will return a tuple containing the values of its left and right nibbles.
-pub(crate) fn nibbles_from_byte(byte: u8) -> (u8, u8) {
+pub(crate) const fn nibbles_from_byte(byte: u8) -> (u8, u8) {


🗺️ This function was made const so it could be called in compile-time computations. In particular, this enabled a pre-computed table of type descriptors to be constructed at compile time instead of during each reader's initialization; I'll call that out when it appears further down in the diff.

zslayton · 2022-06-25T23:55:46Z

src/types/integer.rs

+fn big_integer_from_big_uint(value: BigUint) -> Integer {
+    Integer::BigInt(BigInt::from(value))
+}
+


🗺️ These helper functions are a (very) cold path for reading integers. I've placed them in an inline(never) to minimize their footprint.

zslayton · 2022-06-25T23:56:46Z

src/value/native_reader.rs

@@ -21,17 +20,6 @@ impl<R: RawReader> Iterator for NativeElementIterator<R> {
    }
 }

-impl ElementReader for NativeElementReader {


🗺️ This impl block was moved further down in the same file.

zslayton · 2022-06-25T23:57:42Z

src/value/native_reader.rs

+    }
+}
+
+pub struct NonBlockingNativeElementReader;


🗺️ This type is used by our ion-tests integration to demonstrate spec compliance. It is not especially useful in other contexts.

zslayton · 2022-06-25T23:58:38Z

tests/element_test_vectors.rs

@@ -763,3 +762,144 @@ mod native_element_tests {
        non_equivs(NativeElementApi, file_name)
    }
 }
+


🗺️ I added an ion-tests configuration for the non-blocking reader.

zslayton · 2022-06-26T00:00:01Z

src/binary/non_blocking/binary_buffer.rs

+/// When the wrapped type is a `Vec<u8>`, data can be appended to the buffer between read
+/// operations.
+#[derive(Debug, PartialEq)]
+pub(crate) struct BinaryBuffer<A: AsRef<[u8]>> {


🗺️ This type allows you to read Ion encoding primitives from a &[u8], Vec<u8>, bytes::Buf, or any other container of bytes.

There are many code blocks that are basically copy/pasted from the blocking implementation (for example: reading VarUInt, VarInt, etc). I intend to do a lot of consolidation in a follow-on PR.

I initially tried to use some existing byte buffer implementations from the Rust ecosystem for this, but needed some particular functionality:

I needed the buffer to track how many bytes it had consumed over its lifespan.

I wanted control over allocations.

I needed some types of the buffer to offer methods for adding data but not others.

It was simpler to just write my own.

zslayton · 2022-06-27T13:04:48Z

src/binary/non_blocking/raw_binary_reader.rs

+/// Type, offset, and length information about the serialized value over which the
+/// NonBlockingRawBinaryReader is currently positioned.
+#[derive(Clone, Copy, Debug, PartialEq)]
+struct EncodedValue {


🗺️ Each time the binary reader advances, it will populate a new EncodedValue that tracks the stream offsets of the serialized value's components.

zslayton · 2022-06-27T13:24:57Z

src/binary/non_blocking/raw_binary_reader.rs

+            // The `Skipping(n)` state means that we ran out of data on a previous call to `next()`.
+            // The only way to fix it is to add more input bytes to the buffer using the `append_bytes`
+            // or `read_from` methods. Adding input bytes to the buffer is only supported
+            // when the generic type `A` is `Vec<u8>`.


🗺️ There's a tradeoff lurking here. Moving to the Skipping(n) state means that we leave the reader in no man's land; no operation will succeed until more data has been added.

We could instead have the reader simply return IonError::Incomplete without changing its state. However:

This would mean that the only time the reader could advance is when the entire current value is in the buffer. Given that some real-world use cases work with multi-GB values, this feels like an onerous requirement. Having a Skipping(n) state allows the reader to incrementally skip over very large values.

If a call to next() (which in turn calls advance_to_next_item()) fails as Incomplete, I don't think users would expect to be able to perform more read operations on the value they were on before calling next().

If a use case for the alternative behavior surfaces down the road, I could see adding a config option for this (disable_incremental_advancement?).

desaikd · 2022-06-28T05:57:11Z

src/result.rs

+    Err(incomplete_data_error_raw(label, offset))
+}
+
+pub fn incomplete_data_error_raw(label: &'static str, offset: usize) -> IonError {


Why not #[inline(never)] this raw error function as well similar to other raw error functions.

Great question! The other error methods each allocate a String via format!, which causes their assembly to include code for allocation, formatting, and deallocation. incomplete_data_error_raw holds a &'static str, so it doesn't need to allocate; its assembly is pretty small for now.

Here's an assembly dump for decoding_error_raw (which is behind inline(never)):

In contrast, here's the assembly for incomplete_data_error_raw, which usually gets inlined:

I'm using cargo asm to generate these.

desaikd · 2022-06-28T05:57:20Z

src/binary/non_blocking/binary_buffer.rs

+        }
+
+        if !terminated {
+            // panic!("incomplete (unterm) varint @ pos {}: {:x?}", self.total_consumed(), self.bytes());


debug comment?

Suggested change

// panic!("incomplete (unterm) varint @ pos {}: {:x?}", self.total_consumed(), self.bytes());

Good eye, removed.

desaikd · 2022-06-28T05:57:25Z

src/binary/non_blocking/binary_buffer.rs

+        self.consume(encoded_size_in_bytes);
+        Ok(VarInt::new(
+            magnitude * sign,
+            !is_positive,


Suggestion: Everywhere we use !is_positive with VarInt::new, why not just use is_negative in declaration above and use it as is. Only reason I suggest this is because I had to go back and look at the definition of VarInt::new to make sure the above statement is correct.

zslayton

These comments describe changes I introduced in the latest commit. If you hadn't looked at earlier commits, you can probably ignore them and just read the latest version.

zslayton · 2022-06-28T17:53:12Z

src/binary/non_blocking/binary_buffer.rs

-        let next_byte = self.bytes()[0];
+        let next_byte = self.data.as_ref()[self.start];


🗺️ This change sidesteps the process of building a subslice via bytes(). It's a small optimization, but peeking at the next type descriptor might be the single hottest path in the codebase.

zslayton · 2022-06-28T17:55:08Z

src/binary/non_blocking/raw_binary_reader.rs

-        self.field_id_length as usize
-            + self.annotations_header_length as usize
-            + self.header_length()
-            + self.value_length()


This calculation happens when stepping into a container (to know where the end of the container will be), skipping a value (to know where the end of the value will be), and getting the slice of bytes that represent the body of the value. Caching it in the EncodedValue proved to be worth the space tradeoff.

zslayton · 2022-06-28T17:56:08Z

src/binary/non_blocking/raw_binary_reader.rs

+        } else {
+            self.buffer.consume(bytes_available);
+            self.state = Skipping(bytes_to_skip - bytes_available);
+            incomplete_data_error("ahead to next item", self.buffer.total_consumed())


🗺️ Looking at this method with fresh eyes, I noticed the different states weren't that different after all.

zslayton · 2022-06-28T17:57:45Z

src/binary/non_blocking/raw_binary_reader.rs

-            if self.is_eof() || self.is_at_end_of_container() {
-                return Ok(RawStreamItem::Nothing);


🗺️ This change hoisted all of the branching involved in is_eof() and is_at_end_of_container() into the consume_nop_padding() method, keeping this method smaller.

zslayton · 2022-06-28T17:58:38Z

src/binary/non_blocking/raw_binary_reader.rs

@@ -1147,8 +1114,7 @@ impl<'a, A: AsRef<[u8]>> TxReader<'a, A> {
        self.encoded_value.header = header;
        // Record the *absolute* offset of the type descriptor -- its offset from the beginning of
        // the stream.
-        self.encoded_value.header_offset =
-            self.main_buffer.total_consumed() + self.tx_buffer.total_consumed();
+        self.encoded_value.header_offset = self.tx_buffer.total_consumed();


🗺️ Now that tx_buffer holds the original buffer's total_consumed count, this calculation gets simplified.

zslayton · 2022-06-28T18:10:23Z

src/binary/non_blocking/binary_buffer.rs

-            total_consumed: 0,
+            total_consumed: self.total_consumed,


🗺️ Previously, the transaction buffer would track its own total consumption. However, I realized this wasn't terribly useful in its existing role. Having it copy the original buffer's total_consumed means that if it encounters errors, the error position that it reports will be the absolute offset in the Ion stream which is what we want.

Incidentally, this piece of information was the only reason we stored a reference to the original buffer in the TxReader, so that's been removed as well.

zslayton · 2022-06-28T18:11:02Z

src/binary/non_blocking/raw_binary_reader.rs

    state: &'a mut ReaderState,
-    main_buffer: &'a BinaryBuffer<A>,


🗺️ Now that the tx_buffer below has the original buffer's total_consumed, this field has been removed.

src/binary/non_blocking/raw_binary_reader.rs

Co-authored-by: Khushboo <68757952+desaikd@users.noreply.github.com>

zslayton commented Jun 26, 2022

View reviewed changes

zslayton requested review from jobarr-amzn and desaikd June 26, 2022 00:03

Simplified length reading, made allocating err fns inline(never)

cdf1dac

zslayton commented Jun 27, 2022

View reviewed changes

desaikd reviewed Jun 28, 2022

View reviewed changes

zslayton and others added 2 commits June 28, 2022 10:25

Feedback, cleanup, additional optimization.

7f8ccae

Merge branch 'main' into binary-buffer

3597da3

zslayton commented Jun 28, 2022

View reviewed changes

zslayton mentioned this pull request Jun 30, 2022

Optimizes reading binary symbol and integer values #396

Merged

desaikd approved these changes Jul 5, 2022

View reviewed changes

src/binary/non_blocking/raw_binary_reader.rs Outdated Show resolved Hide resolved

src/binary/non_blocking/raw_binary_reader.rs Outdated Show resolved Hide resolved

zslayton and others added 3 commits July 5, 2022 12:17

comment fix

9e51679

Co-authored-by: Khushboo <68757952+desaikd@users.noreply.github.com>

comment fix

cdc210d

Co-authored-by: Khushboo <68757952+desaikd@users.noreply.github.com>

cargo fmt

22f0af5

nirosys approved these changes Jul 5, 2022

View reviewed changes

zslayton merged commit d7c4348 into main Jul 6, 2022

zslayton deleted the binary-buffer branch July 6, 2022 00:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds a non-blocking raw binary reader #394

Adds a non-blocking raw binary reader #394

zslayton commented Jun 25, 2022 •

edited

Loading

codecov bot commented Jun 25, 2022 •

edited

Loading

zslayton left a comment

zslayton Jun 25, 2022

zslayton Jun 25, 2022

zslayton Jun 25, 2022

zslayton Jun 25, 2022

zslayton Jun 25, 2022

zslayton Jun 25, 2022

zslayton Jun 25, 2022

zslayton Jun 25, 2022

zslayton Jun 25, 2022

zslayton Jun 26, 2022 •

edited

Loading

zslayton Jun 27, 2022

zslayton Jun 27, 2022

desaikd Jun 28, 2022

zslayton Jun 28, 2022

zslayton Jun 28, 2022

desaikd Jun 28, 2022

zslayton Jun 28, 2022

desaikd Jun 28, 2022

zslayton Jun 28, 2022

zslayton left a comment

zslayton Jun 28, 2022

zslayton Jun 28, 2022

zslayton Jun 28, 2022

zslayton Jun 28, 2022

zslayton Jun 28, 2022

zslayton Jun 28, 2022

zslayton Jun 28, 2022

		let next_byte = self.bytes()[0];
		let next_byte = self.data.as_ref()[self.start];

		if self.is_eof() \|\| self.is_at_end_of_container() {
		return Ok(RawStreamItem::Nothing);

		state: &'a mut ReaderState,
		main_buffer: &'a BinaryBuffer<A>,

Adds a non-blocking raw binary reader #394

Adds a non-blocking raw binary reader #394

Conversation

zslayton commented Jun 25, 2022 • edited Loading

codecov bot commented Jun 25, 2022 • edited Loading

Codecov Report

zslayton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zslayton Jun 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zslayton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zslayton commented Jun 25, 2022 •

edited

Loading

codecov bot commented Jun 25, 2022 •

edited

Loading

zslayton Jun 26, 2022 •

edited

Loading