Optimizes reading binary symbol and integer values #396

zslayton · 2022-06-30T16:18:40Z

This PR depends on the changes in PR #394. After that PR has been merged, this one will be rebased on top of main.

This PR:

Eliminates redundant boundary checking that happens when
integer and symbol values are read. It does this by offering
unchecked UInt decoding methods that read a single value from a
provided slice.
Removes support for reading symbol IDs that require more than 8
bytes to encode. This limits the size of the symbol table to about
18 quintillion symbol values, which seems reasonable.
Adds Into implementations for Rust's integer primitives.

Together, these reduced the time needed to read every value in a
1.3GB Ion log file by about 6%.

Fixes #395.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

The existing `RawBinaryReader` wraps an `io::BufRead` implementation and pulls additional bytes from it as needed. This allows it to read large files in a streaming fashion; however, it also means that the file must be complete. If the data source is still growing (for example: a file that is being appended to), the streaming reader will treat the end of the stream as an unexpected EOF and error out. Likewise, in async contexts where bytes can arrive in separate events, trying to read chunk of Ion bytes can fail if the slice doesn't contain a complete Ion stream. This PR adds two new types: a `BinaryBuffer` and a `RawBinaryBufferReader`. The `BinaryBuffer` is a stack-allocated wrapper around an implementation of `AsRef<u8>`. It provides methods to read binary Ion encoding primitives like `VarUInt`, `VarInt`, and `NOP` from the data source. It is very cheap to create and does not modify the data source. The `RawBinaryBufferReader` provides a full `Reader` interface over a `BinaryBuffer`. If it encounters an unexpected EOF while reading, it returns an `IonError::Incomplete`. Unlike the original `RawBinaryReader`, however, it is guaranteed to still be in a good state afterwards. If more data becomes available later on, it can be appended to the buffer and the `RawBinaryBufferReader` can try parsing it again. Finally, a new configuration for the `ion-tests` integration suite has been added that exercises the `Reader` interface in the `RawBinaryBufferReader`, demonstrating that it is spec-compliant. This PR duplicates a fair amount of functionality that is available elsewhere in the crate. Follow-on PRs will consolidate these implementations; it is expected that the blocking raw binary reader will be re-implemented as a wrapper around the non-blocking reader, making calls to `read` whenever an `Incomplete` is encountered.

This PR: * Eliminates redundant boundary checking that happens when integer and symbol values are read. It does this by offering unchecked UInt decoding methods that read a single value from a provided slice. * Removes support for reading symbol IDs that require more than 8 bytes to encode. This limits the size of the symbol table to about 18 quintillion symbol values, which seems reasonable. * Adds Into<Integer> implementations for Rust's integer primitives. Together, these reduced the time needed to read every value in a 1.3GB Ion log file by about 6%. Fixes #395.

codecov · 2022-06-30T16:27:28Z

Codecov Report

Merging #396 (ceae311) into main (d7c4348) will decrease coverage by 0.02%.
The diff coverage is 78.94%.

@@            Coverage Diff             @@
##             main     #396      +/-   ##
==========================================
- Coverage   88.43%   88.41%   -0.03%     
==========================================
  Files          82       82              
  Lines       14027    14042      +15     
==========================================
+ Hits        12405    12415      +10     
- Misses       1622     1627       +5

Impacted Files	Coverage Δ
src/types/integer.rs	`81.42% <56.25%> (-2.08%)`	⬇️
src/binary/non_blocking/raw_binary_reader.rs	`86.52% <88.88%> (+<0.01%)`	⬆️
src/binary/non_blocking/binary_buffer.rs	`97.04% <100.00%> (-0.03%)`	⬇️
src/binary/uint.rs	`93.63% <100.00%> (+0.43%)`	⬆️
src/reader.rs	`84.61% <0.00%> (-0.49%)`	⬇️
src/text/raw_text_reader.rs	`90.75% <0.00%> (+0.33%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d7c4348...ceae311. Read the comment docs.

zslayton · 2022-06-30T16:20:08Z

src/binary/non_blocking/binary_buffer.rs

@@ -93,7 +93,7 @@ impl<A: AsRef<[u8]>> BinaryBuffer<A> {
    /// If the buffer is not empty, returns `Some(_)` containing the next byte in the buffer.
    /// Otherwise, returns `None`.
    pub fn peek_next_byte(&self) -> Option<u8> {
-        self.bytes().get(0).copied()
+        self.data.as_ref().get(self.start).copied()


🗺️ The original code created a byte slice (&[u8]) even though we only needed the first byte in the data. Now it just gets the first byte without creating an intermediate slice.

zslayton · 2022-06-30T16:22:13Z

src/binary/non_blocking/binary_buffer.rs

@@ -262,7 +262,7 @@ impl<A: AsRef<[u8]>> BinaryBuffer<A> {
    ///
    /// See: https://amzn.github.io/ion-docs/docs/binary.html#uint-and-int-fields
    pub fn read_uint(&mut self, length: usize) -> IonResult<DecodedUInt> {
-        if length <= mem::size_of::<usize>() {
+        if length <= mem::size_of::<u64>() {


🗺️ Here and below: the code was originally checking for usize, which had the potential to be overly conservative. On 32-bit systems, this would cause medium-sized values to be pushed into a BigInt.

In some cases (like reading a UInt that represents a SymbolId), limiting it to usize makes sense. However, that's not this method's responsibility.

zslayton · 2022-06-30T16:23:15Z

src/binary/non_blocking/binary_buffer.rs

-            magnitude <<= 8;
-            magnitude |= byte;
-        }
+        let magnitude = DecodedUInt::small_uint_from_slice(uint_bytes);


🗺️ This bitshifting loop now lives in DecodedUInt::small_uint_from_slice so it can be called from multiple locations. I'll point out the other caller location below.

zslayton · 2022-06-30T16:29:40Z