-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizes reading binary symbol and integer values #396
Conversation
The existing `RawBinaryReader` wraps an `io::BufRead` implementation and pulls additional bytes from it as needed. This allows it to read large files in a streaming fashion; however, it also means that the file must be complete. If the data source is still growing (for example: a file that is being appended to), the streaming reader will treat the end of the stream as an unexpected EOF and error out. Likewise, in async contexts where bytes can arrive in separate events, trying to read chunk of Ion bytes can fail if the slice doesn't contain a complete Ion stream. This PR adds two new types: a `BinaryBuffer` and a `RawBinaryBufferReader`. The `BinaryBuffer` is a stack-allocated wrapper around an implementation of `AsRef<u8>`. It provides methods to read binary Ion encoding primitives like `VarUInt`, `VarInt`, and `NOP` from the data source. It is very cheap to create and does not modify the data source. The `RawBinaryBufferReader` provides a full `Reader` interface over a `BinaryBuffer`. If it encounters an unexpected EOF while reading, it returns an `IonError::Incomplete`. Unlike the original `RawBinaryReader`, however, it is guaranteed to still be in a good state afterwards. If more data becomes available later on, it can be appended to the buffer and the `RawBinaryBufferReader` can try parsing it again. Finally, a new configuration for the `ion-tests` integration suite has been added that exercises the `Reader` interface in the `RawBinaryBufferReader`, demonstrating that it is spec-compliant. This PR duplicates a fair amount of functionality that is available elsewhere in the crate. Follow-on PRs will consolidate these implementations; it is expected that the blocking raw binary reader will be re-implemented as a wrapper around the non-blocking reader, making calls to `read` whenever an `Incomplete` is encountered.
This PR: * Eliminates redundant boundary checking that happens when integer and symbol values are read. It does this by offering unchecked UInt decoding methods that read a single value from a provided slice. * Removes support for reading symbol IDs that require more than 8 bytes to encode. This limits the size of the symbol table to about 18 quintillion symbol values, which seems reasonable. * Adds Into<Integer> implementations for Rust's integer primitives. Together, these reduced the time needed to read every value in a 1.3GB Ion log file by about 6%. Fixes #395.
Codecov Report
@@ Coverage Diff @@
## main #396 +/- ##
==========================================
- Coverage 88.43% 88.41% -0.03%
==========================================
Files 82 82
Lines 14027 14042 +15
==========================================
+ Hits 12405 12415 +10
- Misses 1622 1627 +5
Continue to review full report at Codecov.
|
@@ -93,7 +93,7 @@ impl<A: AsRef<[u8]>> BinaryBuffer<A> { | |||
/// If the buffer is not empty, returns `Some(_)` containing the next byte in the buffer. | |||
/// Otherwise, returns `None`. | |||
pub fn peek_next_byte(&self) -> Option<u8> { | |||
self.bytes().get(0).copied() | |||
self.data.as_ref().get(self.start).copied() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ The original code created a byte slice (&[u8]
) even though we only needed the first byte in the data. Now it just gets the first byte without creating an intermediate slice.
@@ -262,7 +262,7 @@ impl<A: AsRef<[u8]>> BinaryBuffer<A> { | |||
/// | |||
/// See: https://amzn.github.io/ion-docs/docs/binary.html#uint-and-int-fields | |||
pub fn read_uint(&mut self, length: usize) -> IonResult<DecodedUInt> { | |||
if length <= mem::size_of::<usize>() { | |||
if length <= mem::size_of::<u64>() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ Here and below: the code was originally checking for usize
, which had the potential to be overly conservative. On 32-bit systems, this would cause medium-sized values to be pushed into a BigInt
.
In some cases (like reading a UInt
that represents a SymbolId
), limiting it to usize
makes sense. However, that's not this method's responsibility.
magnitude <<= 8; | ||
magnitude |= byte; | ||
} | ||
let magnitude = DecodedUInt::small_uint_from_slice(uint_bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ This bitshifting loop now lives in DecodedUInt::small_uint_from_slice
so it can be called from multiple locations. I'll point out the other caller location below.
decoding_error("found a u64 symbol ID that was too large to fit in a usize") | ||
} | ||
} | ||
UInteger::BigUInt(symbol_id) => Self::try_symbol_id_from_big_uint(symbol_id), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ This code path (try_symbol_id_from_big_uint
) would only succeed if the BigUInt
was small enough that it could have been encoded as a usize
to begin with. This code path existed solely to satisfy one of the ion-tests
:
ion-tests/iontestdata/good/typecodes/T7-large.10n
Text view of its contents:
---------------------------------------------------------------------------
Offset | Length | Binary Ion | Text Ion
---------------------------------------------------------------------------
| 4 | e0 01 00 ea | // Ion 1.0 Version Marker
4 | 6 | 75 00 00 00 00 00 | '$0' // $0
10 | 7 | 76 00 00 00 00 00 00 | '$0' // $0
17 | 8 | 77 00 00 00 00 00 00 00 | '$0' // $0
25 | 9 | 78 00 00 00 00 00 00 00 | '$0' // $0
| | 00 |
34 | 10 | 79 00 00 00 00 00 00 00 | '$0' // $0
| | 00 00 |
44 | 11 | 7a 00 00 00 00 00 00 00 | '$0' // $0
| | 00 00 00 |
55 | 12 | 7b 00 00 00 00 00 00 00 | '$0' // $0
| | 00 00 00 00 |
67 | 13 | 7c 00 00 00 00 00 00 00 | '$0' // $0
| | 00 00 00 00 00 |
80 | 14 | 7d 00 00 00 00 00 00 00 | '$0' // $0
| | 00 00 00 00 00 00 |
94 | 16 | 7e 8e 00 00 00 00 00 00 | '$0' // $0
| | 00 00 00 00 00 00 00 00 |
I have removed that code path altogether and added the test to the skip list. I don't believe this represents a substantive loss of functionality.
let (encoded_value, mut buffer) = self.value_and_buffer(IonType::Symbol)?; | ||
match buffer.read_uint(encoded_value.value_length())?.value() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ Previously, we constructed a BinaryBuffer
that wrapped the value bytes. This allowed us to call buffer.read_uint(...)
, which was very ergonomic. However, doing this involved a few redundant steps:
- It constructed the same slice (
&[u8]
) containing the value's body more than once. - It checked the number of bytes remaining in the buffer more than once.
- It had multiple error handling paths even though at most one of them could be called.
let (encoded_value, bytes) = self.value_and_bytes(IonType::Integer)?; | ||
let value: Integer = if bytes.len() <= mem::size_of::<u64>() { | ||
DecodedUInt::small_uint_from_slice(bytes).into() | ||
} else { | ||
DecodedUInt::big_uint_from_slice(bytes).into() | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ This optimization is similar to the read_symbol_id
optimization above. We:
- Use
bytes
directly instead of creating an intermediate buffer. - Use our new helper method to read all of
bytes
without extra bounds checking or error reporting.
We also get a minor boost from the fact that small_uint_from_slice
and big_uint_from_slice
return their corresponding types. Instead of getting back a DecodedUInt
(which needs to be destructured in order to turn it into an Integer
), we get back a u64
or a BigUInt
depending on the branch. We can then call the appropriate into()
method to get an Integer
without the extra destructuring step.
@@ -1195,7 +1200,7 @@ impl<'a, A: AsRef<[u8]>> TxReader<'a, A> { | |||
// Read the length of the annotations sequence | |||
let annotations_length = self.tx_buffer.read_var_uint()?; | |||
|
|||
// Validate that neither the annotations sequence is not empty. | |||
// Validate that the annotations sequence is not empty. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ Fixing a typo.
@@ -1204,7 +1209,7 @@ impl<'a, A: AsRef<[u8]>> TxReader<'a, A> { | |||
let expected_value_length = annotations_and_value_length | |||
- annotations_length.size_in_bytes() | |||
- annotations_length.value(); | |||
self.tx_buffer.total_consumed(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ Removing a stray accessor function that had no impact on the method logic (and was optimized out of the assembly anyway).
@@ -1749,7 +1754,7 @@ mod tests { | |||
fn debug() -> IonResult<()> { | |||
let data = &[ | |||
0xE0, 0x01, 0x00, 0xEA, // IVM | |||
0xc3, 0xd2, 0x84, 0x11, // {'name': true} | |||
0xc3, 0xd2, 0x84, 0x11, // ({'name': true}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ Correcting the text Ion in the comment for this unit test.
magnitude |= byte; | ||
} | ||
magnitude | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ Our new helper methods that operate directly on the provided byte slice.
This PR depends on the changes in PR #394. After that PR has been merged, this one will be rebased on top of
main
.This PR:
integer and symbol values are read. It does this by offering
unchecked UInt decoding methods that read a single value from a
provided slice.
bytes to encode. This limits the size of the symbol table to about
18 quintillion symbol values, which seems reasonable.
Together, these reduced the time needed to read every value in a
1.3GB Ion log file by about 6%.
Fixes #395.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.