Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizes reading binary symbol and integer values #396

Merged
merged 5 commits into from
Jul 6, 2022
Merged

Conversation

zslayton
Copy link
Contributor

This PR depends on the changes in PR #394. After that PR has been merged, this one will be rebased on top of main.


This PR:

  • Eliminates redundant boundary checking that happens when
    integer and symbol values are read. It does this by offering
    unchecked UInt decoding methods that read a single value from a
    provided slice.
  • Removes support for reading symbol IDs that require more than 8
    bytes to encode. This limits the size of the symbol table to about
    18 quintillion symbol values, which seems reasonable.
  • Adds Into implementations for Rust's integer primitives.

Together, these reduced the time needed to read every value in a
1.3GB Ion log file by about 6%.

Fixes #395.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

zslayton added 4 commits June 25, 2022 11:24
The existing `RawBinaryReader` wraps an `io::BufRead` implementation
and pulls additional bytes from it as needed. This allows it to read
large files in a streaming fashion; however, it also means that the
file must be complete. If the data source is still growing (for
example: a file that is being appended to), the streaming reader will
treat the end of the stream as an unexpected EOF and error out.
Likewise, in async contexts where bytes can arrive in separate events,
trying to read chunk of Ion bytes can fail if the slice doesn't contain
a complete Ion stream.

This PR adds two new types: a `BinaryBuffer` and a `RawBinaryBufferReader`.

The `BinaryBuffer` is a stack-allocated wrapper around an implementation of
`AsRef<u8>`. It provides methods to read binary Ion encoding primitives like
`VarUInt`, `VarInt`, and `NOP` from the data source. It is very cheap to
create and does not modify the data source.

The `RawBinaryBufferReader` provides a full `Reader` interface over a
`BinaryBuffer`. If it encounters an unexpected EOF while reading, it returns
an `IonError::Incomplete`. Unlike the original `RawBinaryReader`, however,
it is guaranteed to still be in a good state afterwards. If more data
becomes available later on, it can be appended to the buffer and the
`RawBinaryBufferReader` can try parsing it again.

Finally, a new configuration for the `ion-tests` integration suite has been
added that exercises the `Reader` interface in the `RawBinaryBufferReader`,
demonstrating that it is spec-compliant.

This PR duplicates a fair amount of functionality that is available
elsewhere in the crate. Follow-on PRs will consolidate these implementations;
it is expected that the blocking raw binary reader will be re-implemented
as a wrapper around the non-blocking reader, making calls to `read` whenever
an `Incomplete` is encountered.
This PR:

* Eliminates redundant boundary checking that happens when
integer and symbol values are read. It does this by offering
unchecked UInt decoding methods that read a single value from a
provided slice.
* Removes support for reading symbol IDs that require more than 8
bytes to encode. This limits the size of the symbol table to about
18 quintillion symbol values, which seems reasonable.
* Adds Into<Integer> implementations for Rust's integer primitives.

Together, these reduced the time needed to read every value in a
1.3GB Ion log file by about 6%.

Fixes #395.
@codecov
Copy link

codecov bot commented Jun 30, 2022

Codecov Report

Merging #396 (ceae311) into main (d7c4348) will decrease coverage by 0.02%.
The diff coverage is 78.94%.

@@            Coverage Diff             @@
##             main     #396      +/-   ##
==========================================
- Coverage   88.43%   88.41%   -0.03%     
==========================================
  Files          82       82              
  Lines       14027    14042      +15     
==========================================
+ Hits        12405    12415      +10     
- Misses       1622     1627       +5     
Impacted Files Coverage Δ
src/types/integer.rs 81.42% <56.25%> (-2.08%) ⬇️
src/binary/non_blocking/raw_binary_reader.rs 86.52% <88.88%> (+<0.01%) ⬆️
src/binary/non_blocking/binary_buffer.rs 97.04% <100.00%> (-0.03%) ⬇️
src/binary/uint.rs 93.63% <100.00%> (+0.43%) ⬆️
src/reader.rs 84.61% <0.00%> (-0.49%) ⬇️
src/text/raw_text_reader.rs 90.75% <0.00%> (+0.33%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d7c4348...ceae311. Read the comment docs.

@@ -93,7 +93,7 @@ impl<A: AsRef<[u8]>> BinaryBuffer<A> {
/// If the buffer is not empty, returns `Some(_)` containing the next byte in the buffer.
/// Otherwise, returns `None`.
pub fn peek_next_byte(&self) -> Option<u8> {
self.bytes().get(0).copied()
self.data.as_ref().get(self.start).copied()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ The original code created a byte slice (&[u8]) even though we only needed the first byte in the data. Now it just gets the first byte without creating an intermediate slice.

@@ -262,7 +262,7 @@ impl<A: AsRef<[u8]>> BinaryBuffer<A> {
///
/// See: https://amzn.github.io/ion-docs/docs/binary.html#uint-and-int-fields
pub fn read_uint(&mut self, length: usize) -> IonResult<DecodedUInt> {
if length <= mem::size_of::<usize>() {
if length <= mem::size_of::<u64>() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Here and below: the code was originally checking for usize, which had the potential to be overly conservative. On 32-bit systems, this would cause medium-sized values to be pushed into a BigInt.

In some cases (like reading a UInt that represents a SymbolId), limiting it to usize makes sense. However, that's not this method's responsibility.

magnitude <<= 8;
magnitude |= byte;
}
let magnitude = DecodedUInt::small_uint_from_slice(uint_bytes);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This bitshifting loop now lives in DecodedUInt::small_uint_from_slice so it can be called from multiple locations. I'll point out the other caller location below.

decoding_error("found a u64 symbol ID that was too large to fit in a usize")
}
}
UInteger::BigUInt(symbol_id) => Self::try_symbol_id_from_big_uint(symbol_id),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This code path (try_symbol_id_from_big_uint) would only succeed if the BigUInt was small enough that it could have been encoded as a usize to begin with. This code path existed solely to satisfy one of the ion-tests:

ion-tests/iontestdata/good/typecodes/T7-large.10n

Text view of its contents:

---------------------------------------------------------------------------
 Offset   |  Length   |        Binary Ion        |         Text Ion
---------------------------------------------------------------------------
          |         4 | e0 01 00 ea              |  // Ion 1.0 Version Marker
        4 |         6 | 75 00 00 00 00 00        |   '$0' // $0
       10 |         7 | 76 00 00 00 00 00 00     |   '$0' // $0
       17 |         8 | 77 00 00 00 00 00 00 00  |   '$0' // $0
       25 |         9 | 78 00 00 00 00 00 00 00  |   '$0' // $0
          |           | 00                       |
       34 |        10 | 79 00 00 00 00 00 00 00  |   '$0' // $0
          |           | 00 00                    |
       44 |        11 | 7a 00 00 00 00 00 00 00  |   '$0' // $0
          |           | 00 00 00                 |
       55 |        12 | 7b 00 00 00 00 00 00 00  |   '$0' // $0
          |           | 00 00 00 00              |
       67 |        13 | 7c 00 00 00 00 00 00 00  |   '$0' // $0
          |           | 00 00 00 00 00           |
       80 |        14 | 7d 00 00 00 00 00 00 00  |   '$0' // $0
          |           | 00 00 00 00 00 00        |
       94 |        16 | 7e 8e 00 00 00 00 00 00  |   '$0' // $0
          |           | 00 00 00 00 00 00 00 00  |

I have removed that code path altogether and added the test to the skip list. I don't believe this represents a substantive loss of functionality.

Comment on lines -478 to -479
let (encoded_value, mut buffer) = self.value_and_buffer(IonType::Symbol)?;
match buffer.read_uint(encoded_value.value_length())?.value() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Previously, we constructed a BinaryBuffer that wrapped the value bytes. This allowed us to call buffer.read_uint(...), which was very ergonomic. However, doing this involved a few redundant steps:

  • It constructed the same slice (&[u8]) containing the value's body more than once.
  • It checked the number of bytes remaining in the buffer more than once.
  • It had multiple error handling paths even though at most one of them could be called.

Comment on lines +690 to +695
let (encoded_value, bytes) = self.value_and_bytes(IonType::Integer)?;
let value: Integer = if bytes.len() <= mem::size_of::<u64>() {
DecodedUInt::small_uint_from_slice(bytes).into()
} else {
DecodedUInt::big_uint_from_slice(bytes).into()
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This optimization is similar to the read_symbol_id optimization above. We:

  • Use bytes directly instead of creating an intermediate buffer.
  • Use our new helper method to read all of bytes without extra bounds checking or error reporting.

We also get a minor boost from the fact that small_uint_from_slice and big_uint_from_slice return their corresponding types. Instead of getting back a DecodedUInt (which needs to be destructured in order to turn it into an Integer), we get back a u64 or a BigUInt depending on the branch. We can then call the appropriate into() method to get an Integer without the extra destructuring step.

@@ -1195,7 +1200,7 @@ impl<'a, A: AsRef<[u8]>> TxReader<'a, A> {
// Read the length of the annotations sequence
let annotations_length = self.tx_buffer.read_var_uint()?;

// Validate that neither the annotations sequence is not empty.
// Validate that the annotations sequence is not empty.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Fixing a typo.

@@ -1204,7 +1209,7 @@ impl<'a, A: AsRef<[u8]>> TxReader<'a, A> {
let expected_value_length = annotations_and_value_length
- annotations_length.size_in_bytes()
- annotations_length.value();
self.tx_buffer.total_consumed();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Removing a stray accessor function that had no impact on the method logic (and was optimized out of the assembly anyway).

@@ -1749,7 +1754,7 @@ mod tests {
fn debug() -> IonResult<()> {
let data = &[
0xE0, 0x01, 0x00, 0xEA, // IVM
0xc3, 0xd2, 0x84, 0x11, // {'name': true}
0xc3, 0xd2, 0x84, 0x11, // ({'name': true})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Correcting the text Ion in the comment for this unit test.

magnitude |= byte;
}
magnitude
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Our new helper methods that operate directly on the provided byte slice.

@zslayton zslayton requested review from desaikd and nirosys June 30, 2022 16:40
Base automatically changed from binary-buffer to main July 6, 2022 00:29
@zslayton zslayton merged commit 938663e into main Jul 6, 2022
@zslayton zslayton deleted the faster-uints branch July 6, 2022 00:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove support for oversized SymbolIds
3 participants