Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds NativeElementReader, lifts int size restriction #372

Merged
merged 5 commits into from
May 5, 2022

Conversation

zslayton
Copy link
Contributor

@zslayton zslayton commented May 4, 2022

This PR:

  • Adds NativeElementReader, a pure-Rust implementation of the
    ElementReader API. Iterates over a given byte slice and returns
    materialized OwnedElement representations of the data.
  • Modifies tests/element_test_vectors.rs to be able to run the
    same ion-tests logic using both the IonCElementReader and the
    new NativeElementReader. This includes the creation of a (very
    long) skip list for the NativeElementReader.
  • Renames AnyInt to Integer and moves it to the 'types' module
    alongside Decimal and Timestamp.
  • Renames AnyUInt to UInteger and moves it to the types module.
  • Modifies the UInt decoder to be able to return UInts of any
    size, not just u64s.
  • Modifies the StreamReader trait (and all implementations) to include
    a fn read_integer(&mut self) -> IonResult<Integer> method that can
    return an integer of any size.
  • Fixes a bug in the RawTextReader that caused it to try and step into
    container-typed null values.
  • Adds a check to the Timestamp builder to make sure that the year
    provided is in the range 1-9999 inclusive.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

This PR:
* Adds `NativeElementReader` a pure-Rust implementation of the
`ElementReader` API. Iterates over a given byte slice and returns
materialized `OwnedElement` representations of the data.
* Modifies `tests/element_test_vectors.rs` to be able to run the
same `ion-tests` logic using both the `IonCElementReader` and the
new `NativeElementReader`. This includes the creation of a (very
long) skip list for the `NativeElementReader`.
* Renames `AnyInt` to `Integer` and moves it to the 'types' module
alongside `Decimal` and `Timestamp`.
* Renames `AnyUInt` to `UInteger` and moves it to the `types` module.
* Modifies the `UInt` decoder to be able to return `UInt`s of any
size, not just `u64`s.
* Modifies the `StreamReader` trait (and all implementations) to include
a `fn read_integer(&mut self) -> IonResult<Integer>` method that can
return an integer of any size.
* Fixes a bug in the `RawTextReader` that caused it to try and step into
container-typed `null` values.
@zslayton zslayton requested a review from desaikd May 4, 2022 00:40
@@ -11,10 +11,11 @@ use crate::type_qualifier::type_qualifier_symbol;
use digest::{FixedOutput, Output, Reset, Update};
use ion_rs::binary::{self, decimal::DecimalBinaryEncoder, timestamp::TimestampBinaryEncoder};
use ion_rs::types::decimal::Decimal;
use ion_rs::types::integer::Integer;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR tour: Now that the StreamReader trait returns an int of arbitrary size from the new read_integer() method, I've renamed AnyInt to simply Integer, which is more in line with our Decimal and Timestamp wrapper types.

@@ -441,18 +442,35 @@ impl<R: IonDataSource> StreamReader for RawBinaryReader<R> {
}
}

fn read_integer(&mut self) -> IonResult<Integer> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR tour: read_integer returns an Integer, which can be of any size. A doc comment for this method is available on the StreamReader trait that declares it.

itc => unreachable!("Unexpected IonTypeCode: {:?}", itc),
};

Ok(value)
}

fn read_i64(&mut self) -> IonResult<i64> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR tour: read_i64 is now implemented in terms of read_integer. (I'd be open to marking it #[deprecated] at this point.)

Comment on lines +33 to +41
if length <= UINT_STACK_BUFFER_SIZE {
let buffer = &mut [0u8; UINT_STACK_BUFFER_SIZE];
DecodedUInt::read_using_buffer(data_source, length, buffer)
} else {
// We're reading an enormous int. Heap-allocate a Vec to use as storage.
let mut buffer = vec![0u8; length];
DecodedUInt::read_using_buffer(data_source, length, buffer.as_mut_slice())
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR tour: when decoding a UInt, the reader will use a stack-allocated buffer for values that are <= 16 bytes. If a particularly large UInt is encountered, the reader will allocate an appropriately sized Vec to decode it. This should be very rare indeed; however, it does occur in several of the ion-tests. The largest integer in ion-tests is some 1200+ bytes at the moment. At the top of this method, we enforce an overall maximum allowable size, currently set to 2048 bytes. This is to prevent this from becoming a denial-of-service vector.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what the largest integer we've encountered or heard of in real data is? As an aside I don't think unbounded types are a feature of Ion- I think it's a bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what the largest integer we've encountered or heard of in real data is?

Excellent question. It's not hard to find ion-java client code using IonReader#bigIntegerValue(), but it's impossible to know whether their data requires that or it's just defensive programming.

A bit of code archeology: back when it was a hobby project,ion-rust originally only supported i64 representations of integer. When Rust v1.26 was released, it introduced a stable i128 data type. I briefly planned to make that the go-to representation for integer; it was stack allocated and had fast intrinsic operations relative to BigInt. Ultimately though, I went with BigInt because I knew there were spec conformance tests that featured enormous integers.

We could have just said "we don't support arbitrarily sized ints" and added those test files to a skip list, but I concluded that when 99.999% of integer use cases will fit in an i64, we don't lose much by using a BigInt instead of an i128 for the remaining .001%.

}
}

fn read_using_buffer<R: IonDataSource>(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR tour: this helper method was broken out so the core reading logic would work with either a stack- or heap-allocated buffer.

@@ -0,0 +1,201 @@
use crate::result::{decoding_error, IonError};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR tour: everything in this file was moved wholesale from the value module when I promoted AnyInt to a public-facing type. There are no logic changes.

"Timestamp year '{}' out of range (1-9999)",
self.year
));
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR tour: this fixed several failures that were surfaced by ion-tests. I probably should've waited and done it in a follow-on PR, but I got excited.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The annotation question above drove me to read through the ANTLR grammar for text encoding, and I noted the construct used there to enforce this limit:


fragment
YEAR
    : '000'                     [1-9]
    | '00'            [1-9]     DEC_DIGIT
    | '0'   [1-9]     DEC_DIGIT DEC_DIGIT
    | [1-9] DEC_DIGIT DEC_DIGIT DEC_DIGIT
    ;

@@ -0,0 +1,220 @@
use crate::result::IonResult;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR tour: the contents of this file were moved wholesale from the value::reader module when I introduced the NativeElementReader. Now the trait lives in value::reader while the implementations live in value::ion_c_reader and value::native_reader respectively.

@@ -13,56 +13,6 @@ use test_generator::test_resources;
/// Buffer size for our writing tests around round-tripping
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR tour: this file has been heavily refactored so it can run the same tests for both IonCElementReader and NativeElementReader. The logic is unchanged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very reassuring given that the diff is so large GitHub won't load it by default.

type ElementReader = NativeElementReader;

fn global_skip_list() -> SkipList {
&[
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR tour: this is the skip list for the native reader. Glancing at the failures, there seem to be a few bugs that are responsible for the lion's share. I'll start to address them in follow-on PRs.

For the moment, the NativeElementApi is using the native reader but the ion-c writer. Once the native reader is passing most of its ion-tests, I'll switch it over to a native writer. In this way I hope to minimize the research required for each bug; there should only be one untested thing being used at a time.

@codecov
Copy link

codecov bot commented May 4, 2022

Codecov Report

Merging #372 (eee1b2a) into main (1b8cb98) will increase coverage by 0.49%.
The diff coverage is 93.70%.

@@            Coverage Diff             @@
##             main     #372      +/-   ##
==========================================
+ Coverage   90.90%   91.40%   +0.49%     
==========================================
  Files          71       74       +3     
  Lines       10669    10884     +215     
==========================================
+ Hits         9699     9948     +249     
+ Misses        970      936      -34     
Impacted Files Coverage Δ
src/stream_reader.rs 100.00% <ø> (ø)
src/system_reader.rs 85.71% <ø> (ø)
src/text/text_data_source.rs 66.66% <ø> (+33.33%) ⬆️
src/types/mod.rs 84.61% <ø> (ø)
src/types/integer.rs 77.27% <77.27%> (ø)
src/text/parsers/integer.rs 96.82% <91.30%> (-3.18%) ⬇️
src/binary/raw_binary_reader.rs 92.74% <91.66%> (+0.76%) ⬆️
src/text/raw_text_reader.rs 94.60% <94.11%> (+3.79%) ⬆️
src/binary/uint.rs 94.69% <96.96%> (+1.90%) ⬆️
src/value/ion_c_reader.rs 98.00% <98.00%> (ø)
... and 32 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1b8cb98...eee1b2a. Read the comment docs.

Copy link
Contributor

@jobarr-amzn jobarr-amzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very exciting! I have some questions to clarify about false/true/nan as annotations but no reason I see not to ship this. My Rust comprehension is still, well, not yet oxidized but it looks good to me.

const MAX_UINT_SIZE_IN_BYTES: usize = mem::size_of::<UIntStorage>();
// This limit is used for stack-allocating buffer space to encode/decode UInts.
const UINT_STACK_BUFFER_SIZE: usize = 16;
// This number was chosen somewhat arbitrarily and could be lifted if a use case demands it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valuable comment, thanks :)

Comment on lines +33 to +41
if length <= UINT_STACK_BUFFER_SIZE {
let buffer = &mut [0u8; UINT_STACK_BUFFER_SIZE];
DecodedUInt::read_using_buffer(data_source, length, buffer)
} else {
// We're reading an enormous int. Heap-allocate a Vec to use as storage.
let mut buffer = vec![0u8; length];
DecodedUInt::read_using_buffer(data_source, length, buffer.as_mut_slice())
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what the largest integer we've encountered or heard of in real data is? As an aside I don't think unbounded types are a feature of Ion- I think it's a bug.

Comment on lines +1070 to +1087
let mut reader = RawTextReader::new(&pretty_ion[..]);
assert_eq!(reader.next()?, RawStreamItem::Value(IonType::SExpression));
reader.step_in()?;
assert_eq!(reader.next()?, RawStreamItem::Value(IonType::Struct));

reader.step_in()?;
assert_eq!(reader.next()?, RawStreamItem::Value(IonType::Integer));
assert_eq!(reader.field_name()?, RawSymbolToken::Text("a".to_string()));
assert_eq!(reader.read_i64()?, 1);
assert_eq!(reader.next()?, RawStreamItem::Value(IonType::Integer));
assert_eq!(reader.field_name()?, RawSymbolToken::Text("b".to_string()));
assert_eq!(reader.read_i64()?, 2);
reader.step_out()?;

assert_eq!(reader.next()?, RawStreamItem::Value(IonType::Struct));
reader.step_out()?;
Ok(())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opinions vary on this sort of thing but I'd rather see a different assertion convention, something like helper methods that remove a lot of the noise. It ought to be possible to compose a series of helpers then hand it a reader to verify the expected structure and content from that point.

It never seems worth it for any single test but when we're looking at many KLOC and (hopefully) a solid decade or several in the future of ion-rust hopefully this improves at some point.

Nothing to do here right now, this is merely my reaction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that that would be helpful.

The current unit tests in ion-rust are a product of the library's bootstrapping; a lot of them were written back when the library was missing a lot of features/abstractions and have only been updated to keep the project compiling over time. Now that we have some ergonomic features (like text Ion support!), a lot of the tests could be rewritten for improved readability.

Comment on lines +1089 to +1113
#[test]
fn annotation_false() -> IonResult<()> {
// The reader will reject the unquoted boolean keyword 'false' when used as an annotation
let pretty_ion = br#"
false::23
"#;
let mut reader = RawTextReader::new(&pretty_ion[..]);
let result = reader.next();
println!("{:?}", result);
assert!(result.is_err());
Ok(())
}

#[test]
fn annotation_nan() -> IonResult<()> {
// The reader will reject the unquoted float keyword 'nan' when used as an annotation
let pretty_ion = br#"
nan::23
"#;
let mut reader = RawTextReader::new(&pretty_ion[..]);
let result = reader.next();
println!("{:?}", result);
assert!(result.is_err());
Ok(())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found these two mildly surprising, so I went to go read first the Type Annotations section of the Ion specification and then the text ANTLR grammar. The spec didn't call out this kind of limitation, and my read of the ANTLR grammar (disclaimer I could be bad at this, relevant fragment reproduced below) doesn't seem to support this either.

top_level_value
    : annotation+ top_level_value
    | delimiting_entity
    // numeric literals (if followed by something), need to be followed by
    // whitespace or a token that is either quoted (e.g. string) or
    // starts with punctuation (e.g. clob, struct, list)
    | numeric_entity ws
    | numeric_entity quoted_annotation value
    | numeric_entity delimiting_entity
    // literals that are unquoted symbols or keywords have a similar requirement
    // as the numerics above, they have different productions because the
    // rules for numerics are the same in s-expressions, but keywords
    // have different rules between top-level and s-expressions.
    | keyword_entity ws
    | keyword_entity quoted_annotation value
    | keyword_entity keyword_delimiting_entity
    ;

annotation
    : symbol ws* COLON COLON ws*
    ;

symbol
    : IDENTIFIER_SYMBOL
    // note that this is because we recognize the type names for null
    // they are ordinary symbols on their own
    | TYPE
    | QUOTED_SYMBOL
    ;

IDENTIFIER_SYMBOL
    : [$_a-zA-Z] ([$_a-zA-Z] | DEC_DIGIT)*
    ;

numeric_entity
    : BIN_INTEGER
    | DEC_INTEGER
    | HEX_INTEGER
    | TIMESTAMP
    | FLOAT
    | DECIMAL
    ;

Is this how annotations should work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cases like this, I tend to defer to ion-tests. Here are some relevant tests from the bad examples directory:

ion-java does not allow unquoted keywords as annotations, which jibes with these ion-tests.

"Timestamp year '{}' out of range (1-9999)",
self.year
));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The annotation question above drove me to read through the ANTLR grammar for text encoding, and I noted the construct used there to enforce this limit:


fragment
YEAR
    : '000'                     [1-9]
    | '00'            [1-9]     DEC_DIGIT
    | '0'   [1-9]     DEC_DIGIT DEC_DIGIT
    | [1-9] DEC_DIGIT DEC_DIGIT DEC_DIGIT
    ;

) -> IonResult<Box<dyn Iterator<Item = IonResult<OwnedElement>> + 'b>> {
// Match against the slice to see if it starts with the binary IVM
match data {
&[0xe0, 0x01, 0x00, 0xea, ..] => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to get re-use here between this format detection and the IVM parsing elsewhere in the library?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we probably need a centralized helper method to check for this.

@@ -13,56 +13,6 @@ use test_generator::test_resources;
/// Buffer size for our writing tests around round-tripping
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very reassuring given that the diff is so large GitHub won't load it by default.

@zslayton zslayton merged commit d649d54 into main May 5, 2022
@zslayton zslayton deleted the native-element-reader branch May 5, 2022 15:31
zslayton added a commit that referenced this pull request May 6, 2022
* Adds `NativeElementReader`, lifts int size restriction

This PR:
* Adds `NativeElementReader` a pure-Rust implementation of the
`ElementReader` API. Iterates over a given byte slice and returns
materialized `OwnedElement` representations of the data.
* Modifies `tests/element_test_vectors.rs` to be able to run the
same `ion-tests` logic using both the `IonCElementReader` and the
new `NativeElementReader`. This includes the creation of a (very
long) skip list for the `NativeElementReader`.
* Renames `AnyInt` to `Integer` and moves it to the 'types' module
alongside `Decimal` and `Timestamp`.
* Renames `AnyUInt` to `UInteger` and moves it to the `types` module.
* Modifies the `UInt` decoder to be able to return `UInt`s of any
size, not just `u64`s.
* Modifies the `StreamReader` trait (and all implementations) to include
a `fn read_integer(&mut self) -> IonResult<Integer>` method that can
return an integer of any size.
* Fixes a bug in the `RawTextReader` that caused it to try and step into
container-typed `null` values.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants