Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework ion-hash representations #268

Merged
merged 8 commits into from
May 27, 2021

Conversation

marcbowes
Copy link
Contributor

@marcbowes marcbowes commented May 26, 2021

Before this commit, every Element visited during ion hash computation
would need to allocate a vector: because each element type can be
represented by a variable number of bytes! For example, a boolean has no
representation while a string is representated by its UTF8 bytes.

This commit fixes that by having the representation implementation write
to the digester directly. For some types (like a BigInt) we will still
need a variable length vector. But in other cases, we now no longer need
to allocate.

While implementing this, I wanted to make the escape method also not
require allocation. This is a simple case of wrapping the update
method to maybe write two bytes per byte written. However, because we
were dependent on the Digest trait, writing the wrapper required
wrapping all of those methods. In this commit, we move away from the
Digest trait and rather use the traits that make up that trait. This
means the escaping wrapper just needs to worry about update.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

marcbowes added 2 commits May 25, 2021 19:24
This is useful for debugging, or printing error messages.
Before this commit, every `Element` visited during ion hash computation
would need to allocate a vector: because each element type can be
represented by a variable number of bytes! For example, a boolean has no
representation while a string is representated by its UTF8 bytes.

This commit fixes that by having the representation implementation write
to the digester directly. For some types (like a BigInt) we will still
need a variable length vector. But in other cases, we now no longer need
to allocate.

While implementing this, I wanted to make the escape method also not
require allocation. This is a simple case of wrapping the `update`
method to maybe write two bytes per byte written. However, because we
were dependent on the `Digest` trait, writing the wrapper required
wrapping all of those methods. In this commit, we move away from the
`Digest` trait and rather use the traits that *make up* that trait. This
means the escaping wrapper just needs to worry about `update`.
@marcbowes marcbowes requested a review from zslayton May 26, 2021 02:35
@codecov
Copy link

codecov bot commented May 26, 2021

Codecov Report

Merging #268 (5849b44) into main (8040a39) will decrease coverage by 0.01%.
The diff coverage is 70.45%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #268      +/-   ##
==========================================
- Coverage   91.51%   91.49%   -0.02%     
==========================================
  Files          55       55              
  Lines        8099     8094       -5     
==========================================
- Hits         7412     7406       -6     
- Misses        687      688       +1     
Impacted Files Coverage Δ
ion-hash/src/type_qualifier.rs 89.47% <ø> (ø)
src/types/mod.rs 70.37% <0.00%> (-18.70%) ⬇️
ion-hash/tests/ion_hash_tests.rs 87.93% <84.21%> (+14.30%) ⬆️
ion-hash/src/lib.rs 96.87% <87.50%> (-0.35%) ⬇️
ion-hash/src/representation.rs 88.23% <88.63%> (-2.01%) ⬇️
src/lib.rs 65.92% <0.00%> (+1.22%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8040a39...5849b44. Read the comment docs.

pub struct IonHasher<D>
where
D: Digest,
D: Update + FixedOutput + Reset + Clone + Default,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add the Digest bound to D to make this explicit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the value be? I kind of like enumerating the APIs we need (that we need a thing that goes quack; it need not be a duck).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, I can't think of it--the Digest trait and its factoring is a bit weird IMO.

Comment on lines 23 to 31
match elem.ion_type() {
IonType::Null | IonType::Boolean => todo!(),
IonType::Integer => elem.as_any_int().repr(),
IonType::Float => elem.as_f64().repr(),
IonType::Null | IonType::Boolean => {} // these types have no representation
IonType::Integer => write_repr_integer(elem.as_any_int(), &mut hasher),
IonType::Float => write_repr_float(elem.as_f64(), &mut hasher),
IonType::Decimal | IonType::Timestamp | IonType::Symbol => todo!(),
IonType::String => elem.as_str().repr(),
IonType::String => write_repr_string(elem.as_str(), &mut hasher),
IonType::Clob | IonType::Blob | IonType::List | IonType::SExpression | IonType::Struct => {
todo!()
panic!("type {} is not yet supported", elem.ion_type())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An idea for future factoring (I am not suggesting do it now or even that it is the best factoring), should we model these as extension methods (via trait)? I see this & mut hasher every where and it has a smell to it (I read that as &mut self).

We could play a trick with a trait that has a blanket implementation for all D: Digest + Update to make it work.

Might be overkill TBH, so just a thought...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just don't really see the value in it. The trait just adds more lines of codes, and no safety.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I quite liked that only the struct one needs FixedOutput+Default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't feel strongly other than it really does look like hasher.write_repr_string(elem.as_str()) to me.

},
Some(AnyInt::BigInt(b)) => Some(b.magnitude().to_bytes_be()),
None => None,
struct UpdateEscaping<'a, U: Update>(&'a mut U);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually those write_repr methods could just go on this struct's implementation....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like that newer languages like Rust and Kotlin don't force you to have methods on objects - you're free to organize functions however you want! The current structure in this PR spends no effort trying to tease out or rationalize a new concept here. We have a method that writes out representations, it dispatches with a match statement, and we have an escaping wrapper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I don't feel super strong about it, it just did trigger a code smell to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upcoming struct variant might make this more clear..

Some(s) => Some(s.as_bytes().into()),
None => None,
}
fn write_repr_integer<U: Update>(value: Option<&AnyInt>, hasher: &mut UpdateEscaping<'_, U>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even more so, this definition looks like it should be a method.

Comment on lines 78 to 85
hasher.update(&[0x80, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00])
};
}
if v.is_infinite() {
return if v.is_sign_positive() {
hasher.update(&[0x7F, 0xF0, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00])
} else {
hasher.update(&[0xFF, 0xF0, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, it would be nice to have this in constants (magic numbers and all...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 112 to 113
self.hash_no_repr(elem);
if let Some(repr) = representation::binary_repr(elem) {
self.hasher.update(ion_hash_escape(&repr[..]));
}
representation::write_repr(elem, &mut self.hasher);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got confused by this terminology

no_repr is confusing, I might spell this as hash_type or something, because IIRC, you need to do this for all values, right? write_repr might be better spelled as scalar_repr or something like that.

Channeling my inner @zslayton also I would suggest we spell out the terms to be consistent with the style of the other crates (e.g. representation).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok!

Comment on lines +36 to +58
impl fmt::Display for IonType {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(
f,
"{}",
match self {
IonType::Null => "null",
IonType::Boolean => "boolean",
IonType::Integer => "integer",
IonType::Float => "float",
IonType::Decimal => "decimal",
IonType::Timestamp => "timestamp",
IonType::Symbol => "symbol",
IonType::String => "string",
IonType::Clob => "clob",
IonType::Blob => "blob",
IonType::List => "list",
IonType::SExpression => "sexp",
IonType::Struct => "struct",
}
)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, where do you need this? Is the Debug not sufficient for some use case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep doing println-debugging and {} is just natural. I can remove it if you want..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No biggie--I was just wondering.

Copy link
Contributor

@almann almann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor concern around some struct related concern, but otherwise 🚢

Comment on lines +85 to +92
/// Ion hash defines representations for some special values.
struct Floats;
impl Floats {
const NEGATIVE_ZERO: [u8; 8] = [0x80, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00];
const POSITIVE_INFINITY: [u8; 8] = [0x7F, 0xF0, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00];
const NEGATIVE_INFINITY: [u8; 8] = [0xFF, 0xF0, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00];
const NAN: [u8; 8] = [0x7F, 0xF8, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00];
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind the namespacing, but I'd be cool with just keeping them flat in the module.

Not part of this PR.
@marcbowes marcbowes merged commit 5ea8564 into amazon-ion:main May 27, 2021
@marcbowes marcbowes deleted the ion-hash-structs branch June 3, 2021 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants