-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework ion-hash representations #268
Conversation
This is useful for debugging, or printing error messages.
Before this commit, every `Element` visited during ion hash computation would need to allocate a vector: because each element type can be represented by a variable number of bytes! For example, a boolean has no representation while a string is representated by its UTF8 bytes. This commit fixes that by having the representation implementation write to the digester directly. For some types (like a BigInt) we will still need a variable length vector. But in other cases, we now no longer need to allocate. While implementing this, I wanted to make the escape method also not require allocation. This is a simple case of wrapping the `update` method to maybe write two bytes per byte written. However, because we were dependent on the `Digest` trait, writing the wrapper required wrapping all of those methods. In this commit, we move away from the `Digest` trait and rather use the traits that *make up* that trait. This means the escaping wrapper just needs to worry about `update`.
Codecov Report
@@ Coverage Diff @@
## main #268 +/- ##
==========================================
- Coverage 91.51% 91.49% -0.02%
==========================================
Files 55 55
Lines 8099 8094 -5
==========================================
- Hits 7412 7406 -6
- Misses 687 688 +1
Continue to review full report at Codecov.
|
pub struct IonHasher<D> | ||
where | ||
D: Digest, | ||
D: Update + FixedOutput + Reset + Clone + Default, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add the Digest
bound to D
to make this explicit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would the value be? I kind of like enumerating the APIs we need (that we need a thing that goes quack; it need not be a duck).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair, I can't think of it--the Digest
trait and its factoring is a bit weird IMO.
ion-hash/src/representation.rs
Outdated
match elem.ion_type() { | ||
IonType::Null | IonType::Boolean => todo!(), | ||
IonType::Integer => elem.as_any_int().repr(), | ||
IonType::Float => elem.as_f64().repr(), | ||
IonType::Null | IonType::Boolean => {} // these types have no representation | ||
IonType::Integer => write_repr_integer(elem.as_any_int(), &mut hasher), | ||
IonType::Float => write_repr_float(elem.as_f64(), &mut hasher), | ||
IonType::Decimal | IonType::Timestamp | IonType::Symbol => todo!(), | ||
IonType::String => elem.as_str().repr(), | ||
IonType::String => write_repr_string(elem.as_str(), &mut hasher), | ||
IonType::Clob | IonType::Blob | IonType::List | IonType::SExpression | IonType::Struct => { | ||
todo!() | ||
panic!("type {} is not yet supported", elem.ion_type()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An idea for future factoring (I am not suggesting do it now or even that it is the best factoring), should we model these as extension methods (via trait)? I see this & mut hasher
every where and it has a smell to it (I read that as &mut self
).
We could play a trick with a trait that has a blanket implementation for all D: Digest + Update
to make it work.
Might be overkill TBH, so just a thought...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just don't really see the value in it. The trait just adds more lines of codes, and no safety.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I quite liked that only the struct one needs FixedOutput+Default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't feel strongly other than it really does look like hasher.write_repr_string(elem.as_str())
to me.
ion-hash/src/representation.rs
Outdated
}, | ||
Some(AnyInt::BigInt(b)) => Some(b.magnitude().to_bytes_be()), | ||
None => None, | ||
struct UpdateEscaping<'a, U: Update>(&'a mut U); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually those write_repr
methods could just go on this struct
's implementation....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like that newer languages like Rust and Kotlin don't force you to have methods on objects - you're free to organize functions however you want! The current structure in this PR spends no effort trying to tease out or rationalize a new concept here. We have a method that writes out representations, it dispatches with a match statement, and we have an escaping wrapper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I don't feel super strong about it, it just did trigger a code smell to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The upcoming struct variant might make this more clear..
ion-hash/src/representation.rs
Outdated
Some(s) => Some(s.as_bytes().into()), | ||
None => None, | ||
} | ||
fn write_repr_integer<U: Update>(value: Option<&AnyInt>, hasher: &mut UpdateEscaping<'_, U>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even more so, this definition looks like it should be a method.
ion-hash/src/representation.rs
Outdated
hasher.update(&[0x80, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00]) | ||
}; | ||
} | ||
if v.is_infinite() { | ||
return if v.is_sign_positive() { | ||
hasher.update(&[0x7F, 0xF0, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00]) | ||
} else { | ||
hasher.update(&[0xFF, 0xF0, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor, it would be nice to have this in constants (magic numbers and all...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ion-hash/src/lib.rs
Outdated
self.hash_no_repr(elem); | ||
if let Some(repr) = representation::binary_repr(elem) { | ||
self.hasher.update(ion_hash_escape(&repr[..])); | ||
} | ||
representation::write_repr(elem, &mut self.hasher); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got confused by this terminology
no_repr
is confusing, I might spell this as hash_type
or something, because IIRC, you need to do this for all values, right? write_repr
might be better spelled as scalar_repr
or something like that.
Channeling my inner @zslayton also I would suggest we spell out the terms to be consistent with the style of the other crates (e.g. representation
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok!
impl fmt::Display for IonType { | ||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { | ||
write!( | ||
f, | ||
"{}", | ||
match self { | ||
IonType::Null => "null", | ||
IonType::Boolean => "boolean", | ||
IonType::Integer => "integer", | ||
IonType::Float => "float", | ||
IonType::Decimal => "decimal", | ||
IonType::Timestamp => "timestamp", | ||
IonType::Symbol => "symbol", | ||
IonType::String => "string", | ||
IonType::Clob => "clob", | ||
IonType::Blob => "blob", | ||
IonType::List => "list", | ||
IonType::SExpression => "sexp", | ||
IonType::Struct => "struct", | ||
} | ||
) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, where do you need this? Is the Debug
not sufficient for some use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I keep doing println-debugging and {}
is just natural. I can remove it if you want..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No biggie--I was just wondering.
Co-authored-by: Almann Goo <almann.goo@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor concern around some struct
related concern, but otherwise 🚢
/// Ion hash defines representations for some special values. | ||
struct Floats; | ||
impl Floats { | ||
const NEGATIVE_ZERO: [u8; 8] = [0x80, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00]; | ||
const POSITIVE_INFINITY: [u8; 8] = [0x7F, 0xF0, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00]; | ||
const NEGATIVE_INFINITY: [u8; 8] = [0xFF, 0xF0, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00]; | ||
const NAN: [u8; 8] = [0x7F, 0xF8, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind the namespacing, but I'd be cool with just keeping them flat in the module.
Not part of this PR.
Before this commit, every
Element
visited during ion hash computationwould need to allocate a vector: because each element type can be
represented by a variable number of bytes! For example, a boolean has no
representation while a string is representated by its UTF8 bytes.
This commit fixes that by having the representation implementation write
to the digester directly. For some types (like a BigInt) we will still
need a variable length vector. But in other cases, we now no longer need
to allocate.
While implementing this, I wanted to make the escape method also not
require allocation. This is a simple case of wrapping the
update
method to maybe write two bytes per byte written. However, because we
were dependent on the
Digest
trait, writing the wrapper requiredwrapping all of those methods. In this commit, we move away from the
Digest
trait and rather use the traits that make up that trait. Thismeans the escaping wrapper just needs to worry about
update
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.