Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MD5 Implementation #8272

Closed

Conversation

DaGenix
Copy link

@DaGenix DaGenix commented Aug 3, 2013

An MD5 implementation was originally included in #8097, but, since there are a couple different implementations of that digest algorithm (@alco mentioned his implementation on the mailing list just before I opened that PR), it was suggested that I remove it from that PR and open up a new PR to discuss the different implementations and the best way forward. If anyone wants to discuss a different implementation, feel free to present it here and discuss and compare it to this one. I'll just discuss my implementation and I'll leave it to others to present details of theirs.

This implementation relies on the FixedBuffer struct from cryptoutil.rs for managing the input buffer, just like the Sha1 and Sha2 digest implementations do. I tried manually unrolling the loops in the compression function, but I got slightly worse performance when I did that.

Outside of the #[test]s, I also tested the implementation by generating 1,000 inputs of up to 10MB in size and checking the MD5 digest calculated by this code against the MD5 digest calculated by Java's implementation.

On my computer, I'm getting the following performance:

test md5::bench::md5_10 ... bench: 52 ns/iter (+/- 1) = 192 MB/s
test md5::bench::md5_1k ... bench: 2819 ns/iter (+/- 44) = 363 MB/s
test md5::bench::md5_64k ... bench: 178566 ns/iter (+/- 4927) = 367 MB/s

@bluss
Copy link
Member

bluss commented Aug 3, 2013

We need to discuss the I/O traits (Reader and Writer) at some point. Maybe not now, I'm just making you aware. The SipHash in libstd does implement Writer, and I think it would make sense for all digest objects to do too, eventually.

Right now — since the move to the new I/O traits is not completed — it's not really a convenient time. I was thinking of this because the I/O module already defines reading/writing things like u64_le and so on.

@alco
Copy link
Contributor

alco commented Aug 5, 2013

I'm having issues with building your branch and I don't have much free time to deal with this anymore. My MD5 implementation is based on MD4 that is currently in Rust, so it should be pretty straightforward to compare that to this PR if anyone has interest in this.

@DaGenix
Copy link
Author

DaGenix commented Aug 7, 2013

@blake2-ppc I'm not sure I have a fully baked opinion on implementing Writer. One thing that feels weird to me is that the Writer trait has a flush() method. This method is unnecessary for Digests, but if they all implement Writer, then they'd all have to have no-op versions of this method. Writer also defines an error handling protocol. That protocol isn't really applicable to Digests, however, since they generally can't fail in a way where I think a reasonable application would be able to recover. Md5 can't fail at all; Sha1 and Sha2 can only fail if the input size is too large. I don't see how an application can recover from that error though - if the application needs a Sha1 hash, but the input is too large, I don't think there much that can be done. In general, I think an application that needs a Sha1 hash needs a Sha1 hash specifically and won't be able to detect failure and then fallback to MD5 or Sha-512 - if those hashes were acceptable, it would have used them in the first place.

What about a set of adapter objects that implement Writer and delegate to a Digest? Maybe DigestWriter and DigestStreamWriter? The former would just send the input to Digest, while the latter would send it to the Digest and then also to another Writer for processing?

@DaGenix
Copy link
Author

DaGenix commented Aug 7, 2013

I rebased to get rid of uses of "foreach."

@emberian
Copy link
Member

emberian commented Aug 7, 2013

@DaGenix flush should probably have a no-op default implementation. I don't think your concerns about error handling are relevant. It just wont' throw the error if it can't happen, and if nothing can be done to handle the error for other hashes, that's fine too, but there's no problem using a condition, and using a condition is nicer than task failure when the user does want to handle the error.

@DaGenix
Copy link
Author

DaGenix commented Aug 8, 2013

@cmr I wasn't saying that I think anything is necessarily wrong with the Writer trait. All I'm trying to say is that the Writer trait defines a bunch of behavior that is irrelevant for most or all Digests.

  • flush() is one example.
  • A 2nd is that Writer defines raising a io_error condition which I don't think makes sense for Digests. A caller that expects failed calls to write() to raise io_error might be surprised.
  • If one task writes too much data to a Digest, but doesn't get the result, and than passes that Digest to a child task which does get the result - which task should fail, the parent or the child? If we want the child to fail, we'd have to add extra state handling logic to the Digests for this corner case. If its the parent that fails, that might be surprising due to the previous point.
  • Right now the Reader and Writer traits are a little inconsistent - Reader allows for reads to return 0 bytes, but the Writer trait doesn't currently allow for write() to write 0 bytes. POSIX write() allows for this, however. So, I'd assume that either Reader or Writer will be changed eventually. If Writer grows the ability to complete without writing everything that was requested, that becomes another thing that doesn't really apply to Digests. Any effort that the caller spends to handle this case is wasted if a Digest is passed in. Maybe LLVM could optimize out that effort in some cases, but I'd imagine it couldn't handle all of them.
  • How often does a caller want to write data to a Digest, but not also have access to the data that was processed? Probably sometimes, but, I'd argue that in the majority of cases, if I compute a Digest on some data, I also want to have the data that was processed to write to a file or a socket. Not every function that sends bytes to a Writer is guaranteed to send the exact same bytes each time.

So, is a Digest really a Writer? The main similarity seems to be that both a Writer and a Digest can accept a bunch of bytes and do something with them, but they differ in how they accept those bytes.

If no-one agrees with me, I'm happy to write up a patch that implements Writer for Digests, though, I have some reservations about it. I'd suggest that maybe a different PR would be a better place to do that, since that change would affect all existing Digests, not just this one.

@DaGenix
Copy link
Author

DaGenix commented Aug 8, 2013

I opened this PR to foster a discussion / comparison of various implementations. Is there more discussion that needs to take place? Is this ready for a review or should this stay open a bit longer before being reviewed?

Palmer Cox added 4 commits August 17, 2013 00:22
The shift_add_check_overflow and shift_add_check_overflow_tuple functions are
re-written to be more efficient and to make use of the CheckedAdd instrinsic
instead of manually checking for integer overflow.

* The invokation leading_zeros() is removed and replaced with simple integer
  comparison. The leading_zeros() method results in a ctpop LLVM instruction
  and it may not be efficient on all architectures; integer comparisons,
  however, are efficient on just about any architecture.
* The methods lose the ability for the caller to specify a particular shift
  value - that functionality wasn't being used and removing it allows for the
  code to be simplified.
* Finally, the methods are renamed to add_bytes_to_bits and
  add_bytes_to_bits_tuple to reflect their very specific purposes.
@DaGenix
Copy link
Author

DaGenix commented Aug 17, 2013

@cmr I rebased this to combine it with #8501 to be easier on bors.

bors added a commit that referenced this pull request Aug 17, 2013
An MD5 implementation was originally included in #8097, but, since there are a couple different implementations of that digest algorithm (@alco mentioned his implementation on the mailing list just before I opened that PR), it was suggested that I remove it from that PR and open up a new PR to discuss the different implementations and the best way forward. If anyone wants to discuss a different implementation, feel free to present it here and discuss and compare it to this one. I'll just discuss my implementation and I'll leave it to others to present details of theirs.

This implementation relies on the FixedBuffer struct from cryptoutil.rs for managing the input buffer, just like the Sha1 and Sha2 digest implementations do. I tried manually unrolling the loops in the compression function, but I got slightly worse performance when I did that.

Outside of the #[test]s, I also tested the implementation by generating 1,000 inputs of up to 10MB in size and checking the MD5 digest calculated by this code against the MD5 digest calculated by Java's implementation.

On my computer, I'm getting the following performance:

```
test md5::bench::md5_10 ... bench: 52 ns/iter (+/- 1) = 192 MB/s
test md5::bench::md5_1k ... bench: 2819 ns/iter (+/- 44) = 363 MB/s
test md5::bench::md5_64k ... bench: 178566 ns/iter (+/- 4927) = 367 MB/s
```
@bors bors closed this Aug 17, 2013
flip1995 pushed a commit to flip1995/rust that referenced this pull request Jan 13, 2022
Rustup

r? `@ghost`

changelog: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants