Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why hash whole files? #266

Open
matthiasgoergens opened this issue May 30, 2024 · 2 comments
Open

Why hash whole files? #266

matthiasgoergens opened this issue May 30, 2024 · 2 comments

Comments

@matthiasgoergens
Copy link

Thanks for providing this useful tool!

The Readme writes:

Note that there is no byte-by-byte comparison of files anywhere. All available hash functions are at least 128-bit wide, and you don't need to worry about hash collisions. At 1015 files, the probability of collision is 0.000000001 when using a 128-bit hash, without taking into account the requirement for the files to also match by size.

When you have a group of, say, N files of uniform size S bytes each, then you need to to read N * S bytes to compute all the hashes. In contrast, to compare them byte-by-byte for equality, you need to read N * S bytes.

You might notice that both values are the same.

To make the byte-for-byte comparison work you need to be somewhat careful about the order you read in: I assume you want to avoid having the whole N * S bytes in memory at the same time. On the one hand, that's a complication. On the other hand, you can simplify because you no longer rely on the hash for correctness, only for speed.

Here's a sketch:

If N is less than outrageously large, you can (conceptually) divide the files into chunks, 'zip' up the chunks, and the compare all the first chunks of all files, then all the second chunks of all files, etc. (Smaller chunks size needs less memory, bigger chunks size might be more efficient for reading. But the benefits of bigger chunks are quickly diminishing, even for spinning Rust.)

If N * chunks size is still too large, you can further split and compare some number M << N chunks at the same time, and use divide and conquer. (I can elaborate, if you are interested.)

Does this seem reasonable? Did I miss anything?

@matthiasgoergens matthiasgoergens changed the title Why hash? Why hash whole files? May 30, 2024
@kapitainsky
Copy link
Contributor

Byte-by-byte comparison cost disadvantage is not about number of bytes program has to read (as indeed it is the same as with hashes calculation) but with cost of comparing N arrays (files) of S bytes vs hash calculations. Latter scales linearly with number of files when former computational cost scales exponentially.

@matthiasgoergens
Copy link
Author

matthiasgoergens commented May 30, 2024

How do you make it scale exponentially?

Even the most naive version I can come up with would only scale quadratically.

And it's fairly easy to come up with a linear version. I can elaborate on that, if you are interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants