Why hash whole files? #266

matthiasgoergens · 2024-05-30T05:44:20Z

Thanks for providing this useful tool!

The Readme writes:

Note that there is no byte-by-byte comparison of files anywhere. All available hash functions are at least 128-bit wide, and you don't need to worry about hash collisions. At 10¹⁵ files, the probability of collision is 0.000000001 when using a 128-bit hash, without taking into account the requirement for the files to also match by size.

When you have a group of, say, N files of uniform size S bytes each, then you need to to read N * S bytes to compute all the hashes. In contrast, to compare them byte-by-byte for equality, you need to read N * S bytes.

You might notice that both values are the same.

To make the byte-for-byte comparison work you need to be somewhat careful about the order you read in: I assume you want to avoid having the whole N * S bytes in memory at the same time. On the one hand, that's a complication. On the other hand, you can simplify because you no longer rely on the hash for correctness, only for speed.

Here's a sketch:

If N is less than outrageously large, you can (conceptually) divide the files into chunks, 'zip' up the chunks, and the compare all the first chunks of all files, then all the second chunks of all files, etc. (Smaller chunks size needs less memory, bigger chunks size might be more efficient for reading. But the benefits of bigger chunks are quickly diminishing, even for spinning Rust.)

If N * chunks size is still too large, you can further split and compare some number M << N chunks at the same time, and use divide and conquer. (I can elaborate, if you are interested.)

Does this seem reasonable? Did I miss anything?

kapitainsky · 2024-05-30T06:14:18Z

Byte-by-byte comparison cost disadvantage is not about number of bytes program has to read (as indeed it is the same as with hashes calculation) but with cost of comparing N arrays (files) of S bytes vs hash calculations. Latter scales linearly with number of files when former computational cost scales exponentially.

matthiasgoergens · 2024-05-30T10:43:00Z

How do you make it scale exponentially?

Even the most naive version I can come up with would only scale quadratically.

And it's fairly easy to come up with a linear version. I can elaborate on that, if you are interested.

matthiasgoergens changed the title ~~Why hash?~~ Why hash whole files? May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why hash whole files? #266

Why hash whole files? #266

matthiasgoergens commented May 30, 2024

kapitainsky commented May 30, 2024

matthiasgoergens commented May 30, 2024 •

edited

Loading

Why hash whole files? #266

Why hash whole files? #266

Comments

matthiasgoergens commented May 30, 2024

kapitainsky commented May 30, 2024

matthiasgoergens commented May 30, 2024 • edited Loading

matthiasgoergens commented May 30, 2024 •

edited

Loading