You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note that there is no byte-by-byte comparison of files anywhere. All available hash functions are at least 128-bit wide, and you don't need to worry about hash collisions. At 1015 files, the probability of collision is 0.000000001 when using a 128-bit hash, without taking into account the requirement for the files to also match by size.
When you have a group of, say, N files of uniform size S bytes each, then you need to to read N * S bytes to compute all the hashes. In contrast, to compare them byte-by-byte for equality, you need to read N * S bytes.
You might notice that both values are the same.
To make the byte-for-byte comparison work you need to be somewhat careful about the order you read in: I assume you want to avoid having the whole N * S bytes in memory at the same time. On the one hand, that's a complication. On the other hand, you can simplify because you no longer rely on the hash for correctness, only for speed.
Here's a sketch:
If N is less than outrageously large, you can (conceptually) divide the files into chunks, 'zip' up the chunks, and the compare all the first chunks of all files, then all the second chunks of all files, etc. (Smaller chunks size needs less memory, bigger chunks size might be more efficient for reading. But the benefits of bigger chunks are quickly diminishing, even for spinning Rust.)
If N * chunks size is still too large, you can further split and compare some number M << N chunks at the same time, and use divide and conquer. (I can elaborate, if you are interested.)
Does this seem reasonable? Did I miss anything?
The text was updated successfully, but these errors were encountered:
Byte-by-byte comparison cost disadvantage is not about number of bytes program has to read (as indeed it is the same as with hashes calculation) but with cost of comparing N arrays (files) of S bytes vs hash calculations. Latter scales linearly with number of files when former computational cost scales exponentially.
Thanks for providing this useful tool!
The Readme writes:
When you have a group of, say, N files of uniform size S bytes each, then you need to to read N * S bytes to compute all the hashes. In contrast, to compare them byte-by-byte for equality, you need to read N * S bytes.
You might notice that both values are the same.
To make the byte-for-byte comparison work you need to be somewhat careful about the order you read in: I assume you want to avoid having the whole N * S bytes in memory at the same time. On the one hand, that's a complication. On the other hand, you can simplify because you no longer rely on the hash for correctness, only for speed.
Here's a sketch:
If N is less than outrageously large, you can (conceptually) divide the files into chunks, 'zip' up the chunks, and the compare all the first chunks of all files, then all the second chunks of all files, etc. (Smaller chunks size needs less memory, bigger chunks size might be more efficient for reading. But the benefits of bigger chunks are quickly diminishing, even for spinning Rust.)
If N * chunks size is still too large, you can further split and compare some number M << N chunks at the same time, and use divide and conquer. (I can elaborate, if you are interested.)
Does this seem reasonable? Did I miss anything?
The text was updated successfully, but these errors were encountered: