-
-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hash sample optimization #908
Conversation
* Big files above the user selected threshold can be partially hashed in 3 places. * If the user is willing to take the risk, we consider files with identical md5samples as being identical.
* Instead of keeping md5 samples separate, merge them as one hash computed from the various selected chunks we picked. * We don't need to keep a boolean to see whether or not the user chose to optimize; we can simply compare the value of the threshold, since 0 means no optimization currently active.
Preliminary review looks good, I'll need to pull this and give it a try. |
@glubsy, did some testing with this and I am wondering if we should force a limit on the minimum value here (currently allows 1MB). The reason being the CHUNK_SIZE used for the three hashes is 1MB so if you set the limit to anything smaller than 3MB you are actually doing more work than just hashing the whole file once. I don't know why you would actively set the limit that low, but one could, I tested this with the limit at 1MB and it does not cause issues but does do more work than just hashing the files I tested with (just over 1MB). |
Ran the tests locally while they pass, the negative seek is causing an exception which is caught on test cases which have files smaller than the chunk size. I would prefer either the test be updated to test files that are 1MB or larger (as you cannot actually specify for it to run on smaller files), or write that seek as |
That's a very good point, thanks! I'll work on this very shortly. |
Saved my life in analyzing my movie collection, thanks! |
Had a quick look today, and it just occurred to me: if size <= CHUNK_SIZE * 3:
setattr(self, field, self.md5) Wouldn't this be enough? |
Computing 3 hash samples for files less than 3MiB (3 * CHUNK_SIZE) is not efficient since spans of later samples would overlap a previous one. Therefore we can simply return the hash of the entire small file instead.
6a50a1a
to
7b764f1
Compare
The "walrus" operator is only available in python 3.8 and later. Fall back to more traditional notation.
Perhaps the python byte code is already optimized, but just in case it is not, keep pre-compute the constant expression.
Yeah that looks good to me, tests seem to be good now as well. I am thinking of changing the edit box for this value and the ignore small files to spinboxes instead to prevent some other input validation issues unrelated to this change so I'll handle that separately from this PR. Thanks for the work. |
For files bigger than the threshold set by the user, instead of computing the full md5 (which takes time), pick three chunks located at 25%, 60% of the file length, then the last chunk, and compute this "md5samples" hash instead from these three samples.
Closes #734.