-
-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement hash cache for md5 hash based on sqlite #942
Conversation
084b8be
to
4bdccbe
Compare
@@ -107,11 +108,6 @@ def __getattribute__(self, attrname): | |||
result = self.INITIAL_INFO[attrname] | |||
return result | |||
|
|||
# This offset is where we should start reading the file to get a partial md5 | |||
# For audio file, it should be where audio data starts | |||
def _get_md5partial_offset_and_size(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this removed and hard coded into the new calc_md5partial() function?
We don't handle audio files only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seemed like an unnecessary abstraction. md5partial is always generated the same way, regardless of file type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will say this function used to do something more useful before migration away from hsaudiotag3k, since moving to mutagen this was the one thing "lost" and now the md5 is calculated the same way for audio files as well.
core/directories.py
Outdated
|
||
# This offset is where we should start reading the file to get a partial md5 | ||
# For audio file, it should be where audio data starts | ||
offset, size = (0x4000, 0x4000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be hard coded here.
core/app.py
Outdated
@@ -138,7 +138,8 @@ def __init__(self, view, portable=False): | |||
self.app_mode = AppMode.STANDARD | |||
self.discarded_file_count = 0 | |||
self.exclude_list = ExcludeList() | |||
self.directories = directories.Directories(self.exclude_list) | |||
hash_cache_file = op.join(self.appdata, "hash.cache") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest naming this "hash_cache.db".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, no problem.
core/directories.py
Outdated
return md5 | ||
|
||
md5 = calc_md5(path) | ||
self.cur.execute(self.insert_query, (str(path), size, mtime_ns, md5, md5partial)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be best to keep the hash compute function in the File class. Make a separate function wrapper for the REPLACE INTO statement that can be called from File.
ie. try to get a hash from the DB, if there is none, return None to the File class. The File computes the hash and calls the DB class to insert (replace) the row.
core/directories.py
Outdated
@@ -44,6 +47,117 @@ class InvalidPathError(Exception): | |||
"""The path being added is invalid""" | |||
|
|||
|
|||
def calc_md5(path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these hash compute functions should be kept in the File class.
core/directories.py
Outdated
if hash_cache_file: | ||
self.filesdb = FilesDB(hash_cache_file) | ||
else: | ||
self.filesdb = FilesDBDummy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the DB should be instantiated somewhere else. I think in this case, it should be a singleton, and it should be a global variable accessible from the File class.
core/fs.py
Outdated
self.path = path | ||
self.db = db |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to avoid this if possible. A global variable to reference the DB singleton would be good here perhaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a bad idea but I suggest refactoring a bit.
i am looking forward for this to be implemented. this will be a super plus for this wonderful program. |
@glubsy I addressed many of your suggestions. I moved the hash calculations back into Files class, separated the caching steps into get and put and made the FilesDB class into a singleton. The DupeGuru main class calls Do you have any comments on how to handle cache invalidation, as this is not done at the moment. But I don't know if it's a good idea to keep the cache forever, even I cache based on path, size and modification time. |
@Dobatymo, the CI/CD found some issues in the linting step, these will need to be resolved. Regarding the "missing features" listed:
|
move hash calculation back in to Files class clear cache now clears hash cache in addition to picture cache
@arsenetar I changed "clear picture cache" to simply "clear cache" which now cleans the hash cache too. I also addressed the linting issues. Maybe the cache invalidation can be planned another time. |
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
i have a tried at the latest commit 7746004 but checking the hash_cache.db, it is empty i suppose the purpose of having the database, is you store the hash result for what ever file you comparing, so the the next time you opening that file you dont have to go through hashing again as you already have the hash result? did i missing something? |
@chchia it will commit the db when the scan is complete. Try opening the database afterwards. Or run the scan a second time and see if it's faster. |
i see, i though it write to db on the fly once file got hashed. i will wait until scan completed. thanks. |
by the way, regarding the Cache invalidation, instead of after XX days, i think it should be better to invalidate only is file is no longer exist or file modification time changed. i am suppose file hash result will not change unless file removed or changed. |
Things here are probably fine, I will give another look, CI/CD failure has to do with a change between 3.9.6 and 3.9.7 that changed the behavior of shutil.copy() https://bugs.python.org/issue43219 so its unrelated to these changes. |
This pull request implements caching of md5 (and md5partial) values to a SQLite database.
Enhancements not addressed in this PR: