Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement hash cache for md5 hash based on sqlite #942

Merged
merged 4 commits into from
Nov 24, 2021

Conversation

Dobatymo
Copy link
Contributor

@Dobatymo Dobatymo commented Sep 14, 2021

This pull request implements caching of md5 (and md5partial) values to a SQLite database.

  • database file is saved to hash_cache.db in appdata
  • changes are commited once the full scan is done
  • picture cache clearing generalized to clean md5 cache as well

Enhancements not addressed in this PR:

  • Automatic Cache invalidation (eg. after x days)

@@ -107,11 +108,6 @@ def __getattribute__(self, attrname):
result = self.INITIAL_INFO[attrname]
return result

# This offset is where we should start reading the file to get a partial md5
# For audio file, it should be where audio data starts
def _get_md5partial_offset_and_size(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this removed and hard coded into the new calc_md5partial() function?
We don't handle audio files only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed like an unnecessary abstraction. md5partial is always generated the same way, regardless of file type.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will say this function used to do something more useful before migration away from hsaudiotag3k, since moving to mutagen this was the one thing "lost" and now the md5 is calculated the same way for audio files as well.


# This offset is where we should start reading the file to get a partial md5
# For audio file, it should be where audio data starts
offset, size = (0x4000, 0x4000)
Copy link
Contributor

@glubsy glubsy Oct 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be hard coded here.

core/app.py Outdated
@@ -138,7 +138,8 @@ def __init__(self, view, portable=False):
self.app_mode = AppMode.STANDARD
self.discarded_file_count = 0
self.exclude_list = ExcludeList()
self.directories = directories.Directories(self.exclude_list)
hash_cache_file = op.join(self.appdata, "hash.cache")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest naming this "hash_cache.db".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, no problem.

return md5

md5 = calc_md5(path)
self.cur.execute(self.insert_query, (str(path), size, mtime_ns, md5, md5partial))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be best to keep the hash compute function in the File class. Make a separate function wrapper for the REPLACE INTO statement that can be called from File.
ie. try to get a hash from the DB, if there is none, return None to the File class. The File computes the hash and calls the DB class to insert (replace) the row.

@@ -44,6 +47,117 @@ class InvalidPathError(Exception):
"""The path being added is invalid"""


def calc_md5(path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these hash compute functions should be kept in the File class.

Comment on lines 177 to 180
if hash_cache_file:
self.filesdb = FilesDB(hash_cache_file)
else:
self.filesdb = FilesDBDummy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the DB should be instantiated somewhere else. I think in this case, it should be a singleton, and it should be a global variable accessible from the File class.

core/fs.py Outdated
self.path = path
self.db = db
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to avoid this if possible. A global variable to reference the DB singleton would be good here perhaps.

Copy link
Contributor

@glubsy glubsy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a bad idea but I suggest refactoring a bit.

@chchia
Copy link

chchia commented Oct 29, 2021

i am looking forward for this to be implemented. this will be a super plus for this wonderful program.

@Dobatymo
Copy link
Contributor Author

@glubsy I addressed many of your suggestions. I moved the hash calculations back into Files class, separated the caching steps into get and put and made the FilesDB class into a singleton. The DupeGuru main class calls connect on the singleton.

Do you have any comments on how to handle cache invalidation, as this is not done at the moment. But I don't know if it's a good idea to keep the cache forever, even I cache based on path, size and modification time.

@arsenetar
Copy link
Owner

@Dobatymo, the CI/CD found some issues in the linting step, these will need to be resolved. Regarding the "missing features" listed:

  • The location of the database should follow the same as the picture cache db, which is located in the application data folder. This can be seen in app.py _get_picture_cache_path() which is the same as what you are doing so there is nothing else needed there.
  • The GUI will need to be able to manually clear this cache, either we can add a separate action for this or just use the picture cache clear and change the description to be more generic as it is "Clear Picture Cache".
  • If invalidation of old items is desired based on when they were added, I would have that happen on open of the db. Although I can see cases for and against this as it depends on how often a user is looking for duplicates and how often files are changing. I would almost just have a cache size restriction in either file size of db or number of records that uses the insert/update date only for the purpose of deciding what to remove to stay under a certain size settable by the user.

move hash calculation back in to Files class
clear cache now clears hash cache in addition to picture cache
@Dobatymo
Copy link
Contributor Author

@arsenetar I changed "clear picture cache" to simply "clear cache" which now cleans the hash cache too. I also addressed the linting issues. Maybe the cache invalidation can be planned another time.

@Dobatymo Dobatymo marked this pull request as ready for review October 29, 2021 07:16
@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
0.0% 0.0% Duplication

@chchia
Copy link

chchia commented Oct 29, 2021

i have a tried at the latest commit 7746004
it seems hash is not stored in database after file is hashed

i have over 250 group hashed
image

but checking the hash_cache.db, it is empty
image

i suppose the purpose of having the database, is you store the hash result for what ever file you comparing, so the the next time you opening that file you dont have to go through hashing again as you already have the hash result? did i missing something?

@Dobatymo
Copy link
Contributor Author

Dobatymo commented Oct 29, 2021

@chchia it will commit the db when the scan is complete. Try opening the database afterwards. Or run the scan a second time and see if it's faster.
Also it will only calculate hashes if files of the same size are encountered. Otherwise DupeGuru would never even calculate them.

@chchia
Copy link

chchia commented Oct 29, 2021

@chchia it will commit the db when the scan is complete. Try opening the database afterwards. Or run the scan a second time and see if it's faster. Also it will only calculate hashes if files of the same size are encountered. Otherwise DupeGuru would never even calculate them.

i see, i though it write to db on the fly once file got hashed. i will wait until scan completed. thanks.

@chchia
Copy link

chchia commented Oct 29, 2021

by the way, regarding the Cache invalidation, instead of after XX days, i think it should be better to invalidate only is file is no longer exist or file modification time changed.

i am suppose file hash result will not change unless file removed or changed.

@arsenetar
Copy link
Owner

Things here are probably fine, I will give another look, CI/CD failure has to do with a change between 3.9.6 and 3.9.7 that changed the behavior of shutil.copy() https://bugs.python.org/issue43219 so its unrelated to these changes.

@arsenetar arsenetar merged commit 34f41dc into arsenetar:master Nov 24, 2021
@Dobatymo Dobatymo deleted the hash-cache branch November 24, 2021 03:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants