Implement hash cache for md5 hash based on sqlite #942

Dobatymo · 2021-09-14T03:50:29Z

This pull request implements caching of md5 (and md5partial) values to a SQLite database.

database file is saved to hash_cache.db in appdata
changes are commited once the full scan is done
picture cache clearing generalized to clean md5 cache as well

Enhancements not addressed in this PR:

Automatic Cache invalidation (eg. after x days)

glubsy · 2021-10-24T09:24:26Z

core/fs.py

@@ -107,11 +108,6 @@ def __getattribute__(self, attrname):
                result = self.INITIAL_INFO[attrname]
        return result

-    # This offset is where we should start reading the file to get a partial md5
-    # For audio file, it should be where audio data starts
-    def _get_md5partial_offset_and_size(self):


Why was this removed and hard coded into the new calc_md5partial() function?
We don't handle audio files only.

It seemed like an unnecessary abstraction. md5partial is always generated the same way, regardless of file type.

I will say this function used to do something more useful before migration away from hsaudiotag3k, since moving to mutagen this was the one thing "lost" and now the md5 is calculated the same way for audio files as well.

glubsy · 2021-10-24T09:24:44Z

core/directories.py

+
+    # This offset is where we should start reading the file to get a partial md5
+    # For audio file, it should be where audio data starts
+    offset, size = (0x4000, 0x4000)


This should not be hard coded here.

glubsy · 2021-10-24T09:25:54Z

core/app.py

@@ -138,7 +138,8 @@ def __init__(self, view, portable=False):
        self.app_mode = AppMode.STANDARD
        self.discarded_file_count = 0
        self.exclude_list = ExcludeList()
-        self.directories = directories.Directories(self.exclude_list)
+        hash_cache_file = op.join(self.appdata, "hash.cache")


I suggest naming this "hash_cache.db".

Sure, no problem.

glubsy · 2021-10-24T09:48:25Z

core/directories.py

+                    return md5
+
+            md5 = calc_md5(path)
+            self.cur.execute(self.insert_query, (str(path), size, mtime_ns, md5, md5partial))


I think it would be best to keep the hash compute function in the File class. Make a separate function wrapper for the REPLACE INTO statement that can be called from File.
ie. try to get a hash from the DB, if there is none, return None to the File class. The File computes the hash and calls the DB class to insert (replace) the row.

glubsy · 2021-10-24T09:49:57Z

core/directories.py

@@ -44,6 +47,117 @@ class InvalidPathError(Exception):
    """The path being added is invalid"""


+def calc_md5(path):


I think these hash compute functions should be kept in the File class.

glubsy · 2021-10-24T10:08:37Z

core/directories.py

+        if hash_cache_file:
+            self.filesdb = FilesDB(hash_cache_file)
+        else:
+            self.filesdb = FilesDBDummy()


Perhaps the DB should be instantiated somewhere else. I think in this case, it should be a singleton, and it should be a global variable accessible from the File class.

glubsy · 2021-10-24T10:10:41Z

core/fs.py

        self.path = path
+        self.db = db


Would be nice to avoid this if possible. A global variable to reference the DB singleton would be good here perhaps.

glubsy

Not a bad idea but I suggest refactoring a bit.

chchia · 2021-10-29T03:19:58Z

i am looking forward for this to be implemented. this will be a super plus for this wonderful program.

Dobatymo · 2021-10-29T04:27:06Z

@glubsy I addressed many of your suggestions. I moved the hash calculations back into Files class, separated the caching steps into get and put and made the FilesDB class into a singleton. The DupeGuru main class calls connect on the singleton.

Do you have any comments on how to handle cache invalidation, as this is not done at the moment. But I don't know if it's a good idea to keep the cache forever, even I cache based on path, size and modification time.

arsenetar · 2021-10-29T05:16:50Z

@Dobatymo, the CI/CD found some issues in the linting step, these will need to be resolved. Regarding the "missing features" listed:

The location of the database should follow the same as the picture cache db, which is located in the application data folder. This can be seen in app.py _get_picture_cache_path() which is the same as what you are doing so there is nothing else needed there.
The GUI will need to be able to manually clear this cache, either we can add a separate action for this or just use the picture cache clear and change the description to be more generic as it is "Clear Picture Cache".
If invalidation of old items is desired based on when they were added, I would have that happen on open of the db. Although I can see cases for and against this as it depends on how often a user is looking for duplicates and how often files are changing. I would almost just have a cache size restriction in either file size of db or number of records that uses the insert/update date only for the purpose of deciding what to remove to stay under a certain size settable by the user.

move hash calculation back in to Files class clear cache now clears hash cache in addition to picture cache

Dobatymo · 2021-10-29T07:16:41Z

@arsenetar I changed "clear picture cache" to simply "clear cache" which now cleans the hash cache too. I also addressed the linting issues. Maybe the cache invalidation can be planned another time.

sonarqubecloud · 2021-10-29T07:25:46Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell

No Coverage information
0.0% Duplication

chchia · 2021-10-29T09:12:52Z

i have a tried at the latest commit 7746004
it seems hash is not stored in database after file is hashed

i have over 250 group hashed

but checking the hash_cache.db, it is empty

i suppose the purpose of having the database, is you store the hash result for what ever file you comparing, so the the next time you opening that file you dont have to go through hashing again as you already have the hash result? did i missing something?

Dobatymo · 2021-10-29T09:14:35Z

@chchia it will commit the db when the scan is complete. Try opening the database afterwards. Or run the scan a second time and see if it's faster.
Also it will only calculate hashes if files of the same size are encountered. Otherwise DupeGuru would never even calculate them.

chchia · 2021-10-29T09:22:38Z

@chchia it will commit the db when the scan is complete. Try opening the database afterwards. Or run the scan a second time and see if it's faster. Also it will only calculate hashes if files of the same size are encountered. Otherwise DupeGuru would never even calculate them.

i see, i though it write to db on the fly once file got hashed. i will wait until scan completed. thanks.

chchia · 2021-10-29T09:26:49Z

by the way, regarding the Cache invalidation, instead of after XX days, i think it should be better to invalidate only is file is no longer exist or file modification time changed.

i am suppose file hash result will not change unless file removed or changed.

arsenetar · 2021-11-04T01:54:56Z

Things here are probably fine, I will give another look, CI/CD failure has to do with a change between 3.9.6 and 3.9.7 that changed the behavior of shutil.copy() https://bugs.python.org/issue43219 so its unrelated to these changes.

Dobatymo mentioned this pull request Sep 14, 2021

no repeated MD5 hashing. #674

Closed

Dobatymo force-pushed the hash-cache branch from 084b8be to 4bdccbe Compare September 14, 2021 03:54

glubsy reviewed Oct 24, 2021

View reviewed changes

glubsy suggested changes Oct 24, 2021

View reviewed changes

Dobatymo added 3 commits October 29, 2021 15:12

implement hash cache for md5 hash based on sqlite

2f02a60

changed cache filename

1ea108f

change FilesDB to singleton class

9753afb

move hash calculation back in to Files class clear cache now clears hash cache in addition to picture cache

Dobatymo force-pushed the hash-cache branch from cbe66a2 to 9753afb Compare October 29, 2021 07:13

Dobatymo marked this pull request as ready for review October 29, 2021 07:16

clean up abstraction

7746004

arsenetar merged commit 34f41dc into arsenetar:master Nov 24, 2021

Dobatymo deleted the hash-cache branch November 24, 2021 03:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement hash cache for md5 hash based on sqlite #942

Implement hash cache for md5 hash based on sqlite #942

Dobatymo commented Sep 14, 2021 •

edited

Loading

glubsy Oct 24, 2021

Dobatymo Oct 29, 2021

arsenetar Oct 29, 2021

glubsy Oct 24, 2021 •

edited

Loading

glubsy Oct 24, 2021

Dobatymo Oct 29, 2021

glubsy Oct 24, 2021

glubsy Oct 24, 2021

glubsy Oct 24, 2021

glubsy Oct 24, 2021

glubsy left a comment

chchia commented Oct 29, 2021

Dobatymo commented Oct 29, 2021

arsenetar commented Oct 29, 2021

Dobatymo commented Oct 29, 2021

sonarqubecloud bot commented Oct 29, 2021

chchia commented Oct 29, 2021

Dobatymo commented Oct 29, 2021 •

edited

Loading

chchia commented Oct 29, 2021

chchia commented Oct 29, 2021

arsenetar commented Nov 4, 2021

		@@ -44,6 +47,117 @@ class InvalidPathError(Exception):
		"""The path being added is invalid"""


		def calc_md5(path):

Implement hash cache for md5 hash based on sqlite #942

Implement hash cache for md5 hash based on sqlite #942

Conversation

Dobatymo commented Sep 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glubsy Oct 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glubsy left a comment

Choose a reason for hiding this comment

chchia commented Oct 29, 2021

Dobatymo commented Oct 29, 2021

arsenetar commented Oct 29, 2021

Dobatymo commented Oct 29, 2021

sonarqubecloud bot commented Oct 29, 2021

chchia commented Oct 29, 2021

Dobatymo commented Oct 29, 2021 • edited Loading

chchia commented Oct 29, 2021

chchia commented Oct 29, 2021

arsenetar commented Nov 4, 2021

Dobatymo commented Sep 14, 2021 •

edited

Loading

glubsy Oct 24, 2021 •

edited

Loading

Dobatymo commented Oct 29, 2021 •

edited

Loading