This is a Python module and accompanying command-line programs to identify duplicate files in a set of files or file differences between two sets of files. It uses file size and an md5 hash to determine when files are identical. File information is stored in sqlite3 databases to facilitate iterative processing of the file information without having to re-scan the file information into memory.
The first step in usage is to create the file information database(s) using the mk_file_db.py script. For duplicate file analysis, a single database can be created. For file difference analysis, one database must be created for each of the two sets of files. The database must be created with the same version of Python (e.g. 2.7 or 3.x) as is used to analyze the file information.
The next step in usage is to query the file information database to identify duplicate files using find_dupes.py or file differences between two databases using find_in_B_not_in_A.py. The output of these programs is a simple list at this point. The programs may be extended at some point to allow turning the output into scripts to take action on the output.