-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New HASH option #875
Comments
Maybe, perhaps you can perform some benchmark by using it with fileMD5 on a few large files. I mean, perhaps most of the time is spent on file reading, not hash generation. |
Okey so I did it on a 1.6GB file. Here is the code: import hashlib
from xxhash import xxh64
def fileMD5(filename):
'''Calculate partial MD5, basically the first and last 8M
of the file for large files. This should signicicantly reduce
the time spent on the creation and comparison of file signature
when dealing with large bioinformat ics datasets. '''
# calculate md5 for specified file
md5 = hashlib.md5()
block_size = 2**20 # buffer of 1M
try:
# 2**24 = 16M
with open(filename, 'rb') as f:
while True:
data = f.read(block_size)
if not data:
break
md5.update(data)
except IOError as e:
sys.exit(f'Failed to read {filename}: {e}')
return md5.hexdigest()
def fileH64(filename):
md5 = xxh64()
block_size = 2**20 # buffer of 1M
try:
# 2**24 = 16M
with open(filename, 'rb') as f:
while True:
data = f.read(block_size)
if not data:
break
md5.update(data)
except IOError as e:
sys.exit(f'Failed to read {filename}: {e}')
return md5.hexdigest() Result:
Not only it is faster but the result is shorter. Notice I did full 1.6GB file hash (but 0.3s? good for my SSD!). Speed for |
Patch is easy but not sure if about side effects.... Let us keep this ticket open for a while. |
Seems to be working all right on different systems. Installation is a bit more difficult because |
I recently came across this library
xxhash64
and its python binding. It is fast and the keys generated are shorter. Does it worth considering?The text was updated successfully, but these errors were encountered: