Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New HASH option #875

Closed
gaow opened this issue Feb 1, 2018 · 4 comments
Closed

New HASH option #875

gaow opened this issue Feb 1, 2018 · 4 comments

Comments

@gaow
Copy link
Member

gaow commented Feb 1, 2018

I recently came across this library xxhash64 and its python binding. It is fast and the keys generated are shorter. Does it worth considering?

@BoPeng
Copy link
Contributor

BoPeng commented Feb 1, 2018

Maybe, perhaps you can perform some benchmark by using it with fileMD5 on a few large files. I mean, perhaps most of the time is spent on file reading, not hash generation.

@gaow
Copy link
Member Author

gaow commented Feb 1, 2018

Okey so I did it on a 1.6GB file. Here is the code:

import hashlib
from xxhash import xxh64

def fileMD5(filename):
    '''Calculate partial MD5, basically the first and last 8M
    of the file for large files. This should signicicantly reduce
    the time spent on the creation and comparison of file signature
    when dealing with large bioinformat ics datasets. '''
    # calculate md5 for specified file
    md5 = hashlib.md5()
    block_size = 2**20  # buffer of 1M
    try:
        # 2**24 = 16M
            with open(filename, 'rb') as f:
                while True:
                    data = f.read(block_size)
                    if not data:
                        break
                    md5.update(data)
    except IOError as e:
        sys.exit(f'Failed to read {filename}: {e}')
    return md5.hexdigest()

def fileH64(filename):
    md5 = xxh64()
    block_size = 2**20  # buffer of 1M
    try:
        # 2**24 = 16M
            with open(filename, 'rb') as f:
                while True:
                    data = f.read(block_size)
                    if not data:
                        break
                    md5.update(data)
    except IOError as e:
        sys.exit(f'Failed to read {filename}: {e}')
    return md5.hexdigest()

Result:

[GW] time python -c "from test import fileMD5; print(fileMD5('snp-gene.sqlite3'))"
916b7f5249367c13fac629e3227089f5

real	0m12.472s
user	0m2.140s
sys	0m0.384s
[GW] time python -c "from test import fileH64; print(fileH64('snp-gene.sqlite3'))"
ecbb636ee52b811e

real	0m0.317s
user	0m0.116s
sys	0m0.200s

Not only it is faster but the result is shorter. Notice I did full 1.6GB file hash (but 0.3s? good for my SSD!). Speed for xxh32 is about the same but the result is even shorter: 3b36e970.

BoPeng pushed a commit that referenced this issue Feb 1, 2018
@BoPeng
Copy link
Contributor

BoPeng commented Feb 1, 2018

Patch is easy but not sure if about side effects.... Let us keep this ticket open for a while.

@BoPeng
Copy link
Contributor

BoPeng commented Feb 15, 2018

Seems to be working all right on different systems. Installation is a bit more difficult because pip needs to compile from source and gives some warnings, but nothing failed so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants