New HASH option #875

gaow · 2018-02-01T16:51:46Z

I recently came across this library xxhash64 and its python binding. It is fast and the keys generated are shorter. Does it worth considering?

The text was updated successfully, but these errors were encountered:

BoPeng · 2018-02-01T16:58:30Z

Maybe, perhaps you can perform some benchmark by using it with fileMD5 on a few large files. I mean, perhaps most of the time is spent on file reading, not hash generation.

gaow · 2018-02-01T18:24:52Z

Okey so I did it on a 1.6GB file. Here is the code:

import hashlib
from xxhash import xxh64

def fileMD5(filename):
    '''Calculate partial MD5, basically the first and last 8M
    of the file for large files. This should signicicantly reduce
    the time spent on the creation and comparison of file signature
    when dealing with large bioinformat ics datasets. '''
    # calculate md5 for specified file
    md5 = hashlib.md5()
    block_size = 2**20  # buffer of 1M
    try:
        # 2**24 = 16M
            with open(filename, 'rb') as f:
                while True:
                    data = f.read(block_size)
                    if not data:
                        break
                    md5.update(data)
    except IOError as e:
        sys.exit(f'Failed to read {filename}: {e}')
    return md5.hexdigest()

def fileH64(filename):
    md5 = xxh64()
    block_size = 2**20  # buffer of 1M
    try:
        # 2**24 = 16M
            with open(filename, 'rb') as f:
                while True:
                    data = f.read(block_size)
                    if not data:
                        break
                    md5.update(data)
    except IOError as e:
        sys.exit(f'Failed to read {filename}: {e}')
    return md5.hexdigest()

Result:

[GW] time python -c "from test import fileMD5; print(fileMD5('snp-gene.sqlite3'))"
916b7f5249367c13fac629e3227089f5

real	0m12.472s
user	0m2.140s
sys	0m0.384s
[GW] time python -c "from test import fileH64; print(fileH64('snp-gene.sqlite3'))"
ecbb636ee52b811e

real	0m0.317s
user	0m0.116s
sys	0m0.200s

Not only it is faster but the result is shorter. Notice I did full 1.6GB file hash (but 0.3s? good for my SSD!). Speed for xxh32 is about the same but the result is even shorter: 3b36e970.

BoPeng · 2018-02-01T19:56:01Z

Patch is easy but not sure if about side effects.... Let us keep this ticket open for a while.

BoPeng · 2018-02-15T15:03:33Z

Seems to be working all right on different systems. Installation is a bit more difficult because pip needs to compile from source and gives some warnings, but nothing failed so far.

BoPeng pushed a commit that referenced this issue Feb 1, 2018

Replace hashlib with xxhash #875

700ea2d

BoPeng closed this as completed Feb 15, 2018

BoPeng mentioned this issue Mar 7, 2018

Fail to install SoS in Windows #916

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New HASH option #875

New HASH option #875

gaow commented Feb 1, 2018 •

edited

Loading

BoPeng commented Feb 1, 2018 •

edited

Loading

gaow commented Feb 1, 2018 •

edited

Loading

BoPeng commented Feb 1, 2018

BoPeng commented Feb 15, 2018

New HASH option #875

New HASH option #875

Comments

gaow commented Feb 1, 2018 • edited Loading

BoPeng commented Feb 1, 2018 • edited Loading

gaow commented Feb 1, 2018 • edited Loading

BoPeng commented Feb 1, 2018

BoPeng commented Feb 15, 2018

gaow commented Feb 1, 2018 •

edited

Loading

BoPeng commented Feb 1, 2018 •

edited

Loading

gaow commented Feb 1, 2018 •

edited

Loading