You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First and foremost, I want to express my sincere appreciation for the excellent and comprehensive data processing code and documentation that you've provided. They've been truly invaluable to me.
I am currently faced with the task of performing global deduplication on Slimpajama and Pile v1. However, my resources are somewhat limited as I only have access to a 1TB memory instance.
In the documentation, you've mentioned the option of splitting the LSH Object into multiple buckets, which could potentially be a feasible solution for my situation. Unfortunately, I'm not entirely sure how to go about doing this.
Could you possibly provide some guidance or tips on how I might implement this solution? Your expertise and assistance would be immensely appreciated.
Thank you so much for your time and consideration.
The text was updated successfully, but these errors were encountered:
After some research, I believe replacing the _H function with _H_32(4 bytes) or _H_64(8 bytes) will do the staff.
def sha1_hash32(data):
"""A 32-bit hash function based on SHA1.
Args:
data (bytes): the data to generate 32-bit integer hash from.
Returns:
int: an integer hash value that can be encoded using 32 bits.
"""
return struct.unpack('<I', hashlib.sha1(data).digest()[:4])[0]
def sha1_hash64(data):
"""A 32-bit hash function based on SHA1.
Args:
data (bytes): the data to generate 64-bit integer hash from.
Returns:
int: an integer hash value that can be encoded using 64 bits.
"""
return struct.unpack('<Q', hashlib.sha1(data).digest()[:8])[0]
def _H(hs):
return bytes(hs.byteswap().data)
def _H_32(hs):
return sha1_hash32(bytes(hs.byteswap().data))
def _H_64(hs):
return sha1_hash64(bytes(hs.byteswap().data))
I think one could limit the concurrent process numbers in the generate_duplicate_pairs.py file to limit the peak memory usage.
For example, if you have 20 bands, read and process the hashes of the first 10 bands, find and write the duplicate files, clear the queues, then proceed to the second 10 bands.
First and foremost, I want to express my sincere appreciation for the excellent and comprehensive data processing code and documentation that you've provided. They've been truly invaluable to me.
I am currently faced with the task of performing global deduplication on Slimpajama and Pile v1. However, my resources are somewhat limited as I only have access to a 1TB memory instance.
In the documentation, you've mentioned the option of splitting the LSH Object into multiple buckets, which could potentially be a feasible solution for my situation. Unfortunately, I'm not entirely sure how to go about doing this.
Could you possibly provide some guidance or tips on how I might implement this solution? Your expertise and assistance would be immensely appreciated.
Thank you so much for your time and consideration.
The text was updated successfully, but these errors were encountered: