Global Deduplication of Slimpajama and Pile v1 with Limited Resources #16

lckdl · 2023-07-19T02:23:37Z

First and foremost, I want to express my sincere appreciation for the excellent and comprehensive data processing code and documentation that you've provided. They've been truly invaluable to me.

I am currently faced with the task of performing global deduplication on Slimpajama and Pile v1. However, my resources are somewhat limited as I only have access to a 1TB memory instance.

In the documentation, you've mentioned the option of splitting the LSH Object into multiple buckets, which could potentially be a feasible solution for my situation. Unfortunately, I'm not entirely sure how to go about doing this.

Could you possibly provide some guidance or tips on how I might implement this solution? Your expertise and assistance would be immensely appreciated.

Thank you so much for your time and consideration.

world1tree · 2023-09-27T07:04:26Z

After some research, I believe replacing the _H function with _H_32(4 bytes) or _H_64(8 bytes) will do the staff.

def sha1_hash32(data):
    """A 32-bit hash function based on SHA1.

    Args:
        data (bytes): the data to generate 32-bit integer hash from.

    Returns:
        int: an integer hash value that can be encoded using 32 bits.
    """
    return struct.unpack('<I', hashlib.sha1(data).digest()[:4])[0]

def sha1_hash64(data):
    """A 32-bit hash function based on SHA1.

    Args:
        data (bytes): the data to generate 64-bit integer hash from.

    Returns:
        int: an integer hash value that can be encoded using 64 bits.
    """
    return struct.unpack('<Q', hashlib.sha1(data).digest()[:8])[0]

def _H(hs):
    return bytes(hs.byteswap().data)

def _H_32(hs):
    return sha1_hash32(bytes(hs.byteswap().data))

def _H_64(hs):
    return sha1_hash64(bytes(hs.byteswap().data))

frankang · 2023-10-11T10:52:37Z

I think one could limit the concurrent process numbers in the generate_duplicate_pairs.py file to limit the peak memory usage.
For example, if you have 20 bands, read and process the hashes of the first 10 bands, find and write the duplicate files, clear the queues, then proceed to the second 10 bands.

ntudy · 2024-03-18T09:49:07Z

Is there any implementation available?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Global Deduplication of Slimpajama and Pile v1 with Limited Resources #16

Global Deduplication of Slimpajama and Pile v1 with Limited Resources #16

lckdl commented Jul 19, 2023 •

edited

Loading

world1tree commented Sep 27, 2023

frankang commented Oct 11, 2023

ntudy commented Mar 18, 2024

Global Deduplication of Slimpajama and Pile v1 with Limited Resources #16

Global Deduplication of Slimpajama and Pile v1 with Limited Resources #16

Comments

lckdl commented Jul 19, 2023 • edited Loading

world1tree commented Sep 27, 2023

frankang commented Oct 11, 2023

ntudy commented Mar 18, 2024

lckdl commented Jul 19, 2023 •

edited

Loading