Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global Deduplication of Slimpajama and Pile v1 with Limited Resources #16

Open
lckdl opened this issue Jul 19, 2023 · 3 comments
Open

Comments

@lckdl
Copy link

lckdl commented Jul 19, 2023

First and foremost, I want to express my sincere appreciation for the excellent and comprehensive data processing code and documentation that you've provided. They've been truly invaluable to me.

I am currently faced with the task of performing global deduplication on Slimpajama and Pile v1. However, my resources are somewhat limited as I only have access to a 1TB memory instance.

In the documentation, you've mentioned the option of splitting the LSH Object into multiple buckets, which could potentially be a feasible solution for my situation. Unfortunately, I'm not entirely sure how to go about doing this.

Could you possibly provide some guidance or tips on how I might implement this solution? Your expertise and assistance would be immensely appreciated.

Thank you so much for your time and consideration.

@world1tree
Copy link

After some research, I believe replacing the _H function with _H_32(4 bytes) or _H_64(8 bytes) will do the staff.

def sha1_hash32(data):
    """A 32-bit hash function based on SHA1.

    Args:
        data (bytes): the data to generate 32-bit integer hash from.

    Returns:
        int: an integer hash value that can be encoded using 32 bits.
    """
    return struct.unpack('<I', hashlib.sha1(data).digest()[:4])[0]

def sha1_hash64(data):
    """A 32-bit hash function based on SHA1.

    Args:
        data (bytes): the data to generate 64-bit integer hash from.

    Returns:
        int: an integer hash value that can be encoded using 64 bits.
    """
    return struct.unpack('<Q', hashlib.sha1(data).digest()[:8])[0]

def _H(hs):
    return bytes(hs.byteswap().data)

def _H_32(hs):
    return sha1_hash32(bytes(hs.byteswap().data))

def _H_64(hs):
    return sha1_hash64(bytes(hs.byteswap().data))

@frankang
Copy link

I think one could limit the concurrent process numbers in the generate_duplicate_pairs.py file to limit the peak memory usage.
For example, if you have 20 bands, read and process the hashes of the first 10 bands, find and write the duplicate files, clear the queues, then proceed to the second 10 bands.

@ntudy
Copy link

ntudy commented Mar 18, 2024

Is there any implementation available?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants