Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exact dedup details #115

Open
jordane95 opened this issue May 30, 2024 · 1 comment
Open

Exact dedup details #115

jordane95 opened this issue May 30, 2024 · 1 comment

Comments

@jordane95
Copy link

Hi, from your blog post it seems that the redpajama-v2 has performed an exact dedup for all dumps.

My question is: did you perform dedup for each dump individually or, is it done across different dumps? In the latter case, wouldn't there be a large memory-overhead to load all previous text hashes in the memory? Thanks.

@mauriceweber
Copy link
Collaborator

Hi @jordane95

Thanks for your question. We performed dedup across all dumps. You are correct that loading all hashes into memory would require a large memory overhead -- this is why we have used a bloomfilter for that purpose, which is a space efficient data structure which can be used to test set membership. This allowed us to deduplicate the entire dataset using less than 500GB RAM on a single compute node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants