You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, from your blog post it seems that the redpajama-v2 has performed an exact dedup for all dumps.
My question is: did you perform dedup for each dump individually or, is it done across different dumps? In the latter case, wouldn't there be a large memory-overhead to load all previous text hashes in the memory? Thanks.
The text was updated successfully, but these errors were encountered:
Thanks for your question. We performed dedup across all dumps. You are correct that loading all hashes into memory would require a large memory overhead -- this is why we have used a bloomfilter for that purpose, which is a space efficient data structure which can be used to test set membership. This allowed us to deduplicate the entire dataset using less than 500GB RAM on a single compute node.
Hi, from your blog post it seems that the redpajama-v2 has performed an exact dedup for all dumps.
My question is: did you perform dedup for each dump individually or, is it done across different dumps? In the latter case, wouldn't there be a large memory-overhead to load all previous text hashes in the memory? Thanks.
The text was updated successfully, but these errors were encountered: