The C4 Multilingual Dataset #5265
Replies: 8 comments 13 replies
-
That's really awesome, I have a question: What is the dump date of this CC corpus ? |
Beta Was this translation helpful? Give feedback.
-
Congratulations! Where can we have more details about the cleaning process, e.g. is it deduplicated, etc? |
Beta Was this translation helpful? Give feedback.
-
@dirkgr Can download specific language in mc4 data? I found the the every json line has three keys: 'text', 'timestamp', 'url'. So it need to use langdetect to find the specific language? |
Beta Was this translation helpful? Give feedback.
-
Hello @dirkgr
The total number of examples according to my counting is 87,337,884, which is consistent with the statistics introduced by Google. However, the token number and the size is not consistent with the data in the NAACL paper and this post. I made a summary regarding the differences:
I use a |
Beta Was this translation helpful? Give feedback.
-
Great Effort ! |
Beta Was this translation helpful? Give feedback.
-
how to transfer the data from tfrecord.json into text files ? |
Beta Was this translation helpful? Give feedback.
-
Guys, Thank you for the amazing job!! A quick question, the PT dataset is about Portuguese spoken in Portugal, right? The one we speak in Brazil is different. |
Beta Was this translation helpful? Give feedback.
-
The wait has been long, but we are finally able to release the C4 multilingual dataset!
We now have almost 27TB of clean-ish data, in 101 different languages (plus the "undetected" language). Here are the approximate sizes of uncompressed text for the languages in the set:
Click to see the rest of the languages
For more detail about the contents of the dataset, check out Table 5 from the mT5 paper.
To get it, head over to the original post about the dataset. The instructions are updated, both for TFDS format (thank you, Google!) and for JSON format (thank you, Huggingface!).
Massive thanks to the original authors of the T5 paper, and the mT5 paper that introduces the multilingual dataset (and model). Out of those authors, special thanks to @adarob for making this happen! He was extremely helpful in the process of getting this released.
Beta Was this translation helpful? Give feedback.
All reactions