The C4 Multilingual Dataset #5265

dirkgr · 2021-06-16T01:43:37Z

dirkgr
Jun 16, 2021
Maintainer

The wait has been long, but we are finally able to release the C4 multilingual dataset!

We now have almost 27TB of clean-ish data, in 101 different languages (plus the "undetected" language). Here are the approximate sizes of uncompressed text for the languages in the set:

language	size
en	10401 GB
ru	3615 GB
und	2651 GB
es	1613 GB
de	1404 GB
fr	1128 GB
ja	821 GB
it	590 GB
pt	524 GB
pl	473 GB

Click to see the rest of the languages

language	size
vi	301 GB
tr	285 GB
nl	277 GB
id	242 GB
ar	237 GB
cs	235 GB
fa	220 GB
uk	196 GB
ro	191 GB
el	190 GB
zh	186 GB
sv	179 GB
hu	149 GB
da	107 GB
hi	105 GB
fi	104 GB
bg	103 GB
no	100 GB
ko	100 GB
mr	70 GB
sk	67 GB
iw	66 GB
th	56 GB
ms	49 GB
ca	47 GB
lt	42 GB
sl	32 GB
bn	29 GB
et	26 GB
lv	25 GB
sr	20 GB
cy	20 GB
az	18 GB
ta	17 GB
kk	15 GB
sq	14 GB
ne	13 GB
mt	12 GB
mn	11 GB
ka	10 GB
hy	10 GB
ur	10 GB
ml	9 GB
be	9 GB
gl	9 GB
is	9 GB
mk	8 GB
fil	7 GB
tg	6 GB
af	6 GB
te	6 GB
kn	5 GB
eu	5 GB
ky	5 GB
my	5 GB
sd	4 GB
la	4 GB
so	4 GB
si	3 GB
sw	3 GB
km	3 GB
uz	3 GB
lb	3 GB
gu	3 GB
eo	2 GB
pa	1810 MB
ps	1466 MB
ga	1437 MB
fy	1412 MB
ku	1276 MB
gd	1178 MB
am	1169 MB
yi	876 MB
jv	876 MB
ha	854 MB
zu	839 MB
hmn	739 MB
co	713 MB
mg	704 MB
ceb	678 MB
ht	644 MB
sn	644 MB
lo	641 MB
su	464 MB
ny	424 MB
mi	353 MB
ig	256 MB
sm	245 MB
st	242 MB
haw	236 MB
xh	206 MB
yo	158 MB

For more detail about the contents of the dataset, check out Table 5 from the mT5 paper.

To get it, head over to the original post about the dataset. The instructions are updated, both for TFDS format (thank you, Google!) and for JSON format (thank you, Huggingface!).

Massive thanks to the original authors of the T5 paper, and the mT5 paper that introduces the multilingual dataset (and model). Out of those authors, special thanks to @adarob for making this happen! He was extremely helpful in the process of getting this released.

Luvata · 2021-06-16T16:28:57Z

Luvata
Jun 16, 2021

That's really awesome, I have a question: What is the dump date of this CC corpus ?

1 reply

adarob Jun 16, 2021

https://github.com/tensorflow/datasets/blob/5952d3d60d60e1727786fa7a9a23d24bb463d4d6/tensorflow_datasets/text/c4.py#L101

maite-glicom · 2021-06-17T06:49:43Z

maite-glicom
Jun 17, 2021

Congratulations! Where can we have more details about the cleaning process, e.g. is it deduplicated, etc?

4 replies

dirkgr Jun 17, 2021
Maintainer Author

The data is the output from running this code: https://github.com/tensorflow/datasets/blob/5952d3d60d60e1727786fa7a9a23d24bb463d4d6/tensorflow_datasets/text/c4.py

There are only a few 100 lines there, so it should be easy to get a high-level view of how it all hangs together.

daphnei Jun 23, 2021

The data is cleaned of HTML and other non-natural language. It is also deduplicated in a not very intelligent way: each paragraph (text separated by a newline) of each example is hashed and any paragraphs of hash collisions are removed.

adarob Jun 25, 2021

I don't think "not very intelligent" is the correct description. We had to make some tough decisions to make the dataset reproducible outside of Google within reasonable cost bounds since we were unable to release the data directly. This also drove our use of CLD3 versus more accurate options.

daphnei Jun 26, 2021

Thanks Adam for the clarification. You're absolutely correct that full deduping can be hugely expensive, and it was a mistake to call the implemented method not intelligent given the reproducibility goals of C4.

RyanHuangNLP · 2021-06-25T07:03:10Z

RyanHuangNLP
Jun 25, 2021

@dirkgr Can download specific language in mc4 data? I found the the every json line has three keys: 'text', 'timestamp', 'url'. So it need to use langdetect to find the specific language?

2 replies

dirkgr Jun 25, 2021
Maintainer Author

The filenames contain the language, as detected by CLD3. You can download just one language with the instructions at #5056.

trouble-maker007 Jun 26, 2021

@dirkgr Thanks for quickly response, in file c4-af.tfrecord-00000-of-00064.json.gz, af means language type, so if I just need the af, use git lfs pull --include "multilingual/c4-af.*.json.gz"

lsz05 · 2021-07-06T10:18:51Z

lsz05
Jul 6, 2021

Hello @dirkgr
Thank you for your beautiful work 🙌
I downloaded the ja data following the instructions in #5056 on July 2, 2021, and then decompressed, but I got a different size

$ git lfs pull --include "multilingual/c4-ja.*.json.gz"
$ mkdir ja && mv multilingual/c4-ja.*.json.gz ja
$ gzip -d ja/c4-ja.*.json.gz
$ du -sh ja
1.4T    ja
$ ls -l ja | grep "^-" | wc -l
1024

The total number of examples according to my counting is 87,337,884, which is consistent with the statistics introduced by Google. However, the token number and the size is not consistent with the data in the NAACL paper and this post.

I made a summary regarding the differences:

c4-ja	what I got	existing statistics
# examples	87,337,884	87,337,884
# tokens	291 B	164 B
size (GB)	1,429 GB	821 GB
# tokens/GB	0.204 B	0.200 B

I use a ByteBPETokenizer based on our own Japanese datasets, and I assume that there will not be a significant difference in # tokens result due to the change of tokenizers.
Regarding the cause of these inconsistencies, may I know your opinion? Or are there any updates on the published statistics?
Thank you very much!

1 reply

dirkgr Jul 7, 2021
Maintainer Author

There are a few differences. For the size, I counted only the contents of the "text" field, and I used ja as well as ja-Latn.

I don't know about the tokenizer, since I didn't run those numbers myself, but I would expect something called "ByteBPE" to create a lot of subword tokens, i.e., wordpieces, i.e., multiple tokens per word. That would inflate your token count quite a bit.

Mokashaa · 2021-08-22T11:44:07Z

Mokashaa
Aug 22, 2021

Great Effort !
Hello @dirkgr , would like to know if there is any way so that I can have parallel sentences for any language pair to be used in machine translation ? (i.e. connect 2 sentences from different language files with each other where the the first sentence is in language x and the other is it's translation in language y).

1 reply

dirkgr Aug 23, 2021
Maintainer Author

Sorry, this is not that kind of dataset. mC4 doesn't contain parallel data (though if you search long enough you might find documents that are translations of other documents).

muhammed-saeed · 2022-01-15T21:05:49Z

muhammed-saeed
Jan 15, 2022

how to transfer the data from tfrecord.json into text files ?

1 reply

dirkgr Jan 18, 2022
Maintainer Author

I like to use a tool called jq to do that kind of thing. But you can also write a short Python script that reads the JSON and writes out the text.

rdemorais · 2022-09-20T13:11:31Z

rdemorais
Sep 20, 2022

Guys,

Thank you for the amazing job!!

A quick question, the PT dataset is about Portuguese spoken in Portugal, right? The one we speak in Brazil is different.

1 reply

dirkgr Sep 20, 2022
Maintainer Author

I don't know. It'll be whatever the CLD3 language detector (https://github.com/google/cld3) detects as Portuguese. It's possible it doesn't make the distinction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The C4 Multilingual Dataset #5265

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

The C4 Multilingual Dataset #5265

dirkgr Jun 16, 2021 Maintainer

Replies: 8 comments · 13 replies

dirkgr Jun 17, 2021 Maintainer Author

dirkgr Jun 25, 2021 Maintainer Author

dirkgr Jul 7, 2021 Maintainer Author

dirkgr Aug 23, 2021 Maintainer Author

dirkgr Jan 18, 2022 Maintainer Author

dirkgr Sep 20, 2022 Maintainer Author

dirkgr
Jun 16, 2021
Maintainer

Replies: 8 comments 13 replies

dirkgr Jun 17, 2021
Maintainer Author

dirkgr Jun 25, 2021
Maintainer Author

dirkgr Jul 7, 2021
Maintainer Author

dirkgr Aug 23, 2021
Maintainer Author

dirkgr Jan 18, 2022
Maintainer Author

dirkgr Sep 20, 2022
Maintainer Author