This repository provides cleaned lists of the most frequent words and n-grams (sequences of n words), including some English translations, of the Google Books Ngram Corpus (v3/20200217, all languages), plus customizable Python code which reproduces these lists.
Lists with the most frequent n-grams are provided separately by language and n. Available languages are Chinese (simplified), English, English Fiction, French, German, Hebrew, Italian, Russian, and Spanish. n ranges from 1 to 5. In the provided lists the language subcorpora are restricted to books published in the years 2010-2019, but in the Python code both this and the number of most frequent n-grams included can be adjusted.
The lists are found in the ngrams directory. For all languages except Hebrew cleaned lists are provided for the
- 10.000 most frequent 1-grams,
- 5.000 most frequent 2-grams,
- 3.000 most frequent 3-grams,
- 1.000 most frequent 4-grams,
- 1.000 most frequent 5-grams.
For Hebrew, due to the small corpus size, only the 200 most frequent 4-grams and 80 most frequent 5-grams are provided.
All cleaned lists also contain the number of times each n-gram occurs in the corpus (its frequency, column freq
). For 1-grams (words) there are two additional columns:
cumshare
which for each word contains the cumulative share of all words in the corpus made up by that word and all more frequent words.en
which contains the English translation of the word obtained using the Google Cloud Translate API (only for non-English languages).
Here are the first 10 rows of 1grams_french.csv:
ngram | freq | cumshare | en |
---|---|---|---|
de | 1380202965 | 0.048 | of |
la | 823756863 | 0.077 | the |
et | 651571349 | 0.100 | and |
le | 614855518 | 0.121 | the |
à | 577644624 | 0.142 | at |
l' | 527188618 | 0.160 | the |
les | 503689143 | 0.178 | them |
en | 390657918 | 0.191 | in |
des | 384774428 | 0.205 | of the |
The lists found directly in the ngrams directory have been cleaned and are intended for use when developing language-learning materials. The sub-directory ngrams/more contains uncleaned and less cleaned versions which might be of use for e.g. linguists:
- the most frequent raw n-grams as Google stores them (suffixed
0_raw
), - only keeping entries without part-of-speech (POS) tags (suffixed
1a_no_pos
), - only keeping entries with POS tags (only for 1-grams, suffixed
1b_with_pos
), - entries excluded from the final cleaned lists (suffixed
2_removed
).
To provide some motivation for why leaning the most frequent words first may be a good idea when learning a language, the following graph is provided.