The goal of word.lists is to provide easy access to a handful of word-frequency lists.
These lists come primarily from the researchers Charlie Browne, Brent Culligan and Joseph Phillips, and made publicly available at https://www.newgeneralservicelist.org/ and Tom Cobb, whose site https://www.lextutor.ca/ is immensely useful for vocabulary profiling.
You can install the released version of word.lists from... CRAN with:
# no, you can't
# install.packages("word.lists")
And the development version from GitHub with:
# install.packages("devtools")
# just this one for at least a while
devtools::install_github("antdurrant/word.lists")
We’ll see the distribution of words from the NGSL and NAWL in some arbitrary academic-ish text.
A dataset containing the New Academic Word List (NAWL) & New General Service List (NGSL). Difficulty groupings have been arbitrarily set by me, as follows: Group 1: first 500 words of NGSL by frequency & “supplementary” words - months/numbers etc Group 2: next 500 words of NGSL by frequency Group 3: next 1000 words of NGSL by frequency Group 4: remaining NGSL words by frequency (about 800 words) Group 5: academic word list (about 950 words)
library(word.lists)
library(udpipe)
library(dplyr)
What does a list look like?
list_academic %>%
head() %>%
knitr::kable()
lemma | group | on_list |
---|---|---|
authority | 5 | academic |
publish | 5 | academic |
conference | 5 | academic |
aspect | 5 | academic |
client | 5 | academic |
impact | 5 | academic |
Any old text will do:
text <- "There are numerous indicators of success in any field of research.
Though one may be valid in some context, it may be rendered without utility in another."
Annotate it with udpipe, keep only some useful bits
piped_text <- udpipe(text, object = "english") %>%
select(doc_id, sentence_id, token_id, token, lemma, upos)
piped_text %>%
head() %>%
knitr::kable()
doc_id | sentence_id | token_id | token | lemma | upos |
---|---|---|---|---|---|
doc1 | 1 | 1 | There | there | PRON |
doc1 | 1 | 2 | are | be | VERB |
doc1 | 1 | 3 | numerous | numerous | ADJ |
doc1 | 1 | 4 | indicators | indicator | NOUN |
doc1 | 1 | 5 | of | of | ADP |
doc1 | 1 | 6 | success | success | NOUN |
Show the words and where they are in the list
piped_text %>%
left_join(list_academic) %>%
head() %>%
knitr::kable()
#> Joining, by = "lemma"
doc_id | sentence_id | token_id | token | lemma | upos | group | on_list |
---|---|---|---|---|---|---|---|
doc1 | 1 | 1 | There | there | PRON | 1 | general |
doc1 | 1 | 2 | are | be | VERB | 1 | general |
doc1 | 1 | 3 | numerous | numerous | ADJ | 4 | general |
doc1 | 1 | 4 | indicators | indicator | NOUN | 5 | academic |
doc1 | 1 | 5 | of | of | ADP | 1 | general |
doc1 | 1 | 6 | success | success | NOUN | 2 | general |
Order by most advanced words first
joined_text <- piped_text %>%
left_join(list_academic) %>%
filter(upos != "PUNCT") %>%
arrange(desc(group))
#> Joining, by = "lemma"
joined_text %>%
head() %>%
knitr::kable()
doc_id | sentence_id | token_id | token | lemma | upos | group | on_list |
---|---|---|---|---|---|---|---|
doc1 | 1 | 4 | indicators | indicator | NOUN | 5 | academic |
doc1 | 2 | 5 | valid | valid | ADJ | 5 | academic |
doc1 | 2 | 13 | rendered | render | VERB | 5 | academic |
doc1 | 2 | 15 | utility | utility | NOUN | 5 | academic |
doc1 | 1 | 3 | numerous | numerous | ADJ | 4 | general |
doc1 | 2 | 8 | context | context | NOUN | 3 | general |
How many in each group?
joined_text %>%
count(on_list, group) %>%
arrange(desc(group))%>%
knitr::kable()
on_list | group | n |
---|---|---|
academic | 5 | 4 |
general | 4 | 1 |
general | 3 | 1 |
general | 2 | 2 |
general | 1 | 19 |
We can see that this has a fairly high proportion of “academic” words with the majority falling under the most frequent thousand-word grouping, which is totally normal.
If I am teaching Japanese-speaking kids:
I want to get a wordlist
piped_text %>%
word.lists::get_wordlist(language = "jpn")%>%
knitr::kable()
#> Joining, by = "upos"
token_id | token | lemma | upos | pos | translation |
---|---|---|---|---|---|
1 | There | there | PRON | ||
2 | are | be | VERB | v | である || ではある || ございます || ある |
3 | numerous | numerous | ADJ | a | ぎょうさん || おびただしい |
4 | indicators | indicator | NOUN | n | 指数 || 指標 || 兆候 || 前兆 |
5 | of | of | ADP | ||
6 | success | success | NOUN | n | サクセス || 上首尾 || 成功 || ウイナー |
7 | in | in | ADP | ||
8 | any | any | DET | ||
9 | field | field | NOUN | n | 土地 || 小野 || 前線 || 征野 |
10 | of | of | ADP | ||
11 | research | research | NOUN | n | リサーチ || 研究 || 問い合わせ |
1 | Though | though | SCONJ | ||
2 | one | one | NUM | ||
3 | may | may | AUX | v | |
4 | be | be | AUX | v | である || ではある || ございます || ある |
5 | valid | valid | ADJ | a | 妥当 || 有効 |
6 | in | in | ADP | ||
7 | some | some | DET | ||
8 | context | context | NOUN | n | コンテキスト || 前後関係 |
10 | it | it | PRON | ||
11 | may | may | AUX | v | |
12 | be | be | AUX | v | である || ではある || ございます || ある |
13 | rendered | render | VERB | v | 供給+する || 提供+する || 作る || 作り出す |
14 | without | without | ADP | ||
15 | utility | utility | NOUN | n | 公益法人 || 実用性 || 有益さ || 公共サービス |
16 | in | in | ADP | ||
17 | another | another | DET |
Okay, but I know my students know the first thousand or so words
piped_text %>%
word.lists::get_wordlist(language = "jpn") %>%
left_join(list_academic) %>%
filter(upos != "PUNCT",
group > 2) %>%
select(on_list, group, token, lemma, upos, translation) %>%
knitr::kable()
#> Joining, by = "upos"
#> Joining, by = "lemma"
on_list | group | token | lemma | upos | translation |
---|---|---|---|---|---|
general | 4 | numerous | numerous | ADJ | ぎょうさん || おびただしい |
academic | 5 | indicators | indicator | NOUN | 指数 || 指標 || 兆候 || 前兆 |
academic | 5 | valid | valid | ADJ | 妥当 || 有効 |
general | 3 | context | context | NOUN | コンテキスト || 前後関係 |
academic | 5 | rendered | render | VERB | 供給+する || 提供+する || 作る || 作り出す |
academic | 5 | utility | utility | NOUN | 公益法人 || 実用性 || 有益さ || 公共サービス |
Or maybe I am teaching Finnish-speaking kids who just need academic words:
piped_text %>%
word.lists::get_wordlist(language = "eus") %>%
left_join(list_academic) %>%
filter(upos != "PUNCT",
group == 5) %>%
select(on_list, group, token, lemma, upos, translation) %>%
knitr::kable()
#> Joining, by = "upos"
#> Joining, by = "lemma"
on_list | group | token | lemma | upos | translation |
---|---|---|---|---|---|
academic | 5 | indicators | indicator | NOUN | zenbaki indize || adierazgailu || adierazle |
academic | 5 | valid | valid | ADJ | |
academic | 5 | rendered | render | VERB | bihurtu || bilakatu || eman || eragin |
academic | 5 | utility | utility | NOUN | zerbitzu publiko || baliagarritasun || erabilgarritasun || baliogarritasun |
The translations only go English > OTHER. They all come from the Open Multilingual Wordnet, which itself is an international project using the fundamental work done by the Princeton wordnet. Where there is no entry in the Open Multilingual Wordnet, the translation will come out blank. It is not perfect, but it is a heck of a lot easier than going through texts one word at a time.
The translation work runs through Python’s nltk module.
word.lists::nltk_languages %>%
knitr::kable()
wordnet | lang | synsets | words | senses | core | licence |
---|---|---|---|---|---|---|
Albanet | als | 4675 | 5988 | 9599 | 31% | CC BY 3.0 |
Arabic WordNet (AWN v2) | arb | 9916 | 17785 | 37335 | 47% | CC BY SA 3.0 |
BulTreeBank Wordnet (BTB-WN) | bul | 4959 | 6720 | 8936 | 99% | CC BY 3.0 |
Chinese Open Wordnet | cmn | 42312 | 61533 | 79809 | 100% | wordnet |
Chinese Wordnet (Taiwan) | qcn | 4913 | 3206 | 8069 | 28% | wordnet |
DanNet | dan | 4476 | 4468 | 5859 | 81% | wordnet |
Greek Wordnet | ell | 18049 | 18227 | 24106 | 57% | Apache 2.0 |
Princeton WordNet | eng | 117659 | 148730 | 206978 | 100% | wordnet |
Persian Wordnet | fas | 17759 | 17560 | 30461 | 41% | Free to use |
FinnWordNet | fin | 116763 | 129839 | 189227 | 100% | CC BY 3.0 |
WOLF (Wordnet Libre du Français) | fra | 59091 | 55373 | 102671 | 92% | CeCILL-C |
Hebrew Wordnet | heb | 5448 | 5325 | 6872 | 27% | wordnet |
Croatian Wordnet | hrv | 23120 | 29008 | 47900 | 100% | CC BY 3.0 |
IceWordNet | isl | 4951 | 11504 | 16004 | 99% | CC BY 3.0 |
MultiWordNet | ita | 35001 | 41855 | 63133 | 83% | CC BY 3.0 |
ItalWordnet | ita | 15563 | 19221 | 24135 | 48% | ODC-BY 1.0 |
Japanese Wordnet | jpn | 57184 | 91964 | 158069 | 95% | wordnet |
Multilingual Central Repository | cat | 45826 | 46531 | 70622 | 81% | CC BY 3.0 |
Multilingual Central Repository | eus | 29413 | 26240 | 48934 | 71% | CC BY 3.0 |
Multilingual Central Repository | glg | 19312 | 23124 | 27138 | 36% | CC BY 3.0 |
Multilingual Central Repository | spa | 38512 | 36681 | 57764 | 76% | CC BY 3.0 |
Wordnet Bahasa | ind | 38085 | 36954 | 106688 | 94% | MIT |
Wordnet Bahasa | zsm | 36911 | 33932 | 105028 | 96% | MIT |
Open Dutch WordNet | nld | 30177 | 43077 | 60259 | 67% | CC BY SA 4.0 |
Norwegian Wordnet | nno | 3671 | 3387 | 4762 | 66% | wordnet |
Norwegian Wordnet | nob | 4455 | 4186 | 5586 | 81% | wordnet |
plWordNet | pol | 33826 | 45387 | 52378 | 54% | wordnet |
OpenWN-PT | por | 43895 | 54071 | 74012 | 84% | CC BY-SA |
Romanian Wordnet | ron | 56026 | 49987 | 84638 | 94% | CC BY SA |
Lithuanian WordNet | lit | 9462 | 11395 | 16032 | 35% | CC BY SA 3.0 |
Slovak WordNet | slk | 18507 | 29150 | 44029 | 58% | CC BY SA 3.0 |
sloWNet | slv | 42583 | 40233 | 70947 | 86% | CC BY SA 3.0 |
Swedish (SALDO) | swe | 6796 | 5824 | 6904 | 99% | CC-BY 3.0 |
Thai Wordnet | tha | 73350 | 82504 | 95517 | 81% | wordnet |
TODO: Bring this functionality to an app