Skip to content

Commit

Permalink
Merge pull request #51 from LuminosoInsight/version1.7
Browse files Browse the repository at this point in the history
Version 1.7: update tokenization, update Wikipedia data, add languages
  • Loading branch information
alin-luminoso authored Sep 8, 2017
2 parents dcef581 + 61b2e40 commit 721a1e9
Show file tree
Hide file tree
Showing 81 changed files with 25,728 additions and 25,534 deletions.
20 changes: 20 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,23 @@
## Version 1.7.0 (2017-08-25)

- Tokenization will always keep Unicode graphemes together, including
complex emoji introduced in Unicode 10
- Update the Wikipedia source data to April 2017
- Remove some non-words, such as the Unicode replacement character and the
pilcrow sign, from frequency lists
- Support Bengali and Macedonian, which passed the threshold of having enough
source data to be included


## Version 1.6.1 (2017-05-10)

- Depend on langcodes 1.4, with a new language-matching system that does not
depend on SQLite.

This prevents silly conflicts where langcodes' SQLite connection was
preventing langcodes from being used in threads.


## Version 1.6.0 (2017-01-05)

- Support Czech, Persian, Ukrainian, and Croatian/Bosnian/Serbian
Expand Down
163 changes: 85 additions & 78 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,70 +16,8 @@ or by getting the repository and running its setup.py:

python3 setup.py install


## Additional CJK installation

Chinese, Japanese, and Korean have additional external dependencies so that
they can be tokenized correctly. Here we'll explain how to set them up,
in increasing order of difficulty.


### Chinese

To be able to look up word frequencies in Chinese, you need Jieba, a
pure-Python Chinese tokenizer:

pip3 install jieba


### Japanese

We use MeCab, by Taku Kudo, to tokenize Japanese. To use this in wordfreq, three
things need to be installed:

* The MeCab development library (called `libmecab-dev` on Ubuntu)
* The UTF-8 version of the `ipadic` Japanese dictionary
(called `mecab-ipadic-utf8` on Ubuntu)
* The `mecab-python3` Python interface

To install these three things on Ubuntu, you can run:

```sh
sudo apt-get install libmecab-dev mecab-ipadic-utf8
pip3 install mecab-python3
```

If you choose to install `ipadic` from somewhere else or from its source code,
be sure it's configured to use UTF-8. By default it will use EUC-JP, which will
give you nonsense results.


### Korean

Korean also uses MeCab, with a Korean dictionary package by Yongwoon Lee and
Yungho Yu. This dictionary is not available as an Ubuntu package.

Here's a process you can use to install the Korean dictionary and the other
MeCab dependencies:

```sh
sudo apt-get install libmecab-dev mecab-utils
pip3 install mecab-python3
wget https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.0.1-20150920.tar.gz
tar xvf mecab-ko-dic-2.0.1-20150920.tar.gz
cd mecab-ko-dic-2.0.1-20150920
./autogen.sh
make
sudo make install
```

If wordfreq cannot find the Japanese or Korean data for MeCab when asked to
tokenize those languages, it will raise an error and show you the list of
paths it searched.

Sorry that this is difficult. We tried to just package the data files we need
with wordfreq, like we do for Chinese, but PyPI would reject the package for
being too large.
See [Additional CJK installation][#additional-cjk-installation] for extra
steps that are necessary to get Chinese, Japanese, and Korean word frequencies.


## Usage
Expand Down Expand Up @@ -175,10 +113,10 @@ the list, in descending frequency order.

>>> from wordfreq import top_n_list
>>> top_n_list('en', 10)
['the', 'to', 'of', 'and', 'a', 'in', 'i', 'is', 'that', 'for']
['the', 'of', 'to', 'and', 'a', 'in', 'i', 'is', 'that', 'for']

>>> top_n_list('es', 10)
['de', 'la', 'que', 'en', 'el', 'y', 'a', 'los', 'no', 'se']
['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'se']

`iter_wordlist(lang, wordlist='combined')` iterates through all the words in a
wordlist, in descending frequency order.
Expand All @@ -197,10 +135,12 @@ will select each random word from 2^n words.

If you happen to want an easy way to get [a memorable, xkcd-style
password][xkcd936] with 60 bits of entropy, this function will almost do the
job. In this case, you should actually run the similar function `random_ascii_words`,
limiting the selection to words that can be typed in ASCII.
job. In this case, you should actually run the similar function
`random_ascii_words`, limiting the selection to words that can be typed in
ASCII. But maybe you should just use [xkpa][].

[xkcd936]: https://xkcd.com/936/
[xkpa]: https://github.com/beala/xkcd-password


## Sources and supported languages
Expand Down Expand Up @@ -230,38 +170,40 @@ least 3 different sources of word frequencies:
Language Code # Large? WP Subs News Books Web Twit. Redd. Misc.
──────────────────────────────┼────────────────────────────────────────────────
Arabic ar 5 Yes │ Yes Yes Yes - Yes Yes - -
Bosnian bs [1] 3 │ Yes Yes - - - Yes - -
Bengali bn 3 - │ Yes - Yes - - Yes - -
Bosnian bs [1] 3 - │ Yes Yes - - - Yes - -
Bulgarian bg 3 - │ Yes Yes - - - Yes - -
Catalan ca 4 - │ Yes Yes Yes - - Yes - -
Chinese zh [3] 6 Yes │ Yes - Yes Yes Yes Yes - Jieba
Croatian hr [1] 3 │ Yes Yes - - - Yes - -
Czech cs 3 - │ Yes Yes - - - Yes - -
Danish da 3 - │ Yes Yes - - - Yes - -
German de 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
Greek el 3 - │ Yes Yes - - Yes - - -
Dutch nl 4 Yes │ Yes Yes Yes - - Yes - -
English en 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
Spanish es 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
Persian fa 3 - │ Yes Yes - - - Yes - -
Finnish fi 5 Yes │ Yes Yes Yes - - Yes Yes -
French fr 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
German de 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
Greek el 3 - │ Yes Yes - - Yes - - -
Hebrew he 4 - │ Yes Yes - Yes - Yes - -
Hindi hi 3 - │ Yes - - - - Yes Yes -
Croatian hr [1] 3 │ Yes Yes - - - Yes - -
Hungarian hu 3 - │ Yes Yes - - Yes - - -
Indonesian id 3 - │ Yes Yes - - - Yes - -
Italian it 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
Japanese ja 5 Yes │ Yes Yes - - Yes Yes Yes -
Korean ko 4 - │ Yes Yes - - - Yes Yes -
Macedonian mk 3 - │ Yes Yes Yes - - - - -
Malay ms 3 - │ Yes Yes - - - Yes - -
Norwegian nb [2] 4 - │ Yes Yes - - - Yes Yes -
Dutch nl 4 Yes │ Yes Yes Yes - - Yes - -
Persian fa 3 - │ Yes Yes - - - Yes - -
Polish pl 5 Yes │ Yes Yes Yes - - Yes Yes -
Portuguese pt 5 Yes │ Yes Yes Yes - Yes Yes - -
Romanian ro 3 - │ Yes Yes - - - Yes - -
Russian ru 6 Yes │ Yes Yes Yes Yes Yes Yes - -
Serbian sr [1] 3 - │ Yes Yes - - - Yes - -
Spanish es 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
Swedish sv 4 - │ Yes Yes - - - Yes Yes -
Turkish tr 3 - │ Yes Yes - - - Yes - -
Ukrainian uk 4 - │ Yes Yes - - - Yes Yes -
Chinese zh [3] 6 Yes │ Yes - Yes Yes Yes Yes - Jieba

[1] Bosnian, Croatian, and Serbian use the same underlying word list, because
they share most of their vocabulary and grammar, they were once considered the
Expand All @@ -277,7 +219,7 @@ Chinese, with primarily Mandarin Chinese vocabulary. See "Multi-script
languages" below.

Some languages provide 'large' wordlists, including words with a Zipf frequency
between 1.0 and 3.0. These are available in 12 languages that are covered by
between 1.0 and 3.0. These are available in 13 languages that are covered by
enough data sources.


Expand Down Expand Up @@ -314,7 +256,7 @@ into multiple tokens:
>>> zipf_frequency('New York', 'en')
5.35
>>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway"
3.56
3.55

The word frequencies are combined with the half-harmonic-mean function in order
to provide an estimate of what their combined frequency would be. In Chinese,
Expand Down Expand Up @@ -381,6 +323,71 @@ frequencies in `cmn-Hans` (the fully specific language code for Mandarin in
Simplified Chinese), you will get the `zh` wordlist, for example.


## Additional CJK installation

Chinese, Japanese, and Korean have additional external dependencies so that
they can be tokenized correctly. Here we'll explain how to set them up,
in increasing order of difficulty.


### Chinese

To be able to look up word frequencies in Chinese, you need Jieba, a
pure-Python Chinese tokenizer:

pip3 install jieba


### Japanese

We use MeCab, by Taku Kudo, to tokenize Japanese. To use this in wordfreq, three
things need to be installed:

* The MeCab development library (called `libmecab-dev` on Ubuntu)
* The UTF-8 version of the `ipadic` Japanese dictionary
(called `mecab-ipadic-utf8` on Ubuntu)
* The `mecab-python3` Python interface

To install these three things on Ubuntu, you can run:

```sh
sudo apt-get install libmecab-dev mecab-ipadic-utf8
pip3 install mecab-python3
```

If you choose to install `ipadic` from somewhere else or from its source code,
be sure it's configured to use UTF-8. By default it will use EUC-JP, which will
give you nonsense results.


### Korean

Korean also uses MeCab, with a Korean dictionary package by Yongwoon Lee and
Yungho Yu. This dictionary is not available as an Ubuntu package.

Here's a process you can use to install the Korean dictionary and the other
MeCab dependencies:

```sh
sudo apt-get install libmecab-dev mecab-utils
pip3 install mecab-python3
wget https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.0.1-20150920.tar.gz
tar xvf mecab-ko-dic-2.0.1-20150920.tar.gz
cd mecab-ko-dic-2.0.1-20150920
./autogen.sh
make
sudo make install
```

If wordfreq cannot find the Japanese or Korean data for MeCab when asked to
tokenize those languages, it will raise an error and show you the list of
paths it searched.

Sorry that this is difficult. We tried to just package the data files we need
with wordfreq, like we do for Chinese, but PyPI would reject the package for
being too large.


## License

`wordfreq` is freely redistributable under the MIT license (see
Expand Down
14 changes: 14 additions & 0 deletions scripts/top_n.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
"""
A quick script to output the top N words (1000 for now) in each language.
You can send the output to a file and diff it to see changes between wordfreq
versions.
"""
import wordfreq


N = 1000


for lang in sorted(wordfreq.available_languages()):
for word in wordfreq.top_n_list(lang, 1000):
print('{}\t{}'.format(lang, word))
4 changes: 3 additions & 1 deletion tests/test.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,8 @@ def test_freq_examples():
'he': 'חחח',
'bg': 'ахаха',
'uk': 'хаха',
'bn': 'হা হা',
'mk': 'хаха'
}


Expand Down Expand Up @@ -190,7 +192,7 @@ def test_not_really_random():
# This not only tests random_ascii_words, it makes sure we didn't end
# up with 'eos' as a very common Japanese word
eq_(random_ascii_words(nwords=4, lang='ja', bits_per_word=0),
'00 00 00 00')
'1 1 1 1')


@raises(ValueError)
Expand Down
Binary file modified wordfreq/data/combined_ar.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_bg.msgpack.gz
Binary file not shown.
Binary file added wordfreq/data/combined_bn.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_ca.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_cs.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_da.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_de.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_el.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_en.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_es.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_fa.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_fi.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_fr.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_he.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_hi.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_hu.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_id.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_it.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_ja.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_ko.msgpack.gz
Binary file not shown.
Binary file added wordfreq/data/combined_mk.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_ms.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_nb.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_nl.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_pl.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_pt.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_ro.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_ru.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_sh.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_sv.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_tr.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_uk.msgpack.gz
Binary file not shown.
Binary file modified wordfreq/data/combined_zh.msgpack.gz
Binary file not shown.
Loading

0 comments on commit 721a1e9

Please sign in to comment.