Merge pull request #51 from LuminosoInsight/version1.7

Version 1.7: update tokenization, update Wikipedia data, add languages
rspeer · Sep 8, 2017 · 721a1e9 · 721a1e9
2 parents dcef581 + 61b2e40
commit 721a1e9
Show file tree

Hide file tree

Showing 81 changed files with 25,728 additions and 25,534 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,23 @@
+## Version 1.7.0 (2017-08-25)
+
+- Tokenization will always keep Unicode graphemes together, including
+ complex emoji introduced in Unicode 10
+- Update the Wikipedia source data to April 2017
+- Remove some non-words, such as the Unicode replacement character and the
+ pilcrow sign, from frequency lists
+- Support Bengali and Macedonian, which passed the threshold of having enough
+ source data to be included
+
+
+## Version 1.6.1 (2017-05-10)
+
+- Depend on langcodes 1.4, with a new language-matching system that does not
+ depend on SQLite.
+
+ This prevents silly conflicts where langcodes' SQLite connection was
+ preventing langcodes from being used in threads.
+
+
 ## Version 1.6.0 (2017-01-05)
 
 - Support Czech, Persian, Ukrainian, and Croatian/Bosnian/Serbian

diff --git a/README.md b/README.md
@@ -16,70 +16,8 @@ or by getting the repository and running its setup.py:
 
  python3 setup.py install
 
-
-## Additional CJK installation
-
-Chinese, Japanese, and Korean have additional external dependencies so that
-they can be tokenized correctly. Here we'll explain how to set them up,
-in increasing order of difficulty.
-
-
-### Chinese
-
-To be able to look up word frequencies in Chinese, you need Jieba, a
-pure-Python Chinese tokenizer:
-
- pip3 install jieba
-
-
-### Japanese
-
-We use MeCab, by Taku Kudo, to tokenize Japanese. To use this in wordfreq, three
-things need to be installed:
-
- * The MeCab development library (called `libmecab-dev` on Ubuntu)
- * The UTF-8 version of the `ipadic` Japanese dictionary
- (called `mecab-ipadic-utf8` on Ubuntu)
- * The `mecab-python3` Python interface
-
-To install these three things on Ubuntu, you can run:
-
-```sh
-sudo apt-get install libmecab-dev mecab-ipadic-utf8
-pip3 install mecab-python3
-```
-
-If you choose to install `ipadic` from somewhere else or from its source code,
-be sure it's configured to use UTF-8. By default it will use EUC-JP, which will
-give you nonsense results.
-
-
-### Korean
-
-Korean also uses MeCab, with a Korean dictionary package by Yongwoon Lee and
-Yungho Yu. This dictionary is not available as an Ubuntu package.
-
-Here's a process you can use to install the Korean dictionary and the other
-MeCab dependencies:
-
-```sh
-sudo apt-get install libmecab-dev mecab-utils
-pip3 install mecab-python3
-wget https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.0.1-20150920.tar.gz
-tar xvf mecab-ko-dic-2.0.1-20150920.tar.gz
-cd mecab-ko-dic-2.0.1-20150920
-./autogen.sh
-make
-sudo make install
-```
-
-If wordfreq cannot find the Japanese or Korean data for MeCab when asked to
-tokenize those languages, it will raise an error and show you the list of
-paths it searched.
-
-Sorry that this is difficult. We tried to just package the data files we need
-with wordfreq, like we do for Chinese, but PyPI would reject the package for
-being too large.
+See [Additional CJK installation][#additional-cjk-installation] for extra
+steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
 
 
 ## Usage
@@ -175,10 +113,10 @@ the list, in descending frequency order.
 
  >>> from wordfreq import top_n_list
  >>> top_n_list('en', 10)
- ['the', 'to', 'of', 'and', 'a', 'in', 'i', 'is', 'that', 'for']
+ ['the', 'of', 'to', 'and', 'a', 'in', 'i', 'is', 'that', 'for']
 
  >>> top_n_list('es', 10)
- ['de', 'la', 'que', 'en', 'el', 'y', 'a', 'los', 'no', 'se']
+ ['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'se']
 
 `iter_wordlist(lang, wordlist='combined')` iterates through all the words in a
 wordlist, in descending frequency order.
@@ -197,10 +135,12 @@ will select each random word from 2^n words.
 
 If you happen to want an easy way to get [a memorable, xkcd-style
 password][xkcd936] with 60 bits of entropy, this function will almost do the
-job. In this case, you should actually run the similar function `random_ascii_words`,
-limiting the selection to words that can be typed in ASCII.
+job. In this case, you should actually run the similar function
+`random_ascii_words`, limiting the selection to words that can be typed in
+ASCII. But maybe you should just use [xkpa][].
 
 [xkcd936]: https://xkcd.com/936/
+[xkpa]: https://github.com/beala/xkcd-password
 
 
 ## Sources and supported languages
@@ -230,38 +170,40 @@ least 3 different sources of word frequencies:
  Language Code # Large? WP Subs News Books Web Twit. Redd. Misc.
  ──────────────────────────────┼────────────────────────────────────────────────
  Arabic ar 5 Yes │ Yes Yes Yes - Yes Yes - -
- Bosnian bs [1] 3 │ Yes Yes - - - Yes - -
+ Bengali bn 3 - │ Yes - Yes - - Yes - -
+ Bosnian bs [1] 3 - │ Yes Yes - - - Yes - -
  Bulgarian bg 3 - │ Yes Yes - - - Yes - -
  Catalan ca 4 - │ Yes Yes Yes - - Yes - -
+ Chinese zh [3] 6 Yes │ Yes - Yes Yes Yes Yes - Jieba
+ Croatian hr [1] 3 │ Yes Yes - - - Yes - -
  Czech cs 3 - │ Yes Yes - - - Yes - -
  Danish da 3 - │ Yes Yes - - - Yes - -
- German de 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
- Greek el 3 - │ Yes Yes - - Yes - - -
+ Dutch nl 4 Yes │ Yes Yes Yes - - Yes - -
  English en 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
- Spanish es 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
- Persian fa 3 - │ Yes Yes - - - Yes - -
  Finnish fi 5 Yes │ Yes Yes Yes - - Yes Yes -
  French fr 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
+ German de 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
+ Greek el 3 - │ Yes Yes - - Yes - - -
  Hebrew he 4 - │ Yes Yes - Yes - Yes - -
  Hindi hi 3 - │ Yes - - - - Yes Yes -
- Croatian hr [1] 3 │ Yes Yes - - - Yes - -
  Hungarian hu 3 - │ Yes Yes - - Yes - - -
  Indonesian id 3 - │ Yes Yes - - - Yes - -
  Italian it 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
  Japanese ja 5 Yes │ Yes Yes - - Yes Yes Yes -
  Korean ko 4 - │ Yes Yes - - - Yes Yes -
+ Macedonian mk 3 - │ Yes Yes Yes - - - - -
  Malay ms 3 - │ Yes Yes - - - Yes - -
  Norwegian nb [2] 4 - │ Yes Yes - - - Yes Yes -
- Dutch  nl  4 Yes │ Yes Yes Yes - - Yes - -
+ Persian fa  3 -  │ Yes Yes -  - - Yes - -
  Polish pl 5 Yes │ Yes Yes Yes - - Yes Yes -
  Portuguese pt 5 Yes │ Yes Yes Yes - Yes Yes - -
  Romanian ro 3 - │ Yes Yes - - - Yes - -
  Russian ru 6 Yes │ Yes Yes Yes Yes Yes Yes - -
  Serbian sr [1] 3 - │ Yes Yes - - - Yes - -
+ Spanish es 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
  Swedish sv 4 - │ Yes Yes - - - Yes Yes -
  Turkish tr 3 - │ Yes Yes - - - Yes - -
  Ukrainian uk 4 - │ Yes Yes - - - Yes Yes -
- Chinese zh [3] 6 Yes │ Yes - Yes Yes Yes Yes - Jieba
 
 [1] Bosnian, Croatian, and Serbian use the same underlying word list, because
 they share most of their vocabulary and grammar, they were once considered the
@@ -277,7 +219,7 @@ Chinese, with primarily Mandarin Chinese vocabulary. See "Multi-script
 languages" below.
 
 Some languages provide 'large' wordlists, including words with a Zipf frequency
-between 1.0 and 3.0. These are available in 12 languages that are covered by
+between 1.0 and 3.0. These are available in 13 languages that are covered by
 enough data sources.
 
 
@@ -314,7 +256,7 @@ into multiple tokens:
  >>> zipf_frequency('New York', 'en')
  5.35
  >>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway"
- 3.56
+ 3.55
 
 The word frequencies are combined with the half-harmonic-mean function in order
 to provide an estimate of what their combined frequency would be. In Chinese,
@@ -381,6 +323,71 @@ frequencies in `cmn-Hans` (the fully specific language code for Mandarin in
 Simplified Chinese), you will get the `zh` wordlist, for example.
 
 
+## Additional CJK installation
+
+Chinese, Japanese, and Korean have additional external dependencies so that
+they can be tokenized correctly. Here we'll explain how to set them up,
+in increasing order of difficulty.
+
+
+### Chinese
+
+To be able to look up word frequencies in Chinese, you need Jieba, a
+pure-Python Chinese tokenizer:
+
+ pip3 install jieba
+
+
+### Japanese
+
+We use MeCab, by Taku Kudo, to tokenize Japanese. To use this in wordfreq, three
+things need to be installed:
+
+ * The MeCab development library (called `libmecab-dev` on Ubuntu)
+ * The UTF-8 version of the `ipadic` Japanese dictionary
+ (called `mecab-ipadic-utf8` on Ubuntu)
+ * The `mecab-python3` Python interface
+
+To install these three things on Ubuntu, you can run:
+
+```sh
+sudo apt-get install libmecab-dev mecab-ipadic-utf8
+pip3 install mecab-python3
+```
+
+If you choose to install `ipadic` from somewhere else or from its source code,
+be sure it's configured to use UTF-8. By default it will use EUC-JP, which will
+give you nonsense results.
+
+
+### Korean
+
+Korean also uses MeCab, with a Korean dictionary package by Yongwoon Lee and
+Yungho Yu. This dictionary is not available as an Ubuntu package.
+
+Here's a process you can use to install the Korean dictionary and the other
+MeCab dependencies:
+
+```sh
+sudo apt-get install libmecab-dev mecab-utils
+pip3 install mecab-python3
+wget https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.0.1-20150920.tar.gz
+tar xvf mecab-ko-dic-2.0.1-20150920.tar.gz
+cd mecab-ko-dic-2.0.1-20150920
+./autogen.sh
+make
+sudo make install
+```
+
+If wordfreq cannot find the Japanese or Korean data for MeCab when asked to
+tokenize those languages, it will raise an error and show you the list of
+paths it searched.
+
+Sorry that this is difficult. We tried to just package the data files we need
+with wordfreq, like we do for Chinese, but PyPI would reject the package for
+being too large.
+
+
 ## License
 
 `wordfreq` is freely redistributable under the MIT license (see

diff --git a/scripts/top_n.py b/scripts/top_n.py
@@ -0,0 +1,14 @@
+"""
+A quick script to output the top N words (1000 for now) in each language.
+You can send the output to a file and diff it to see changes between wordfreq
+versions.
+"""
+import wordfreq
+
+
+N = 1000
+
+
+for lang in sorted(wordfreq.available_languages()):
+ for word in wordfreq.top_n_list(lang, 1000):
+ print('{}\t{}'.format(lang, word))
diff --git a/tests/test.py b/tests/test.py
@@ -35,6 +35,8 @@ def test_freq_examples():
  'he': 'חחח',
  'bg': 'ахаха',
  'uk': 'хаха',
+ 'bn': 'হা হা',
+ 'mk': 'хаха'
 }
 
 
@@ -190,7 +192,7 @@ def test_not_really_random():
  # This not only tests random_ascii_words, it makes sure we didn't end
  # up with 'eos' as a very common Japanese word
  eq_(random_ascii_words(nwords=4, lang='ja', bits_per_word=0),
- '00 00 00 00')
+ '1 1 1 1')
 
 
 @raises(ValueError)

diff --git a/wordfreq/data/combined_ar.msgpack.gz b/wordfreq/data/combined_ar.msgpack.gz
diff --git a/wordfreq/data/combined_bg.msgpack.gz b/wordfreq/data/combined_bg.msgpack.gz
diff --git a/wordfreq/data/combined_bn.msgpack.gz b/wordfreq/data/combined_bn.msgpack.gz
diff --git a/wordfreq/data/combined_ca.msgpack.gz b/wordfreq/data/combined_ca.msgpack.gz
diff --git a/wordfreq/data/combined_cs.msgpack.gz b/wordfreq/data/combined_cs.msgpack.gz
diff --git a/wordfreq/data/combined_da.msgpack.gz b/wordfreq/data/combined_da.msgpack.gz
diff --git a/wordfreq/data/combined_de.msgpack.gz b/wordfreq/data/combined_de.msgpack.gz
diff --git a/wordfreq/data/combined_el.msgpack.gz b/wordfreq/data/combined_el.msgpack.gz
diff --git a/wordfreq/data/combined_en.msgpack.gz b/wordfreq/data/combined_en.msgpack.gz
diff --git a/wordfreq/data/combined_es.msgpack.gz b/wordfreq/data/combined_es.msgpack.gz
diff --git a/wordfreq/data/combined_fa.msgpack.gz b/wordfreq/data/combined_fa.msgpack.gz
diff --git a/wordfreq/data/combined_fi.msgpack.gz b/wordfreq/data/combined_fi.msgpack.gz
diff --git a/wordfreq/data/combined_fr.msgpack.gz b/wordfreq/data/combined_fr.msgpack.gz
diff --git a/wordfreq/data/combined_he.msgpack.gz b/wordfreq/data/combined_he.msgpack.gz
diff --git a/wordfreq/data/combined_hi.msgpack.gz b/wordfreq/data/combined_hi.msgpack.gz
diff --git a/wordfreq/data/combined_hu.msgpack.gz b/wordfreq/data/combined_hu.msgpack.gz
diff --git a/wordfreq/data/combined_id.msgpack.gz b/wordfreq/data/combined_id.msgpack.gz
diff --git a/wordfreq/data/combined_it.msgpack.gz b/wordfreq/data/combined_it.msgpack.gz
diff --git a/wordfreq/data/combined_ja.msgpack.gz b/wordfreq/data/combined_ja.msgpack.gz
diff --git a/wordfreq/data/combined_ko.msgpack.gz b/wordfreq/data/combined_ko.msgpack.gz
diff --git a/wordfreq/data/combined_mk.msgpack.gz b/wordfreq/data/combined_mk.msgpack.gz
diff --git a/wordfreq/data/combined_ms.msgpack.gz b/wordfreq/data/combined_ms.msgpack.gz
diff --git a/wordfreq/data/combined_nb.msgpack.gz b/wordfreq/data/combined_nb.msgpack.gz
diff --git a/wordfreq/data/combined_nl.msgpack.gz b/wordfreq/data/combined_nl.msgpack.gz
diff --git a/wordfreq/data/combined_pl.msgpack.gz b/wordfreq/data/combined_pl.msgpack.gz
diff --git a/wordfreq/data/combined_pt.msgpack.gz b/wordfreq/data/combined_pt.msgpack.gz
diff --git a/wordfreq/data/combined_ro.msgpack.gz b/wordfreq/data/combined_ro.msgpack.gz
diff --git a/wordfreq/data/combined_ru.msgpack.gz b/wordfreq/data/combined_ru.msgpack.gz
diff --git a/wordfreq/data/combined_sh.msgpack.gz b/wordfreq/data/combined_sh.msgpack.gz
diff --git a/wordfreq/data/combined_sv.msgpack.gz b/wordfreq/data/combined_sv.msgpack.gz
diff --git a/wordfreq/data/combined_tr.msgpack.gz b/wordfreq/data/combined_tr.msgpack.gz
diff --git a/wordfreq/data/combined_uk.msgpack.gz b/wordfreq/data/combined_uk.msgpack.gz
diff --git a/wordfreq/data/combined_zh.msgpack.gz b/wordfreq/data/combined_zh.msgpack.gz