Unigram generated from 16 files provided by NILC - Núcleo Interinstitucional de Linguística Computacional.
These files are composed by +681.639.644 tokens:
- Wikpedia (pt-br) - 2016
- Google News
- SubIMDB-PT
- G1
- PNL-Br
- Literancy works of public domain
- Lacio-Web
- Portuguese e-books
- Mundo Estranho
- CHC
- Fapesp
- Textbooks
- Folhinha
- NILC subcorpus
- Para seu filho ler
- SARESP
The files are available at:
http://nilc.icmc.usp.br/nilc/index.php/repositorio-de-word-embeddings-do-nilc
The reason to create this file is to provide unigrams to be used in the word segmentation algorithm:
https://github.com/grantjenks/python-wordsegment
The the script used to create this file is npl_word_segment.py and group_files.py.