BUILDING VOCABULARY
Processed 1754541204 tokens.
Counted 5329509 unique words.
Truncating vocabulary at min count 5.
Using vocabulary of size 1539115.
The arabic corpus {1.9B word} consists of the following resources:
- ShamelaLibrary348.7z link {1.15B}
- UN arabic corpus mirror1 mirror2 {0.37B}
- AraCorpus.tar.gz link {0.14B}
- Arabic Wikipedia Latest Articles Dump link {0.11B}
- Tashkeela-arabic-diacritized-text-utf8-0.3.zip link {0.07B}
- Arabic Tweets link {0.03B}
- watan-2004.7z link {0.01B}
More resources are listed by Ayman Eddakrouri
After downloading the resources from the above links, run the make_corpus.sh to automate the extraction, preprocessing, formatting and finally generating a single-line file will the full arabic corpus. Some the the used commands are discussed in commands.
Due to file sizes limits in github, no files are added due to huge file sizes.
A zipped tar may be downloaded from archive.org. I welcome mirroring this file.