GitHub - tarekeldeeb/arabic_corpus: Arabic Dataset Corpus 1.75 Billion Token

BUILDING VOCABULARY
Processed 1754541204 tokens.
Counted 5329509 unique words.
Truncating vocabulary at min count 5.
Using vocabulary of size 1539115.

Build the Arabic Corpus

Dowload Resources

The arabic corpus {1.9B word} consists of the following resources:

ShamelaLibrary348.7z link {1.15B}
UN arabic corpus mirror1 mirror2 {0.37B}
AraCorpus.tar.gz link {0.14B}
Arabic Wikipedia Latest Articles Dump link {0.11B}
Tashkeela-arabic-diacritized-text-utf8-0.3.zip link {0.07B}
Arabic Tweets link {0.03B}
watan-2004.7z link {0.01B}

More resources are listed by Ayman Eddakrouri

Parse and Process

After downloading the resources from the above links, run the make_corpus.sh to automate the extraction, preprocessing, formatting and finally generating a single-line file will the full arabic corpus. Some the the used commands are discussed in commands.

Due to file sizes limits in github, no files are added due to huge file sizes.

Download Pre-built Arabic Corpus

A zipped tar may be downloaded from archive.org. I welcome mirroring this file.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
README.md		README.md
commands.md		commands.md
download_corpus.sh		download_corpus.sh
make_corpus.sh		make_corpus.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Build the Arabic Corpus

Dowload Resources

Parse and Process

Download Pre-built Arabic Corpus

About

Releases

Packages

Languages

tarekeldeeb/arabic_corpus

Folders and files

Latest commit

History

Repository files navigation

Build the Arabic Corpus

Dowload Resources

Parse and Process

Download Pre-built Arabic Corpus

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages