Skip to content

tarekeldeeb/arabic_corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 

Repository files navigation

BUILDING VOCABULARY
Processed 1754541204 tokens.
Counted 5329509 unique words.
Truncating vocabulary at min count 5.
Using vocabulary of size 1539115.

Build the Arabic Corpus

Dowload Resources

The arabic corpus {1.9B word} consists of the following resources:

  • ShamelaLibrary348.7z link {1.15B}
  • UN arabic corpus mirror1 mirror2 {0.37B}
  • AraCorpus.tar.gz link {0.14B}
  • Arabic Wikipedia Latest Articles Dump link {0.11B}
  • Tashkeela-arabic-diacritized-text-utf8-0.3.zip link {0.07B}
  • Arabic Tweets link {0.03B}
  • watan-2004.7z link {0.01B}

More resources are listed by Ayman Eddakrouri

Parse and Process

After downloading the resources from the above links, run the make_corpus.sh to automate the extraction, preprocessing, formatting and finally generating a single-line file will the full arabic corpus. Some the the used commands are discussed in commands.

Due to file sizes limits in github, no files are added due to huge file sizes.

Download Pre-built Arabic Corpus

A zipped tar may be downloaded from archive.org. I welcome mirroring this file.

About

Arabic Dataset Corpus 1.75 Billion Token

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages