Skip to content

duaaalkhafaje/arwikiExtracts

 
 

Repository files navigation

Arabic Wikipedia Extracts

Documents extracts from Arabic Wikipedia downloaded from Arabic Wikipedia dumps

instructions

  1. get the corpus dump
wget https://dumps.wikimedia.org/arwiki/latest/arwiki-latest-pages-articles.xml.bz2 
  1. get the tool
git clone https://github.com/attardi/wikiextractor.git

OR

wget https://github.com/attardi/wikiextractor/raw/master/WikiExtractor.py
  1. extract:
python arwikiExtracts/WikiExtractor.py arwiki-latest-pages-articles.xml.bz2 -o 20190920 --json 
  1. Compile and compress (optional):
python json2text.py
7za -v50m a arwiki_20190920.txt.zip arwiki_20190920.txt

to unzip

7za x arwiki_20190920.txt.zip.001

License

License: CC BY-SA 4.0

This corpus is extracted by wikiextractor

corpus extracts from 20-09-2019

documents words vocabulary
953,507 123,079,742 4,437,963

Most frequent words and Hepax words

corpus extracts from 20-10-2018

corpus extracts from 20-01-2017

Corpus Information

documents words vocabulary
459,208 83.5M 4.7M

To cite this resource:

Motaz Saad and Basem Alijla (2017). WikiDocsAligner: an off-the-shelf Wikipedia Documents Alignment Tool. in The Second Palestinian International Conference on Information and Communication Technology (PICICT 2017).

About

Arabic Wikipedia Extracts

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%