Arabic Wikipedia Extracts

Documents extracts from Arabic Wikipedia downloaded from Arabic Wikipedia dumps

instructions

get the corpus dump

wget https://dumps.wikimedia.org/arwiki/latest/arwiki-latest-pages-articles.xml.bz2

get the tool

git clone https://github.com/attardi/wikiextractor.git

OR

wget https://github.com/attardi/wikiextractor/raw/master/WikiExtractor.py

extract:

python arwikiExtracts/WikiExtractor.py arwiki-latest-pages-articles.xml.bz2 -o 20190920 --json

Compile and compress (optional):

python json2text.py
7za -v50m a arwiki_20190920.txt.zip arwiki_20190920.txt

to unzip

7za x arwiki_20190920.txt.zip.001

License

This corpus is extracted by wikiextractor

corpus extracts from 20-09-2019

documents	words	vocabulary
953,507	123,079,742	4,437,963

Most frequent words and Hepax words

corpus extracts from 20-10-2018

corpus extracts from 20-01-2017

Corpus Information

documents	words	vocabulary
459,208	83.5M	4.7M

To cite this resource:

Motaz Saad and Basem Alijla (2017). WikiDocsAligner: an off-the-shelf Wikipedia Documents Alignment Tool. in The Second Palestinian International Conference on Information and Communication Technology (PICICT 2017).

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
20170120/arwiki		20170120/arwiki
20181020		20181020
20190920		20190920
zip		zip
LICENSE.txt		LICENSE.txt
README.md		README.md
WikiExtractor.py		WikiExtractor.py
arwiki_20190920_info.md		arwiki_20190920_info.md
json2text.py		json2text.py
text_info.py		text_info.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Arabic Wikipedia Extracts

instructions

License

corpus extracts from 20-09-2019

corpus extracts from 20-10-2018

corpus extracts from 20-01-2017

Corpus Information

To cite this resource:

About

Releases

Packages

Languages

License

duaaalkhafaje/arwikiExtracts

Folders and files

Latest commit

History

Repository files navigation

Arabic Wikipedia Extracts

instructions

License

corpus extracts from 20-09-2019

corpus extracts from 20-10-2018

corpus extracts from 20-01-2017

Corpus Information

To cite this resource:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages