Skip to content

Python based implementation of extracting documents from cleaned Wikipedia dumps.

Notifications You must be signed in to change notification settings

mpss2019fn1/docextractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocExtractor

Tiny Python3 script to extract Wikipedia documents from multiple files.

Requirements

  • Python3
  • pathvalidate (pip3 install pathvalidate)

Usage

python3 doc_extractor.py --source={PATH_TO_SOURCE_DIR} --target={PATH_TO_TARGET_DIR} --extension={FILE_EXTENSION=article}

About

Python based implementation of extracting documents from cleaned Wikipedia dumps.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages