Deep Learning on the English Wikipedia Dataset

The goal of this project is to create a Question Answering bot. To get there, I have to solve various Information Retrieval tasks. In this task, I am trying to use convolutional neural networks to encode wikipedia articles into fixed size vectors.

Creating Data

The Wikipedia dump is downloaded as "enwiki-20161201-pages-articles.xml.bz2". This archive is then unzipped and parsed with a python script. (http://medialab.di.unipi.it/wiki/Wikipedia_Extractor , https://github.com/attardi/wikiextractor).

The parser will create lots of files, each containg ~5000 lines. Before performing any machine learning, these need a little pre-processing. This is done with the Apache Spark application found in the folder "TextCleaner"

To run this Spark app, you need to have spark installed. The app can be run with the "run_spark.sh" script. To get the script to run, you need to edit the script such that it points to the path at which your Apache Spark installation is.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
TextCleaner		TextCleaner
.gitignore		.gitignore
README.md		README.md
data_set.py		data_set.py
discriminator.py		discriminator.py
document_batcher.py		document_batcher.py
encoder.py		encoder.py
model.py		model.py
splitscript.py		splitscript.py
train_data_batcher.py		train_data_batcher.py
visualize.py		visualize.py
word_dict.py		word_dict.py
word_dict_5.pickle		word_dict_5.pickle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Deep Learning on the English Wikipedia Dataset

Creating Data

About

Uh oh!

Releases

Packages

Languages

bnoreus/wiki_deep_learning

Folders and files

Latest commit

History

Repository files navigation

Deep Learning on the English Wikipedia Dataset

Creating Data

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages