- Take N classes of wikipedia articles (laziest way possible)
- for each class - 1000 articles
- Create an autoencoder to compress the articles.
- Perform classification with typical classifier.
- Discussion:
- Compare to a classification on plain text.
- Compare to PCA
- Crawling Wikipedia:
- mediawiki API.
requests
python package.
- Word / Article Representation:
- stopwords removal from
nltk
. - co-occurance probability with a default context size of 6 for each token.
- article representation through simple sum of co-occurance probability for each token.
- stopwords removal from
- Compression:
- AutoEncoder:
- a simple 4 layers NN (implemented with
tensorflow
).
- a simple 4 layers NN (implemented with
- PCA:
sklearn
implementation of PCA.
- AutoEncoder:
- Classification:
- LogisticRegression with
sklearn pipeline
. - RandomForest with
sklearn pipeline
.
- LogisticRegression with
- Visualization:
- Visualization using
matplotlib
.
- Visualization using