Identifying Horror Authors from Samples of their Writings

This is my capstone project work on the Kaggle Spooky Author Identification dataset for Udacity's Machine Learning Engineer Nanodegree program, which I was enrolled in from March to September 2018. In this project, I applied machine learning and NLP techniques to solve an authorship attribution problem. The task was to identify the horror authors Edgar Allan Poe, H.P. Lovecraft, and Mary Wollstonecraft Shelley from samples of their writings. 60 iterations of 10-fold cross validation random search was run for each of the following six models: an MLP with the top 20,000 most common n-gram features, an MLP with all n-gram features, a CNN with GloVe embeddings, an RNN with GloVe embeddings, a CNN with fastText embeddings, and an RNN with fastText embeddings. Using multiclass logarithmic loss, or logloss, as the evaluation metric, it was the tuned MLP with all n-gram features that scored the lowest validation logloss, therefore performing the best. I have published the results in a project report, accessible here.

Getting Started

Project Requirements

Download Anaconda or Miniconda: https://conda.io/docs/user-guide/install/download.html.
The project was executed on Ubuntu 16.04 LTS powered by a GeForce GTX 1080 Ti to run random search on the neural network models. If you don't have TensorFlow installed on your system, please refer to their installation docs: https://www.tensorflow.org/install/.
View the requirements for this project in the requirements.txt file; note that you may need different package versions based on your operating system.

Installation Steps

After you have Anaconda or Miniconda installed, run conda create --name authorid python=3.6.5.
Run source activate authorid.
Run pip install -r requirements.txt.
Run KERAS_BACKEND=tensorflow python -c "from keras import backend".
Run python -m ipykernel install --user --name authorid --display-name "authorid".
If you don't already have the required input files, perform the steps in Download Input Files first.
Run jupyter notebook and open one of the notebook files in the code/ folder.
Select Kernel > Change kernel > authorid to change the kernel.

Download Input Files

Create a Kaggle account and download the data from https://www.kaggle.com/c/spooky-author-identification/data (you might need to accept competition rules before you are allowed to retrieve the data). Download the 3 files train.zip, test.zip, and sample_submission.zip, unzip them, and add the resulting CSV files to the input/ folder.
Download the GloVe glove.840B.300d.zip embeddings from https://nlp.stanford.edu/projects/glove/, unzip them, and add them to the input/embeddings/ folder.
Download the fastText crawl-300d-2M.vec.zip embeddings from https://fasttext.cc/docs/en/english-vectors.html, unzip them, and add them to the input/embeddings/ folder.

Usage

Recommended Order to View the Notebook Files

Start with exploratory data analysis (EDA): code/eda_and_text_preprocessing.ipynb.
Follow along with tests 36 to 53 in results/kaggle_spooky_author_submission_results.csv:
1. results/random_search_model_evaluations.txt
2. code/bow_mlp_34_model.ipynb
3. code/bow_mlp_33_model.ipynb
4. code/glove_cnn_model.ipynb
5. code/glove_rnn_model.ipynb
6. code/fasttext_cnn_model.ipynb
7. code/fasttext_rnn_model.ipynb
(Optional) Two scratch notebooks were used to make Kaggle submissions for some of the manual models:
1. code/test_glove_models.ipynb
2. code/test_fasttext_models.ipynb

Creating a New Notebook for a Model

Open an existing model notebook in the code/ folder, select File > Make a Copy... from the Jupyter menu, and change the name to the new model.
Change the title in the notebook to the new model.
Change the MODEL_NAME variable in cell 5 to the new model.
Change the EMBEDDINGS_FILE_PATH variable in cell 5 if necessary.
Change cell 12 ("Import model-dependent files") if necessary.

Testing a Set Model without Random Search

Set the number of iterations num_random_search_iter to 1.
Comment out the line random_model_params = get_random_model_params() and replace it with random_model_params = {} (see the models/__init__.py file for details on what the parameters should be set to).

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
code		code
images		images
input/embeddings		input/embeddings
output		output
report		report
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Identifying Horror Authors from Samples of their Writings

Getting Started

Project Requirements

Installation Steps

Download Input Files

Usage

Recommended Order to View the Notebook Files

Creating a New Notebook for a Model

Testing a Set Model without Random Search

About

Releases

Packages

Languages

License

mrbarbasa/kaggle-spooky-author

Folders and files

Latest commit

History

Repository files navigation

Identifying Horror Authors from Samples of their Writings

Getting Started

Project Requirements

Installation Steps

Download Input Files

Usage

Recommended Order to View the Notebook Files

Creating a New Notebook for a Model

Testing a Set Model without Random Search

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages