Skip to content

ColdFire87/sentdex-tensorflow-tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tensorflow Neural Networks

Based on sentdex's tutorial at https://www.youtube.com/watch?v=6rDWwL6irG0

Contains code for DNN, LSTM (RNN) and CNN.

NOTES:

  • LSTM (RNN) code is buggy.
  • CNN only tested with MNIST dataset.

Prerequisites:

Required dependencies:

  • Anaconda python distribution:
    • numpy
    • scipy
    • pandas
    • matplotlib
    • NLTK
  • tensorflow (ideally with GPU support)
  • tqdm (pip install tqdm)

Make sure you have the following folder structure (create folders if needed):

<project_dir>
    |
    |--- preprocessing
                |
                |--- large_data
                |       |
                |       |--- data
                |       |--- saved
                |       |--- temp
                |
                |--- small_data
                        |
                        |--- data
                        |--- saved
     

Only the small dataset is included in the repo.

Download large dataset (see bottom of this readme for link) and place it at preprocessing/large_data/data/training.1600000.processed.noemoticon.csv

Usage:

To preprocess small dataset:

python3 preprocessing/create_sentiment_feature_sets_small.py

This takes as input the data in preprocessing/small_data/data and saves output to preprocessing/small_data/saved

To preprocess large dataset:

python3 preprocessing/create_sentiment_feature_sets_large.py

This takes as input the data in preprocessing/large_data/data and saves output to preprocessing/large_data/saved. Temporary files are stored in preprocessing/large_data/temp.

To train neural network:

python3 tf_nn.py

Use the following options in the __main__ function to specify behaviour:

SMALL_DATA = False
USE_MNIST = False

# TODO: LSTM (RNN) model is buggy (chunking part during training)
network_model = MODELS[0]  # (0 - DNN, 1 - LSTM (RNN), 2 - CNN)

Improvements:

1

The original tutorial uses regular lists for storing large sparse matrices and serializes them either using the pickle module or as CSV files. This leads to very large files.

In one case, a generated CSV file is 19.6GB in size.

By transforming these lists into scipy sparse matrices and serializing them as zipped numpy arrays we reduce the size greatly.

In the previous example, the size is reduced from 19.6GB to 26.7MB.

Using sparse matrices also means we can fit all data in RAM!

2

Added progress bars (using tqdm) for:

  • generating bag of words (word vectors) for large dataset.
  • training epochs

Datasets from: