The aim of the project is to analyse performance of various neural networks in identifying the word spoken by a person. The data used in this process is Google command dataset. It contains ~65k audio files each of which has a word spoken by a person anf tag for that file which is the text for that audio file. There are 30 different words in the dataset spoken by different people. Thus, the task is to classify the audio files based on the word spoken. We ran the following neural networks to perform the task:
- Lenet
- VGG
- ResNet
- CNNRNN
- CNN-1D
- Parallel Net ( combination of two networks trained in parallel)
The output/ directory contains the training logs for these networks. The models/ directory contains the trained models for the best performing configurations. These models can be loaded directly in memory and used for classification.
Please follow the steps to train/test a model:
- mkdir bdml
- cd bdml
- git clone
- module load anaconda3/5.3.1
- conda env create -f requirements.yaml
- source activate bdml
- Download Speech data to this directory
wget "http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz" - gunzip speech_commands_v0.01.tar.gz
- mkdir data; mv speech_commands_v0.01.tar data
- cd data
- tar xopf ..path_to/speech_commands_v0.01.tar
- cd ..
- mkdir speechdata
- cd BDML
- python create_dataset.py ../data --out_path ../speechdata
- python run.py --train_path ../speechdata/train/ --valid_path ../speechdata/valid --test_path ../speechdata/test --model CNN1D
Note: You can specify/change arguments to the run.py script like batch_size, model e.t.c. The information on other options is present in the run.py script.
We have run this project on NYU Prince server using Slurm batch script.
- sbatch runbatch.s
Note: You can change the arguments in the runbatch.s script to run with various network configuration.