Skip to content

Classifies sound phrases from large scale corpora using NLP, POS tagging, and SMVs

Notifications You must be signed in to change notification settings

radurevutchi/sound_phrase_classifier

Repository files navigation

sound_phrase_classifier

Classifies sound phrases from large scale corpora using NLP, POS tagging, Word Embeddings, and SVMs.

Description

This project is a replication of the experiments conducted in Section 2 of the paper: "Discovering sound concepts and acoustic relations in text" found on IEEE Xplore

The project processes large scale text corpora and uses regular expressions and POS tagging to classify sound phrases. I then manually labeled around 3000 sound phrases obtained previously into sound or non-sound classification. The resulting was used to train a Linear SVM to produce a sound phrase vs non-sound phrase classifier.

The project runs in Python3.

Files included

train_sound_clf.py
test_sound_clf.py
run_sound_clf.py
training_data

Additional Files:
training_data (training data for train_sound_clf.py)
clf1.model (classifier model trained on word2vec 300d vectors)
sample_document (input for run_sound_clf.py when set to "true")
sample_list (input for run_sound_clf.py when set to "false")
results.txt (output from run_sound_clf.py when input is sample_list)

Files not included (must download)

Google's pretrained word2vec represantations model (found here: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit )
Stanford's GloVe pretrained vectors (found here: https://nlp.stanford.edu/projects/glove/)

Dependencies and Libraries

numpy, optunity, gensim, sklearn, pickle, sys, os, nltk, re

How to Use

To train the sound_classifier on new data and get a saved copy of the LinearSVM model, run:
python3 train_sound_clf.py (word2vec/glove) <embeddings_filename> <training_data_filename>
This will save the classifier model to the filename 'clf1.model'

To test the accuracy of the sound classifier on a list of labeled data, run:
python3 test_sound_clf.py (word2vec/glove) <embeddings_filename> <model_filename> <test_data_filename>
This will print the accuracy of the classifier on the test data.



To run the classifier on a large text document or a list of unlabeled sounds, run:
python3 run_sound_clf.py (word2vec/glove) <embeddings_filename> <model_filename> <data_filename> (true/false)
(True for large document, false for list of sounds) This will process the document(or list) and output a list (results.txt) of filtered sound phrases with their confidence scores.



IMPORTANT: A classifier may be trained on glove or word2vec embeddings only. Additionally, the input files for training_data and sample_list (when run_sound_clf.py set to 'false') must match the format given in the examples files.

About

Classifies sound phrases from large scale corpora using NLP, POS tagging, and SMVs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages