We try to detect sentence boundaries using deep learning. Created as part of the "Practical Applications of Multimedia Retrieval" seminar at the Hasso-Plattner-Institute, Potsdam, Germany.
We build a python-based demo using caffe.
#####Prerequirements:
- Clone this repository
- Install python 2.7 including the following packages from requirements.txt
pip install -r requirements.txt
- Use the nltk downloader to download
averaged_perceptron_tagger
andpunkt
models:
python -m nltk.downloader
- Setup caffe, like described here
- Add path to the repository to your python path:
export PYTHONPATH=/path/to/sentence-boundary-detection-nn/python:$PYTHONPATH
- Download Google Word Vector (GoogleNews-vectors-negative300.bin.gz) from here or use directly this url and extract the result into the
sentence-boundary-detection-nn/python/demo_data
directory - Paste your trained models into a demo data folder, for example
sentence-boundary-detection-nn/python/demo_data
with the following structure:
- lexical_models : containing all pretrained models you want to use in a seperate directory. Each models needs a
- .ini
- .caffemodel
- net.prototxt file.
- text_data: containing all possible text files, which should be used as prediction input
- audio_models: containing all pretrainied audio models, each in a seperate directory. Each needs the same files as described for lexical models
- audio_examples: containing all audio files, which should be available during the demo. Each one in a seperate directory containing the ctm, energy and pitch files.
#####Start up
Change into the repository directory and execute, this should work right out of the box, unless you are using a custom demo_data
folder:
python web_demo/web.py
Optionally you can specify the location of the word vector and the demo data. Otherwise default values are used. For further information execute:
python web_demo/web.py -h