Skip to content

ai4os-hub/ai4os-speech-to-text-tf

Repository files navigation

DEEP Open Catalogue: Speech to Text

Build Status

Author: Lara Lloret Iglesias (CSIC)

Project: This work is part of the DEEP Hybrid-DataCloud project that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 777435.

This is a plug-and-play tool to train and evaluate a speech to text tool using deep neural networks.

You can find more information about it in the AI4OS Hub.

Table of contents

  1. Installing this module
    1. Local installation
    2. Docker installation
  2. Train a speech classifier
    1. Data preprocessing
    2. Train the classifier
  3. Predict
  4. Acknowledgements

Installing this module

Local installation

Requirements

This project has been tested in Ubuntu 18.04 with Python 3.6.5. Further package requirements are described in the requirements.txt file.

To start using this framework clone the repo:

git clone https://github.com/ai4os-hub/ai4os-speech-to-text-tf
cd ai4os-speech-to-text-tf
pip install -e .

now run DEEPaaS:

deepaas-run --listen-ip 0.0.0.0

and open http://0.0.0.0:5000/ui and look for the methods belonging to the speechclas module.

Docker installation

We have also prepared a ready-to-use Docker container to run this module. To run it:

docker search ai4oshub
docker run -ti -p 5000:5000 -p 6006:6006 -p 8888:8888 ai4oshub/ai4os-speech-to-text-tf

Now open http://0.0.0.0:5000/ui and look for the methods belonging to the speechclas module.

Train a speech classifier

Data preprocessing

The first step to train your speech to text neural network is to put your .wav files into folders. The name of each folder should correspond to the label for those particular audios.
Put your audios in the./data/dataset_files folder.

Alternatively you provide an URL with the location of the tar.gz containing all the folders with the training files. This will automatically download the tar.gz, read the labels and get everything ready to start the training.

Train the classifier

Go to http://0.0.0.0:5000/ui and look for the TRAIN POST method. Click on 'Try it out', change whatever training args you want and click 'Execute'. The training will be launched and you will be able to follow its status by executing the TRAIN GET method which will also give a history of all trainings previously executed.

If the module has some sort of training monitoring configured (like Tensorboard) you will be able to follow it at http://0.0.0.0:6006.

After training you can check training statistics and check the logs where you will be able to find the standard output during the training together with the confusion matrix after the training was finished.

Since usually this type of models are used in mobile phone application, the training generates the model in .pb format allowing to use it easily to perform inference from a mobile phone app.

Predict

Note

This module does not come with a pretrained classifier so you will have to train a classifier first before being able to use the testing methods.

Go to http://0.0.0.0:5000/ui and look for the PREDICT POST method. Click on 'Try it out', change whatever test args you want and click 'Execute'. You can either supply a:

  • a data argument with a path pointing to a wav file.

OR

  • a url argument with an URL pointing to a wav file. Here is an example of such an url that you can use for testing purposes.

Acknowledgments

The network architecture is based in one of the tutorials provided by Tensorflow. The architecture used in this tutorial is based on some described in the paper Convolutional Neural Networks for Small-footprint Keyword Spotting. It was chosen because it's comparatively simple, quick to train, and easy to understand, rather than being state of the art. There are lots of different approaches to building neural network models to work with audio, including recurrent networks or dilated (atrous) convolutions. This tutorial is based on the kind of convolutional network that will feel very familiar to anyone who's worked with image recognition. That may seem surprising at first though, since audio is inherently a one-dimensional continuous signal across time, not a 2D spatial problem. We define a window of time we believe our spoken words should fit into, and converting the audio signal in that window into an image. This is done by grouping the incoming audio samples into short segments, just a few milliseconds long, and calculating the strength of the frequencies across a set of bands. Each set of frequency strengths from a segment is treated as a vector of numbers, and those vectors are arranged in time order to form a two-dimensional array. This array of values can then be treated like a single-channel image, and is known as a spectrogram. An example of what one of these spectrograms looks like:

If you consider this project to be useful, please consider citing the DEEP Hybrid DataCloud project:

García, Álvaro López, et al. A Cloud-Based Framework for Machine Learning Workloads and Applications. IEEE Access 8 (2020): 18681-18692.

About

Module to perform speech to text using Tensorflow

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published