Speech-to-Text

Idea Behind the Project

The idea was to create a Speech-to-sign converter, which could in a way ease the conversation with people having speaking and hearing impairments. Since direct conversion seemed a little doubtful, the idea was to break the overall process into two parts

Speech to Text
Text to Sign

While creating Speech to Text seemed very easy with Pyaudio library and various applications available from tech giants such as google, what intruiged me here was to see how exactly are the signals being processed. So, the project swiftly got diverted. I ended up using Google's Speech Command Dataset to predict 31 command keywords.

P.S. I do hope I'll get to implement the Speech-to-sign thing, although a dataset for sign language isn't available yet. If you do find one... let me know (;_;)

Dataset Used

The dataset used for this project was Google's Speech Command Dataset. It had audio samples, mostly of 1 sec of lenght, for 31 command keywords and few noises. A total of 65,000 audio samples are there. (Sadly, the size is too large ;_;)

Hierarchy

Let's quickly go over on how you are supposed to run the above code and how exactly the flow takes place.

Running the code solely for getting the output from the model

If this is your objective, then you can simply run the server.py and client.py file. Make sure to have all the package requirements preinstalled. Also, make sure to have the audio file ready. I'd suggest trimming the audio file to a length which only contains the audio and no black spaces. Using the rest of the code should be easy, although you might need a little knowledge of how flask works. There's no front space for the webapp, so it might not seem very colorful, but rest assured, it works anyway.

Understanding the flow of data

Firstly, we put all the data in our dataset directory. Note that it will have category labelled folders inside and then there would be audio samples in ogg format
Then we'll use our create dataset.ipynb file to load and transform the data using MFCCs. Data will have following fields - Labels (words), mappings (labels mapped with indexes), MFCCs(list of MFCCs transformed coefficients), files (file path). The data would be exported in json format.
Then we'll train the model using train.ipynb. MFCCs would become the x values and labels would be y values. get_data function would grab the data and create test, train and validation sets for us. main function would build, train, test and export the model.
Then, we'll use the spotting keywords.ipynb file to predict the output for certain input files. predict function would first preprocess the input data (i.e. get the MFCCs), and then will predict the index. We'll keep a dictionary of mappings from which we'll finally return the predicted word.
server.ipynb would be used next. The app route would be set at '/predict' and method would have POST. predict function would take the input audio file, give it to spotting keywords and would then return the predicted output in json format using Flask's jsonify.
client.ipynb would fetch the data (from my github in this case, you can upload a file elsewhere and use it) and would call for response = request.post, with url set at '/predict' and files having a dictionary having path, file loaded from path, and the format. The response would be what server would return.

The last two involves very basic use of Flask for local deployment (wouldn't exactly call it deployment ;_;)

Preprocessing Data

The audio samples were taken into frequency domain, but we didn't use the classical fourier transform. Instead, to mimic the way human ear responds to various frequencies, we used Mel-frequency-cepstral coefficients. You can read more about them here The data_generation file does this job for us and we saved all the coefficients in json format.

Deployment

The model was deployed as a WebApp using streamlit. Though the functionality right now is limited to recording-first-then-uploading, future upgrades for a record-and-predict-on-spot could be added. You can find the working deployment here

Here's how you can use the webapp for demo.

key-text.vid.demo.mp4

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
Models		Models
__pycache__		__pycache__
Aptfile		Aptfile
Procfile		Procfile
README.md		README.md
Spotting_keywords.ipynb		Spotting_keywords.ipynb
app.py		app.py
client.ipynb		client.ipynb
create dataset.ipynb		create dataset.ipynb
key-text vid demo.mp4		key-text vid demo.mp4
model.h5		model.h5
output.wav		output.wav
packages.txt		packages.txt
py to js.ipynb		py to js.ipynb
requirements.txt		requirements.txt
server.ipynb		server.ipynb
setup.sh		setup.sh
sound.py		sound.py
spot_keywords.py		spot_keywords.py
test-mpeg.mpg		test-mpeg.mpg
train.ipynb		train.ipynb
up.ogg		up.ogg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech-to-Text

Table of Contents

Idea Behind the Project

Dataset Used

Hierarchy

Running the code solely for getting the output from the model

Understanding the flow of data

Preprocessing Data

Deployment

About

Releases

Packages

Languages

StaticJunkk/speech-to-text

Folders and files

Latest commit

History

Repository files navigation

Speech-to-Text

Table of Contents

Idea Behind the Project

Dataset Used

Hierarchy

Running the code solely for getting the output from the model

Understanding the flow of data

Preprocessing Data

Deployment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages