-
Notifications
You must be signed in to change notification settings - Fork 0
/
ai4-metadata.yml
45 lines (40 loc) · 2.58 KB
/
ai4-metadata.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
metadata_version: 2.0.0
title: Speech keywords classifier
summary: Train a speech classifier to classify audio files between different keywords.
description: |-
This is a plug-and-play tool to train and evaluate a speech-to-text tool using deep neural networks.
The network architecture is based on one of the tutorials provided by Tensorflow [1].
The architecture used in this tutorial is based on some described in the paper Convolutional Neural Networks for Small-footprint Keyword Spotting [2].
There are lots of different approaches to building neural network models to work with audio including recurrent networks or dilated convolutions.
This model is based on the kind of convolutional network that will feel very familiar to anyone who's worked with image recognition.
That may seem surprising at first though, since audio is inherently a one-dimensional continuous signal across time, not a 2D spatial problem.
We define a window of time we believe our spoken words should fit into, and converting the audio signal in that window into an image.
This is done by grouping the incoming audio samples into short segments, just a few milliseconds long, and calculating the strength of the frequencies across a set of bands.
Each set of frequency strengths from a segment is treated as a vector of numbers, and those vectors are arranged in time order to form a two-dimensional array.
This array of values can then be treated like a single-channel image and is known as a spectrogram. These spectrograms will be the input for the training.
The container does not come with any pretrained model, it has to be trained first on a dataset to be used for prediction.
The PREDICT method expects an audio files as input (or the url of an audio file) and will return a JSON with the top 5 predictions.
<img class='fit', src='https://raw.githubusercontent.com/ai4os-hub/ai4os-speech-to-text-tf/main/reports/figures/speech-to-text.png'/>
**References**
1. https://www.tensorflow.org/tutorials/sequences/audio_recognition
2. https://static.googleusercontent.com/media/research.google.com/es/pubs/archive/43969.pdf
dates:
created: '2019-07-31'
updated: '2024-08-12'
links:
source_code: https://github.com/ai4os-hub/ai4os-speech-to-text-tf
docker_image: ai4oshub/ai4os-speech-to-text-tf
ai4_template: ai4-template/1.9.9
citation: https://static.googleusercontent.com/media/research.google.com/es//pubs/archive/43969.pdf
tags:
- deep learning
- general purpose
tasks:
- Natural Language Processing
categories:
- AI4 trainable
- AI4 inference
libraries:
- TensorFlow
data-type:
- Audio