The Speech Command Recognizer is a JavaScript module that enables recognition of spoken commands comprised of simple isolated English words from a small vocabulary. The default vocabulary includes the following words: the ten digits from "zero" to "nine", "up", "down", "left", "right", "go", "stop", "yes", "no", as well as the additional categories of "unknown word" and "background noise".
It uses the web browser's WebAudio API. It is built on top of TensorFlow.js and can perform inference and transfer learning entirely in the browser, using WebGL GPU acceleration.
The underlying deep neural network has been trained using the TensorFlow Speech Commands Dataset.
For more details on the data set, see:
Warden, P. (2018) "Speech commands: A dataset for limited-vocabulary speech recognition" https://arxiv.org/pdf/1804.03209.pdf
A speech command recognizer can be used in two ways:
- Online streaming recognition, during which the library automatically
opens an audio input channel using the browser's
getUserMedia
and WebAudio APIs (requesting permission from user) and performs real-time recognition on the audio input. - Offline recognition, in which you provide a pre-constructed TensorFlow.js
Tensor object or a
Float32Array
and the recognizer will return the recognition results.
To use the speech-command recognizer, first create a recognizer instance,
then start the streaming recognition by calling its listen()
method.
const tf = require('@tensorflow/tfjs');
const speechCommands = require('@tensorflow-models/speech-commands');
// When calling `create()`, you must provide the type of the audio input.
// The two available options are `BROWSER_FFT` and `SOFT_FFT`.
// - BROWSER_FFT uses the browser's native Fourier transform.
// - SOFT_FFT uses JavaScript implementations of Fourier transform
// (not implemented yet).
const recognizer = speechCommands.create('BROWSER_FFT');
// Make sure that the underlying model and metadata are loaded via HTTPS
// requests.
await recognizer.ensureModelLoaded();
// See the array of words that the recognizer is trained to recognize.
console.log(recognizer.wordLabels());
// `listen()` takes two arguments:
// 1. A callback function that is invoked anytime a word is recognized.
// 2. A configuration object with adjustable fields such a
// - includeSpectrogram
// - probabilityThreshold
// - includeEmbedding
recognizer.listen(result => {
// - result.scores contains the probability scores that correspond to
// recognizer.wordLabels().
// - result.spectrogram contains the spectrogram of the recognized word.
}, {
includeSpectrogram: true,
probabilityThreshold: 0.75
});
// Stop the recognition in 10 seconds.
setTimeout(() => recognizer.stopListening(), 10e3);
When calling speechCommands.create()
, you can specify the vocabulary
the loaded model will be able to recognize. This is specified as the second,
optional argument to speechCommands.create()
. For example:
const recognizer = speechCommands.create('BROWSER_FFT', 'directional4w');
Currently, the supported vocabularies are:
- '18w' (default): The 20 item vocaulbary, consisting of: 'zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'up', 'down', 'left', 'right', 'go', 'stop', 'yes', and 'no', in addition to 'background_noise' and 'unknown'.
- 'directional4w': The four directional words: 'up', 'down', 'left', and 'right', in addition to 'background_noise' and 'unknown'.
'18w' is the default vocabulary.
As the example above shows, you can specify optional parameters when calling
listen()
. The supported parameters are:
overlapFactor
: Controls how often the recognizer performs prediction on spectrograms. Must be >=0 and <1 (default: 0.5). For example, if each spectrogram is 1000 ms long andoverlapFactor
is set to 0.25, the prediction will happen every 250 ms.includeSpectrogram
: Let the callback function be invoked with the spectrogram data included in the argument. Default:false
.probabilityThreshold
: The callback function will be invoked if and only if the maximum probability score of all the words is greater than this threshold. Default:0
.invokeCallbackOnNoiseAndUnknown
: Whether the callback function will be invoked if the "word" with the maximum probability score is the "unknown" or "background noise" token. Default:false
.includeEmbedding
: Whether an internal activation from the underlying model will be included in the callback argument, in addition to the probability scores. Note: if this field is set astrue
, the value ofinvokeCallbackOnNoiseAndUnknown
will be overridden totrue
and the value ofprobabilityThreshold
will be overridden to0
.
To perform offline recognition, you need to have obtained the spectrogram
of an audio snippet through a certain means, e.g., by loading the data
from a .wav file or synthesizing the spectrogram programmatically.
Assuming you have the spectrogram stored in an Array of numbers or
a Float32Array, you can create a tf.Tensor
object. Note that the
shape of the Tensor must match the expectation of the recognizer instance.
E.g.,
const tf = require('@tensorflow/tfjs');
const speechCommands = require('@tensorflow-models/speech-commands');
const recognizer = speechCommands.create('BROWSER_FFT');
// Inspect the input shape of the recognizer's underlying tf.Model.
console.log(recognizer.modelInputShape());
// You will get something like [null, 43, 232, 1].
// - The first dimension (null) is an undetermined batch dimension.
// - The second dimension (e.g., 43) is the number of audio frames.
// - The third dimension (e.g., 232) is the number of frequency data points in
// every frame (i.e., column) of the spectrogram
// - The last dimension (e.g., 1) is fixed at 1. This follows the convention of
// convolutional neural networks in TensorFlow.js and Keras.
// Inspect the sampling frequency and FFT size:
console.log(recognizer.params().sampleRateHz);
console.log(recognizer.params().fftSize);
const x = tf.tensor4d(
mySpectrogramData, [1].concat(recognizer.modelInputShape().slice(1)));
const output = await recognizer.recognize(x);
// output has the same format as `result` in the online streaming example
// above: the `scores` field contains the probabilities of the words.
tf.dispose([x, output]);
Note that you must provide a spectrogram value to the recognize()
call
in order to perform the offline recognition. If recognize()
is called
without a first argument, it will perform one-shot online recognition
by collecting a frame of audio via WebAudio.
By default, a recognizer object will load the underlying
tf.Model via HTTP requests to a centralized location, when its
listen()
or recognize()
method is called the first time.
You can pre-load the model to reduce the latency of the first calls
to these methods. To do that, use the ensureModelLoaded()
method of the
recognizer object. The ensureModelLoaded()
method also "warms up" model after
the model is loaded. "Warm up" means running a few dummy examples through the
model for inference to make sure that the necessary states are set up, so that
subsequent inferences can be fast.
Transfer learning is the process of taking a model trained previously on a dataset (say dataset A) and applying it on a different dataset (say dataset B). To achieve transfer learning, the model needs to be slightly modified and re-trained on dataset B. However, thanks to the training on the original dataset (A), the training on the new dataset (B) takes much less time and computational resource, in addition to requiring a much smaller amount of data than the original training data. The modification process involves removing the top (output) dense layer of the original model and keeping the "base" of the model. Due to its previous training, the base can be used as a good feature extractor for any data similar to the original training data. The removed dense layer is replaced with a new dense layer configured specifically for the new dataset.
The speech-command model is a model suitable for transfer learning on previously unseen spoken words. The original model has been trained on a relatively large dataset (~50k examples from 20 classes). It can be used for transfer learning on words different from the original vocabulary. We provide an API to perform this type of transfer learning. The steps are listed in the example code snippet below
const baseRecognizer = speechCommands.create('BROWSER_FFT');
await baseRecognizer.ensureModelLoaded();
// Each instance of speech-command recognizer supports multiple
// transfer-learning models, each of which can be trained for a different
// new vocabulary.
// Therefore we give a name to the transfer-learning model we are about to
// train ('colors' in this case).
const transferRecognizer = baseRecognizer.createTransfer('colors');
// Call `collectExample()` to collect a number of audio examples
// via WebAudio.
await transferRecognizer.collectExample('red');
await transferRecognizer.collectExample('green');
await transferRecognizer.collectExample('blue');
await transferRecognizer.collectExample('red');
// Don't forget to collect some background-noise examples, so that the
// transfer-learned model will be able to detect moments of silence.
await transferRecognizer.collectExample('_background_noise_');
await transferRecognizer.collectExample('green');
await transferRecognizer.collectExample('blue');
await transferRecognizer.collectExample('_background_noise_');
// ... You would typically want to put `collectExample`
// in the callback of a UI button to allow the user to collect
// any desired number of examples in random order.
// You can check the counts of examples for different words that have been
// collect for this transfer-learning model.
console.log(transferRecognizer.countExamples());
// e.g., {'red': 2, 'green': 2', 'blue': 2, '_background_noise': 2};
// Start training of the transfer-learning model.
// You can specify `epochs` (number of training epochs) and `callback`
// (the Model.fit callback to use during training), among other configuration
// fields.
await transferRecognizer.train({
epochs: 25,
callback: {
onEpochEnd: async (epoch, logs) => {
console.log(`Epoch ${epoch}: loss=${logs.loss}, accuracy=${logs.acc}`);
}
}
});
// After the transfer learning completes, you can start online streaming
// recognition using the new model.
await transferRecognizer.listen(result => {
// - result.scores contains the scores for the new vocabulary, which
// can be checked with:
const words = transferRecognizer.wordLabels();
// `result.scores` contains the scores for the new words, not the original
// words.
for (let i = 0; i < words.length; ++i) {
console.log(`score for word '${words[i]}' = ${result.scores[i]}`);
}
}, {probabilityThreshold: 0.75});
// Stop the recognition in 10 seconds.
setTimeout(() => transferRecognizer.stopListening(), 10e3);
Once examples has been collected with a transfer recognizer,
you can export the examples in serialized form with the serielizedExamples()
method, e.g.,
const serialized = transferRecognizer.serializeExamples();
serialized
is a binary ArrayBuffer
amenable to storage and transmission.
It contains the spectrogram data of the examples, as well as metadata such
as word labels.
You can also serialize the examples from a subset of the words in the transfer recognizer's vocabulary, e.g.,
const serializedWithOnlyFoo = transferRecognizer.serializeExamples('foo');
// Or
const serializedWithOnlyFooAndBar = transferRecognizer.serializeExamples(['foo', 'bar']);
The serialized examples can later be loaded into another instance of
transfer recognizer with the loadExamples()
method, e.g.,
const clearExisting = false;
newTransferRecognizer.loadExamples(serialized, clearExisting);
Theo clearExisting
flag ensures that the examples that newTransferRecognizer
already holds are preserved. If true
, the existing exampels will be cleared.
If clearExisting
is not specified, it'll default to false
.
A developer-oriented live demo is available at this address.
The demo/ folder contains a live demo of the speech-command recognizer. To run it, do
cd speech-commands
yarn
yarn publish-local
cd demo
yarn
yarn link-local
yarn watch