Skip to content

Commit

Permalink
Merge pull request #51 from ReadAlongs/async_considered_harmful
Browse files Browse the repository at this point in the history
Make the API less asynchronous
  • Loading branch information
dhdaines authored Dec 19, 2022
2 parents 3ba551c + 81867df commit 4a003fd
Show file tree
Hide file tree
Showing 8 changed files with 483 additions and 488 deletions.
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ include(CheckSymbolExists)
include(CheckLibraryExists)
include(TestBigEndian)

project(soundswallower VERSION 0.4.3
project(soundswallower VERSION 0.5.0
DESCRIPTION "An even smaller speech recognizer")

if(CMAKE_PROJECT_NAME STREQUAL PROJECT_NAME)
Expand Down
143 changes: 62 additions & 81 deletions js/README.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,24 @@
SoundSwallower: an even smaller speech recognizer
=================================================
# SoundSwallower: an even smaller speech recognizer

> "Time and change have a voice; eternity is silent. The human ear is
> always searching for one or the other."<br>
> Leena Krohn, *Datura, or a delusion we all see*
> Leena Krohn, _Datura, or a delusion we all see_
SoundSwallower is a refactored version of PocketSphinx intended for
embedding in web applications. The goal is not to provide a fast
implementation of large-vocabulary continuous speech recognition, but
rather to provide a *small*, *asynchronous* implementation of simple,
useful speech technologies.
SoundSwallower is a refactored version of PocketSphinx intended for embedding in
web applications. The goal is not to provide a fast implementation of
large-vocabulary continuous speech recognition, but rather to provide a _small_
implementation of simple, useful speech technologies.

With that in mind the current version is limited to finite-state
grammar recognition.

Installation
------------
## Installation

SoundSwallower can be installed in your NPM project:

# From the Internets
npm install soundswallower

You can also build and install it from source, provided you have
Emscripten and CMake installed:

Expand Down Expand Up @@ -88,11 +85,10 @@ Look at the [SoundSwallower-Demo
repository](https://github.com/dhdaines/soundswallower-demo) for an
example.

Basic Usage
-----------
## Basic Usage

The entire package is contained within a module compiled by
Emscripten. The NPM package includes only the compiled code, but you
Emscripten. The NPM package includes only the compiled code, but you
can rebuild it yourself using [the full source code from
GitHub](https://github.com/ReadAlongs/SoundSwallower) which also
includes C and Python implementations.
Expand All @@ -103,54 +99,43 @@ that returns a promise that is fulfilled with the actual module once
the WASM code is fully loaded:

```js
const ssjs = await require('soundswallower')();
const ssjs = await require("soundswallower")();
```

Once you figure out how to get the module, you can try to initialize
the recognizer and recognize some speech.

Great, so let's initialize the recognizer. Anything that changes the
state of the recognizer is an async function. So everything except
getting the current recognition result. We follow the
construct-then-initialize pattern:
Great, so let's initialize the recognizer. This possibly involves some long I/O
operations so it's asynchronous. We follow the construct-then-initialize
pattern. You can use `Promise`s too of course.

```js
let decoder = new ssjs.Decoder({
loglevel: "INFO",
backtrace: true
loglevel: "INFO",
backtrace: true,
});
await decoder.initialize();
```

The optional `loglevel` and `backtrace` options will make it a bit
more verbose, so you can be sure it's actually doing something. Now
we will create and enable the world's stupidest grammar, which
recognizes one sentence:

```js
await decoder.set_fsg("goforward", 0, 4, [
{from: 0, to: 1, prob: 1.0, word: "go"},
{from: 1, to: 2, prob: 1.0, word: "forward"},
{from: 2, to: 3, prob: 1.0, word: "ten"},
{from: 3, to: 4, prob: 1.0, word: "meters"}
]);
```
more verbose, so you can be sure it's actually doing something.

If you actually want to just recognize a single sentence, in order to
get time alignments (this is known as "force-alignment"), we have a
better method for you:
The simplest use case is to recognize some text we already know, which is called
"force alignment". In this case you set this text, which must already be
preprocessed to be a whitespace-separated string containing only words in the
dictionary, using `set_align_text`:

```js
await decoder.set_align_text("go forward ten meters");
decoder.set_align_text("go forward ten meters");
```

It is also possible to parse a grammar in
[JSGF](https://en.wikipedia.org/wiki/JSGF) format, see below for an
example.

Okay, let's wreck a nice beach! Record yourself saying something,
Okay, let's wreck a nice beach! Record yourself saying something,
preferably the sentence "go forward ten meters", using SoX, for
example. Note that we record at 44.1kHz in 32-bit floating point
example. Note that we record at 44.1kHz in 32-bit floating point
format as this is the default under JavaScript (due to WebAudio
limitations).

Expand All @@ -162,13 +147,13 @@ Now you can load it and recognize it with:

```js
let audio = await fs.readFile("goforward.raw");
await decoder.start();
await decoder.process(audio, false, true);
await decoder.stop();
decoder.start();
decoder.process(audio, false, true);
decoder.stop();
```

The results can be obtained with `get_hyp()` or in a more detailed
format with time alignments using `get_hypseg()`. These are not
format with time alignments using `get_hypseg()`. These are not
asynchronous methods, as they do not depend on or change the state of
the decoder:

Expand All @@ -178,19 +163,19 @@ console.log(decoder.get_hypseg());
```

If you want even more detailed segmentation (phone and HMM state
level) you can use `get_alignment_json`. For more detail on this
level) you can use `get_alignment_json`. For more detail on this
format, see [the PocketSphinx
documentation](https://github.com/cmusphinx/pocketsphinx#usage) as it
is borrowed from there. Since this is JSON, you can create an object
is borrowed from there. Since this is JSON, you can create an object
from it and iterate over it:

```js
const result = JSON.parse(await decoder.get_alignment_json());
const result = JSON.parse(decoder.get_alignment_json());
for (const word of result.w) {
console.log(`word ${word.t} at ${word.b} has duration ${word.d}`);
for (const phone of word.w) {
console.log(`phone ${phone.t} at ${phone.b} has duration ${phone.d}`);
}
console.log(`word ${word.t} at ${word.b} has duration ${word.d}`);
for (const phone of word.w) {
console.log(`phone ${phone.t} at ${phone.b} has duration ${phone.d}`);
}
}
```

Expand All @@ -202,47 +187,44 @@ awful:
decoder.delete();
```

Loading models
--------------
## Loading models

By default, SoundSwallower will use a not particularly good acoustic
model and a reasonable dictionary for US English. A model for French
model and a reasonable dictionary for US English. A model for French
is also available, which you can load by default by setting the
`defaultModel` property in the module object before loading:

```js
const ssjs = {
defaultModel: "fr-fr"
defaultModel: "fr-fr",
};
await require('soundswallower')(ssjs);
await require("soundswallower")(ssjs);
```

The default model is expected to live under the `model/` directory
relative to the current web page (on the web) or the `soundswallower`
module (in Node.js). You can modify this by setting the `modelBase`
module (in Node.js). You can modify this by setting the `modelBase`
property in the module object when loading, e.g.:

```js
const ssjs = {
modelBase: "/assets/models/", /* Trailing slash is necessary */
defaultModel: "fr-fr",
modelBase: "/assets/models/" /* Trailing slash is necessary */,
defaultModel: "fr-fr",
};
await require('soundswallower')(ssjs);
await require("soundswallower")(ssjs);
```

This is simply concatenated to the model name, so you should make sure
to include the trailing slash, e.g. "model/" and not "model"!

## Using grammars

Using grammars
--------------

We currently support JSGF for writing grammars. You can parse one
We currently support JSGF for writing grammars. You can parse one
from a JavaScript string and set it in the decoder like this (a
hypothetical pizza-ordering grammar):

```js
await decoder.set_jsgf(`#JSGF V1.0;
decoder.set_jsgf(`#JSGF V1.0;
grammar pizza;
public <order> = [<greeting>] [<want>] [<quantity>] [<size>] [pizza] <toppings>;
<greeting> = hi | hello | yo | howdy;
Expand All @@ -255,27 +237,28 @@ public <order> = [<greeting>] [<want>] [<quantity>] [<size>] [pizza] <toppings>;
```

Note that all the words in the grammar must first be defined in the
dictionary. You can add custom dictionary words using the `add_word`
dictionary. You can add custom dictionary words using the `add_word`
method on the `Decoder` object, as long as you speak ArpaBet (or
whatever phoneset the acoustic model uses). IPA and
whatever phoneset the acoustic model uses). IPA and
grapheme-to-phoneme support may become possible in the near future.
If you are going to add a bunch of words, pass `false` as the third
argument for all but the last one, as this will delay the reloading of
the internal state.

```js
await decoder.add_word("supercalifragilisticexpialidocious",
"S UW P ER K AE L IH F R AE JH IH L IH S T IH K EH K S P IY AE L IH D OW SH Y UH S");
decoder.add_word(
"supercalifragilisticexpialidocious",
"S UW P ER K AE L IH F R AE JH IH L IH S T IH K EH K S P IY AE L IH D OW SH Y UH S"
);
```

Voice activity detection / Endpointing
--------------------------------------
## Voice activity detection / Endpointing

This is a work in progress, but it is also possible to detect the
start and end of speech in an input stream using an `Endpointer`
object. This requires you to pass buffers of a specific size, which
is understandably difficult since WebAudio also only wants to *give*
you buffers of a specific (and entirely different) size. A better
object. This requires you to pass buffers of a specific size, which
is understandably difficult since WebAudio also only wants to _give_
you buffers of a specific (and entirely different) size. A better
example is forthcoming but it looks a bit like this (copied directly
from [the
documentation](https://soundswallower.readthedocs.io/en/latest/soundswallower.js.html#Endpointer.get_in_speech):
Expand All @@ -285,14 +268,12 @@ let prev_in_speech = ep.get_in_speech();
let frame_size = ep.get_frame_size();
// Presume `frame` is a Float32Array of frame_size or less
let speech;
if (frame.size < frame_size)
speech = ep.end_stream(frame);
else
speech = ep.process(frame);
if (frame.size < frame_size) speech = ep.end_stream(frame);
else speech = ep.process(frame);
if (speech !== null) {
if (!prev_in_speech)
console.log("Speech started at " + ep.get_speech_start());
if (!ep.get_in_speech())
console.log("Speech ended at " + ep.get_speech_end());
if (!prev_in_speech)
console.log("Speech started at " + ep.get_speech_start());
if (!ep.get_in_speech())
console.log("Speech ended at " + ep.get_speech_end());
}
```
Loading

0 comments on commit 4a003fd

Please sign in to comment.