This is a C++ SDK for Verbit's Streaming Transcription API.
A client using the C++ SDK makes a WebSocket connection to Verbit's Streaming Speech Recognition services, streams media to the service, and gets transcription responses back from it.
The C++ SDK can be downloaded from GitHub.
- C++11 (currently tested with
g++
v4.8+) - WebSocket++
- this is a header-only library
- requires Boost headers for compilation only (libraries not needed)
- tested with Boost 1.71.0 in Ubuntu 20.04.3 LTS
- tested with Boost 1.65.1 in Ubuntu 18.04.6 LTS
- JSON for Modern C++ 3.x
- this is a header-only library
- Doxygen and Graphviz (if building documentation)
To compile and install the SDK from source:
$ make all [ doc ]
$ sudo make [ TARGET=/usr/local ] install [ install-doc ]
Ubuntu 18.04 LTS and 20.04 LTS packages are also available; the package can be built from source:
$ make package
Verbit's Streaming Speech Recognition services can currently produce two different types of responses; the client can request either or both:
-
Transcript: this type of response contains the recognized words since the beginning of the current utterance. Like in real human speech, the stream of words is segmented into utterances in automatic speech recognition. An utterance is recognized incrementally, processing more of the incoming audio at each step. Each utterance starts at a specific start-time and extends its end-time with each step, yielding the most updated result. Note that sequential updates for the same utterance will overlap, each response superseding the previous one - until a response signaling the end of the utterance is received (having
is_final == True
). Thealternatives
array might contain different hypotheses, ordered by confidence level.Example "transcript" responses can be found in examples/responses/transcript.md.
-
Captions: this type of response contains the recognized words within a specific time window. In contrast to the incremental nature of "transcript"-type responses, the "captions"-type responses are non-overlapping and consecutive. Only one "captions"-type response covering a specific time-span in the audio will be returned (or none, if no words were uttered). The
is_final
field is alwaysTrue
because no updates will be output for the same time-span. Thealternatives
array will always have only one item for captions.Example "captions" responses can be found in examples/responses/captions.md.
The full JSON response schema can be found in examples/responses/schema.md.
In order to use Verbit's Streaming Speech Recognition services, you must place an order using Verbit's Ordering API. Please refer to the "Ordering API" section of the Python (reference) SDK documentation for details.
After placing the order and obtaining the access token, you can run the client example. This sends audio from a WAV file, and receives Captions-type responses which are emitted to stdout
.
$ make bin/example_client
$ export VERBIT_WS_TOKEN="<your_access_token>"
$ bin/example_client test-files/fox-and-grapes.wav
The example source can be used to guide the implementation of your own custom client:
examples/example_client.cpp
is the main client program- consult
WebSocketStreamingClient
in the SDK documentation for details
- consult
examples/wav_media_generator.*
shows how to create a custom media generator- consult
MediaGenerator
in the SDK documentation for details
- consult
Documentation for the C++ SDK is installed to /usr/local/share/doc/verbit_streaming
by make install
(or /usr/share/doc/verbit_streaming
by the Ubuntu package). You can build the documentation in the doc
subdirectory with:
$ make doc
The C++ SDK comes with a set of unit and client tests that can be used to ensure correct functioning of the streaming client. To run them:
$ make test
The C++ SDK also comes with a test WebSocket server, compatible with Verbit's Streaming Speech Recognition services, which can be used for client testing. It requires the UUID library.
By default the test server listens on port 9002/tcp
. To start it:
$ make run-test-server
test-bin/test_server [ port ]
You can then run the client example against the test server with:
$ export VERBIT_WS_TOKEN="<any_string_that_is_at_least_40_characters_long>"
$ make run-test-example
bin/example_client [ -u wss://localhost:9002 ] test-files/thats-good.wav
The test server currently only supports Captions-type responses. It returns fake transcription text (i.e. it does no speech processing on the received media).