Skip to content

ivan770/ems

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EMS (External Media Server)

EMS is a service, that enriches your calls with:

  • Speech recognition
  • Speech synthesis
  • Audio playback using WebSocket API
  • Call termination using WebSocket API
  • Dynamic speech recognition service configuration

You can use it only in pair with some AudioSocket client, as EMS itself only receives extracted call audio and sends audio back for playback.

Currently, only one client has a support for AudioSocket protocol - Asterisk.

Relevant Asterisk documentation on configuration of external media and AudioSocket protocol can be found here:

While EMS can work fully standalone relying only on AudioSocket client, features like Text-to-Speech or Speech-to-Text require additional connection from your application via WebSocket. With configured WebSocket stream EMS can send you speech transcriptions and play text that you send via connection.

Currently, only Google Cloud services are supported - Cloud Speech-to-Text and Cloud Text-to-Speech. If you have an implementation for similar services of other cloud providers - PRs are always appreciated.

Build

EMS uses rustup to manage the Rust toolchain, so ensure that you have it installed.

$ git clone https://github.com/ivan770/ems
$ cd ems
$ cargo build --release
$ ./target/release/ems --help

Usage

To start using EMS you have to configure a server via ./ems.toml file (you can change path via --config flag):

audiosocket_addr = "0.0.0.0:12345"
websocket_addr = "0.0.0.0:12346"

recognition_driver = "google"
synthesis_driver = "google"

[gcs]
service_account_path = "./my_gcs_key.json"

[gctts]
service_account_path = "./my_gctts_key.json"

Then, launch EMS with ./ems.

Configuration

Config provided above uses recommended parameters for optional keys. All available keys and their default values are provided below:

Name Value Description
threads All CPU threads A max number of threads EMS can use in runtime
audiosocket_addr Required. Address on which EMS should listen for incoming AudioSocket messages
websocket_addr Required. Address on which EMS should listen for incoming WebSocket streams
message_timeout 3 Max amount of seconds that EMS can wait between new AudioSocket messages. If elapsed time is greater than provided value, then EMS will close AudioSocket connection
recognition_config_timeout 3 Max amount of seconds that EMS can wait for speech recognition config. If elapsed time is greater than provided value, then EMS will use the config provided in ems.toml.
recognition_driver Speech recognition driver that EMS can use for call transcription generation. Supported values: google
synthesis_driver Speech synthesis driver that EMS can use for call voice synthesis. Supported values: google
gcs Google Cloud Text-to-Speech configuration
gctts Google Cloud Speech-to-Text configuration
recognition_fallback Speech recognition fallback config. Used in case if WebSocket client missed opportunity to send recognition config. If empty, en-US config is used as a fallback.
loopback_audio false Send all received audio back

WebSocket API

Every WebSocket message requires you to provide UUID of call. Usually, you specify this UUID when registering external media server.

Requests

Terminate call:

{
    "id": "00000000-0000-0000-0000-000000000000",
    "data": "hangup"
}

Synthesize speech:

gender, speaking_rate and pitch are optional.

{
    "id": "00000000-0000-0000-0000-000000000000",
    "data": {
        "synthesize": {
            "ssml": "Hello, world",
            "language_code": "en-US",
            "gender": "neutral",
            "speaking_rate": 2,
            "pitch": 1
        }
    }
}

Speech recognition config:

profanity_filter and punctuation are optional.

{
    "id": "00000000-0000-0000-0000-000000000000",
    "data": {
        "recognitionConfig": {
            "language": "en-US",
            "profanity_filter": false,
            "punctuation": false
        }
    }
}

Responses

Call transcription:

{
    "id": "00000000-0000-0000-0000-000000000000",
    "data": {
        "transcription": "Hello, world"
    }
}

Recognition config request:

{
    "id": "00000000-0000-0000-0000-000000000000",
    "data": "recognitionConfigRequest"
}