-
Notifications
You must be signed in to change notification settings - Fork 0
3.9.2 Speech Processing in MMIR
One important feature in the MMIR framework is the speech input and output capabilities. There are several modules available for enabling speech input (ASR: Automatic Speech Recognition) and speech output (TTS: Text To Speech) on different platforms, environments using different technologies.
The modules implement a unified API that is accessible via the MediaManager
(via mmir.media
). The configuration in /www/config/configuration.json
determines which modules will be loaded (and be used when invoking the corresponding functions on MediaManager
).
-
webspeechAudioInput.js
- Technology: Web Speech API (HTML5)
- Platforms: HTML5/browser
- Remarks: included in
mmir-lib
- Requirements: the Web Speech API is currently only supported by Google Chrome
- Configuration ID:
webspeechAudioInput
-
webasrGoogleImpl.js
- Technology: getUserMedia, WebWorker
- Platforms: HTML5
- Remarks:
- can be installed as MMIR Media Plugin mmir-plugin-asr-google-web (see also mmir-plugins-media)
- encoding to FLAC is done using JavaScript based encoders
- uses Google's Speech API (NOTE: requires credentials), for details see compiled information on using Google Speech API
- Requirements:
- accessing the web service requires credentials (i.e. account etc. for using the web services)
- the app using this module needs to run in a special execution context that allows cross domain access (for accessing the ASR web services); e.g. as Chrome Extension or as FireFox Addon
- Configuration Settings:
{"mod": "webAudioInput", "config": "webasrGoogleImpl"}
-
webasrNuanceImpl.js
- Technology: getUserMedia, WebWorker
- Platforms: HTML5
- Remarks:
- can be installed as MMIR Media Plugin mmir-plugin-asr-nuance-web (see also mmir-plugins-media)
- encoding to AMR is done using JavaScript based encoders
- uses Nuance SpeechKit (NOTE: requires credentials)
- Requirements:
- accessing the web service requires credentials (i.e. account etc. for using the web services)
- the app using this module needs to run in a special execution context that allows cross domain access (for accessing the ASR web services); e.g. as Chrome Extension or as FireFox Addon
- Configuration Settings:
{"mod": "webAudioInput", "config": "webasrNuanceImpl"}
-
nuanceAudioInput.js
- Technology: (native) Nuance SpeechKit
- Platforms: Android, iOS
- Remarks: can be installed as Cordova Nuance Speech Plugin (combined with nuanceTextToSpeech.js)
- Requirements: requires credentials (i.e. account for Nuance Developers)
- Configuration ID:
nuanceAudioInput
-
androidAudioInput.js
- Technology: Android System's Speech Recognition (most of the time this would be Google Speech Recognition)
- Platforms: Android
- Remarks: can be installed as Cordova Android Speech Plugin (combined with androidTextToSpeech.js)
- Configuration ID:
androidAudioInput
-
webttsMaryImpl.js
- Technology: Audio (HTML5), MARY (Open Source TTS system)
- Platforms: HTML5
- Remarks:
- included in
mmir-lib
- should use its own installation of the MARY servers (by default this module is configured for the public MARY demo server)
example entry for setting MARY service URL (inconfig/configuration.json
):... "maryTextToSpeech": { "serverBasePath": "http://mary.dfki.de:59125/" }, ...
- included in
- Configuration Settings:
{"mod": "webAudioTextToSpeech", "config": "webttsMaryImpl"}
-
webttsNuanceImpl.js
- Technology: Audio (HTML5)
- Platforms: HTML5
- Remarks:
- can be installed as MMIR Media Plugin mmir-plugin-tts-nuance-web (see also mmir-plugins-media)
- uses Nuance SpeechKit (NOTE: requires credentials)
- Requirements:
- requires credentials (i.e. account for Nuance Developers)
- the app using this module needs to run in a special execution context that allows cross domain access (for accessing the TTS web service via POST); e.g. as Chrome Extension or as FireFox Addon
- Configuration Settings:
{"mod": "webAudioTextToSpeech", "config": "webttsNuanceImpl"}
-
ttsSpeakJsImpl.js
- Technology: Audio (HTML5), WebWorker
- Platforms: HTML5
- Remarks:
- can be installed as MMIR Media Plugin mmir-plugin-tts-speakjs (see also mmir-plugins-media)
- uses
emspripten
compiled (i.e. JavaScript) version of eSpeak (Open Source)
- Requirements:
- requires credentials (i.e. account for Nuance Developers)
- the app using this module needs to run in a special execution context that allows cross domain access (for accessing the TTS web service via POST); e.g. as Chrome Extension or as FireFox Addon
- Configuration Settings:
{"mod": "webAudioTextToSpeech", "config": "ttsSpeakJsImpl"}
-
nuanceTextToSpeech.js
- Technology: (native) Nuance SpeechKit
- Platforms: Android, iOS
- Remarks: can be installed as Cordova Nuance Speech Plugin (combined with nuanceAudioInput.js)
- Requirements: requires credentials (i.e. account for Nuance Developers)
- Configuration ID:
nuanceTextToSpeech
-
androidTextToSpeech.js
- Technology: Android System's Text To Speech Engine (most of the time this would be Google's TTS engine)
- Platforms: Android
- Remarks: can be installed as Cordova Android Speech Plugin (combined with androidAudioInput.js)
- Configuration ID:
androidTextToSpeech
(for reference)
- speech input (ASR)
webasrGooglev1Impl.js
- Technology: getUserMedia, WebWorker
- Platforms: HTML5
- Remarks:
- Outdated: the Google Web Speech Recognition service v1 that was used for this module is not available anymore
- as MMIR Media Plugin mmir-plugin-asr-googlev1-web (see also mmir-plugins-media)
- encoding to FLAC is done using JavaScript based encoders
- Requirements:
- the app using this module needs to run in a special execution context that allows cross domain access (for accessing the ASR web services); e.g. as Chrome Extension or as FireFox Addon
- Configuration Settings:
{"mod": "webAudioInput", "config": "webasrGooglev1Impl"}
webasrAtntImpl.js
- Technology: getUserMedia, WebWorker
- Platforms: HTML5
- Remarks:
- Outdated: the AT&T Speech service is not available any more
- as MMIR Media Plugin mmir-plugin-asr-atnt-web (see also mmir-plugins-media)
- encoding to AMR is done using JavaScript based encoders
- uses AT&T Speech API web service (NOTE: requires credentials)
- Requirements:
- accessing the web service requires credentials (i.e. account etc. for using the web services)
- the app using this module needs to run in a special execution context that allows cross domain access (for accessing the ASR web services); e.g. as Chrome Extension or as FireFox Addon
- Configuration Settings:
{"mod": "webAudioInput", "config": "webasrAtntImpl"}
TBD add detailed description
The speech modules are loaded at start-up of the app. In principal, there are two differenct configuration sets: one for the Cordova environment and one for the browser environment.
in /www/config/configuration.json
:
"mediaManager": {
"plugins": {
"browser": ["html5AudioOutput.js",
"webkitAudioInput.js",
"maryTextToSpeech.js"
],
"cordova": ["cordovaAudioOutput.js",
"androidAudioInput.js",
"androidTextToSpeech.js"
]
}
}
During start-up, the framework tries to detect the platform / execution environment.
Note, that currently this is only relevant, when executed as an Cordova app.
If the Cordova plugin plugin-cordova-device
is installed, the platform corresponds to the
property device.platform
(but all in lower cases). Otherwise the platform is "manually" detected using agent
property. "Manually" detect platforms are
android
ios
default
You can use the platform for supplying a specific configuration, by using the platform as a key/name.
For example, the configuration:
"mediaManager": {
"plugins": {
"browser": ["html5AudioOutput.js",
"webkitAudioInput.js",
"maryTextToSpeech.js"
],
"cordova": ["cordovaAudioOutput.js",
"html5AudioInputEnc.js",
"maryTextToSpeech.js"
],
"ios": ["cordovaAudioOutput.js",
"nuanceAudioInput.js",
"nuanceTextToSpeech.js"
],
"android": ["cordovaAudioOutput.js",
"androidAudioInput.js",
"androidTextToSpeech.js"
]
}
}
would use the specific configuration "ios"
when running as Cordova app on an iOS device, and "android"
when running on an Android device, and "cordova"
when running as Cordova app on any other mobile platform.
TBD add detailed description
It is also possible to load multiple modules and use them independently during runtime of the app.
For this purpose, the MediaManager (at mmir.media
) provides a ctx
(~ "context") property (cf. also the functions
MediaManager.getFunc(ctx, funcName)
, MediaManager.perform(ctx, funcName, args)
, and MediaManager.setDefaultCtx(ctxId)
).
NOTE: In most cases ASR and TTS should work a singleton/exclusively, i.e. either (one) ASR or (one) TTS should be active at a time. This needs special attention when using multiple modules (e.g. canceling ASR on all modules before starting TTS).
For example, the following configuration in /www/config/configuration.json
:
"mediaManager": {
"plugins": {
"browser": ["html5AudioOutput.js",
"webkitAudioInput.js",
"maryTextToSpeech.js",
{"ctx": "nuance", "mod": "webAudioInput", "config": "webasrNuanceImpl"},
{"ctx": "nuance", "mod": "webTextToSpeech", "config": "webttsNuanceImpl"}
],
"cordova": ["cordovaAudioOutput.js",
"androidAudioInput.js",
"nuanceTextToSpeech.js",
{"ctx": "android", "mod": "androidAudioInput.js"},
{"ctx": "android", "mod": "androidTextToSpeech.js"},
{"ctx": "nuance", "mod": "nuanceAudioInput.js"},
{"ctx": "nuance", "mod": "nuanceTextToSpeech.js"}
]
}
}
will load maryTextToSpeech.js
as default TTS module in the browser environment, and webasrNuanceImpl
(as well as webttsNuanceImpl
) in the context nuance
, i.e. MediaManager.textToSpeech()
will use
MARY TTS, and MediaManager.ctx.nuance.textToSpeech()
will use the HTTP module for the Nuance TTS.
NOTE the MediaManager
can be used via mmir.media
.
MediaManager.recognize(success, error) //with end-of-speech detection
MediaManager.startRecord(success, error[, withIntermediateResults]) //without end-of-speech detection
MediaManager.stopRecord(success, error)
MediaManager.cancelRecognition(success, error)
callbacks:
success(asr_result: String, asr_score: Number, asr_type: String, asr_alternatives: Array, asr_unstable: String)
- NOTE the success callback will usually be called multiple times (see asr_type)
- asr_result: the ASR result as returned by the used speech recognizer
- asr_score: the score for the result (-1 indicates, that there was no scoring by the recognizer); the actual value and value-range depends on the recognizer
- asr_type: "RECORDING_BEGIN", "INTERMEDIATE", "FINAL", "RECORDING_DONE"
- RECORDING_BEGIN: audio input is active
- INTERMEDIATE: intermediate ASR result
- FINAL: final ASR result
- RECORDING_DONE: audio input is finished
- asr_alternatives: if there are alternative results, then this is an array with entries {result: String, score: Number}
- asr_unstable: if the recognizer supports this feature, then this String may contain unstable/"guessed" parts for the current recognition
error(err, errCode)
See also API documentation:
NOTE the MediaManager
can be used via mmir.media
.
MediaManager.tts(options, onPlayedCallback, failureCallback, onReadyCallback)
MediaManager.cancelSpeech(success, error)
where options:
{
text: String | String[], text that should be read aloud
, pauseDuration: OPTIONAL Number, the length of the pauses between sentences (i.e. for String Arrays) in milliseconds
, language: OPTIONAL String, the language for synthesis (if omitted, the current language setting is used)
, voice: OPTIONAL String, the voice (language specific) for synthesis; NOTE that the specific available voices depend on the TTS engine
, success: OPTIONAL Function, the on-playing-completed callback (see arg onPlayedCallback)
, error: OPTIONAL Function, the error callback (see arg failureCallback)
, ready: OPTIONAL Function, the audio-ready callback (see arg onReadyCallback)
}
callback onPlayedCallback: OPTIONAL callback that is invoked when the audio of the speech synthesis finished playing: onPlayedCallback()
callback failureCallback: OPTIONAL callback that is invoked in case an error occurred: failureCallback(error: String | Error)
callback onReadyCallback: OPTIONAL callback that is invoked when audio becomes ready / is starting to play. If, after the first invocation, audio is paused due to preparing the next audio, then the callback will be invoked with false, and then with true (as first argument), when the audio becomes ready again, i.e. the callback signature is: onReadyCallback(isReady: Boolean, audio: IAudio)
See also API documentation:
deprecated:
MediaManager.textToSpeech: function(parameter: String | Array, onPlayedCallback, failureCallback)
The SemanticInterpreter (at mmir.semantic
) offers grammar-based processing of the ASR results, that is translating the natural language text into machine-understandable data.
See also the section about Adding a New Grammar.
SemanticInterpreter.interpret(phrase, langCode, callback)
where:
* phrase: String, the natural language text that should be parsed
* langCode: OPTIONAL String, the language code (identifier) for the parser/grammar
* callback: OPTIONAL function, a callback function that receives the return value
(instead of receiving the result as return value from this function directly).
The signature for the callback is:
callback(result: Object)
(i.e. the result that would be returned by this function itself is
passed as argument into the callback function; see also documentation
for <em>returns</em>).
NOTE: in case, the grammar for the requested langCode
is not compiled yet (i.e. not present as executable JavaScript),
the corresponding JSON definition of the grammar needs to be
compiled first, before processing the ASR's semantics is possible.
In this case, a callback function MUST
be supplied in order to receive a result (since compilation of the
grammar may be asynchronous).
See also API documentation:
NOTE: in mmir-lib
< 4.x, the function interpret
is called getASRSemantic
.
< previous: "Setup MMIR for Internationalization" | next: "Getting Started" >
- 1 Introduction
- 2 What is MMIR
- 3 MMIR Project Structure
- 4 Getting started