Skip to content

3.9.2 Speech Processing in MMIR

russa edited this page Jan 15, 2019 · 13 revisions

Speech Processing in MMIR

Speech Modules

One important feature in the MMIR framework is the speech input and output capabilities. There are several modules available for enabling speech input (ASR: Automatic Speech Recognition) and speech output (TTS: Text To Speech) on different platforms, environments using different technologies.

The modules implement a unified API that is accessible via the MediaManager (via mmir.media). The configuration in /www/config/configuration.json determines which modules will be loaded (and be used when invoking the corresponding functions on MediaManager).

Speech Input Modules (ASR)

  • webspeechAudioInput.js
    • Technology: Web Speech API (HTML5)
    • Platforms: HTML5/browser
    • Remarks: included in mmir-lib
    • Requirements: the Web Speech API is currently only supported by Google Chrome
    • Configuration ID: webspeechAudioInput
  • webasrGoogleImpl.js
    • Technology: getUserMedia, WebWorker
    • Platforms: HTML5
    • Remarks:
    • Requirements:
      • accessing the web service requires credentials (i.e. account etc. for using the web services)
      • the app using this module needs to run in a special execution context that allows cross domain access (for accessing the ASR web services); e.g. as Chrome Extension or as FireFox Addon
    • Configuration Settings: {"mod": "webAudioInput", "config": "webasrGoogleImpl"}
  • webasrNuanceImpl.js
    • Technology: getUserMedia, WebWorker
    • Platforms: HTML5
    • Remarks:
    • Requirements:
      • accessing the web service requires credentials (i.e. account etc. for using the web services)
      • the app using this module needs to run in a special execution context that allows cross domain access (for accessing the ASR web services); e.g. as Chrome Extension or as FireFox Addon
    • Configuration Settings: {"mod": "webAudioInput", "config": "webasrNuanceImpl"}
  • nuanceAudioInput.js
    • Technology: (native) Nuance SpeechKit
    • Platforms: Android, iOS
    • Remarks: can be installed as Cordova Nuance Speech Plugin (combined with nuanceTextToSpeech.js)
    • Requirements: requires credentials (i.e. account for Nuance Developers)
    • Configuration ID: nuanceAudioInput
  • androidAudioInput.js
    • Technology: Android System's Speech Recognition (most of the time this would be Google Speech Recognition)
    • Platforms: Android
    • Remarks: can be installed as Cordova Android Speech Plugin (combined with androidTextToSpeech.js)
    • Configuration ID: androidAudioInput

Speech Output Modules (TTS)

  • webttsMaryImpl.js
    • Technology: Audio (HTML5), MARY (Open Source TTS system)
    • Platforms: HTML5
    • Remarks:
      • included in mmir-lib
      • should use its own installation of the MARY servers (by default this module is configured for the public MARY demo server)
        example entry for setting MARY service URL (in config/configuration.json):
        ...
        "maryTextToSpeech": {
        	"serverBasePath": "http://mary.dfki.de:59125/"
        },
        ...
    • Configuration Settings: {"mod": "webAudioTextToSpeech", "config": "webttsMaryImpl"}
  • webttsNuanceImpl.js
    • Technology: Audio (HTML5)
    • Platforms: HTML5
    • Remarks:
    • Requirements:
      • requires credentials (i.e. account for Nuance Developers)
      • the app using this module needs to run in a special execution context that allows cross domain access (for accessing the TTS web service via POST); e.g. as Chrome Extension or as FireFox Addon
    • Configuration Settings: {"mod": "webAudioTextToSpeech", "config": "webttsNuanceImpl"}
  • ttsSpeakJsImpl.js
    • Technology: Audio (HTML5), WebWorker
    • Platforms: HTML5
    • Remarks:
    • Requirements:
      • requires credentials (i.e. account for Nuance Developers)
      • the app using this module needs to run in a special execution context that allows cross domain access (for accessing the TTS web service via POST); e.g. as Chrome Extension or as FireFox Addon
    • Configuration Settings: {"mod": "webAudioTextToSpeech", "config": "ttsSpeakJsImpl"}
  • nuanceTextToSpeech.js
    • Technology: (native) Nuance SpeechKit
    • Platforms: Android, iOS
    • Remarks: can be installed as Cordova Nuance Speech Plugin (combined with nuanceAudioInput.js)
    • Requirements: requires credentials (i.e. account for Nuance Developers)
    • Configuration ID: nuanceTextToSpeech
  • androidTextToSpeech.js
    • Technology: Android System's Text To Speech Engine (most of the time this would be Google's TTS engine)
    • Platforms: Android
    • Remarks: can be installed as Cordova Android Speech Plugin (combined with androidAudioInput.js)
    • Configuration ID: androidTextToSpeech

Outdated Speech Modules

(for reference)

  • speech input (ASR)
    • webasrGooglev1Impl.js
      • Technology: getUserMedia, WebWorker
      • Platforms: HTML5
      • Remarks:
        • Outdated: the Google Web Speech Recognition service v1 that was used for this module is not available anymore
        • as MMIR Media Plugin mmir-plugin-asr-googlev1-web (see also mmir-plugins-media)
        • encoding to FLAC is done using JavaScript based encoders
      • Requirements:
        • the app using this module needs to run in a special execution context that allows cross domain access (for accessing the ASR web services); e.g. as Chrome Extension or as FireFox Addon
      • Configuration Settings: {"mod": "webAudioInput", "config": "webasrGooglev1Impl"}
    • webasrAtntImpl.js
      • Technology: getUserMedia, WebWorker
      • Platforms: HTML5
      • Remarks:
      • Requirements:
        • accessing the web service requires credentials (i.e. account etc. for using the web services)
        • the app using this module needs to run in a special execution context that allows cross domain access (for accessing the ASR web services); e.g. as Chrome Extension or as FireFox Addon
      • Configuration Settings: {"mod": "webAudioInput", "config": "webasrAtntImpl"}

Configuration

TBD add detailed description

The speech modules are loaded at start-up of the app. In principal, there are two differenct configuration sets: one for the Cordova environment and one for the browser environment.

in /www/config/configuration.json:

    "mediaManager": {
    	"plugins": {
    		"browser": ["html5AudioOutput.js",
    		            "webkitAudioInput.js",
    		            "maryTextToSpeech.js"
    		],
    		"cordova": ["cordovaAudioOutput.js",
    		            "androidAudioInput.js",
    		            "androidTextToSpeech.js"
    		]
    	}
    }

Platform Specific Configuration

During start-up, the framework tries to detect the platform / execution environment. Note, that currently this is only relevant, when executed as an Cordova app. If the Cordova plugin plugin-cordova-device is installed, the platform corresponds to the property device.platform (but all in lower cases). Otherwise the platform is "manually" detected using agent property. "Manually" detect platforms are

  • android
  • ios
  • default

You can use the platform for supplying a specific configuration, by using the platform as a key/name.

For example, the configuration:

    "mediaManager": {
    	"plugins": {
    		"browser": ["html5AudioOutput.js",
    		            "webkitAudioInput.js",
    		            "maryTextToSpeech.js"
    		],
    		"cordova": ["cordovaAudioOutput.js",
    		            "html5AudioInputEnc.js",
    		            "maryTextToSpeech.js"
    		],
    		"ios":     ["cordovaAudioOutput.js",
    		            "nuanceAudioInput.js",
    		            "nuanceTextToSpeech.js"
    		],
    		"android": ["cordovaAudioOutput.js",
    		            "androidAudioInput.js",
    		            "androidTextToSpeech.js"
    		]
    	}
    }

would use the specific configuration "ios" when running as Cordova app on an iOS device, and "android" when running on an Android device, and "cordova" when running as Cordova app on any other mobile platform.

Configuration for Using Multiple Modules

TBD add detailed description

It is also possible to load multiple modules and use them independently during runtime of the app.

For this purpose, the MediaManager (at mmir.media) provides a ctx (~ "context") property (cf. also the functions MediaManager.getFunc(ctx, funcName), MediaManager.perform(ctx, funcName, args), and MediaManager.setDefaultCtx(ctxId)).

NOTE: In most cases ASR and TTS should work a singleton/exclusively, i.e. either (one) ASR or (one) TTS should be active at a time. This needs special attention when using multiple modules (e.g. canceling ASR on all modules before starting TTS).

For example, the following configuration in /www/config/configuration.json:

    "mediaManager": {
    	"plugins": {
    		"browser": ["html5AudioOutput.js",
    		            "webkitAudioInput.js",
    		            "maryTextToSpeech.js",
    		            {"ctx": "nuance", "mod": "webAudioInput", "config": "webasrNuanceImpl"},
    		            {"ctx": "nuance", "mod": "webTextToSpeech", "config": "webttsNuanceImpl"}
    		],
    		"cordova": ["cordovaAudioOutput.js",
    		            "androidAudioInput.js",
    		            "nuanceTextToSpeech.js",
    		            {"ctx": "android", "mod": "androidAudioInput.js"},
    		            {"ctx": "android", "mod": "androidTextToSpeech.js"},
    		            {"ctx": "nuance", "mod": "nuanceAudioInput.js"},
    		            {"ctx": "nuance", "mod": "nuanceTextToSpeech.js"}
    		]
    	}
    }

will load maryTextToSpeech.js as default TTS module in the browser environment, and webasrNuanceImpl (as well as webttsNuanceImpl) in the context nuance, i.e. MediaManager.textToSpeech() will use MARY TTS, and MediaManager.ctx.nuance.textToSpeech() will use the HTTP module for the Nuance TTS.

Speech Input API

NOTE the MediaManager can be used via mmir.media.

MediaManager.recognize(success, error)								//with end-of-speech detection
MediaManager.startRecord(success, error[, withIntermediateResults])	//without end-of-speech detection
MediaManager.stopRecord(success, error)

MediaManager.cancelRecognition(success, error)

callbacks:
success(asr_result: String, asr_score: Number, asr_type: String, asr_alternatives: Array, asr_unstable: String)
  - NOTE the success callback will usually be called multiple times (see asr_type)
  - asr_result: the ASR result as returned by the used speech recognizer
  - asr_score: the score for the result (-1 indicates, that there was no scoring by the recognizer); the actual value and value-range depends on the recognizer
  - asr_type: "RECORDING_BEGIN", "INTERMEDIATE", "FINAL", "RECORDING_DONE"
    - RECORDING_BEGIN: audio input is active
    - INTERMEDIATE: intermediate ASR result
    - FINAL: final ASR result
    - RECORDING_DONE: audio input is finished
  - asr_alternatives: if there are alternative results, then this is an array with entries {result: String, score: Number}
  - asr_unstable: if the recognizer supports this feature, then this String may contain unstable/"guessed" parts for the current recognition
error(err, errCode)

See also API documentation:

Speech Output API

NOTE the MediaManager can be used via mmir.media.

MediaManager.tts(options, onPlayedCallback, failureCallback, onReadyCallback)

MediaManager.cancelSpeech(success, error)

where options:
{
    text: String | String[], text that should be read aloud
  , pauseDuration: OPTIONAL Number, the length of the pauses between sentences (i.e. for String Arrays) in milliseconds
  , language: OPTIONAL String, the language for synthesis (if omitted, the current language setting is used)
  , voice: OPTIONAL String, the voice (language specific) for synthesis; NOTE that the specific available voices depend on the TTS engine
  , success: OPTIONAL Function, the on-playing-completed callback (see arg onPlayedCallback)
  , error: OPTIONAL Function, the error callback (see arg failureCallback)
  , ready: OPTIONAL Function, the audio-ready callback (see arg onReadyCallback)
}

callback onPlayedCallback: OPTIONAL callback that is invoked when the audio of the speech synthesis finished playing: onPlayedCallback()

callback failureCallback: OPTIONAL callback that is invoked in case an error occurred: failureCallback(error: String | Error)

callback onReadyCallback: OPTIONAL callback that is invoked when audio becomes ready / is starting to play. If, after the first invocation, audio is paused due to preparing the next audio, then the callback will be invoked with false, and then with true (as first argument), when the audio becomes ready again, i.e. the callback signature is: onReadyCallback(isReady: Boolean, audio: IAudio) 

See also API documentation:

deprecated:
MediaManager.textToSpeech: function(parameter: String | Array, onPlayedCallback, failureCallback)

SemanticInterpreter

The SemanticInterpreter (at mmir.semantic) offers grammar-based processing of the ASR results, that is translating the natural language text into machine-understandable data.
See also the section about Adding a New Grammar.

SemanticInterpreter.interpret(phrase, langCode, callback)

where:
 * phrase: String, the natural language text that should be parsed
 * langCode: OPTIONAL String, the language code (identifier) for the parser/grammar
 * callback: OPTIONAL function, a callback function that receives the return value
        (instead of receiving the result as return value from this function directly).
        The signature for the callback is: 
              callback(result: Object)
          (i.e. the result that would be returned by this function itself is
           passed as argument into the callback function; see also documentation
           for <em>returns</em>).
        NOTE: in case, the grammar for the requested langCode
        	  is not compiled yet (i.e. not present as executable JavaScript),
        	  the corresponding JSON definition of the grammar needs to be
              compiled first, before processing the ASR's semantics is possible.
        	  In this case, a callback function MUST
        	  be supplied in order to receive a result (since compilation of the
              grammar may be asynchronous).

See also API documentation:

NOTE: in mmir-lib < 4.x, the function interpret is called getASRSemantic.


< previous: "Setup MMIR for Internationalization" | next: "Getting Started" >

Clone this wiki locally