Skip to content
This repository has been archived by the owner on May 7, 2020. It is now read-only.

new Speech-To-Text service interface #1021

Closed
kdavis-mozilla opened this issue Feb 11, 2016 · 42 comments
Closed

new Speech-To-Text service interface #1021

kdavis-mozilla opened this issue Feb 11, 2016 · 42 comments
Labels

Comments

@kdavis-mozilla
Copy link

I wanted to capture here several details related to a new interface for converting speech-to-text that Kai and our group discussed.

  1. A new interface org.eclipse.smarthome.io.multimedia.stt.STTService should be introduced that allows one to bind speech-to-text engines. The name and package of the interface may change and is simply a placeholder for now.
  2. The interface STTService should implement the AudioSink interface of Issue Refactor TTS APIs into a general audio dispatching allowing different sources and sinks #584
  3. The interface STTService should expose a method(s) that allow it to be passed an AudioSource and return recognition result(s) (Note the details of this are subtle because the AudioSource may be of an indefinite length, and recognition results usually have associated confidences, and often partial results are returned when an AudioSource of indefinite length is being processed.... So, the details of this must be solidified.)
  4. The interface STTService should allow for grammars to be specified for speech recognition. The utility of this is that it allows for lower word error rates when dealing with systems not requiring large vocabulary continuous speech recognition.

If there is anything in the discussion not captured here, please feel free to add it!

@kaikreuzer
Copy link
Contributor

Sounds pretty complete to me.
Will you come up with a suggestion for the STTService interface, so that this can then be taken as a starting point for further discussion?

@kdavis-mozilla
Copy link
Author

In the coming week I should be able to create an initial version I can post here for comments.

@tavalin
Copy link
Contributor

tavalin commented Feb 12, 2016

Re: 4 - simplistically is this the list of commands that the speech recognition understands?

@kdavis-mozilla
Copy link
Author

@tavalin It could be that simple

grammar = hello;

slightly more complicated

grammar = [ on | off ];

Or even more complicated

grammar = <basicCmd>;
<basicCmd> = <startPolite> <command> <endPolite>;
<startPolite> = (please | kindly | could you |  oh  mighty  computer)*;
<command> = <action> <object>;
<action> = [ open | close | delete | move ];
<object> = [the | a] (window | file | menu);
<endPolite> = [ please | thanks | thank you ];

Note, this is in a psudo grammar that hopefully is self explanatory.

@tavalin
Copy link
Contributor

tavalin commented Feb 12, 2016

That's as I expected and would obviously be useful/required with a "local"/ESH specific STT service. Probably less so for third party SST services.

I have some experience trying to integrate openHAB with third party STT and Human Language Interpreter services so this and #1028 are things I would like to get involved in.

@kaikreuzer kaikreuzer changed the title A new Speech-To-Text binding interface sould be introduced new Speech-To-Text service interface Feb 15, 2016
@kdavis-mozilla
Copy link
Author

I've made a sketch of a possible STTService interface

public interface STTService {
    public SpeechRecognitionAlternative[] recognize(AudioSource audioSource, Locale locale, String[] grammars);
    public AudioFormat[] getSupportedFormats();
}

public interface SpeechRecognitionAlternative {
    public String getTranscript();
    public float getConfidence();
}

One thing which I've not completely thought through is how to deal with an AudioSource that is of an indefinite length.

If the AudioSource is of an indefinite length, then recognize() would block until, say, a VAD algorithm determines the speech has ended. This many take quite some time. Not to mention the fact the it would miss any speech after the first pause.

The questions this sketch raises are

  • Is it OK for recognize() to block for an indefinite time period?
  • How is speech after the first VAD detected pause dealt with?

To address this one could, alternatively, modify the signature of recognize() such that it returns void and takes a UUID, say, to uniquely identify the request,

public interface STTService {
    public void recognize(AudioSource audioSource, Locale locale, String[] grammars, UUID uuid);
    public AudioFormat[] getSupportedFormats();
}

but then, instead of returning its results through return values, returns them by placing Event instances, something like

public class SpeechRecognitionEvent extends AbstractEvent {
  ...
  public UUID getUUID();
  public SpeechRecognitionAlternative[] getSpeechRecognitionAlternatives();
  ...
}

containing recognition results on the bus whenever a recognition occurs. This seems to be the "right way" to do things.

I'd be curious as to other's opinions on this.

@hkuhn42
Copy link
Contributor

hkuhn42 commented Feb 16, 2016

One thing which I've not completely thought through is how to deal with an AudioSource that is of an indefinite length.

I would not support that in the beginning. Most online APIs do not support it anyways and i fear it could cause all kind of resource locking problems (especially on smaller devices like the pi). I think restricting the input to a certain max time (like 10-30 seconds) per recognition roundtrip would be suffice for most commands. I do not see any use case for a streaming like continous recognition. The input can always be chopped up and processes in multiple api calls.

@kaikreuzer
Copy link
Contributor

I think the question of "when to stop" is important and should be considered in the interface design (even if it is a finite length audio source - it might still be 10GB ;-)).

I could imagine an interface similar to our DiscoveryService. This has (non-blocking) start/abort methods and provides its output through a listener interface to its subscribers.
This would give full control to the caller - whenever he is satisfied with the output, he can stop the recognition. If he does "surveillance", he can keep it running indefinitely, even through periods of silence.
And even if there is never any pause in the audio, he can interrupt the recognition at any time.

Just a syntactical remark: In ESH we tend to use Lists and Sets instead of arrays in interface signatures, so I'd suggest to do the same here.

@kdavis-mozilla
Copy link
Author

Let my try and make a quick sketch of the STT interface with listeners before I run out the door...

I'd guess the core interfaces should be something like

public interface STTService {
    public void recognize(AudioSource audioSource, Locale locale, Collection<String> grammars);
    public Collection<AudioFormat> getSupportedFormats();
    public void addSTTListener(STTListener listener);
    public void removeSTTListener(STTListener listener);
}

public interface STTListener {
    public void speechRecognized(STTService sttService, Collection<SpeechRecognitionAlternative> speechRecognitionAlternatives);
}

public interface SpeechRecognitionAlternative {
    public String getTranscript();
    public float getConfidence();
}

@hkuhn42
Copy link
Contributor

hkuhn42 commented Feb 17, 2016

@kdavis-mozilla I am still not exactly sure what the grammers parameter is for? Would not be the notation / meaning of a grammer be highly dependant on the actual implementation. I looked at the servies i played araound with and did not find any usage for it.

In generel i think that we should not define an interface where multiple possible alternatives are available as a result. I know that many stt / nlp services offer this but i always found it to cause more trouble than it offers. If more than one "highly confident" result is available some kind of handling must be done anyway. Would it not be much easier to just offer the "best / most confident" and let the implementation decide wether its confidence is high enough (we would probably need a config param for this). If not, some kind of handling must be done anyway. In the end, for real world applications you cannot neither handle all results (turn on everything because the item was unclear). So i would suggest going for the less flexible but much simpler solution of

public interface STTListener {
    public void speechRecognized(STTService sttService, SpeechRecognitionResult  recognitionResult);
}

public interface SpeechRecognitionResult {
    public String getTranscript();
    public float getConfidence();
}

Also i think we should offer a

public Collection<Locale> getSupportedLocales();

@kdavis-mozilla i really think we are getting there :)

@kdavis-mozilla
Copy link
Author

@hkuhn42 I'm fine with both of your points.

@kaikreuzer
Copy link
Contributor

getSupportedFormats and getSupportedLocales should return Sets, not Collections - we do not want to have duplicates in there, do we?

What about an abort method as suggested in #1021 (comment)?

@kaikreuzer
Copy link
Contributor

@kdavis-mozilla Same question as @hkuhn42: What exactly is expected as the grammar parameter? Is this just a plain set of words as a vocabulary? Or is there more structure expected? Then we should probably formalize this in a its own class definition.

@kdavis-mozilla
Copy link
Author

@kaikreuzer I'm fine with having them return a Set.

As for abort() I'm also fine with that. But I think we have to think about how abort()will function with multiple recognitions occurring at the same time.

@kdavis-mozilla
Copy link
Author

@kaikreuzer Generally, the grammar is specific to the STT engine being used. There are basically as many grammar formalisms as there are STT engines, a slight exaggeration but only slight.

Unfortunately, it's often the case that different engines which claim to use the same grammar formalism implement things in different ways or don't implement the entire spec or....

So, I'm slightly worried about trying to spend time on creating a meta-grammar that can map to any existing grammar. For example, for the WebSpeech API[1] the W3C punted on just this point.

@kaikreuzer
Copy link
Contributor

But I think we have to think about how abort()will function with multiple recognitions occurring at the same time.

If you require recognize to be thread safe (which you seem to say with this and which is a potential requirement indeed), you anyhow cannot have add/remove listener methods, but you need to pass the listener as a parameter to the recognize method. Likewise, recognize should return a handle on which you can then call abort(). So allowing multiple recognitions concurrently has a huge impact on the interface design, so we need to have a decision here.

Generally, the grammar is specific to the STT engine being used

In this case, I would not add it to the "generic" interface at all - because it simply does not define a clear contract then. It is rather up to the SST service implementation to gather what it needs for operation. It can itself access the item registry and read out the names of items etc.

@kdavis-mozilla
Copy link
Author

@kaikreuzer @hkuhn42 First another try at the interface

public interface STTService {
    public Set<Locale> getSupportedLocales();
    public Set<AudioFormat> getSupportedFormats();
    public STTServiceHandle recognize(STTListener sttListener,
                                      AudioSource audioSource,
                                      Locale locale,
                                      Set<String> grammars);
}

public interface STTServiceHandle {
    public void abort();
}

public interface STTListener {
    public void speechRecognized(STTService sttService,
                                 SpeechRecognitionResult  recognitionResult);
}

public interface SpeechRecognitionResult {
    public String getTranscript();
    public float getConfidence();
}

Then some comments....

As far as I know, the only point of contention is grammars. I agree grammars does not define a clear contract. However, I think the alternative, having the "SST service implementation to gather what it needs for operation" is also less than ideal.

Having the "SST service implementation to gather what it needs for operation" implies that the SST service somehow knows the what the text-to-action implementation defines as a grammatical text. It doesn't know this and it shouldn't know. The implementation of the STTService and the implementation of the text-to-action interface should be decoupled as much as is possible.

Requiring the STTService method recognize() be passed grammars is the means by which this decoupling is achieved. With this decoupling the text-to-action implementation need only know the grammar specification, e.g. JSpeech Grammar Format, of the STTService. And the STTService need not know anything about the text-to-action implementation.

While the lack of standard grammars is unfortunate, I think the tight coupling of a STTService implementation and a text-to-action implementation required by not allowing for a grammar is worse.

@kaikreuzer
Copy link
Contributor

I think the STTListener.speechRecognized method does not require the sttService parameter - after all, the caller who created the listener also already has the sttservice at hand.

Regarding the grammar: I probably still do not have a good understanding of what is really passed in here. Since you say that there is no standard, how will this set of Strings look like and what exactly is the STTService supposed to do with it? Looking at your example above, it seems we need to come up with a certain format that is well defined?

@tavalin
Copy link
Contributor

tavalin commented Feb 18, 2016

Maybe this is a obvious question but following on from @kaikreuzer's point about example grammars, will the grammars (in whatever form they take) require localization?

@kdavis-mozilla
Copy link
Author

@kaikreuzer What if a caller registers with multiple STTService instances? As far as I can tell, in this case the sttService would be required.

There is not one standard there are multiple standards

all implemented with varying fidelity in various STT engines.

Rather than trying to create our own grammar format. The easier path would be to simply state that grammar is in, say, the Speech Recognition Grammar Specification format. (Note, this is just an example. I'd like to research the decision before committing to any one format.)

@kdavis-mozilla
Copy link
Author

@tavalin Yes the grammars would be local specific.

@kaikreuzer
Copy link
Contributor

What if a caller registers with multiple STTService instances?

Then he will pass different listener instances in the different STTService-recognize() calls and if he wants the listener to know which service is relevant for it, it can easily inject the service upfront, e.g. in the constructor of the listener.

The easier path would be to simply state that grammar is in xyz format.

That was my point - we would have to settle for one grammar. But if we do so, is the content of this grammar then different depending on the TTAService that we want to use afterwards? If not, isn't the grammar a static configuration for the STTService and thus does not have to be passed in the recognize() call?

@kdavis-mozilla
Copy link
Author

@kaikreuzer

Then he will pass different listener instances in the different STTService-recognize() calls...

This is fine with me. Just makes callers have many listener instances instead of one.

...we would have to settle for one grammar...

For callers this is great, but it will make implementation of the STTService much harder.

But if we do so, is the content of this grammar then different depending on the TTAService that we want to use afterwards? If not, isn't the grammar a static configuration for the STTService and thus does not have to be passed in the recognize() call?

I think maybe this is pointing to a misunderstanding. I'll try to clarify.

A grammar syntax, for example Speech Recognition Grammar Specification, JSpeech Grammar Format..., specifies what a syntactically valid grammar is. For example, say there is a grammar syntax X that specifies a pipe | is used to separate things that are "or'ed" together. Then a syntactically valid grammar according to the X grammar syntax would be

yes | no

Similarly, one could have a second grammar syntax the Y grammar syntax in which a double pipe || is used to separate things that are "or'ed" together. Then a syntactically valid grammar according to the Y grammar syntax would be

yes || no

What we are talking about now is if we are standardizing around the grammar syntax X or the grammar syntax Y.

What is passed as the grammars argument to the method recognize()

public interface STTService {
    public Set<Locale> getSupportedLocales();
    public Set<AudioFormat> getSupportedFormats();
    public STTServiceHandle recognize(AudioSource audioSource,
                                      Locale locale,
                                      Set<String> grammars);
}

is a Set of String instances each of which is a syntactically valid grammar according to a particular grammar syntax, for example the Speech Recognition Grammar Specification syntax or the JSpeech Grammar Format syntax or... Generally such syntactically valid grammars increase the accuracy of the STTService by giving it hints as to what it should expect.

The TTAService, as it knows which sentences it "understands", would create a syntactically valid grammar according to a particular grammar syntax, for example the Speech Recognition Grammar Specification syntax or the JSpeech Grammar Format syntax or... then pass this syntactically valid grammar to the STTService giving it hints as to what it should expect.

To make this more concrete. If we decide on the X grammar syntax, a syntactically valid grammar produced by a TTAService might be

Turn off the ceiling light. | Turn on the ceiling light.

This would be passed to the STTService to give it hints as to what to expect. If someone adds a new light and calls it "floor lamp", then the TTAService would have to create a new syntactically valid grammar

Turn off the ceiling light. | Turn on the ceiling light. | Turn off the floor lamp. | Turn on the floor lamp.

and pass it to the STTService.

I hope this is a bit clearer. Sorry this is so long!

@tavalin
Copy link
Contributor

tavalin commented Feb 18, 2016

@kdavis-mozilla looking at your example above, if we wanted to enable users to rephrase the command to "Turn the ceiling light on" for example, would we need to add every variation to the set of grammar or would we expect the implementation to be able to cope?

I guess the question is how much does the service rely on the grammar?

@kdavis-mozilla
Copy link
Author

@tavalin The short answer "yes we would need to add every variation if the service relied on the grammar". The long answer is....

Grammar is a tool to help speech recognition engines that are not able to handle unrestricted speech still be accurate. So, a grammar is not required by all engines but only for some engines.

In particular, having a grammar will be useful if the STT engine has to fit on some "small form factor" device. For example, a device with 512MB of memory will likely not be able to handle unrestricted speech as "generically" the machine learning models used to recognize unrestricted speech take more than 512MB of memory.

However, if the device has sufficient memory, CPU power, and GPU power, there's no need for a grammar as the STT engine can understand unrestricted speech. But, such hardware requirements usually imply that the device is a server, as such machine learning models take at least 3GB of memory and often require beefy GPU cards.

@kaikreuzer
Copy link
Contributor

@kdavis-mozilla Sorry, this all does not make much sense to me.
I think we need to differentiate between the syntax and the content of a grammar.

yes | no 

and

yes || no 

are two syntactically different grammars. A STT engine would have to KNOW about the syntax that is passed into it, so it could actually declare that it expects a grammar according to JSpeech and nothing else.

Turn off the ceiling light. | Turn on the ceiling light. | Turn off the floor lamp. | Turn on the floor lamp.

Is this a realistic example? Would this be one single string of the grammar parameter set? This imho does not make sense.
To my understanding, the grammars work with placeholders like

$command = Turn $action $object;
$action = on | off
$object = [the | a] (ceiling light | floor lamp);

These will have to be defined by us, because this directly relates to the "intents" that we want to support in TTA.
More precisely, what we have to define is

$command = Turn $action $object;
$action = on | off
$object = [the | a] (<list of all switch item names separated by | >);

But this is nothing the TTAService produces, is it? It rather needs this info as well to do its job (although with this input, the job is almost already done). So I do not see a flow from TTA->SST here (especially as the normal flow is that the output of SST is passed to TTA)...

@kdavis-mozilla
Copy link
Author

@kaikreuzer Yes, we need to we need to differentiate between the "syntax and the content of a grammar". This is why I was careful to always differentiate between grammar syntax and a syntactically valid grammar.

In my example X, with "or-divider" |, is a grammar syntax and

yes | no

is a syntactically valid grammar according to the grammar syntax X.

Yes, STT would have to know about the grammar syntax it considers valid, and it would have to declare what grammar syntax it considers valid.

As to...

Turn off the ceiling light. | Turn on the ceiling light. | Turn off the floor lamp. | Turn on the floor lamp.

it is a realistic example. Most grammar syntaxes use placeholders. But you don't have to use placeholders.

You are correct, what we have to define is something like

$command = Turn $action $object;
$action = on | off
$object = [the | a] (<list of all switch item names separated by | >);

However, this syntactically valid grammar should not be defined in a STTService. A STTService is "dumb". It just does speech-to-text.

The "smarts" of the system are in the implementation of the TTAService. The implementation of the TTAService interface knows what sentences can be understood, in other words it knows what sentences can be acted upon. As it knows what sentences are actionable, it can create a syntactically valid grammar for these sentences and pass this syntactically valid grammar to the STTService to assist the STTService in understanding the user's wishes.

Once the STTService does a conversion from speech-to-text, then the resulting text is passed to the TTAService implementation so it can act on the text.

@kaikreuzer
Copy link
Contributor

The "smarts" of the system are in the implementation of the TTAService.

Right, but this is rather a co-incidence and it does not imply that the TTAService interface has to deal with that fact - see my comment here.

I would want a STTService to be operable without having to associate a TTAService - I might simply only want to turn some audio to text and do nothing further with it. By declaring a dependency to a GrammarProvider the STTService can do so. Also note that the <list of all switch item names separated by | > part will require additional services to be filled (like the ItemRegistry see my comment here.

@kdavis-mozilla
Copy link
Author

@kaikreuzer I think there's nothing saying that the grammars parameter passed to a STTService has to come from a TTAService, as the grammars parameter is simply a Set<String> it can come from anywhere.

I think the idea of the GrammarProvider interface is fine, and yes whomever implements it would have to talk to additional services like the ItemRegistry.

@kdavis-mozilla
Copy link
Author

To hopefully bring this one step closer to a conclusion I'll try to summarize where we are.

For STT we will, in the package org.eclipse.smarthome.io.voice, introduce the following interfaces

public interface STTService {
    public Set<Locale> getSupportedLocales();
    public Set<AudioFormat> getSupportedFormats();
    public STTServiceHandle recognize(STTListener sttListener,
                                      AudioSource audioSource,
                                      Locale locale,
                                      Set<String> grammars);
}

public interface STTServiceHandle {
    public void abort();
}

public interface STTListener {
    public void speechRecognized(SpeechRecognitionResult  recognitionResult);
}

public interface SpeechRecognitionResult {
    public String getTranscript();
    public float getConfidence();
}

Are there any objections to this API?

@kaikreuzer
Copy link
Contributor

Shall we have something like a STTListener.done() method to give the engine a chance to tell that there won't be any further results being passed to the listener anymore? This could e.g. happen, if the stream of the AudioSource came to an end.

@kdavis-mozilla
Copy link
Author

@kaikreuzer Wouldn't this only be of use when the AudioSource is of a definite length? For a streaming AudioSource the STTService could not determine if the audio is complete and thus could not call STTListener.done().

@kaikreuzer
Copy link
Contributor

Well, isn't any stream of a definite length and for some you just cannot tell, when it ends? I mean even for streams without a length you can run into the situation that the stream ends (because the device with the microphone is turned off, because the user pressed a button to stop recording, or because the incoming call has ended). The caller of the STTService has no clue about this, because it is the STTService itself, which has the stream.
Now imagine that you want to send the transcript of an incoming call via email. How will you know when the call is over and can stop waiting for further SpeechRecognitionResults?

@kdavis-mozilla
Copy link
Author

@kaikreuzer I agree with all you say. I guess I was hoping not to have to introduce a lot more machinery. Which, unfortunately, I think we need.

That being said. I think the STTService should inform the STTListener of more than just when the audio is "done". There are many obvious use cases that require such notifications. For example, a GUI/VUI could need to indicate when the system

  • Started trying to recognize
  • Heard audio
  • Heard speech
  • Didn't hear speech anymore
  • Produced a SpeechRecognitionResult
  • Didn't hear audio anymore
  • Stopped trying to recognize
  • Encountered an error

One can easily think of cases where all of this is of use by, for example, considering the GUI feedback provided by something like Echo's "UFO lights".
UFO lights

So with that being said. How about doing something like this

public interface STTService {
    public Set<Locale> getSupportedLocales();
    public Set<AudioFormat> getSupportedFormats();
    public STTServiceHandle recognize(STTListener sttListener,
                                      AudioSource audioSource,
                                      Locale locale,
                                      Set<String> grammars);
}

public interface STTServiceHandle {
    public void abort();
}

as before. But then also doing something slightly new by introducing events to handle the above cases

public class STTEvent extends EventObject {
 ...
}
public class RecognitionStartEvent extends STTEvent {
 ...
}
public class AudioStartEvent extends STTEvent {
 ...
}
public class SpeechStartEvent extends STTEvent {
 ...
}
public class SpeechStopEvent extends STTEvent {
 ...
}
public class SpeechRecognitionEvent  extends STTEvent {
    public String getTranscript();
    public float getConfidence();
...
}
public class AudioStopEvent extends STTEvent {
 ...
}
public class RecognitionStopEvent extends STTEvent {
 ...
}
public class SpeechRecognitionErrorEvent extends STTEvent {
 ...
}

public interface STTListener {
    public void sttEventReceived(STTEvent  sttEvent);
}

@hkuhn42
Copy link
Contributor

hkuhn42 commented Feb 26, 2016

But would not all these events and callbacks essentially be useless for any kind of online STT Service (in essence none of that callbacks could be called before such a service finishes)?
Also what should a calling service "do" with that kind of detailed information? In my understanding the primary use case or the interface would be some kind of voice command system for ESH.
I think that an interface more in the lines of the current draft of #1081 would make usage and implementation much easier without sacrificing any high level features.

@kdavis-mozilla
Copy link
Author

@hkuhn42 Actually no they will be of use in such an online case.

In the online case the AudioSource will be of an indefinite length and the STTService can use the various events as follows:

  • A RecognitionStartEvent is fired when recognition starts
  • An AudioStartEvent is fired when the STTService first starts detecting sound
  • A SpeechStartEvent is fired when the STTService first starts detecting speech
  • A SpeechStopEvent is fired when the STTService VAD detects a phrase/sentence ending.
  • A SpeechRecognitionEvent is fired recognizing the speech that occured between the last SpeechStartEvent and SpeechStopEvent events.
  • The previous three steps repeat as often as the user says something
  • An AudioStopEvent is fired when someone turns off the mic
  • A RecognitionStopEvent is fired thereafter
  • A SpeechRecognitionErrorEvent is fired if network access to say a continually listening Bluetooth mic is interrupted

@kdavis-mozilla
Copy link
Author

@hkuhn42 As to what the use cases are, I think the Echo is a good example. It lights up when someone starts talking. It stops being lit up when someone stops. Siri does a similar thing. Beyond good UI, I think this is also a security feature as the system should indicate when it is listening and when it's not.

@hkuhn42
Copy link
Contributor

hkuhn42 commented Feb 27, 2016

@kdavis-mozilla i understand your point. I also recognize that you probably have a lot more experience with on voice recognition.
However this imho makes only sense in integrated devices like echo or maybe a smartphone.
The api does not make as much sense if you split it up. In my rather primitive prototype, audio capturing is done by a browser ui. The speech recognition is done by a web service (microsoft oxford).
In this scenario (and a couple of others i can think of) audio capture and thus feedback for 'listening' is done by one service (the browser, a smart mirror, a lightbulb...) and the actual voice recognition is done by a different service (in another software one the same device, in the cloud, on a ESH Server in the basement,). So i think we should at least separate the audio capture related stuff (we can and should however keep these requirements in mind if we decide to create a separate audio capture service that handles local microphones for example via #584).

Also most of the Webservices for STT i locked into do not allow or support unlimited audio. They accept only small packages of like 10s per call. Larger portions have to be chunked. So i think the service should also support handling short audio fragments without doing to much overhead.

@kdavis-mozilla
Copy link
Author

@hkuhn42 By "split it up" I take it you mean split the audio capture from speech recognition. I'll comment below assuming that's what you mean. If that's not what you mean, then ignore what I say below.

As far I understand the API I am suggesting splits "audio capture", the device that creates a AudioSource from "speech recognition", the code that implements STTService.

In your example the audio form the browser would eventually have to presented as an AudioSource to be consumed by a STTService. So, the "audio capture", creation of an AudioSource, is separated form the "speech recognition", consuming of the AudioSource by a STTService.

Maybe your confusion is in regards to the events AudioStartEvent and AudioStopEvent? All other events RecognitionStartEvent, SpeechStartEvent, SpeechStopEvent, SpeechRecognitionEvent, RecognitionStopEvent, and SpeechRecognitionErrorEvent can only be fired by the STTService.

The events AudioStartEvent and AudioStopEvent, or their analog, could indeed also be fired by some other interface, AudioCaptueService say. Indeed for an AudioSource of finite length it make sense for the AudioCaptueService say to fire these two events.

If a particular web service only supports small packages of audio, 10s per call say, then the current API handles that. In both the streaming and non-streaming case 10s after the STTService has fired a RecognitionStartEvent it simply fires a RecognitionStopEvent and doesn't do anything after that for the corresponding AudioSource.

@hkuhn42
Copy link
Contributor

hkuhn42 commented Feb 27, 2016

@kdavis-mozilla

By "split it up" I take it you mean split the audio capture from speech recognition. I

yes thats what i meant! Sorry if i was to unspecific!

The events AudioStartEvent and AudioStopEvent, or their analog, could indeed also be fired by some other interface, AudioCaptueService say. Indeed for an AudioSource of finite length it make sense for the AudioCaptueService say to fire these two events.

I would move AudioStartEvent and AudioStopEvent to said AudioCaptueService. However if you see a need, we can have them on both. Apart from that, most of the complexity can probably be explained in the documentation where we will need to describe the order and meaning of these events in detail. So i still think it to be quite complex but i trust your experience in this so i am fine with the interface 👍 .

@kaikreuzer @kdavis-mozilla So from my perspective there is only one minor thing to clarify: my understanding would be that the STTEvents are ESH Events (extend the ESH Event interface) and would also be available on the event bus. Is this correct?

@kdavis-mozilla
Copy link
Author

@hkuhn42 The events can extend from whatever is most appropriate. I'll let Kai make that call, and we can move AudioStartEvent and AudioStopEvent to a AudioCaptueService say, but that might complicate usage a bit as the listener must now know about AudioCaptueService instances and STTService instances . I don't know which is best.

I think the API is quite complex, but more-or-less in line with other similar STT API's. For example

@kaikreuzer
Copy link
Contributor

I like @kdavis-mozilla proposal as I think the interface stays lean and simple (sttEventReceived(STTEvent sttEvent)) and an implementation only needs to handle what it is interested in. If all you want to know is the RecognitionStopEvent (as per my example above), you only need to handle this. But it leaves you the possibility to also do more.

my understanding would be that the STTEvents are ESH Events

No, not at all! ESH events are events that are sent on the internal event bus and are a such "globally" available to all interested subscribers. In the STT case, the events are local to a specific recognize call and the listener that was reached in with it. So I'd say we can simply define STTEvent as a marker interface, which does not have to extend anything.

Wrt AudioCaptureService: I would not do this split for now to keep things simple (well, they are already complex enough). Let's deal with that if we really have the need for it.

kdavis-mozilla added a commit to mozilla/smarthome that referenced this issue Mar 17, 2016
Signed-off-by: Kelly Davis <kdavis@mozilla.com>
kdavis-mozilla added a commit to mozilla/smarthome that referenced this issue Mar 21, 2016
Signed-off-by: Kelly Davis <kdavis@mozilla.com>
tilmankamp pushed a commit to tilmankamp/smarthome that referenced this issue Apr 13, 2016
Signed-off-by: Kelly Davis <kdavis@mozilla.com>
tilmankamp pushed a commit to tilmankamp/smarthome that referenced this issue Apr 20, 2016
Signed-off-by: Kelly Davis <kdavis@mozilla.com>
tilmankamp pushed a commit to tilmankamp/smarthome that referenced this issue Apr 20, 2016
Signed-off-by: Kelly Davis <kdavis@mozilla.com>
tilmankamp pushed a commit to tilmankamp/smarthome that referenced this issue Apr 21, 2016
Signed-off-by: Kelly Davis <kdavis@mozilla.com>
tilmankamp pushed a commit to tilmankamp/smarthome that referenced this issue Apr 21, 2016
Signed-off-by: Kelly Davis <kdavis@mozilla.com>
tilmankamp pushed a commit to tilmankamp/smarthome that referenced this issue May 9, 2016
Signed-off-by: Kelly Davis <kdavis@mozilla.com>
tilmankamp pushed a commit to tilmankamp/smarthome that referenced this issue May 9, 2016
Signed-off-by: Kelly Davis <kdavis@mozilla.com>
tilmankamp pushed a commit to tilmankamp/smarthome that referenced this issue Jun 8, 2016
Signed-off-by: Kelly Davis <kdavis@mozilla.com>
tilmankamp pushed a commit to tilmankamp/smarthome that referenced this issue Jun 8, 2016
Signed-off-by: Kelly Davis <kdavis@mozilla.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants