-
Notifications
You must be signed in to change notification settings - Fork 782
new Speech-To-Text service interface #1021
Comments
Sounds pretty complete to me. |
In the coming week I should be able to create an initial version I can post here for comments. |
Re: 4 - simplistically is this the list of commands that the speech recognition understands? |
@tavalin It could be that simple
slightly more complicated
Or even more complicated
Note, this is in a psudo grammar that hopefully is self explanatory. |
That's as I expected and would obviously be useful/required with a "local"/ESH specific STT service. Probably less so for third party SST services. I have some experience trying to integrate openHAB with third party STT and Human Language Interpreter services so this and #1028 are things I would like to get involved in. |
I've made a sketch of a possible public interface STTService {
public SpeechRecognitionAlternative[] recognize(AudioSource audioSource, Locale locale, String[] grammars);
public AudioFormat[] getSupportedFormats();
}
public interface SpeechRecognitionAlternative {
public String getTranscript();
public float getConfidence();
} One thing which I've not completely thought through is how to deal with an If the The questions this sketch raises are
To address this one could, alternatively, modify the signature of public interface STTService {
public void recognize(AudioSource audioSource, Locale locale, String[] grammars, UUID uuid);
public AudioFormat[] getSupportedFormats();
} but then, instead of returning its results through return values, returns them by placing public class SpeechRecognitionEvent extends AbstractEvent {
...
public UUID getUUID();
public SpeechRecognitionAlternative[] getSpeechRecognitionAlternatives();
...
} containing recognition results on the bus whenever a recognition occurs. This seems to be the "right way" to do things. I'd be curious as to other's opinions on this. |
I would not support that in the beginning. Most online APIs do not support it anyways and i fear it could cause all kind of resource locking problems (especially on smaller devices like the pi). I think restricting the input to a certain max time (like 10-30 seconds) per recognition roundtrip would be suffice for most commands. I do not see any use case for a streaming like continous recognition. The input can always be chopped up and processes in multiple api calls. |
I think the question of "when to stop" is important and should be considered in the interface design (even if it is a finite length audio source - it might still be 10GB ;-)). I could imagine an interface similar to our DiscoveryService. This has (non-blocking) start/abort methods and provides its output through a listener interface to its subscribers. Just a syntactical remark: In ESH we tend to use Lists and Sets instead of arrays in interface signatures, so I'd suggest to do the same here. |
Let my try and make a quick sketch of the STT interface with listeners before I run out the door... I'd guess the core interfaces should be something like public interface STTService {
public void recognize(AudioSource audioSource, Locale locale, Collection<String> grammars);
public Collection<AudioFormat> getSupportedFormats();
public void addSTTListener(STTListener listener);
public void removeSTTListener(STTListener listener);
}
public interface STTListener {
public void speechRecognized(STTService sttService, Collection<SpeechRecognitionAlternative> speechRecognitionAlternatives);
}
public interface SpeechRecognitionAlternative {
public String getTranscript();
public float getConfidence();
} |
@kdavis-mozilla I am still not exactly sure what the grammers parameter is for? Would not be the notation / meaning of a grammer be highly dependant on the actual implementation. I looked at the servies i played araound with and did not find any usage for it. In generel i think that we should not define an interface where multiple possible alternatives are available as a result. I know that many stt / nlp services offer this but i always found it to cause more trouble than it offers. If more than one "highly confident" result is available some kind of handling must be done anyway. Would it not be much easier to just offer the "best / most confident" and let the implementation decide wether its confidence is high enough (we would probably need a config param for this). If not, some kind of handling must be done anyway. In the end, for real world applications you cannot neither handle all results (turn on everything because the item was unclear). So i would suggest going for the less flexible but much simpler solution of
Also i think we should offer a
@kdavis-mozilla i really think we are getting there :) |
@hkuhn42 I'm fine with both of your points. |
What about an abort method as suggested in #1021 (comment)? |
@kdavis-mozilla Same question as @hkuhn42: What exactly is expected as the grammar parameter? Is this just a plain set of words as a vocabulary? Or is there more structure expected? Then we should probably formalize this in a its own class definition. |
@kaikreuzer I'm fine with having them return a As for |
@kaikreuzer Generally, the grammar is specific to the STT engine being used. There are basically as many grammar formalisms as there are STT engines, a slight exaggeration but only slight. Unfortunately, it's often the case that different engines which claim to use the same grammar formalism implement things in different ways or don't implement the entire spec or.... So, I'm slightly worried about trying to spend time on creating a meta-grammar that can map to any existing grammar. For example, for the WebSpeech API[1] the W3C punted on just this point. |
If you require
In this case, I would not add it to the "generic" interface at all - because it simply does not define a clear contract then. It is rather up to the SST service implementation to gather what it needs for operation. It can itself access the item registry and read out the names of items etc. |
@kaikreuzer @hkuhn42 First another try at the interface public interface STTService {
public Set<Locale> getSupportedLocales();
public Set<AudioFormat> getSupportedFormats();
public STTServiceHandle recognize(STTListener sttListener,
AudioSource audioSource,
Locale locale,
Set<String> grammars);
}
public interface STTServiceHandle {
public void abort();
}
public interface STTListener {
public void speechRecognized(STTService sttService,
SpeechRecognitionResult recognitionResult);
}
public interface SpeechRecognitionResult {
public String getTranscript();
public float getConfidence();
} Then some comments.... As far as I know, the only point of contention is Having the "SST service implementation to gather what it needs for operation" implies that the SST service somehow knows the what the text-to-action implementation defines as a grammatical text. It doesn't know this and it shouldn't know. The implementation of the Requiring the While the lack of standard grammars is unfortunate, I think the tight coupling of a |
I think the Regarding the grammar: I probably still do not have a good understanding of what is really passed in here. Since you say that there is no standard, how will this set of Strings look like and what exactly is the STTService supposed to do with it? Looking at your example above, it seems we need to come up with a certain format that is well defined? |
Maybe this is a obvious question but following on from @kaikreuzer's point about example grammars, will the grammars (in whatever form they take) require localization? |
@kaikreuzer What if a caller registers with multiple There is not one standard there are multiple standards all implemented with varying fidelity in various STT engines. Rather than trying to create our own grammar format. The easier path would be to simply state that |
@tavalin Yes the grammars would be local specific. |
Then he will pass different listener instances in the different STTService-recognize() calls and if he wants the listener to know which service is relevant for it, it can easily inject the service upfront, e.g. in the constructor of the listener.
That was my point - we would have to settle for one grammar. But if we do so, is the content of this grammar then different depending on the TTAService that we want to use afterwards? If not, isn't the grammar a static configuration for the STTService and thus does not have to be passed in the |
This is fine with me. Just makes callers have many listener instances instead of one.
For callers this is great, but it will make implementation of the
I think maybe this is pointing to a misunderstanding. I'll try to clarify. A grammar syntax, for example Speech Recognition Grammar Specification, JSpeech Grammar Format..., specifies what a syntactically valid grammar is. For example, say there is a grammar syntax X that specifies a pipe
Similarly, one could have a second grammar syntax the Y grammar syntax in which a double pipe
What we are talking about now is if we are standardizing around the grammar syntax X or the grammar syntax Y. What is passed as the public interface STTService {
public Set<Locale> getSupportedLocales();
public Set<AudioFormat> getSupportedFormats();
public STTServiceHandle recognize(AudioSource audioSource,
Locale locale,
Set<String> grammars);
} is a The To make this more concrete. If we decide on the X grammar syntax, a syntactically valid grammar produced by a
This would be passed to the
and pass it to the I hope this is a bit clearer. Sorry this is so long! |
@kdavis-mozilla looking at your example above, if we wanted to enable users to rephrase the command to "Turn the ceiling light on" for example, would we need to add every variation to the set of grammar or would we expect the implementation to be able to cope? I guess the question is how much does the service rely on the grammar? |
@tavalin The short answer "yes we would need to add every variation if the service relied on the grammar". The long answer is.... Grammar is a tool to help speech recognition engines that are not able to handle unrestricted speech still be accurate. So, a grammar is not required by all engines but only for some engines. In particular, having a grammar will be useful if the STT engine has to fit on some "small form factor" device. For example, a device with 512MB of memory will likely not be able to handle unrestricted speech as "generically" the machine learning models used to recognize unrestricted speech take more than 512MB of memory. However, if the device has sufficient memory, CPU power, and GPU power, there's no need for a grammar as the STT engine can understand unrestricted speech. But, such hardware requirements usually imply that the device is a server, as such machine learning models take at least 3GB of memory and often require beefy GPU cards. |
@kdavis-mozilla Sorry, this all does not make much sense to me.
and
are two syntactically different grammars. A STT engine would have to KNOW about the syntax that is passed into it, so it could actually declare that it expects a grammar according to JSpeech and nothing else.
Is this a realistic example? Would this be one single string of the grammar parameter set? This imho does not make sense.
These will have to be defined by us, because this directly relates to the "intents" that we want to support in TTA.
But this is nothing the TTAService produces, is it? It rather needs this info as well to do its job (although with this input, the job is almost already done). So I do not see a flow from TTA->SST here (especially as the normal flow is that the output of SST is passed to TTA)... |
@kaikreuzer Yes, we need to we need to differentiate between the "syntax and the content of a grammar". This is why I was careful to always differentiate between grammar syntax and a syntactically valid grammar. In my example X, with "or-divider"
is a syntactically valid grammar according to the grammar syntax X. Yes, STT would have to know about the grammar syntax it considers valid, and it would have to declare what grammar syntax it considers valid. As to...
it is a realistic example. Most grammar syntaxes use placeholders. But you don't have to use placeholders. You are correct, what we have to define is something like
However, this syntactically valid grammar should not be defined in a The "smarts" of the system are in the implementation of the Once the |
Right, but this is rather a co-incidence and it does not imply that the TTAService interface has to deal with that fact - see my comment here. I would want a |
@kaikreuzer I think there's nothing saying that the I think the idea of the |
To hopefully bring this one step closer to a conclusion I'll try to summarize where we are. For STT we will, in the package public interface STTService {
public Set<Locale> getSupportedLocales();
public Set<AudioFormat> getSupportedFormats();
public STTServiceHandle recognize(STTListener sttListener,
AudioSource audioSource,
Locale locale,
Set<String> grammars);
}
public interface STTServiceHandle {
public void abort();
}
public interface STTListener {
public void speechRecognized(SpeechRecognitionResult recognitionResult);
}
public interface SpeechRecognitionResult {
public String getTranscript();
public float getConfidence();
} Are there any objections to this API? |
Shall we have something like a |
@kaikreuzer Wouldn't this only be of use when the |
Well, isn't any stream of a definite length and for some you just cannot tell, when it ends? I mean even for streams without a length you can run into the situation that the stream ends (because the device with the microphone is turned off, because the user pressed a button to stop recording, or because the incoming call has ended). The caller of the |
@kaikreuzer I agree with all you say. I guess I was hoping not to have to introduce a lot more machinery. Which, unfortunately, I think we need. That being said. I think the
One can easily think of cases where all of this is of use by, for example, considering the GUI feedback provided by something like Echo's "UFO lights". So with that being said. How about doing something like this public interface STTService {
public Set<Locale> getSupportedLocales();
public Set<AudioFormat> getSupportedFormats();
public STTServiceHandle recognize(STTListener sttListener,
AudioSource audioSource,
Locale locale,
Set<String> grammars);
}
public interface STTServiceHandle {
public void abort();
} as before. But then also doing something slightly new by introducing events to handle the above cases public class STTEvent extends EventObject {
...
}
public class RecognitionStartEvent extends STTEvent {
...
}
public class AudioStartEvent extends STTEvent {
...
}
public class SpeechStartEvent extends STTEvent {
...
}
public class SpeechStopEvent extends STTEvent {
...
}
public class SpeechRecognitionEvent extends STTEvent {
public String getTranscript();
public float getConfidence();
...
}
public class AudioStopEvent extends STTEvent {
...
}
public class RecognitionStopEvent extends STTEvent {
...
}
public class SpeechRecognitionErrorEvent extends STTEvent {
...
}
public interface STTListener {
public void sttEventReceived(STTEvent sttEvent);
} |
But would not all these events and callbacks essentially be useless for any kind of online STT Service (in essence none of that callbacks could be called before such a service finishes)? |
@hkuhn42 Actually no they will be of use in such an online case. In the online case the
|
@hkuhn42 As to what the use cases are, I think the Echo is a good example. It lights up when someone starts talking. It stops being lit up when someone stops. Siri does a similar thing. Beyond good UI, I think this is also a security feature as the system should indicate when it is listening and when it's not. |
@kdavis-mozilla i understand your point. I also recognize that you probably have a lot more experience with on voice recognition. Also most of the Webservices for STT i locked into do not allow or support unlimited audio. They accept only small packages of like 10s per call. Larger portions have to be chunked. So i think the service should also support handling short audio fragments without doing to much overhead. |
@hkuhn42 By "split it up" I take it you mean split the audio capture from speech recognition. I'll comment below assuming that's what you mean. If that's not what you mean, then ignore what I say below. As far I understand the API I am suggesting splits "audio capture", the device that creates a In your example the audio form the browser would eventually have to presented as an Maybe your confusion is in regards to the events The events If a particular web service only supports small packages of audio, 10s per call say, then the current API handles that. In both the streaming and non-streaming case 10s after the |
yes thats what i meant! Sorry if i was to unspecific!
I would move AudioStartEvent and AudioStopEvent to said AudioCaptueService. However if you see a need, we can have them on both. Apart from that, most of the complexity can probably be explained in the documentation where we will need to describe the order and meaning of these events in detail. So i still think it to be quite complex but i trust your experience in this so i am fine with the interface 👍 . @kaikreuzer @kdavis-mozilla So from my perspective there is only one minor thing to clarify: my understanding would be that the STTEvents are ESH Events (extend the ESH Event interface) and would also be available on the event bus. Is this correct? |
@hkuhn42 The events can extend from whatever is most appropriate. I'll let Kai make that call, and we can move I think the API is quite complex, but more-or-less in line with other similar STT API's. For example |
I like @kdavis-mozilla proposal as I think the interface stays lean and simple (
No, not at all! ESH events are events that are sent on the internal event bus and are a such "globally" available to all interested subscribers. In the STT case, the events are local to a specific Wrt |
Signed-off-by: Kelly Davis <kdavis@mozilla.com>
Signed-off-by: Kelly Davis <kdavis@mozilla.com>
Signed-off-by: Kelly Davis <kdavis@mozilla.com>
Signed-off-by: Kelly Davis <kdavis@mozilla.com>
Signed-off-by: Kelly Davis <kdavis@mozilla.com>
Signed-off-by: Kelly Davis <kdavis@mozilla.com>
Signed-off-by: Kelly Davis <kdavis@mozilla.com>
Signed-off-by: Kelly Davis <kdavis@mozilla.com>
Signed-off-by: Kelly Davis <kdavis@mozilla.com>
Signed-off-by: Kelly Davis <kdavis@mozilla.com>
Signed-off-by: Kelly Davis <kdavis@mozilla.com>
I wanted to capture here several details related to a new interface for converting speech-to-text that Kai and our group discussed.
org.eclipse.smarthome.io.multimedia.stt.STTService
should be introduced that allows one to bind speech-to-text engines. The name and package of the interface may change and is simply a placeholder for now.STTService
should implement theAudioSink
interface of Issue Refactor TTS APIs into a general audio dispatching allowing different sources and sinks #584STTService
should expose a method(s) that allow it to be passed anAudioSource
and return recognition result(s) (Note the details of this are subtle because theAudioSource
may be of an indefinite length, and recognition results usually have associated confidences, and often partial results are returned when anAudioSource
of indefinite length is being processed.... So, the details of this must be solidified.)STTService
should allow for grammars to be specified for speech recognition. The utility of this is that it allows for lower word error rates when dealing with systems not requiring large vocabulary continuous speech recognition.If there is anything in the discussion not captured here, please feel free to add it!
The text was updated successfully, but these errors were encountered: