-
Notifications
You must be signed in to change notification settings - Fork 782
A human language interpreter binding interface #1028
Comments
As already stated in #584, i worked on something similar for some time and would like to contribute to this. I already did some research regarding different ways to implement voice command interfaces (so speech to text to action to text to speech) and made an extremely crude prototype (i was sidetracked by implementing an audio api among other things - [see ICommandInterpreter at my sylvani repo. |
@hkuhn42 Note that this is about "text to intent", so no "voice" is directly involved in here - so I am actually not clear on whether a "voice-to-intent" service (do you have an example for such a service?) would at all fit in here. @tilmankamp: Regarding the "grammar getter" (2): Is this something the service must provide or is this information that the runtime would have to provide to the service? Do you have an example how that looks? I am not sure if this is really the same "grammar" as on the STTService (where I rather thought that a "vocabulary" is provided)? |
@kaikreuzer: there are services like nuance mix which do not convert language to text but interpret that text and deliver a json representation (which is also a kind of text) of the intent as result (please see Mix. I was planning on also supporting interpreters for that kind of service in my project. However you are right in that this does not exactly match the requirements. @tilmankamp: In general i think that it might prove quite difficult to specify a vocabulary inside the api for certain kinds of implementations. To quote some other examples, i was experimenting with using openNLP and alternatively lucene to build indices or models of the items and channels inside openhab and than match the text input (may it be from a voice-to-text service, a chatbot or a simple textarea) to these. For neither approach would a vocabulary be easy to specify by the service. |
No, this is already the "intent", so also the output of the text-to-intent service suggested here.
Nice website with cool explanations! |
The two services I've used to generate intent data (they can handle voice or text to generate the intent) both have different, but similar formats. Having had a quick look at Mix, they are using what looks another different JSON format. The way I could see this working is that ESH picks some sort of standard JSON structure (maybe an established structure or maybe it's own) for intents and we generate (or translate if we are using third party services like Mix, wit.ai or api.ai etc) this standard JSON and pass on to the "intent processor". This should hopefully simplify the job of the "intent processor" if it works on a standard input. |
To sum up my understanding: the human language interpreter would consist of two services: a text-to-intent interpreter which converts natural language to intents (and is locale dependant) and an intent-to-action interpreter which "converts" the intent to actions in ESH which is locale neutral. Also we define a JSON structure for the intents. Both services will also support returning the "result" of the action. @kaikreuzer should we separate this into a second issue? |
Here's some examples from my English language agents that I've experimented with. wit.ai:
api.ai:
Things common to both:
As you can see apart from that there are quite some differences in the structure depending on capabilities and direction of the agent. For example, wit.ai returns a confidence rating so you know how confident the engine is that it has successful extracted the correct intent from the input. api.ai allows you to define text responses that can be displayed or send to a TTS engine. |
@tilmankamp I wonder about requirement 3
Couldn't part of "executing the resulting actions" be sending a command to the TTS synthesizer to say "The temperature is 28C" for example. In other words why does the method have to return a string? |
@hkuhn42 In looking at public interface ICommandInterpreter {
/**
* Handle a textual command (like turn the head light on) and respond with a textual response
*
* @param command the command to handle
* @return a textual response
*/
public String handleCommand(String command);
} I have a few comments.
|
@tavalin @hkuhn42 I would really shy away from making any interlingua such as done by wit.ai or api.ai, as it's a real investment in time: interlingua design, interlingua parsing, and implies heavy weight implementations as each implementation must understand interlingua+natual language. Such a interlingua implies that NLP tools to do text tokenization, sentence splitting, morphological analysis, suffix treatment, named entity detection... will all have to be in any implementation, making all implementations extremely heavyweight. Not to mention the fact that this would imply similar tools be available in any targeted language, which shuts out many smaller languages. To this end I think keeping the "text-to-intent interpreter", which converts natural language to intents (and is locale dependant) and the "intent-to-action interpreter", which "converts" the intent to actions behind a single interface is good idea. It allows one much flexibility in that one is able to make lightweight implementations, simply some RegEx parsing text, but also heavyweight implementations that include as many Stanford NLP tools as one likes. |
@kdavis-mozilla : The interface and its early prototype implementation are meant to process textual input (turn the light on), execute the identified action and respond in a textual form (ok, there was an error, the light is already on ...) and was originally meant to work in conjunction with a voice interperter and syntheziser. The comment regarding structured text originated in the fact that i also had a look at nuance mix. Overall i feel that a simple sentence in , sentence out interface would be much more easy to use and maintain in the beginning. Even if this would mean not beeing able to easily use advances services as mix. |
@kdavis-mozilla : Having a string return value for a human language response at requirement 3 was just for symmetry reasons. However: Somehow the desired sink has to be passed into the routine. So an additional argument would be required - like a target binding id. |
@tilmankamp Good point, we need to specify the sink somehow. |
@tilmankamp @hkuhn42 @tavalin The text-to-intent and intent-to-action split is overkill. It implies an increase in complexity that doesn't justify any current utility we can gain from it. Speaking as one who spent years building just such a text-to-intent system using UIMA, it is not a small undertaking and involves layers and layers of NLP tools that in this case are simply not needed. Not to mention that the use of such NLP tools would imply we use only languages with well-supported NLP ecosystems, i.e. English, French, German, and maybe one or two more. |
@kdavis-mozilla are you saying that we should be doing text-to-action directly or that text-to-intent and then intent-to-action should be part of the same interface? |
@tavalin I think that text-to-action should be done directly |
@kdavis-mozilla a couple of queries/concerns... Will this mean we need to issue commands according to rigid grammar rather than natural language expressions? I guess what I'm getting at is, will users need to be conscious of the way they speak for commands to be understood and actioned? Would this handle multiple commands in one sentence? e.g. "open the blinds and turn the light(s) off" Can we easily cope with multi language support doing it this way? |
@tavalin You can issue commands as complicated as you want. (This includes multiply commands in one sentence.) However, you will also have to have an implementation of the "text-to-action" interface that is sufficiently complicated to understand your text commands. I don't think the conversation here has gotten detailed enough to specify if multiple languages are/are not supported by the "text-to-action" interface. However, I would hope that whatever "text-to-action" interface comes out of this discussion it supports multiple languages. |
@kdavis-mozilla @tavalin : Yes, we really should support multiple languages throughout all involved components. How about a global system configuration property? It could populate its supported values from the Add-On repository. Add-Ons that don't support the selected language will either default to English or fail/complain on the log. |
In think there is no need to discuss wether multiple languages are support, only how to implement it :) In essence, Siri, Google Now, Cortana and Alexa created expectations that computers can understand natural language (i know that this is about text interpretation but chances are the text originates form SST or a chatbot like infrastructure).
|
So it sounds like the proposed end to end solution is as follows: As this issue focuses on the text-to-action service, have we any ideas for how to implement that? |
@tavalin I think your summary is accurate. As to implementation, I've some ideas. Here are some obvious first cuts...
There are many many possible ways to do this. The only limitation is imagination. |
Another aproach i was thinking about was to usw a full-text or nlp Engine to Build an custom Dynamic Index / Model for the active esh setup (using the available Things and Channels. It is probably not scientific but my idea was to First try and Match the Target item (e.g. the omipresent light) and use this as a base to find out what the User wants by checking what is possible. Talking Interface i would definitly add a
|
@hkuhn42 Adding public Locale getSupportedLocales() sounds good to me. |
@hkuhn42 my first experiment for this part also used that approach. I used Solr to build an index of my items and tried to match the phrase. It was very simple and worked OK to a point but I found it reporting false positives when I asked "turn on bedroom fan" (which didn't exist) and it found a hit against a group called "bedroom". |
Another thought: if we are going down the route of a more natural speech/text conversation then we need to consider contextual information that may accompany the command in order to provide enough information to determine the correct action. e.g. Contextual information would probably also be necessary for machine learning component. |
I think adding |
@kaikreuzer you are Right regarding Apple! But please do not be too quiet :)
|
That's a good one (if you are German) 😆 |
And funny enough a co production of me and also apple 😆 |
Coming from what @tavalin said in regards to the steps that a voice command would take: How about having an inbound counterpart to the current global Furthermore I would also join Text-to-action and Text-response into one Add-On. It is just more practical to put localized response texts next to their localized input parsers and grammars. Finally I think that there is no return value or "into-some-other-machine-sinking" of response texts needed, if just the global So here is an updated version of the proposal:
interface TextToAction {
Set<Locale> getSupportedLocales();
String getGrammar();
void interpret(String text);
} A spoken command (e.g. "turn ceiling light on") could be captured by an audio input binding that is connected to a |
@tilmankamp Not having a text response would reduce the interface to voice output. Scenarios like a chatbot or a "smartclient" which does the text to voice and voice to text are than no longer possible. Also the whole service would not be usable without a TTSService. Taking into accont the other discussion threads #1021 moving the response handling into a listener
|
@hkuhn42 @tilmankamp I guess you should take a look at the comment from Kai and my follow-up for #1021. For the aysnc case this interface has changed a bit. |
@hkuhn42 : I totally agree with you on the given scenarios and your design. I just wanted to be as close to OpenHAB conventions as possible. If a subscriber model is the way to go, I will do it like this. |
I think this will depend on where and how it is used. You could allow specific wirings through configuration (or parameters when initiating it), but for other services we also have a "default" value which refers to the service that should be used, if nothing else is defined. |
@kaikreuzer @hkuhn42 @kdavis-mozilla @tavalin Ok - here is the interface I will implement now. It's the last version of @hkuhn42 - I like the name public enum CommandInterpreterResult {
OK, INCOMPLETE_PHRASE, UNABLE_TO_EXECUTE, UNSUPPORTED_PHRASE
}
public interface CommandInterpreter {
void interpret(String command, Locale locale);
Set<Locale> getSupportedLocales();
void registerCommandInterpreterListener(CommandInterpreterListener interpreterListener);
void removeCommandInterpreterListener(CommandInterpreterListener interpreterListener);
}
public interface CommandInterpreterListener {
void interpreted(CommandInterpreter commandInterpreter, CommandInterpreterResult result, String response);
} Thanks for all the input! |
@tilmankamp This interface will not work. It doesn't specify a grammar. |
Ah - just forgot it - thanks for the hint! public enum CommandInterpreterResult {
OK, INCOMPLETE_PHRASE, UNABLE_TO_EXECUTE, UNSUPPORTED_PHRASE
}
public interface CommandInterpreter {
void interpret(String command, Locale locale);
Set<Locale> getSupportedLocales();
String getGrammar();
void registerCommandInterpreterListener(CommandInterpreterListener interpreterListener);
void removeCommandInterpreterListener(CommandInterpreterListener interpreterListener);
}
public interface CommandInterpreterListener {
void interpreted(CommandInterpreter commandInterpreter, CommandInterpreterResult result, String response);
} |
@tilmankamp Sorry to be nit picky, but the grammar is |
Makes sense - I also put it into the callback - maybe someone needs it... public enum CommandInterpreterResult {
OK, INCOMPLETE_PHRASE, UNABLE_TO_EXECUTE, UNSUPPORTED_PHRASE
}
public interface CommandInterpreter {
void interpret(String command, Locale locale);
Set<Locale> getSupportedLocales();
String getGrammar(Locale locale);
void registerCommandInterpreterListener(CommandInterpreterListener interpreterListener);
void removeCommandInterpreterListener(CommandInterpreterListener interpreterListener);
}
public interface CommandInterpreterListener {
void interpreted(CommandInterpreter commandInterpreter, CommandInterpreterResult result, Locale locale, String response);
} |
@tilmankamp Did you consider the threading issues brought up by Kai? |
@kdavis-mozilla : The actual question is weather one should be able to abort a running interpretation. I'm not 100% sure if this makes sense. The service's primary purpose is figuring out, which actions to execute. But - yes - it could happen that it executes asynchronous actions/jobs that offer a capability to abort them. Not forwarding this capability would be bad. So I'll add it and also align with you by bundling all result fields into a new result interface: public enum CommandInterpreterResultCode {
OK, INCOMPLETE_PHRASE, UNABLE_TO_EXECUTE, EXECUTION_ABORTED, UNSUPPORTED_PHRASE
}
public interface CommandInterpreterHandle {
public void abort();
}
public interface CommandInterpreterListener {
void interpreted(CommandInterpreter commandInterpreter, CommandInterpreterResult result);
}
public interface CommandInterpreterResult {
String getResponse();
CommandInterpreterResultCode getResultCode();
}
public interface CommandInterpreter {
CommandInterpreterHandle interpret(CommandInterpreterListener listener, Locale locale, String command);
Set<Locale> getSupportedLocales();
String getGrammar(Locale locale);
} Thanks for the heads up. |
Hm, I think this is overly complex. If I see it right, we do not expect multiple results from the same interpretation being asynchronously delivered. In this case, there is imho no need to have listeners at all. So my suggestion would be:
|
Back to square one. But I would go for |
Sorry ;-)
👍 |
I am still not sure about the grammar. So 👍 for
|
I am feeling the same way. The grammar does not belong to this interface for me. |
@hkuhn42 @kaikreuzer I'm fine with having a |
👍 |
I already thought about the same - its better/cleaner support for the non-grammar Kaldi STT use case. |
public interface GrammarProvider {
String getGrammar(Locale locale);
}
public interface HumanLanguageInterpreter {
String interpret(Locale locale, String instruction) throws InterpretationException;
Set<Locale> getSupportedLocales();
} |
Do we expect multiple GrammarProviders to be potentially present? We will have to add some meta-data like e.g. the grammar syntax it provides or potentially also an id to reference it (ids are actually also for the other TTS,STT,HLI services as well). |
FTR: We have a first version of the HLI merged with #1098. |
In a workshop with Kai we discussed the need for some kind of Text-To-Action binding interface.
The intent is to use it in conjunction with speech to text (new) and text to speech bindings.
Here is some first idea draft:
org.eclipse.smarthome.io.commands.HumanLanguageInterpreter
that allows execution of human language commands.EBNF
form - seeSTTService
proposal.A spoken command (e.g. "turn ceiling light on") could be captured by a
STTService
and passed into theHumanLanguageInterpreter
that would send the according command to the item named "ceiling light". It then could return a human language string saying "ceiling light on", which will be passed into aTTSService
binding to be finally sent to some loudspeaker.The text was updated successfully, but these errors were encountered: