Skip to content
This repository has been archived by the owner on May 7, 2020. It is now read-only.

A human language interpreter binding interface #1028

Closed
tilmankamp opened this issue Feb 12, 2016 · 62 comments
Closed

A human language interpreter binding interface #1028

tilmankamp opened this issue Feb 12, 2016 · 62 comments
Labels

Comments

@tilmankamp
Copy link
Contributor

In a workshop with Kai we discussed the need for some kind of Text-To-Action binding interface.
The intent is to use it in conjunction with speech to text (new) and text to speech bindings.

Here is some first idea draft:

  1. Adding a new interface org.eclipse.smarthome.io.commands.HumanLanguageInterpreter that allows execution of human language commands.
  2. The interface provides a getter for retrieving the supported grammar in some EBNF form - see STTService proposal.
  3. The interface provides a function that takes a human language command as string and returns a human language response as string. It will interpret the command and execute the resulting actions accordingly - e.g. sending commands to items.

A spoken command (e.g. "turn ceiling light on") could be captured by a STTService and passed into the HumanLanguageInterpreter that would send the according command to the item named "ceiling light". It then could return a human language string saying "ceiling light on", which will be passed into a TTSService binding to be finally sent to some loudspeaker.

@tilmankamp tilmankamp changed the title A human text interpreter binding interface A human language interpreter binding interface Feb 12, 2016
@hkuhn42
Copy link
Contributor

hkuhn42 commented Feb 12, 2016

As already stated in #584, i worked on something similar for some time and would like to contribute to this. I already did some research regarding different ways to implement voice command interfaces (so speech to text to action to text to speech) and made an extremely crude prototype (i was sidetracked by implementing an audio api among other things - [see ICommandInterpreter at my sylvani repo.
Regarding a getter for grammar support i think it would be best if this could be optional because for certain services that do voice to intend instead of mere voice to text (like supported by amazon alexa, microsoft oxford or nuance mix) it could be quite difficult to offer a grammar. However adding these kind of services may offer good results

@kaikreuzer
Copy link
Contributor

@hkuhn42 Note that this is about "text to intent", so no "voice" is directly involved in here - so I am actually not clear on whether a "voice-to-intent" service (do you have an example for such a service?) would at all fit in here.

@tilmankamp: Regarding the "grammar getter" (2): Is this something the service must provide or is this information that the runtime would have to provide to the service? Do you have an example how that looks? I am not sure if this is really the same "grammar" as on the STTService (where I rather thought that a "vocabulary" is provided)?

@hkuhn42
Copy link
Contributor

hkuhn42 commented Feb 13, 2016

@kaikreuzer: there are services like nuance mix which do not convert language to text but interpret that text and deliver a json representation (which is also a kind of text) of the intent as result (please see Mix. I was planning on also supporting interpreters for that kind of service in my project. However you are right in that this does not exactly match the requirements.

@tilmankamp: In general i think that it might prove quite difficult to specify a vocabulary inside the api for certain kinds of implementations. To quote some other examples, i was experimenting with using openNLP and alternatively lucene to build indices or models of the items and channels inside openhab and than match the text input (may it be from a voice-to-text service, a chatbot or a simple textarea) to these. For neither approach would a vocabulary be easy to specify by the service.

@kaikreuzer
Copy link
Contributor

deliver a json representation (which is also a kind of text)

No, this is already the "intent", so also the output of the text-to-intent service suggested here.

please see Mix.

Nice website with cool explanations!
So what could be done is trying to have a similar JSON structure for "intents", then the "intent processor" (which we have not really defined yet anywhere, maybe we should?) could be shared among such services (i.e. the voice2intent and the text2intent ones). Is there any (de-facto) standard for such intent JSON structures?

@tavalin
Copy link
Contributor

tavalin commented Feb 13, 2016

The two services I've used to generate intent data (they can handle voice or text to generate the intent) both have different, but similar formats. Having had a quick look at Mix, they are using what looks another different JSON format.

The way I could see this working is that ESH picks some sort of standard JSON structure (maybe an established structure or maybe it's own) for intents and we generate (or translate if we are using third party services like Mix, wit.ai or api.ai etc) this standard JSON and pass on to the "intent processor".

This should hopefully simplify the job of the "intent processor" if it works on a standard input.

@hkuhn42
Copy link
Contributor

hkuhn42 commented Feb 14, 2016

To sum up my understanding: the human language interpreter would consist of two services: a text-to-intent interpreter which converts natural language to intents (and is locale dependant) and an intent-to-action interpreter which "converts" the intent to actions in ESH which is locale neutral. Also we define a JSON structure for the intents. Both services will also support returning the "result" of the action.

@kaikreuzer should we separate this into a second issue?
@tavalin could you give samples of the two notations you worked with to date. It would really be interesting to have a look at them

@tavalin
Copy link
Contributor

tavalin commented Feb 14, 2016

Here's some examples from my English language agents that I've experimented with.

wit.ai:

{
  "msg_id" : "df145aa9-de46-4208-86ac-6a0ac552fa80",
  "_text" : "turn on the light",
  "outcomes" : [ {
    "_text" : "turn on the light",
    "confidence" : 0.52,
    "intent" : "on_off",
    "entities" : {
      "state" : [ {
        "type" : "value",
        "value" : "on"
      } ],
      "device" : [ {
        "type" : "value",
        "value" : "light"
      } ]
    }
  } ]
}

api.ai:

  {
  "id": "771c5e53-3939-44c0-ac80-89d18ce48e98",
  "timestamp": "2016-02-13T22:19:40.149Z",
  "result": {
    "source": "agent",
    "resolvedQuery": "turn on the light",
    "action": "on_off",
    "actionIncomplete": true,
    "parameters": {
      "device": "light",
      "room": "",
      "state": "on"
    },
    "contexts": [
      {
        "name": "device_room_on_off_dialog_params_room",
        "parameters": {
          "state": "on",
          "device": "light",
          "room": ""
        },
        "lifespan": 1
      },
      {
        "name": "device_room_on_off_dialog_context",
        "parameters": {
          "state": "on",
          "device": "light",
          "room": ""
        },
        "lifespan": 2
      }
    ],
    "metadata": {
      "intentId": "e0e4c588-9bb3-430e-93c9-ed3634f905d7",
      "intentName": "device_room_on_off"
    },
    "fulfillment": {
      "speech": "Which room?"
    }
  },
  "status": {
    "code": 200,
    "errorType": "success"
  }
}

Things common to both:

  • contain the command/query text (regardless of whether the query input was an audio input or text input)
  • contain the name of the action/intent (i.e. action/command)
  • contain the parameters extracted from the input (i.e. devices, states)
  • allow for inputs to contain contextual information

As you can see apart from that there are quite some differences in the structure depending on capabilities and direction of the agent. For example, wit.ai returns a confidence rating so you know how confident the engine is that it has successful extracted the correct intent from the input. api.ai allows you to define text responses that can be displayed or send to a TTS engine.

@kdavis-mozilla
Copy link

@tilmankamp I wonder about requirement 3

"The interface provides a function that takes a human language command as string and returns a human language response as string. It will interpret the command and execute the resulting actions accordingly - e.g. sending commands to items."

Couldn't part of "executing the resulting actions" be sending a command to the TTS synthesizer to say "The temperature is 28C" for example. In other words why does the method have to return a string?

@kdavis-mozilla
Copy link

@hkuhn42 In looking at ICommandInterpreter

public interface ICommandInterpreter {
    /**
     * Handle a textual command (like turn the head light on) and respond with a textual response 
     * 
     * @param command the command to handle
     * @return a textual response
     */
    public String handleCommand(String command);

}

I have a few comments.

  • What is the input format exactly? You mention "implementations should handle either unstructured..or structured..commands". How is a client of this interface to know what is possible?
  • Does an implementation of this interface actually perform any actions besides providing a textural response? If not, it doesn't seem to fit the spec from tilmankamp

@kdavis-mozilla
Copy link

@tavalin @hkuhn42 I would really shy away from making any interlingua such as done by wit.ai or api.ai, as it's a real investment in time: interlingua design, interlingua parsing, and implies heavy weight implementations as each implementation must understand interlingua+natual language.

Such a interlingua implies that NLP tools to do text tokenization, sentence splitting, morphological analysis, suffix treatment, named entity detection... will all have to be in any implementation, making all implementations extremely heavyweight. Not to mention the fact that this would imply similar tools be available in any targeted language, which shuts out many smaller languages.

To this end I think keeping the "text-to-intent interpreter", which converts natural language to intents (and is locale dependant) and the "intent-to-action interpreter", which "converts" the intent to actions behind a single interface is good idea.

It allows one much flexibility in that one is able to make lightweight implementations, simply some RegEx parsing text, but also heavyweight implementations that include as many Stanford NLP tools as one likes.

@hkuhn42
Copy link
Contributor

hkuhn42 commented Feb 14, 2016

@kdavis-mozilla : The interface and its early prototype implementation are meant to process textual input (turn the light on), execute the identified action and respond in a textual form (ok, there was an error, the light is already on ...) and was originally meant to work in conjunction with a voice interperter and syntheziser.
To give an example, thats what my early prototype currently does:
it captures the sentence turn the light on (or off) with a ui (html / javascript or command line), sends the audio data via the openhab ui plugin to microsoft oxford recognition webservice, routes the recognized text to the command interpreter which parses the text, looks for (hue) lights in ESH / Openhab, sends a turn on event if it finds at least one and responds with a text which is in turn send to microsoft oxford syntheziser. The audio output of the webservice is then played by the ui.

The comment regarding structured text originated in the fact that i also had a look at nuance mix. Overall i feel that a simple sentence in , sentence out interface would be much more easy to use and maintain in the beginning. Even if this would mean not beeing able to easily use advances services as mix.

@tilmankamp
Copy link
Contributor Author

@kdavis-mozilla : Having a string return value for a human language response at requirement 3 was just for symmetry reasons. However: Somehow the desired sink has to be passed into the routine. So an additional argument would be required - like a target binding id.
@kdavis-mozilla @hkuhn42 @tavalin : I like the idea of splitting it into text-to-intent and intent-to-action. But how do we deal with the/a response? Imagine the complexity in case we really want to do it right (which includes also i18n of the responses):
audio-source -> STT -> text-to-intent (n languages) -> intent-to-action -> i18n-response (n languages) -> TTS -> audio-sink

@kdavis-mozilla
Copy link

@tilmankamp Good point, we need to specify the sink somehow.

@kdavis-mozilla
Copy link

@tilmankamp @hkuhn42 @tavalin The text-to-intent and intent-to-action split is overkill. It implies an increase in complexity that doesn't justify any current utility we can gain from it.

Speaking as one who spent years building just such a text-to-intent system using UIMA, it is not a small undertaking and involves layers and layers of NLP tools that in this case are simply not needed. Not to mention that the use of such NLP tools would imply we use only languages with well-supported NLP ecosystems, i.e. English, French, German, and maybe one or two more.

@tavalin
Copy link
Contributor

tavalin commented Feb 15, 2016

@kdavis-mozilla are you saying that we should be doing text-to-action directly or that text-to-intent and then intent-to-action should be part of the same interface?

@kdavis-mozilla
Copy link

@tavalin I think that text-to-action should be done directly

@tavalin
Copy link
Contributor

tavalin commented Feb 15, 2016

@kdavis-mozilla a couple of queries/concerns...

Will this mean we need to issue commands according to rigid grammar rather than natural language expressions? I guess what I'm getting at is, will users need to be conscious of the way they speak for commands to be understood and actioned?

Would this handle multiple commands in one sentence? e.g. "open the blinds and turn the light(s) off"

Can we easily cope with multi language support doing it this way?

@kdavis-mozilla
Copy link

@tavalin You can issue commands as complicated as you want. (This includes multiply commands in one sentence.) However, you will also have to have an implementation of the "text-to-action" interface that is sufficiently complicated to understand your text commands.

I don't think the conversation here has gotten detailed enough to specify if multiple languages are/are not supported by the "text-to-action" interface. However, I would hope that whatever "text-to-action" interface comes out of this discussion it supports multiple languages.

@tilmankamp
Copy link
Contributor Author

@kdavis-mozilla @tavalin : Yes, we really should support multiple languages throughout all involved components. How about a global system configuration property? It could populate its supported values from the Add-On repository. Add-Ons that don't support the selected language will either default to English or fail/complain on the log.

@hkuhn42
Copy link
Contributor

hkuhn42 commented Feb 15, 2016

I don't think the conversation here has gotten detailed enough to specify if multiple languages are/are not supported by the "text-to-action" interface

In think there is no need to discuss wether multiple languages are support, only how to implement it :)

In essence, Siri, Google Now, Cortana and Alexa created expectations that computers can understand natural language (i know that this is about text interpretation but chances are the text originates form SST or a chatbot like infrastructure).
If we want to aim for supporting these non technical users, we need support for commands in multiple natural languages (not necessarily in one sentence but at least somehow configurable). Multi command support would be nice but i think thats something people can accept if not possible.

@kdavis-mozilla

The text-to-intent and intent-to-action split is overkill
You are right, lets go for TTA (text-to-action)

@tavalin
Copy link
Contributor

tavalin commented Feb 15, 2016

So it sounds like the proposed end to end solution is as follows:
Audio source -> STT -> Text-to-action (multiple languages) -> Text response (multiple languages) -> TTS -> Audio sink

As this issue focuses on the text-to-action service, have we any ideas for how to implement that?

@kdavis-mozilla
Copy link

@tavalin I think your summary is accurate.

As to implementation, I've some ideas. Here are some obvious first cuts...

  • A Map keying a limited set of canonical phrases against the possible actions
  • The above Map proceeded by a rule engine that rephrases non-canonical phrases to canonical ones
  • The above Map proceeded by a k-means clustering algorithm trained to rephrase non-canonical phrases to canonical ones
  • The above Map proceeded a RNN trained to rephrase non-canonical phrases to canonical ones
  • The above Map proceeded a BRNN trained to rephrase non-canonical phrases to canonical ones
  • ...

There are many many possible ways to do this. The only limitation is imagination.

@hkuhn42
Copy link
Contributor

hkuhn42 commented Feb 16, 2016

Another aproach i was thinking about was to usw a full-text or nlp Engine to Build an custom Dynamic Index / Model for the active esh setup (using the available Things and Channels. It is probably not scientific but my idea was to First try and Match the Target item (e.g. the omipresent light) and use this as a base to find out what the User wants by checking what is possible.

Talking Interface i would definitly add a
...
public Locale getSupportedLocales()
...
methode. If ok with all, i would update my existing Interface accordingly in the evening to use as a base for further discussion.

Am 16.02.2016 um 07:27 schrieb Kelly Davis notifications@github.com:

@tavalin I think your summary is accurate.

As to implementation, I've some ideas. Here are some obvious first cuts...

A Map keying a limited set of canonical phrases against the possible actions
The above Map proceeded by a rule engine that rephrases non-canonical phrases to canonical ones
The above Map proceeded by a k-means clustering algorithm trained to rephrase non-canonical phrases to canonical ones
The above Map proceeded a RNN trained to rephrase non-canonical phrases to canonical ones
The above Map proceeded a BRNN trained to rephrase non-canonical phrases to canonical ones
...
There are many many possible ways to do this. The only limitation is imagination.


Reply to this email directly or view it on GitHub.

@kdavis-mozilla
Copy link

@hkuhn42 Adding

public Locale getSupportedLocales()

sounds good to me.

@tavalin
Copy link
Contributor

tavalin commented Feb 16, 2016

@hkuhn42 my first experiment for this part also used that approach. I used Solr to build an index of my items and tried to match the phrase. It was very simple and worked OK to a point but I found it reporting false positives when I asked "turn on bedroom fan" (which didn't exist) and it found a hit against a group called "bedroom".

@tavalin
Copy link
Contributor

tavalin commented Feb 16, 2016

Another thought: if we are going down the route of a more natural speech/text conversation then we need to consider contextual information that may accompany the command in order to provide enough information to determine the correct action.

e.g.
user: "Turn on the living room lights"
response: "OK, turning the living room lights on"
(living room lights turn on)
user: "OK, turn them off"
response: "OK, turning the living room lights off"
(living room lights turn off)

Contextual information would probably also be necessary for machine learning component.

@tilmankamp
Copy link
Contributor Author

I think adding
public Locale[] getSupportedLocales()
is the best way.

@hkuhn42
Copy link
Contributor

hkuhn42 commented Feb 16, 2016

@kaikreuzer you are Right regarding Apple! But please do not be too quiet :)

Am 16.02.2016 um 17:17 schrieb Kai Kreuzer notifications@github.com:

What an imperfect Apple solution 8-) Yeah, it is not just voice recognition, but home automation in general that Apple has proven that it is a very complex matter by not coming up with a user-friendly solution for it as everybody had expected from them...

But I fully understand and support all your points above, so I fall quite again now :-)


Reply to this email directly or view it on GitHub.

@kaikreuzer
Copy link
Contributor

elend to end prototype

That's a good one (if you are German) 😆

@hkuhn42
Copy link
Contributor

hkuhn42 commented Feb 17, 2016

elend to end prototype

That's a good one (if you are German) 😆

And funny enough a co production of me and also apple 😆

@tilmankamp
Copy link
Contributor Author

Coming from what @tavalin said in regards to the steps that a voice command would take:
Audio source -> STT -> Text-to-action (multiple languages) -> Text response (multiple languages) -> TTS -> Audio sink, I want to make some further simplifications:

How about having an inbound counterpart to the current global say command - like interpret? This would also be the name of the interpreting method of the Text-to-action interface. A STT Add-On will use it to pass text into the/a current Text-to-action Add-On.

Furthermore I would also join Text-to-action and Text-response into one Add-On. It is just more practical to put localized response texts next to their localized input parsers and grammars.

Finally I think that there is no return value or "into-some-other-machine-sinking" of response texts needed, if just the global say is used whenever something has to be said (sounds strange - yes).

So here is an updated version of the proposal:

  1. Adding a new interface org.eclipse.smarthome.io.commands.TextToAction that allows execution of human language commands.
  2. The interface provides a getter for retrieving the supported grammar in some EBNF form - see STTService proposal.
  3. The interface provides a function that takes a human language command as string. It will interpret the command and execute the resulting actions accordingly - e.g. sending commands to items. If there should be a textual response to the user, it will use global say command for that.
  4. It supports a getter for retrieving supported languages.
    interface TextToAction {
        Set<Locale> getSupportedLocales();
        String getGrammar();
        void interpret(String text);
    }

A spoken command (e.g. "turn ceiling light on") could be captured by an audio input binding that is connected to a STTService implementation. It will translate the given audio data to its textual representation and call interpret("turn ceiling light on");. A/The current TextToAction Add-On will match one of its supported phrases to the given text and execute the appropriate action - setting the state of CeilingLight to ON. Finally it will call say("ceiling light on");. This will cause the/a current TTSService to send the resulting audio data to a connected audio sink.

@hkuhn42
Copy link
Contributor

hkuhn42 commented Feb 17, 2016

@tilmankamp Not having a text response would reduce the interface to voice output. Scenarios like a chatbot or a "smartclient" which does the text to voice and voice to text are than no longer possible. Also the whole service would not be usable without a TTSService.

Taking into accont the other discussion threads #1021 moving the response handling into a listener
could make sense:

interface CommandInterpreter {
    public void interpret(String command, Locale locale);
    public Set<Locale> getSupportedLocales();
        void registerCommandInterpreterListener(CommandInterpreterListener interpreterListener);
    void removeCommandInterpreterListener(CommandInterpreterListener interpreterListener);
}

public interface CommandInterpreterListener {
     public void interpreted(CommandInterpreter commandInterpreter, String response); 
}

@kdavis-mozilla
Copy link

@hkuhn42 @tilmankamp I guess you should take a look at the comment from Kai and my follow-up for #1021. For the aysnc case this interface has changed a bit.

@tilmankamp
Copy link
Contributor Author

@hkuhn42 : I totally agree with you on the given scenarios and your design. I just wanted to be as close to OpenHAB conventions as possible. If a subscriber model is the way to go, I will do it like this.
One question: How is the wiring between subscriber and service supposed to be configured? Just by scripting?

@kaikreuzer
Copy link
Contributor

One question: How is the wiring between subscriber and service supposed to be configured? Just by scripting?

I think this will depend on where and how it is used. You could allow specific wirings through configuration (or parameters when initiating it), but for other services we also have a "default" value which refers to the service that should be used, if nothing else is defined.

@tilmankamp
Copy link
Contributor Author

@kaikreuzer @hkuhn42 @kdavis-mozilla @tavalin Ok - here is the interface I will implement now. It's the last version of @hkuhn42 - I like the name CommandInterpreter and will follow the subscriber model. I also added a structured result to the interpreted callback.

public enum CommandInterpreterResult {
    OK, INCOMPLETE_PHRASE, UNABLE_TO_EXECUTE, UNSUPPORTED_PHRASE
}

public interface CommandInterpreter {
    void interpret(String command, Locale locale);
    Set<Locale> getSupportedLocales();
    void registerCommandInterpreterListener(CommandInterpreterListener interpreterListener);
    void removeCommandInterpreterListener(CommandInterpreterListener interpreterListener);
}

public interface CommandInterpreterListener {
    void interpreted(CommandInterpreter commandInterpreter, CommandInterpreterResult result, String response); 
}

Thanks for all the input!

@kdavis-mozilla
Copy link

@tilmankamp This interface will not work. It doesn't specify a grammar.

@tilmankamp
Copy link
Contributor Author

Ah - just forgot it - thanks for the hint!
Here it is:

public enum CommandInterpreterResult {
    OK, INCOMPLETE_PHRASE, UNABLE_TO_EXECUTE, UNSUPPORTED_PHRASE
}

public interface CommandInterpreter {
    void interpret(String command, Locale locale);
    Set<Locale> getSupportedLocales();
    String getGrammar();
    void registerCommandInterpreterListener(CommandInterpreterListener interpreterListener);
    void removeCommandInterpreterListener(CommandInterpreterListener interpreterListener);
}

public interface CommandInterpreterListener {
    void interpreted(CommandInterpreter commandInterpreter, CommandInterpreterResult result, String response); 
}

@kdavis-mozilla
Copy link

@tilmankamp Sorry to be nit picky, but the grammar is Locale specific.

@tilmankamp
Copy link
Contributor Author

Makes sense - I also put it into the callback - maybe someone needs it...

public enum CommandInterpreterResult {
    OK, INCOMPLETE_PHRASE, UNABLE_TO_EXECUTE, UNSUPPORTED_PHRASE
}

public interface CommandInterpreter {
    void interpret(String command, Locale locale);
    Set<Locale> getSupportedLocales();
    String getGrammar(Locale locale);
    void registerCommandInterpreterListener(CommandInterpreterListener interpreterListener);
    void removeCommandInterpreterListener(CommandInterpreterListener interpreterListener);
}

public interface CommandInterpreterListener {
    void interpreted(CommandInterpreter commandInterpreter, CommandInterpreterResult result, Locale locale, String response); 
}

@kdavis-mozilla
Copy link

@tilmankamp Did you consider the threading issues brought up by Kai?

@tilmankamp
Copy link
Contributor Author

@kdavis-mozilla : The actual question is weather one should be able to abort a running interpretation. I'm not 100% sure if this makes sense. The service's primary purpose is figuring out, which actions to execute. But - yes - it could happen that it executes asynchronous actions/jobs that offer a capability to abort them. Not forwarding this capability would be bad. So I'll add it and also align with you by bundling all result fields into a new result interface:

public enum CommandInterpreterResultCode {
    OK, INCOMPLETE_PHRASE, UNABLE_TO_EXECUTE, EXECUTION_ABORTED, UNSUPPORTED_PHRASE
}

public interface CommandInterpreterHandle {
    public void abort();
}

public interface CommandInterpreterListener {
    void interpreted(CommandInterpreter commandInterpreter, CommandInterpreterResult result); 
}

public interface CommandInterpreterResult {
    String getResponse();
    CommandInterpreterResultCode getResultCode();
}

public interface CommandInterpreter {
    CommandInterpreterHandle interpret(CommandInterpreterListener listener, Locale locale, String command);
    Set<Locale> getSupportedLocales();
    String getGrammar(Locale locale);
}

Thanks for the heads up.

@kaikreuzer
Copy link
Contributor

Hm, I think this is overly complex. If I see it right, we do not expect multiple results from the same interpretation being asynchronously delivered. In this case, there is imho no need to have listeners at all.
Assuming that it normally is no long running operation (and if it is, e.g. because it wants to execute something with a delay, the implementation should anyhow schedule a separate job for it), I don't think we need abort either.
One other remark: I don't think "Command" is a good choice, because this term is already widely used throughout ESH for the commands that are sent to items. So should we maybe go for something similar like e.g. "Instruction"?

So my suggestion would be:

public interface InstructionInterpreter throws InterpretationException {
    String interpret(Locale locale, String instruction);
    Set<Locale> getSupportedLocales();
    String getGrammar(Locale locale);
}

@tilmankamp
Copy link
Contributor Author

Back to square one. But I would go for HumanLanguageInterpreter. The exception would carry a failure response text.

@kaikreuzer
Copy link
Contributor

Back to square one.

Sorry ;-)

But I would go for HumanLanguageInterpreter

👍

@hkuhn42
Copy link
Contributor

hkuhn42 commented Feb 19, 2016

I am still not sure about the grammar.
However i am probably just to much stuck in a corner to see it so i will shut up 😄

So 👍 for

public interface HumanLanguageInterpreter  {
    String interpret(Locale locale, String instruction) throws InterpretationException;
    Set<Locale> getSupportedLocales();
    String getGrammar(Locale locale)  
}

@kaikreuzer
Copy link
Contributor

I am still not sure about the grammar.

I am feeling the same way. The grammar does not belong to this interface for me.
From a software design approach, I think we should rather introduce an additional GrammarProvider interface and an TTA engine happens to implement both HumanLanguageInterpreter and GrammarProvider. A SSTService can then (optionally) depend on a GrammarProvider.

@kdavis-mozilla
Copy link

@hkuhn42 @kaikreuzer I'm fine with having a GrammarProvider interface and having an TTA engine that happens to implement both HumanLanguageInterpreter and GrammarProvider.

@hkuhn42
Copy link
Contributor

hkuhn42 commented Feb 19, 2016

👍

@tilmankamp
Copy link
Contributor Author

I already thought about the same - its better/cleaner support for the non-grammar Kaldi STT use case.

@tilmankamp
Copy link
Contributor Author

public interface GrammarProvider {
    String getGrammar(Locale locale);
}

public interface HumanLanguageInterpreter  {
    String interpret(Locale locale, String instruction) throws InterpretationException;
    Set<Locale> getSupportedLocales();
}

@kaikreuzer
Copy link
Contributor

Do we expect multiple GrammarProviders to be potentially present? We will have to add some meta-data like e.g. the grammar syntax it provides or potentially also an id to reference it (ids are actually also for the other TTS,STT,HLI services as well).

@kaikreuzer
Copy link
Contributor

FTR: We have a first version of the HLI merged with #1098.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants