-
Notifications
You must be signed in to change notification settings - Fork 782
Refactor TTS APIs into a general audio dispatching allowing different sources and sinks #584
Comments
I wanted to capture here several details related to this issue that Kai and our group discussed. I'll try to be pedantic to document the details of the discussion to make sure that all attendees agree on the discussion's results.
If any details from the discussion were not captured here, please add them! |
A few comments from my side:
|
Hello, i am working on something similar for some Time now and would like to contribut to this. I am currently on vacation without a computer but will be back tomorrow. The (very early) State of my implementation can be found at https://github.com/hkuhn42/sylvani but ist still very Rough...
|
Thanks @hkuhn42, I briefly browsed your repo and that looks indeed very similar to what our plans - so it would be great to have you on board here! |
@hkuhn42 I've also taken a look. It does look very similar to what we had in mind. We should try to join forces! |
Ok, how do we go about this (sorry but as kai already learned i am new to this and also to github and may need some support)? In general the repo contains some experiments regarding voice control (text-to-speech and speech-to-text) as i did find the current support somewhat lacking. If anything of that code is interesting i can also move it over (i apologize for the missing comments and stuff ;) . |
I think we should first start to roughly define the API (incl. package name etc.) according to the above requirements and agree on them. |
I think that it covers most of the requirements as far as i understood them. I currently do not have a mechanism that matches the auto pairing as described in 10. My idea for that was to use a registry to register sources and outputs and also offer methods for simple playback and recording.
|
Thinking a bit about the public interface ISyntheziserService {
public AudioSource synthesize(String text, Locale locale, String voice, AudioFormat requestedFormat)
throws Exception;
} A few things popped to mind
I haven't had time to look at the audio portion in detail. I'll comment on |
That is actually a point i spend a lot of time thinking about and did not come up with an ideal solution. Voices do imply locale but when using the different services i experimented with, i usually did not care for the actual voice used but rather just used the first to match the my locale (due to the fact that apart from english, locale support gets thin very fast..). Also for some, like mary tts, additional voices have to be installed first. So the two parameters were meant to be one or the other (which is not ideal). A way to tell the service to use any voice for a given locale would be a must have for me.
My aproach was that the implementation should the default supported format (or one of them). Converting from one of these to another should than be part of an audio api if needed. However, audio transcoding is heavy stuff so my idea was to start with not doing any trancoding but rely on available formats. So taking into account your comments and some of the my thoughts a new version of the interface could look like:
|
@hkuhn42 I'm liking the direction this is going! I've a suggestion, motivated by the WebSpeech API, for the public interface Voice {
/**
* Globally unique identifier of the voice
*
* @return A URI uniquely identifying the voice globally
*/
public java.net.URI getVoiceURI();
/**
* Name of the voice, usually used for GUI's or VUI's
*
* @return Name of the voice, may not be globally unique
*/
public String getName();
/**
* Locale of the voice
*
* @return Locale of the voice
*/
public Locale getLocale();
/**
* Indicates if the voice requires internet access
*
* @return A boolean indicating if the voice in local to the hardware
*/
public boolean isLocalVoice();
} then sufficient information would be provided by the |
@kdavis-mozilla the getDefaultVoice was meant to help in finding the most suitable voice for a given locale. However i dropped it as finding a voice for a locale can alwys be implemented outside of such a service. Regarding the Voice interface, i see you point but i would rather not mix the "usage" interface with any kind of voice of configuration management as this seems to be highly implementation dependant. E.g. for online services, isLocale would have to always return false (so in my understanding the whole services allways depends on internet beeing available). In this case, i would rather raise an exception if no internet is available. @kaikreuzer should we make a new voice bundle or shoud this be a part of the existing multimedia bundle? |
@kdavis-mozilla A few questions/comments...
Before we start on implementation, we need to consider the |
We should probably rather talk about an id and a label - the id being the unique technical identifier, the label is the text to show in UIs. Usually, we have UI labels also localizable, but for voices this is imho not needed.
I am not sure if we would need this information - so far, we have nowhere an indication whether something is local or remote. And anyhow, I would assume that it usually applies to the whole TTS service and that it is a rare exception that a single TTS service has local AND remote voices.
We need to discuss the namespace for all the new components. I could imagine that we move this out of the "multimedia" package into e.g. |
@kaikreuzer If we want to call the However, I think keeping the As far as "meta-information for UIs" I agree. But I guess we should get the internals out of the way first. The reason I thought As for namespaces, @kaikreuzer whatever you decide upon in fine. |
@kdavis-mozilla Note that for an URI, only scheme and path are mandatory. So
As I said, I think this will usually apply to the whole TTS service, not just to a few voices from it. And then it can be nicely documented for the service itself, i.e. a user who picks it and installs it should already know whether it needs a cloud connection or not. |
@kaikreuzer We can use a On
If that's always the case, we don't need the |
Before we jump into implementation, I think we still need to agree on the contents of org.sylvani.audio or more specifically what our version of this package will be. A quick summary of the things we've explicitly used, public interface AudioSource {
/**
* Returns the human readable name of the source
*
* @return the human readable name of the source
*/
public String getName();
/**
* an array containing all supported audio formats
*
* @return
*/
public AudioFormat[] getSupportedFormats();
/**
* Gives access to an InputStream for reading audio data, the format is the default format
*
* @return InputStream for reading audio data
* @throws AudioException
*/
public InputStream getInputStream() throws AudioException;
/**
* An inputstream for reading audio data, the format is set to the given format
*
* @param format the desired format (one of getSupportedFormats) or null for the default format
* @return InputStream for reading audio data
* @throws AudioException
*/
public InputStream getInputStream(AudioFormat format) throws AudioException;
/**
* Load data from this AudioSource to the given {@link AudioOutput}
*
* @param output
* @throws AudioException
*/
public void stream(AudioOutput output) throws AudioException;
/**
* Returns true if this AudioSource can stream to the given {@link AudioOutput}
* false otherwise
*
* @param source
* @return true if the AudioSource can be processes
*/
public boolean canStream(AudioOutput source);
}
public class AudioFormat {
private AudioCodec codec;
private AudioContainer container;
/**
* bit depth (https://en.wikipedia.org/wiki/Audio_bit_depth)
* or
* bit rate (https://en.wikipedia.org/wiki/Bit_rate)
* depending on codec
*/
private int bits;
/**
* sample frequence
*/
private long frequency;
public AudioCodec getCodec() {
return codec;
}
public void setCodec(AudioCodec codec) {
this.codec = codec;
}
public AudioContainer getContainer() {
return container;
}
public void setContainer(AudioContainer container) {
this.container = container;
}
public int getBits() {
return bits;
}
public void setBits(int bits) {
this.bits = bits;
}
public long getFrequency() {
return frequency;
}
public void setFrequency(long rate) {
this.frequency = rate;
}
@Override
public boolean equals(Object obj) {
if (obj instanceof AudioFormat) {
AudioFormat format = (AudioFormat) obj;
if (format.getCodec() != getCodec()) {
return false;
}
if (format.getContainer() != getContainer()) {
return false;
}
if (format.getBits() != getBits()) {
return false;
}
if (format.getFrequency() != getFrequency()) {
return false;
}
return true;
}
return super.equals(obj);
}
} Along with the things we've implicitly used, ignoring exceptions, public interface AudioOutput {
/**
* Returns the human readable name of the source
*
* @return the human readable name of the source
*/
public String getName();
/**
* Array containing all supported audio formats this output can process
*
* @return
*/
public AudioFormat[] getSupportedFormats();
/**
* An output stream for writing audio data in the default {@link AudioFormat} of this output
*
* @return an {@link OutputStream}
* @throws AudioException
*/
public OutputStream getOutputStream() throws AudioException;
/**
* An output stream for output audio, the format is set to the given format, throws and {@link UnsupportedAudioFormatException} if the given
* format is not supported
*
* @param format the desired format (one of getSupportedFormats) or null for the default format
* @return an OutputStream to read data from this output
* @throws AudioException thrown among other reasons if the given format is not supported
*/
public OutputStream getOutputStream(AudioFormat format) throws AudioException;
/**
* Process audio data from the provided {@link AudioSource} throws and {@link AudioException} if matching formats are found
*
* @param source
* @throws AudioException
*/
public void stream(AudioSource source) throws AudioException;
/**
* Returns true if the given {@link AudioSource} can be processes by this output
*
* @param source
* @return true if the AudioSource can be processes
*/
public boolean canStream(AudioSource source);
}
public enum AudioCodec {
/**
* PCM Signed
*
* http://wiki.multimedia.cx/?title=PCM#PCM_Types
*/
PCM_SIGNED
,
/**
* PCM Unsigned
*
* http://wiki.multimedia.cx/?title=PCM#PCM_Types
*/
PCM_UNSIGNED
,
/**
* MP3 Codec
*/
MP3
,
/**
* Vorbis Codec
*/
VORBIS
}
public enum AudioContainer {
/**
* NONE
* AudioCodec encoded data without any container header or footer
*
* eg MP3 is a non container format
*/
NONE
/**
* Microsofts wave container format
* http://www.zytrax.com/tech/audio/formats.html#wav-format
*
* for a list of codesc supported by WAV see
* http://www.opennetcf.com/library/sdf/html/60ca47dc-0b9d-2be4-a738-d0080c6fe10c.htm
*
* the riff audio format
*/
,WAVE
/**
* http://www.xiph.org/ogg/
*/
,OGG
} |
This would be a use case for an annotation: We should create two annotations that would mark the services as being @offline and @online. This way, this information could be used by humans and software alike). However this could possibly be a general functionality that imho would belong in some base package of ESH. Lucky enough we have exactly the right developer to ask at hand: @kaikreuzer what do you think? About the audio package: We obviously should change both
As stated somewhere in the javadoc, i ignored the Endian problem in AudioFormat. We probably should add something for that. A boolean would probably be just enough. and fix the typos in javadoc 😄 |
As the audio api can be used without TTS or STT (for example to support notification sounds) i suggest we move it to a seperate bundle. I created org.eclipse.smarthome.io.audio for that purpose. |
@kaikreuzer @hkuhn42 A naïve question, but why don't we simply use the standard java package javax.sound.sampled? As far as I can see, |
@kdavis-mozilla I actually thought a lot about this before i wrote my own api definition.
I actually tried to use it for my prototype but in the end i wrote this api and created an implementation for JavaSoundAPI. |
@hkuhn42 Great! I just wanted to make sure we weren't overlooking the obvious. |
I still doubt that this is of much value - nobody yet ever asked for such a flag for any of the 120+ bindings; it is usually clear to them already what the binding is about and what the technology behind it does. If we really see the need for such formal annotations, we can still add it any time later.
Makes sense! Regarding the interfaces:
|
Having stream on both interfaces is just nice and symmetric but apart from that it could easily be removed and replaced with a comment to point this out.
Fine with me, if we do not have both (like with an actual file) they can easily be equal
Actually yes, i was always thinking about adding a kind of ContentProvider Interface that defines content selection. This could also offer named groups / folders and additional metadata about the provider. I wonder wether we should not separate the TTS and AudioFeature into two issues at this point? @kdavis-mozilla I think we are now approaching the point where we should move the discussed TTS code into a bundle to make it easier to refine it. What do you think? |
You are right. However for certain use cases (like a kind of ringtones) it might be helpful to have some way of identifying the concrete AudioSource / AudioSink.
Sorry, you are right again, i was taken a bit off track there...
No not at all, imho there is only one thing missing: i would add a way to access an outputStream as a convenience to enable users of PlaybackService to not have to create an AudioSource first if they choose not to. So in slight variation:
If you both think this unnecessary we can leave the method out. If you both are happy with this interface and you do not object, i will try to incorporate these changes in my fork and prepare a pull request in the evening ( @kaikreuzer: i fear will need your help with the details) |
@hkuhn42 I think I'm fine with the current interface |
Let's only add it when we really need it. Until then, you can use toString() to have a string representing the instance :-)
You missed the situation that one of us agrees and the other objects 😎 |
@kaikreuzer Good catch. I think you're right public interface AudioSink {
Set<AudioFormat> getSupportedFormats();
boolean process(AudioSource audioSource);
OutputStream getOutputStream(AudioFormat audioFormat);
} The As to threading and the use of an
For the first case I don't think there is anything special to do. The threads collectively "know" what they are doing as they pass around the single For the second case I also don't see a "problem". I would just expect the The question is: What if a given For example an |
My assumption is that this is not possible in 95% of all cases. So the sink can only provide one OutputStream instance at a time and has to make sure that it is only used by one thread.
This is pretty ugly for the caller and the caller would have to retry by himself to find out when the stream might be available again - this is horrifying API design. I would prefer to remove this method and rediscuss, if we really come across a situation where it is desirable to directly obtain the raw output stream. |
Sounds fine to me. |
In general i think that the AudioSource and AudioSink are now quite different in the way they work. However I agree with @kaikreuzer that we should just go ahead and see how the api works out when used. In my experience you learn most about an api when when implementing and using it. |
@kaikreuzer i tried to create a pull request. Please have a look. Thanks! |
Signed-off-by: Kelly Davis <kdavis@mozilla.com>
Initial version of audio API for #584
Implemented APIs with #1132. @kdavis-mozilla You mentioned that you were porting |
On our fork we've implemented
But, we based all of this off our implementation of the Audio interfaces as we didn't have time to wait for this pull request. However, our implementation of the Audio interfaces should be the same as the pull request here, mod whitespace issues and the like. |
Then it is probably now a good time to reconcile things - you should avoid having your fork differing too much from the ESH master. Hope you'll find the time to do PRs for your work! |
Bug: eclipse-archived#584 Also-By: Kelly Davis <kdavis@mozilla.com> Signed-off-by: Harald Kuhn <harald.kuhn@gmail.com>
Initial version of audio API for eclipse-archived#584
Signed-off-by: Kelly Davis <kdavis@mozilla.com>
Bug: eclipse-archived#584 Also-By: Kelly Davis <kdavis@mozilla.com> Signed-off-by: Harald Kuhn <harald.kuhn@gmail.com>
Bug: eclipse-archived#584 Also-By: Kelly Davis <kdavis@mozilla.com> Signed-off-by: Harald Kuhn <harald.kuhn@gmail.com>
@kdavis-mozilla (@kaikreuzer) Cfr #1200, where are you standing? |
@kgoderis Unfortunately @kdavis-mozilla and team stopped working on these features. |
@kaikreuzer playSound is M.I.A. ? cfr the initial post in this thread |
Well, at the moment it is still only available in the compat layer, but I plan to migrate this to the new framework (and have it moved to ESH). My biggest issue there is that the plain javasound sink does not support mp3s... Therefore I plan to implement an "enhanced" sink in OH2, which uses the jl library, which is used by the "old" Audio action. |
I need it for Sonos in my setup so I will give it a go Sent from my iPhone
|
This is now fully in place (through quite a few PRs in the last two weeks, which I don't want to list all here). |
migrated from Bugzilla #463802
status NEW severity enhancement in component Core for ---
Reported in version unspecified on platform All
Assigned to: Project Inbox
On 2015-04-02 09:11:10 -0400, Kai Kreuzer wrote:
The text was updated successfully, but these errors were encountered: