Skip to content
This repository has been archived by the owner on May 7, 2020. It is now read-only.

Refactor TTS APIs into a general audio dispatching allowing different sources and sinks #584

Closed
kaikreuzer opened this issue Dec 3, 2015 · 59 comments

Comments

@kaikreuzer
Copy link
Contributor

migrated from Bugzilla #463802
status NEW severity enhancement in component Core for ---
Reported in version unspecified on platform All
Assigned to: Project Inbox

On 2015-04-02 09:11:10 -0400, Kai Kreuzer wrote:

The current TTS implementations combine the creation of the audio stream with its output. E.g. the Google translate TTS service generates an mp3 and outputs it on the local soundcard.
It would be much nicer, if the mp3 would be created by one service (the "source", here: Google Translate) and the output would be done by another one - this way, it could be played back as well on Sonos, Squeezebox etc. (the "sink").

When splitting it this way, it automatically means that it is not limited to TTS - but it could indeed be any kind of soundfile. So this refactoring should also address the "playSound" action at the same time.

@kdavis-mozilla
Copy link

I wanted to capture here several details related to this issue that Kai and our group discussed. I'll try to be pedantic to document the details of the discussion to make sure that all attendees agree on the discussion's results.

  1. The current interface TTSService should be refactored
  2. The refactoring should introduce two new interfaces AudioSource and AudioSink, the names may change and are simply placeholders for now.
  3. AudioSource is an interface any binding that is a sound source should implement. For example a binding for a microphone would implement this interface.
  4. An AudioSource can be in any audio format. For example mp3, wav...
  5. An AudioSource must allow a client of the interface to obtain its format. In other words it should expose sufficient get methods for a client to determine if it can play such an AudioSource.
  6. An AudioSource can either be of a finite length, for example the mp3 of a song, or of an indefinite length, for example the live streaming of a speech.
  7. AudioSink is an interface any binding that is a sound player should implement. For example a binding for a Sonos player should implement this interface.
  8. The refactored TTSService interface would redefine the say method to produce a AudioSource.
  9. The voice parameter of the refactored TTSService should uniquely identify the voice used in the say command. (If several TTS engines have voices of the same name name clashes must be dealt with somehow.)
  10. A mechanism must be put in place to allow at runtime a AudioSource to be dynamically paired with a compatible AudioSink. (This may be as simple as having an AudioSink expose a bool isPlayable(AudioSource audioSource) method then at runtime looping over all AudioSink's looking for the first that returns true from its isPlayable method.)

If any details from the discussion were not captured here, please add them!

@kaikreuzer
Copy link
Contributor Author

A few comments from my side:

  • on 3 and 7: "binding... should implement": So far, the TTS implementations are not bindings, but an add-on type of its own. We will have AudioSources and AudioSinks, which won't be related to bindings, but are simply local services (e.g. the system audio out). But it is right that microphones & speakers on "Things" will also serve as sources and sinks, so we will have to see how we can declare such functionalities within a Thing and how and when these are registered as a service.
  • on 4: It should not be "any" format, but a format from a list that we should define
  • on 8: Additionally, we should re-implement the Audio action and move it to ESH, which would then bring together the right source+sink.
  • on 10: Besides "auto-matching", we should also allow manually selection of sources and sinks (e.g. for the rule action, where the user might want to have different notifications on different speakers). In order to let users do this through UIs as well, we will have to think about adding some meta-data (labels, descriptions, etc.) as well.

@hkuhn42
Copy link
Contributor

hkuhn42 commented Feb 11, 2016

Hello,

i am working on something similar for some Time now and would like to contribut to this. I am currently on vacation without a computer but will be back tomorrow. The (very early) State of my implementation can be found at https://github.com/hkuhn42/sylvani but ist still very Rough...

Am 11.02.2016 um 13:25 schrieb Kai Kreuzer notifications@github.com:

A few comments from my side:

on 3 and 7: "binding... should implement": So far, the TTS implementations are not bindings, but an add-on type of its own. We will have AudioSources and AudioSinks, which won't be related to bindings, but are simply local services (e.g. the system audio out). But it is right that microphones & speakers on "Things" will also serve as sources and sinks, so we will have to see how we can declare such functionalities within a Thing and how and when these are registered as a service.
on 4: It should not be "any" format, but a format from a list that we should define
on 8: Additionally, we should re-implement the Audio action and move it to ESH, which would then bring together the right source+sink.
on 10: Besides "auto-matching", we should also allow manually selection of sources and sinks (e.g. for the rule action, where the user might want to have different notifications on different speakers). In order to let users do this through UIs as well, we will have to think about adding some meta-data (labels, descriptions, etc.) as well.

Reply to this email directly or view it on GitHub.

@kaikreuzer
Copy link
Contributor Author

Thanks @hkuhn42, I briefly browsed your repo and that looks indeed very similar to what our plans - so it would be great to have you on board here!

@kdavis-mozilla
Copy link

@hkuhn42 I've also taken a look. It does look very similar to what we had in mind. We should try to join forces!

@hkuhn42
Copy link
Contributor

hkuhn42 commented Feb 12, 2016

Ok, how do we go about this (sorry but as kai already learned i am new to this and also to github and may need some support)?
I could setup a repo with a fork and move the code there. Would that be a good way to get things started?
If so, which package should i put it and which code should i add? Just the api? I also started to write basic implementations for the java sound api to support local audio devices but that api is not part of compact profile 2 so it violates the guidelines (as stated by the guidelines).

In general the repo contains some experiments regarding voice control (text-to-speech and speech-to-text) as i did find the current support somewhat lacking. If anything of that code is interesting i can also move it over (i apologize for the missing comments and stuff ;) .

@kaikreuzer
Copy link
Contributor Author

I think we should first start to roughly define the API (incl. package name etc.) according to the above requirements and agree on them.
@hkuhn42 Do you feel that your current code is already fulfilling all ideas above or does it actually address things that we might have forgotten and that should be considered?
Once we have agreement on this, you could create a fork of ESH, refactor your code if necessary and create a pull request against ESH, on which we can then continue any further discussion.

@hkuhn42
Copy link
Contributor

hkuhn42 commented Feb 13, 2016

I think that it covers most of the requirements as far as i understood them. I currently do not have a mechanism that matches the auto pairing as described in 10. My idea for that was to use a registry to register sources and outputs and also offer methods for simple playback and recording.
Also i have to take a look at that audio action to understand what it currently does.
For me it comes down to what would be more easy for you. You could have a look at the current state of the api or i can extract it into a fork first. If it does not match your requirements, we can always start from scratch. I added some more doc to hopefully make it easier for you to find the relevant parts in my repo.

@kdavis-mozilla
Copy link

Thinking a bit about the ISyntheziserService interface

public interface ISyntheziserService {

    public AudioSource synthesize(String text, Locale locale, String voice, AudioFormat requestedFormat)
            throws Exception;

}

A few things popped to mind

  • A voice is, in all cases I am aware of, bound to a locale. In other, words specifying a voice implies a locale. So, I don't think locale is required as a parameter.
  • If a GUI or voice interface wishes to display or say the available voices, it has no means to obtain the list of possible voices.
  • It's unclear to me if voice also specifies the add-on from which the targeted voice comes. Two TTS add-ons may have voices with the same name. Hence, we should provide a means to disambiguate.
  • Is audio transcoding the responsibility of each ISyntheziserService implementation or do they simply output one of the n formats they know of and throw otherwise.
  • If audio transcoding is the responsibility of each ISyntheziserService implementation, I think it shouldn't be and we should introduce a new interface/implementation to do transcoding.

I haven't had time to look at the audio portion in detail. I'll comment on ICommandInterpreterin issue #1028

@hkuhn42
Copy link
Contributor

hkuhn42 commented Feb 14, 2016

A voice is, in all cases I am aware of, bound to a locale. In other, words specifying a voice implies a locale. So, I don't think locale is required as a parameter.

That is actually a point i spend a lot of time thinking about and did not come up with an ideal solution. Voices do imply locale but when using the different services i experimented with, i usually did not care for the actual voice used but rather just used the first to match the my locale (due to the fact that apart from english, locale support gets thin very fast..). Also for some, like mary tts, additional voices have to be installed first. So the two parameters were meant to be one or the other (which is not ideal). A way to tell the service to use any voice for a given locale would be a must have for me.
Regarding the actual voice selection i was torn between using strings for simplicity and an interface which every service would have to implement.

Is audio transcoding the responsibility of each ISyntheziserService implementation or do they simply output one of the n formats they know of and throw otherwise.

My aproach was that the implementation should the default supported format (or one of them). Converting from one of these to another should than be part of an audio api if needed. However, audio transcoding is heavy stuff so my idea was to start with not doing any trancoding but rely on available formats.

So taking into account your comments and some of the my thoughts a new version of the interface could look like:


/**
     * Synthezise voice output for the given text in the given locale with the specified voice and format
     * 
     * Voice and format info should be optional
     * 
     * @param text
     * @param voice
     * @param requestedFormat
     * @return
     * @throws Exception
     */
    public AudioSource synthesize(String text, Voice voice, AudioFormat requestedFormat)
            throws Exception;

    /**
     * Array containing all supported audio formats this SyntheziserService can produce
     *
     * @return an array of AudioFormat
     */
    public AudioFormat[] getSupportedFormats();

    /**
     * Get the default voice for the given locale or null if it is not supported 
     * 
     * @param locale
     * @return
     */
    public Voice getDefaultVoice(Locale locale);

    /**
     * An array containing all supported voices of this SyntheziserService
     * 
     * @return an array of Voice implementations
     */
    public Voice[] getAvailableVoices();
...

@kdavis-mozilla
Copy link

@hkuhn42 I'm liking the direction this is going!

I've a suggestion, motivated by the WebSpeech API, for the Voice interface that would let us get rid of the getDefaultVoice() method. If the Voice interface takes a form something like

public interface Voice {
       /**
        * Globally unique identifier of the voice
        *
        * @return A URI uniquely identifying the voice globally
        */
        public java.net.URI getVoiceURI();

       /**
        * Name of the voice, usually used for GUI's or VUI's
        *
        * @return Name of the voice, may not be globally unique
        */
        public String getName();

       /**
        * Locale of the voice
        *
        * @return Locale of the voice
        */
        public Locale getLocale();

       /**
        * Indicates if the voice requires internet access
        *
        * @return A boolean indicating if the voice in local to the hardware
        */
        public boolean isLocalVoice();
}

then sufficient information would be provided by the getAvailableVoices() method to obtain a Voice for a given Locale, which, as far as I understand, is your use case for getDefaultVoice().

@hkuhn42
Copy link
Contributor

hkuhn42 commented Feb 17, 2016

@kdavis-mozilla the getDefaultVoice was meant to help in finding the most suitable voice for a given locale. However i dropped it as finding a voice for a locale can alwys be implemented outside of such a service.

Regarding the Voice interface, i see you point but i would rather not mix the "usage" interface with any kind of voice of configuration management as this seems to be highly implementation dependant. E.g. for online services, isLocale would have to always return false (so in my understanding the whole services allways depends on internet beeing available). In this case, i would rather raise an exception if no internet is available.
For using an uri to identify a voice, i am not sure if this is the way its done in ESH. Beeing able to identify a voice is definitly something we should make available.
If nobody objects, I will modify my implementation accordingly later today and, if no more comments come in, move it to an ESH fork for further discussion.

@kaikreuzer should we make a new voice bundle or shoud this be a part of the existing multimedia bundle?

@kdavis-mozilla
Copy link

@kdavis-mozilla A few questions/comments...

  • "...as finding a voice for a locale can alwys be implemented outside of such a service" How? If the service doesn't expose this information I don't see how this is done.
  • A GUI/VUI can also use the TTS service to allow selection of possible voices. I see this as a valid usage. Thus, I see the Voice interface as valid interface to have.
  • Why raise an exception instead of having isLocalVoice() return false? A non-local service is not an "abnormal condition".
  • Currently ESH uses "the name of the voice". Normally voices are named things like "Fred", "Alice"... As such names are so generic there may exist name clashes between various ISyntheziserService implementations. The URI disambiguates this. Maybe we can use something else. But a URI is a natural choice.

Before we start on implementation, we need to consider the AudioFormat and AudioSource interfaces in more detail.

@kaikreuzer
Copy link
Contributor Author

Currently ESH uses "the name of the voice".

We should probably rather talk about an id and a label - the id being the unique technical identifier, the label is the text to show in UIs. Usually, we have UI labels also localizable, but for voices this is imho not needed.
Btw, regarding "meta-information for UIs", we will probably have to define structures as well, e.g. stuff similar as for bindings. But I think we can keep that for a later phase and concentrate on the "internals" for now. What can be easily done for the start though is adding a config description, which will make the service pop-up in the Paper UI. See here for some hints on how to do this.

isLocalVoice()

I am not sure if we would need this information - so far, we have nowhere an indication whether something is local or remote. And anyhow, I would assume that it usually applies to the whole TTS service and that it is a rare exception that a single TTS service has local AND remote voices.

@kaikreuzer should we make a new voice bundle or shoud this be a part of the existing multimedia bundle?

We need to discuss the namespace for all the new components. I could imagine that we move this out of the "multimedia" package into e.g.
org.eclipse.smarthome.io.voice.tts,
org.eclipse.smarthome.io.voice.sst and
org.eclipse.smarthome.io.voice.interpreter.
Any better suggestions?

@kdavis-mozilla
Copy link

@kaikreuzer If we want to call the URI the id and the name the label. I'm fine with that.

However, I think keeping the ida URI is a good idea as it leads to natural namespaces. The MaryTTS group cold name their voices something like http://mary.dfki.de/voices/en/unitselection/Fred, while the FreeTTS could use something like http://freetts.sourceforge.net/Fred, and the two naming schemes would never conflict.

As far as "meta-information for UIs" I agree. But I guess we should get the internals out of the way first.

The reason I thought isLocalVoice() would be useful is for security conscious users. They'd likely not want anything going out over the network.

As for namespaces, @kaikreuzer whatever you decide upon in fine.

@kaikreuzer
Copy link
Contributor Author

@kdavis-mozilla Note that for an URI, only scheme and path are mandatory. So mary:Fred and freetts:Fred are valid URIs as well - and these would be much nicer to be used e.g. within a textual rule, where you want to pass them as a parameter like say("mary:Fred", "Hello world!").
And again for this use case, I think using a simply String instead of the java URI class is easier to use. For such unique ids, we usually call the field then uid , so this might be a good choice here as well. The JavaDoc can then say that it should have a "prefix:voicename" format.

The reason I thought isLocalVoice() would be useful is for security conscious users. They'd likely not want anything going out over the network.

As I said, I think this will usually apply to the whole TTS service, not just to a few voices from it. And then it can be nicely documented for the service itself, i.e. a user who picks it and installs it should already know whether it needs a cloud connection or not.

@kdavis-mozilla
Copy link

@kaikreuzer We can use a String and impose validity through documentation, works for me.

On isLocalVoice() I guess the missing part for me was the statement...

...And then it can be nicely documented for the service itself...

If that's always the case, we don't need the isLocalVoice() method.

@kdavis-mozilla
Copy link

Before we jump into implementation, I think we still need to agree on the contents of org.sylvani.audio or more specifically what our version of this package will be.

A quick summary of the things we've explicitly used, AudioSource and AudioFormat

public interface AudioSource {

    /**
     * Returns the human readable name of the source
     *
     * @return the human readable name of the source
     */
    public String getName();

    /**
     * an array containing all supported audio formats
     *
     * @return
     */
    public AudioFormat[] getSupportedFormats();

    /**
     * Gives access to an InputStream for reading audio data, the format is the default format
     *
     * @return InputStream for reading audio data
     * @throws AudioException
     */
    public InputStream getInputStream() throws AudioException;

    /**
     * An inputstream for reading audio data, the format is set to the given format
     *
     * @param format the desired format (one of getSupportedFormats) or null for the default format
     * @return InputStream for reading audio data
     * @throws AudioException
     */
    public InputStream getInputStream(AudioFormat format) throws AudioException;

    /**
     * Load data from this AudioSource to the given {@link AudioOutput} 
     *
     * @param output
     * @throws AudioException
     */
    public void stream(AudioOutput output) throws AudioException;

    /**
     * Returns true if this AudioSource can stream to the given {@link AudioOutput}
     * false otherwise
     * 
     * @param source
     * @return true if the AudioSource can be processes
     */
    public boolean canStream(AudioOutput source);
}

public class AudioFormat {

    private AudioCodec codec;
    private AudioContainer container;
    /**
     * bit depth (https://en.wikipedia.org/wiki/Audio_bit_depth)
     * or
     * bit rate (https://en.wikipedia.org/wiki/Bit_rate)
     * depending on codec
     */
    private int bits;
    /**
     * sample frequence
     */
    private long frequency;

    public AudioCodec getCodec() {
        return codec;
    }

    public void setCodec(AudioCodec codec) {
        this.codec = codec;
    }

    public AudioContainer getContainer() {
        return container;
    }

    public void setContainer(AudioContainer container) {
        this.container = container;
    }

    public int getBits() {
        return bits;
    }

    public void setBits(int bits) {
        this.bits = bits;
    }

    public long getFrequency() {
        return frequency;
    }

    public void setFrequency(long rate) {
        this.frequency = rate;
    }

    @Override
    public boolean equals(Object obj) {
        if (obj instanceof AudioFormat) {
            AudioFormat format = (AudioFormat) obj;
            if (format.getCodec() != getCodec()) {
                return false;
            }
            if (format.getContainer() != getContainer()) {
                return false;
            }
            if (format.getBits() != getBits()) {
                return false;
            }
            if (format.getFrequency() != getFrequency()) {
                return false;
            }
            return true;
        }
        return super.equals(obj);
    }
}

Along with the things we've implicitly used, ignoring exceptions,

public interface AudioOutput {

    /**
     * Returns the human readable name of the source
     *
     * @return the human readable name of the source
     */
    public String getName();

    /**
     * Array containing all supported audio formats this output can process
     *
     * @return
     */
    public AudioFormat[] getSupportedFormats();

    /**
     * An output stream for writing audio data in the default {@link AudioFormat} of this output
     * 
     * @return an {@link OutputStream} 
     * @throws AudioException 
     */
    public OutputStream getOutputStream() throws AudioException;

    /**
     * An output stream for output audio, the format is set to the given format, throws and {@link UnsupportedAudioFormatException} if the given
     * format is not supported
     *
     * @param format the desired format (one of getSupportedFormats) or null for the default format
     * @return an OutputStream to read data from this output
     * @throws AudioException thrown among other reasons if the given format is not supported
     */
    public OutputStream getOutputStream(AudioFormat format) throws AudioException;

    /**
     * Process audio data from the provided {@link AudioSource} throws and {@link AudioException} if matching formats are found
     * 
     * @param source
     * @throws AudioException
     */
    public void stream(AudioSource source) throws AudioException;

    /**
     * Returns true if the given {@link AudioSource} can be processes by this output
     * 
     * @param source
     * @return true if the AudioSource can be processes
     */
    public boolean canStream(AudioSource source);
}

public enum AudioCodec {

    /**
     * PCM Signed
     * 
     * http://wiki.multimedia.cx/?title=PCM#PCM_Types
     */
    PCM_SIGNED

    ,
    /**
     * PCM Unsigned
     * 
     * http://wiki.multimedia.cx/?title=PCM#PCM_Types
     */
    PCM_UNSIGNED

    ,
    /**
     * MP3 Codec
     */
    MP3

    ,
    /**
     * Vorbis Codec
     */
    VORBIS
}

public enum AudioContainer {

    /**
     * NONE
     * AudioCodec encoded data without any container header or footer
     *
     * eg MP3 is a non container format
     */
    NONE

    /**
     * Microsofts wave container format
     * http://www.zytrax.com/tech/audio/formats.html#wav-format
     *
     * for a list of codesc supported by WAV see
     * http://www.opennetcf.com/library/sdf/html/60ca47dc-0b9d-2be4-a738-d0080c6fe10c.htm
     * 
     * the riff audio format
     */
    ,WAVE


    /**
     * http://www.xiph.org/ogg/
    */ 
    ,OGG   
}

@hkuhn42
Copy link
Contributor

hkuhn42 commented Feb 21, 2016

@kaikreuzer @kdavis-mozilla

If that's always the case, we don't need the isLocalVoice() method.

This would be a use case for an annotation: We should create two annotations that would mark the services as being @offline and @online. This way, this information could be used by humans and software alike). However this could possibly be a general functionality that imho would belong in some base package of ESH. Lucky enough we have exactly the right developer to ask at hand: @kaikreuzer what do you think?

About the audio package:

We obviously should change both

public AudioFormat[] getSupportedFormats();
to
public Set<AudioFormat> getSupportedFormats();

As stated somewhere in the javadoc, i ignored the Endian problem in AudioFormat. We probably should add something for that. A boolean would probably be just enough.

and fix the typos in javadoc 😄

@hkuhn42
Copy link
Contributor

hkuhn42 commented Feb 21, 2016

As the audio api can be used without TTS or STT (for example to support notification sounds) i suggest we move it to a seperate bundle. I created org.eclipse.smarthome.io.audio for that purpose.

@kdavis-mozilla
Copy link

@kaikreuzer @hkuhn42 A naïve question, but why don't we simply use the standard java package javax.sound.sampled?

As far as I can see, AudioSource would correspond to TargetDataLine and AudioFormat to AudioFormat.

@hkuhn42
Copy link
Contributor

hkuhn42 commented Feb 21, 2016

@kdavis-mozilla I actually thought a lot about this before i wrote my own api definition.
I came up with a number of points. From my audio readme

There are multiple reasons for not using the Java Sound API:
a) it is horrible for so many reasons (among them: nobody seems to know why target is for input and source is for output and there is an audioInputStream but no AudioOutputStream)
b) it only defines constants and values for formats and codes it supports (and it only supports very basic ones)
c) it is not part of compact profile 2 which is the target for eclipse smart home so any api using java sound can only be an extension
d) it does not offer a easy to use way to choose the output or input source / targets
e) java itself offers a second sound related api with the java media framework and both are not widely supported

I actually tried to use it for my prototype but in the end i wrote this api and created an implementation for JavaSoundAPI.

@kdavis-mozilla
Copy link

@hkuhn42 Great! I just wanted to make sure we weren't overlooking the obvious.

@kaikreuzer
Copy link
Contributor Author

We should create two annotations that would mark the services as being @offline and @online

I still doubt that this is of much value - nobody yet ever asked for such a flag for any of the 120+ bindings; it is usually clear to them already what the binding is about and what the technology behind it does. If we really see the need for such formal annotations, we can still add it any time later.

As the audio api can be used without TTS or STT (for example to support notification sounds) i suggest we move it to a seperate bundle.

Makes sense!

Regarding the interfaces:

  • I think we do not need both AudioSource.stream and AudioOutput.stream. Having the output stream a source should be enough, or am I missing something?
    Likewise I would remove AudioSource.canStream.
  • We should replace the name by a technical id and a human-readable label.
  • What about allowing the source to offer different streams/content like with public InputStream getInputStream(String contentId)? We would also need additional methods through which an AudioSource can tell, which contentIds exist. But I am thinking e.g. of different radio stations, an mp3 library with 3 different door bells sounds etc. Or would you want to keep the content selection completely out of the technical interface? But if, what would be the other option? Having an AudioSource service registered for every single content option would definitely be too heavy.

@hkuhn42
Copy link
Contributor

hkuhn42 commented Feb 22, 2016

I think we do not need both AudioSource.stream and AudioOutput.stream. Having the output stream a source should be enough, or am I missing something? Likewise I would remove AudioSource.canStream.

Having stream on both interfaces is just nice and symmetric but apart from that it could easily be removed and replaced with a comment to point this out.

We should replace the name by a technical id and a human-readable label.

Fine with me, if we do not have both (like with an actual file) they can easily be equal

Or would you want to keep the content selection completely out of the technical interface

Actually yes, i was always thinking about adding a kind of ContentProvider Interface that defines content selection. This could also offer named groups / folders and additional metadata about the provider.
That kind of functionality would probably not be needed for Services like STT or TTS that do not offer a selection.

I wonder wether we should not separate the TTS and AudioFeature into two issues at this point?

@kdavis-mozilla I think we are now approaching the point where we should move the discussed TTS code into a bundle to make it easier to refine it. What do you think?

@hkuhn42
Copy link
Contributor

hkuhn42 commented Mar 2, 2016

@kaikreuzer

Which means that the TTSService is the service displayed to the user and thus requires label, etc. and the audioSource is a concrete instance of a given spoken text.

You are right. However for certain use cases (like a kind of ringtones) it might be helpful to have some way of identifying the concrete AudioSource / AudioSink.

With your examples, we are completely getting rid off the AudioSink interface... I wonder if we should not add it back by having the "hand-over" part clearly isolated in our source/sink concept and not requiring all kinds of different service interfaces for special use cases.

Sorry, you are right again, i was taken a bit off track there...

Or am I missing your recent idea of how the AudioSink interface is meant to be used?

No not at all, imho there is only one thing missing: i would add a way to access an outputStream as a convenience to enable users of PlaybackService to not have to create an AudioSource first if they choose not to.

So in slight variation:

public interface PlaybackService {
  AudioSink getAudioSink(String location);
  Set<String> getLocations();
}

public interface AudioSink {
   Set<AudioFormat> getSupportedFormats();
   boolean process(AudioSource audioSource);
   OutputStream getOutputStream();
}

If you both think this unnecessary we can leave the method out.

If you both are happy with this interface and you do not object, i will try to incorporate these changes in my fork and prepare a pull request in the evening ( @kaikreuzer: i fear will need your help with the details)

@kdavis-mozilla
Copy link

@hkuhn42 I think I'm fine with the current interface

@kaikreuzer
Copy link
Contributor Author

However for certain use cases (like a kind of ringtones) it might be helpful to have some way of identifying the concrete AudioSource / AudioSink.

Let's only add it when we really need it. Until then, you can use toString() to have a string representing the instance :-)

If you both think this unnecessary we can leave the method out.
If you both are happy with this interface and you do not object

You missed the situation that one of us agrees and the other objects 😎
The problem that I am seeing with getOutputStream()is that a sink might need to provide a specific output stream depending on the audio format. This method now assumes that the stream can directly deal with any content. So to be clean, we would probably rather have to have a getOutputStream(AudioFormat format). Another problem is multi-threading. Assuming the sink is something like a speaker, there can only be a single source stream being processed at a time. The process method supports this nicely as it can either block until the previous stream is done or it can internally queue the request. Reaching out the output stream to an unknown number of callers can result in pretty bad concurrency problems and unexpected behavior.

@kdavis-mozilla
Copy link

@kaikreuzer Good catch. I think you're right getOutputStream() would have to take an AudioFormat

public interface AudioSink {
   Set<AudioFormat> getSupportedFormats();
   boolean process(AudioSource audioSource);
   OutputStream getOutputStream(AudioFormat audioFormat);
}

The AudioFormat was there in a previous revision but then somehow got lost in the copy and paste.

As to threading and the use of an OutputStream, I guess there are two cases we have to deal with:

  1. An OutputStream instance from a given AudioSink instance is passed from one thread to another and each thread writes to it.
  2. Several threads have OutputStream instances all from a single AudioSink instance and each thread writes to their own OutputStream instance.

For the first case I don't think there is anything special to do. The threads collectively "know" what they are doing as they pass around the single OutputStream instance.

For the second case I also don't see a "problem". I would just expect the AudioSink, if it's a Sonos player say, to play all the audio from the different threads at the same time.

The question is: What if a given AudioSink can only accept input from a single OutputStream at a time? If this were the case, then it should be up to the AudioSink implementation to enforce this.

For example an AudioSink could have a WeakReference<OutputStream> member to the OutputStream, called outputStreamWeakReference say, and throw an exception when getOutputStream() is called and outputStreamWeakReference.get() is not null.

@kaikreuzer
Copy link
Contributor Author

I would just expect the AudioSink, if it's a Sonos player say, to play all the audio from the different threads at the same time.

My assumption is that this is not possible in 95% of all cases. So the sink can only provide one OutputStream instance at a time and has to make sure that it is only used by one thread.

and throw an exception when getOutputStream() is called

This is pretty ugly for the caller and the caller would have to retry by himself to find out when the stream might be available again - this is horrifying API design. I would prefer to remove this method and rediscuss, if we really come across a situation where it is desirable to directly obtain the raw output stream.

@kdavis-mozilla
Copy link

@kaikreuzer

I would prefer to remove this method and rediscuss, if we really come across a situation where it is desirable to directly obtain the raw output stream.

Sounds fine to me.

@hkuhn42
Copy link
Contributor

hkuhn42 commented Mar 3, 2016

In general i think that the AudioSource and AudioSink are now quite different in the way they work.

However I agree with @kaikreuzer that we should just go ahead and see how the api works out when used. In my experience you learn most about an api when when implementing and using it.

@hkuhn42
Copy link
Contributor

hkuhn42 commented Mar 3, 2016

@kaikreuzer i tried to create a pull request. Please have a look. Thanks!

kdavis-mozilla added a commit to mozilla/smarthome that referenced this issue Mar 17, 2016
Signed-off-by: Kelly Davis <kdavis@mozilla.com>
kaikreuzer added a commit that referenced this issue Mar 23, 2016
Initial version of audio API for #584
@kaikreuzer
Copy link
Contributor Author

Implemented APIs with #1132.
Now I guess we need to adapt implementations/framework as well for it.

@kdavis-mozilla You mentioned that you were porting TTSServiceMacOS - will you create a PR for it? What else do we need?

@kdavis-mozilla
Copy link

On our fork we've implemented

But, we based all of this off our implementation of the Audio interfaces as we didn't have time to wait for this pull request.

However, our implementation of the Audio interfaces should be the same as the pull request here, mod whitespace issues and the like.

@kaikreuzer
Copy link
Contributor Author

we didn't have time to wait for this pull request.

Then it is probably now a good time to reconcile things - you should avoid having your fork differing too much from the ESH master. Hope you'll find the time to do PRs for your work!

cdjackson pushed a commit to cdjackson/smarthome that referenced this issue Mar 28, 2016
Bug: eclipse-archived#584
Also-By: Kelly Davis <kdavis@mozilla.com>
Signed-off-by: Harald Kuhn <harald.kuhn@gmail.com>
cdjackson pushed a commit to cdjackson/smarthome that referenced this issue Mar 28, 2016
tilmankamp pushed a commit to tilmankamp/smarthome that referenced this issue Apr 13, 2016
Signed-off-by: Kelly Davis <kdavis@mozilla.com>
tilmankamp pushed a commit to tilmankamp/smarthome that referenced this issue Apr 21, 2016
Bug: eclipse-archived#584
Also-By: Kelly Davis <kdavis@mozilla.com>
Signed-off-by: Harald Kuhn <harald.kuhn@gmail.com>
reitermarkus pushed a commit to reitermarkus/smarthome that referenced this issue Jun 7, 2016
Bug: eclipse-archived#584
Also-By: Kelly Davis <kdavis@mozilla.com>
Signed-off-by: Harald Kuhn <harald.kuhn@gmail.com>
@kgoderis
Copy link
Contributor

@kdavis-mozilla (@kaikreuzer) Cfr #1200, where are you standing?

@kaikreuzer
Copy link
Contributor Author

@kgoderis Unfortunately @kdavis-mozilla and team stopped working on these features.
I am currently trying to reconcile their state of work, clean it up and create a PR from it - I hope to have that available by the end of this week and I hope @hkuhn42 will be able to help reviewing and commenting it then.

@kgoderis
Copy link
Contributor

@kaikreuzer playSound is M.I.A. ? cfr the initial post in this thread

@kaikreuzer
Copy link
Contributor Author

Well, at the moment it is still only available in the compat layer, but I plan to migrate this to the new framework (and have it moved to ESH).

My biggest issue there is that the plain javasound sink does not support mp3s... Therefore I plan to implement an "enhanced" sink in OH2, which uses the jl library, which is used by the "old" Audio action.
And as long as there is no sink that can play mp3s, I didn't bother to migrate the playSound action.

@kgoderis
Copy link
Contributor

I need it for Sonos in my setup so I will give it a go

Sent from my iPhone

On 31 Aug 2016, at 19:02, Kai Kreuzer notifications@github.com wrote:

Well, at the moment it is still only available in the compat layer, but I plan to migrate this to the new framework (and have it moved to ESH).

My biggest issue there is that the plain javasound sink does not support mp3s... Therefore I plan to implement an "enhanced" sink in OH2, which uses the jl library, which is used by the "old" Audio action.
And as long as there is no sink that can play mp3s, I didn't bother to migrate the playSound action.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@kaikreuzer
Copy link
Contributor Author

This is now fully in place (through quite a few PRs in the last two weeks, which I don't want to list all here).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants