Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] use the Azure TTS API? #1553

Open
xiaoyifang opened this issue Jun 11, 2024 · 18 comments
Open

[Feature] use the Azure TTS API? #1553

xiaoyifang opened this issue Jun 11, 2024 · 18 comments
Labels

Comments

@xiaoyifang
Copy link
Owner

xiaoyifang commented Jun 11, 2024

https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-text-to-speech?tabs=windows%2Cterminal&pivots=programming-language-cli#prerequisites

The Microsoft TTS has offered a very high quality audio ,maybe worth a try to implemented as a function.

Users can select the text and use right-click menu to pronounce the text with the above engine.

The implementation can be wrapped around the cli command or use the provided C++ SDK.

image

image

@xiaoyifang xiaoyifang added the help wanted PR welcomed. label Jun 11, 2024
@shenlebantongying
Copy link
Collaborator

shenlebantongying commented Jun 13, 2024

The existing forvo/lingualibre are essentially the same.

We could merge them into "Online TTS".


Related art: this popular Anki add-on that provides TTS from many serveries (Azure included.).
https://ankiweb.net/shared/info/1436550454 (It has cringe AD but works ok. I used a few times in the fast.)

Maybe we can copy its UI.

The left side can select a service and add various related parameters.

@xiaoyifang
Copy link
Owner Author

xiaoyifang commented Jun 13, 2024

The existing forvo/lingualibre are essentially the same.

not the same .
forvo/lingualibre are used for word, it is displayed as a seperate dictionary.

Azure TTS can be used for article or sentences which means it can work across different dictionaries.

Maybe we can copy its UI.

We can keep the configuration at minimum . speed and pitch can even be left out.
image

@shenlebantongying
Copy link
Collaborator

shenlebantongying commented Jun 13, 2024

We can keep the configuration at minimum . speed and pitch can even be left out.

What I really mean is that we don't limit this feature to one specific service provider.

The implementation should allow adding new service providers easy 😅

Adding new parameters shouldn't be much harder because it is pretty much combining new query URLs.

image

@xiaoyifang xiaoyifang changed the title [Feature] use the microsoft text to speech API? [Feature] use the Azure TTS API? Jun 13, 2024
@xiaoyifang
Copy link
Owner Author

Though a bit ambition at first. I have no rejection with this. :-)

@shenlebantongying
Copy link
Collaborator

After some investigation, I find this feature should not be implemented with the current dictionary.hh facilities.

Websites/Programs/TTS/Transliteration are inherently different from other local storage-based dictionaries.

It was a mistake to merge them into one. All implementations of those “dictionary but actually not” are messy AF. Websites/Programs/TTS/Transliteration are the afterthought of designing dictionary.hh.

  1. Having one single dedicated object that inherits nothing to handles this feature.
  2. plug it into the current "dictionary.hh" monstrosity.

I find doing 1. (aka write from scratch) is 10x simpler than 2.


Leaky abstraction in action:

For example, how to extend the properties of a dictionary with dictionary.hh? Instead of putting properties into the dictionary class, they all live in config.hh. Websites/Programs/TTS/Transliteration need extra properties, so we have these lines below.

dictionary.hh is abstract enough to have "toHTML" method but also concrete enough to have "dictionary files" that Websites/Programs/TTS/Transliteration don't have (so they all have to return empty.).

goldendict-ng/src/config.hh

Lines 448 to 824 in 6a91c6b

/// A MediaWiki network dictionary definition
struct MediaWiki
{
QString id, name, url;
bool enabled;
QString icon;
QString lang;
MediaWiki():
enabled( false )
{
}
MediaWiki( QString const & id_,
QString const & name_,
QString const & url_,
bool enabled_,
QString const & icon_,
QString const & lang_ = "" ):
id( id_ ),
name( name_ ),
url( url_ ),
enabled( enabled_ ),
icon( icon_ ),
lang( lang_ )
{
}
bool operator==( MediaWiki const & other ) const
{
return id == other.id && name == other.name && url == other.url && enabled == other.enabled && icon == other.icon
&& lang == other.lang;
}
};
/// Any website which can be queried though a simple template substitution
struct WebSite
{
QString id, name, url;
bool enabled;
QString iconFilename;
bool inside_iframe;
WebSite():
enabled( false )
{
}
WebSite( QString const & id_,
QString const & name_,
QString const & url_,
bool enabled_,
QString const & iconFilename_,
bool inside_iframe_ ):
id( id_ ),
name( name_ ),
url( url_ ),
enabled( enabled_ ),
iconFilename( iconFilename_ ),
inside_iframe( inside_iframe_ )
{
}
bool operator==( WebSite const & other ) const
{
return id == other.id && name == other.name && url == other.url && enabled == other.enabled
&& iconFilename == other.iconFilename && inside_iframe == other.inside_iframe;
}
};
/// All the WebSites
typedef QVector< WebSite > WebSites;
/// Any DICT server
struct DictServer
{
QString id, name, url;
bool enabled;
QString databases;
QString strategies;
QString iconFilename;
DictServer():
enabled( false )
{
}
DictServer( QString const & id_,
QString const & name_,
QString const & url_,
bool enabled_,
QString const & databases_,
QString const & strategies_,
QString const & iconFilename_ ):
id( id_ ),
name( name_ ),
url( url_ ),
enabled( enabled_ ),
databases( databases_ ),
strategies( strategies_ ),
iconFilename( iconFilename_ )
{
}
bool operator==( DictServer const & other ) const
{
return id == other.id && name == other.name && url == other.url && enabled == other.enabled
&& databases == other.databases && strategies == other.strategies && iconFilename == other.iconFilename;
}
};
/// All the DictServers
typedef QVector< DictServer > DictServers;
/// Hunspell configuration
struct Hunspell
{
QString dictionariesPath;
typedef QVector< QString > Dictionaries;
Dictionaries enabledDictionaries;
bool operator==( Hunspell const & other ) const
{
return dictionariesPath == other.dictionariesPath && enabledDictionaries == other.enabledDictionaries;
}
bool operator!=( Hunspell const & other ) const
{
return !operator==( other );
}
};
/// All the MediaWikis
typedef QVector< MediaWiki > MediaWikis;
/// Chinese transliteration configuration
struct Chinese
{
bool enable;
bool enableSCToTWConversion;
bool enableSCToHKConversion;
bool enableTCToSCConversion;
Chinese();
bool operator==( Chinese const & other ) const
{
return enable == other.enable && enableSCToTWConversion == other.enableSCToTWConversion
&& enableSCToHKConversion == other.enableSCToHKConversion
&& enableTCToSCConversion == other.enableTCToSCConversion;
}
bool operator!=( Chinese const & other ) const
{
return !operator==( other );
}
};
struct CustomTrans
{
bool enable = false;
QString context;
bool operator==( CustomTrans const & other ) const
{
return enable == other.enable && context == other.context;
}
bool operator!=( CustomTrans const & other ) const
{
return !operator==( other );
}
};
/// Romaji transliteration configuration
struct Romaji
{
bool enable;
bool enableHepburn;
bool enableNihonShiki;
bool enableKunreiShiki;
bool enableHiragana;
bool enableKatakana;
Romaji();
bool operator==( Romaji const & other ) const
{
return enable == other.enable && enableHepburn == other.enableHepburn && enableNihonShiki == other.enableNihonShiki
&& enableKunreiShiki == other.enableKunreiShiki && enableHiragana == other.enableHiragana
&& enableKatakana == other.enableKatakana;
}
bool operator!=( Romaji const & other ) const
{
return !operator==( other );
}
};
struct Transliteration
{
bool enableRussianTransliteration;
bool enableGermanTransliteration;
bool enableGreekTransliteration;
bool enableBelarusianTransliteration;
CustomTrans customTrans;
#ifdef MAKE_CHINESE_CONVERSION_SUPPORT
Chinese chinese;
#endif
Romaji romaji;
bool operator==( Transliteration const & other ) const
{
return enableRussianTransliteration == other.enableRussianTransliteration
&& enableGermanTransliteration == other.enableGermanTransliteration
&& enableGreekTransliteration == other.enableGreekTransliteration
&& enableBelarusianTransliteration == other.enableBelarusianTransliteration && customTrans == other.customTrans &&
#ifdef MAKE_CHINESE_CONVERSION_SUPPORT
chinese == other.chinese &&
#endif
romaji == other.romaji;
}
bool operator!=( Transliteration const & other ) const
{
return !operator==( other );
}
Transliteration():
enableRussianTransliteration( false ),
enableGermanTransliteration( false ),
enableGreekTransliteration( false ),
enableBelarusianTransliteration( false )
{
}
};
struct Lingua
{
bool enable;
QString languageCodes;
bool operator==( Lingua const & other ) const
{
return enable == other.enable && languageCodes == other.languageCodes;
}
bool operator!=( Lingua const & other ) const
{
return !operator==( other );
}
};
struct Forvo
{
bool enable;
QString apiKey;
QString languageCodes;
Forvo():
enable( false )
{
}
bool operator==( Forvo const & other ) const
{
return enable == other.enable && apiKey == other.apiKey && languageCodes == other.languageCodes;
}
bool operator!=( Forvo const & other ) const
{
return !operator==( other );
}
};
struct Program
{
bool enabled;
enum Type {
Audio,
PlainText,
Html,
PrefixMatch,
MaxTypeValue
} type;
QString id, name, commandLine;
QString iconFilename;
Program():
enabled( false )
{
}
Program( bool enabled_,
Type type_,
QString const & id_,
QString const & name_,
QString const & commandLine_,
QString const & iconFilename_ ):
enabled( enabled_ ),
type( type_ ),
id( id_ ),
name( name_ ),
commandLine( commandLine_ ),
iconFilename( iconFilename_ )
{
}
bool operator==( Program const & other ) const
{
return enabled == other.enabled && type == other.type && name == other.name && commandLine == other.commandLine
&& iconFilename == other.iconFilename;
}
bool operator!=( Program const & other ) const
{
return !operator==( other );
}
};
typedef QVector< Program > Programs;
#ifndef NO_TTS_SUPPORT
struct VoiceEngine
{
bool enabled;
//engine name.
QString engine_name;
QString name;
//voice name.
QString voice_name;
QString iconFilename;
QLocale locale;
int volume; // 0~1 allowed
int rate; // -1 ~ 1 allowed
VoiceEngine():
enabled( false ),
volume( 50 ),
rate( 0 )
{
}
VoiceEngine( QString engine_nane_, QString name_, QString voice_name_, QLocale locale_, int volume_, int rate_ ):
enabled( false ),
engine_name( engine_nane_ ),
name( name_ ),
voice_name( voice_name_ ),
locale( locale_ ),
volume( volume_ ),
rate( rate_ )
{
}
bool operator==( VoiceEngine const & other ) const
{
return enabled == other.enabled && engine_name == other.engine_name && name == other.name
&& voice_name == other.voice_name && locale == other.locale && iconFilename == other.iconFilename
&& volume == other.volume && rate == other.rate;
}
bool operator!=( VoiceEngine const & other ) const
{
return !operator==( other );
}
};
typedef QVector< VoiceEngine > VoiceEngines;
#endif

@xiaoyifang
Copy link
Owner Author

After some investigation, I find this feature should not be implemented with the current dictionary.hh facilities.

I agree with this , Azure tts can be used across dictionaries and act on its own. It can be displayed as a single function(for example, in the right context menu).

@shenlebantongying
Copy link
Collaborator

Not sure about the experience. Azure tts's endpoint depends on region, a user needs to copy both endpoint and API key in a super condensed interface 😅


Uses this hurl file https://hurl.dev/

POST {{endpoint}}/cognitiveservices/v1

Ocp-Apim-Subscription-Key: ${Your key here}
X-Microsoft-OutputFormat: ogg-48khz-16bit-mono-opus
Content-Type: application/ssml+xml
User-Agent: WhatEver
<speak version='1.0' xml:lang='en-US'>
    <voice name='en-US-LunaNeural'>
        {{sentence}}
    </voice>
</speak>

with

hurl ./voice.hurl --variable endpont="https://eastus.api.cognitive.microsoft.com/" --variable sentence="This is nice!"  --output nice.ogg

will yield an audio.

The {{endpoint}} is obtained from the screenshot.
The voice name is needed from {{endpoint}}/cognitiveservices/voices/list


It seems all cloud TTS supports the same "SSML" thing

https://cloud.google.com/text-to-speech/docs/ssml
https://learn.microsoft.com/azure/ai-services/speech-service/speech-synthesis-markup
https://docs.aws.amazon.com/polly/latest/dg/ssml.html

@xiaoyifang
Copy link
Owner Author

xiaoyifang commented Jun 19, 2024

a little ui improvement
POST {{endpoint}}/cognitiveservices/v1
can be
POST https://{{region}}.api.cognitive.microsoft.com/cognitiveservices/v1

users can use a dropdown list to select the regions which have fixed values in advance

voices can also be provided with fixed values in advance.

@shenlebantongying shenlebantongying self-assigned this Jun 19, 2024
@shenlebantongying
Copy link
Collaborator

I think can add this one directly under “Edit” menu instead of “Edit -> Dictionaries”

Most things on the right side are only "somewhat a dictionary". It is a mistake for the morphology and transliteration, they cannot even be shown as an article.


Furthermore, as a separate component, the config can also be separated into a new file beside config -> config_cloud_tts.xml.

The timing of saving/reading config of different components are not entirely the same.

For example, the saving of MainWindowGeometry needs to read/write at program shutting down, while the dictionaries doesn't, there is no point of putting them in a single config. The “ominous” commitdata is overused. Opening/closing the editdictionaries dialog needs initializing/mutating excessive states (like crash of Qt TTS will bring down the entire dialog.).

Putting it into somewhere separate also makes adding/removing the feature entirely easy, there is no need to add #if feature_x macros, there is no need to carefully think about how to plug new feature things with existing ones. It is not clear how to add a config option without jumping around, and reading everything in the past.

Everything related to one component in one place

vs

orgy of features


TTS dialog -> read/write config
The TTS engine -> read config

(Side effect: this makes building this component as a separate program easy.)

@xiaoyifang
Copy link
Owner Author

Most things on the right side are only "somewhat a dictionary". It is a mistake for the morphology and transliteration, they cannot even be shown as an article.

move them to Edit->preference?

@xiaoyifang
Copy link
Owner Author

xiaoyifang commented Jun 20, 2024

Furthermore, as a separate component, the config can also be separated into a new file beside config -> config_cloud_tts.xml.

This can be considered . azure tts can have its own config file.
It does not have to implemented the dictionary.hh

@shenlebantongying
Copy link
Collaborator

shenlebantongying commented Jun 20, 2024

It is not really difficult to replicate AwesomeTTS for an audio preview pane 😅

Progress for today, a little app https://github.com/SourceReviver/temp_ctts_impl

@xiaoyifang
Copy link
Owner Author

Do you have time to implement this feature?

@shenlebantongying
Copy link
Collaborator

shenlebantongying commented Jun 25, 2024

I think https://github.com/SourceReviver/temp_ctts_impl is complete for the initial version of this feature.

However, I need to prepare for an exam on Friday, so I will prepare an PR this weekends 😅

@xiaoyifang
Copy link
Owner Author

Exam first . PR can wait.

@shenlebantongying shenlebantongying removed their assignment Jul 2, 2024
@shenlebantongying shenlebantongying self-assigned this Jul 13, 2024
@shenlebantongying shenlebantongying removed their assignment Jul 16, 2024
@shenlebantongying
Copy link
Collaborator

shenlebantongying commented Jul 17, 2024

Offer user a way to stop playing the current selection . Especially when the selection is not soon to finish playing.

Refactoring the current "Pronounce" button on the toolbar is needed, I think? Currently, the length of selection of pronunciation is limited to 60. We can implement "stop" later.

I don't have another chunk of time until at least August 13. #1685 is usable as of now. Not sure how we should proceed. 😅

@xiaoyifang
Copy link
Owner Author

Offer user a way to stop playing the current selection . Especially when the selection is not soon to finish playing.

Refactoring the current "Pronounce" button on the toolbar is needed, I think? Currently, the length of selection of pronunciation is limited to 60. We can implement "stop" later.

I don't have another chunk of time until at least August 13. #1685 is usable as of now. Not sure how we should proceed. 😅

That is ok, I can continue to work on it when I'm available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants