French g2p with pronunciation dictionary #7601

mgrafu · 2023-10-02T19:57:46Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>

Signed-off-by: Mariana <47233618+mgrafu@users.noreply.github.com>

for more information, see https://pre-commit.ci

nemo/collections/common/tokenizers/text_to_speech/ipa_lexicon.py

XuesongYang · 2023-10-02T20:54:41Z

LGTM. Could you please also add some unit tests for the tokenizers?

Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>

for more information, see https://pre-commit.ci

XuesongYang · 2023-10-07T00:24:53Z

tests/collections/common/tokenizers/text_to_speech/test_tokenizer_utils.py

+
+    @pytest.mark.run_only_on('CPU')
+    @pytest.mark.unit
+    def test_french_text_preprocessing_lower(self):
+        input_text = "pomme banane poire"
+        expected_output = "pomme banane poire"
+
+        output = french_text_preprocessing(input_text)
+        assert output == expected_output
+
+    @pytest.mark.run_only_on('CPU')
+    @pytest.mark.unit
+    def test_french_text_preprocessing_mixed(self):
+        input_text = "BONJOUR le Monde!"
+        expected_output = "bonjour le monde!"
+
+        output = french_text_preprocessing(input_text)
+        assert output == expected_output
+
+    @pytest.mark.run_only_on('CPU')
+    @pytest.mark.unit
+    def test_french_text_preprocessing_upper(self):
+        input_text = "A BIENTÔT."
+        expected_output = "a bientôt."
+
+        output = french_text_preprocessing(input_text)
+        assert output == expected_output


these tests are for text_preprocessing funcs, not for any tokenize funcs. Could you pls revise following the above unit tests?

we can remove the above three unit tests, and instead add test examples inside all functions related to any_locale_word_tokenize. For example, you could extend below input_text and expected_output as a list by adding fr-fr examples.

@pytest.mark.run_only_on('CPU') @pytest.mark.unit def test_any_locale_word_tokenize(self): input_text = "apple banana pear" expected_output = self._create_expected_output(["apple", " ", "banana", " ", "pear"]) output = any_locale_word_tokenize(input_text) assert output == expected_output @pytest.mark.run_only_on('CPU') @pytest.mark.unit def test_any_locale_word_tokenize_with_accents(self): input_text = "The naïve piñata at the café..." expected_output = self._create_expected_output( ["The", " ", "naïve", " ", "piñata", " ", "at", " ", "the", " ", "café", "..."] ) output = any_locale_word_tokenize(input_text) assert output == expected_output @pytest.mark.run_only_on('CPU') @pytest.mark.unit def test_any_locale_word_tokenize_with_numbers(self): input_text = r"Three times× four^teen ÷divided by [movies] on \slash." expected_output = self._create_expected_output( [ "Three", " ", "times", "× ", "four", "^", "teen", " ÷", "divided", " ", "by", " [", "movies", "] ", "on", " \\", "slash", ".", ] ) output = any_locale_word_tokenize(input_text) assert output == expected_output

XuesongYang · 2023-10-07T00:27:30Z

nemo/collections/common/tokenizers/text_to_speech/tts_tokenizers.py

+            non_default_punct_list: List of punctuation marks which will be used instead default.
+        """
+
+        fr_alphabet = get_grapheme_character_set(locale="fr-FR", case="mixed")


when you specified case="mixed", meaning the tokenizer will distinguish uppercase and lowercase letters, and assign different token index. For example, we should expect the final char alphabet like A-Za-z. But from your test cases for French, I feel all are lowercase. Could you please double check either you want all lowercased, uppercased, or mixed?

corrected this in newer commit :)

Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>

XuesongYang · 2023-10-13T18:13:59Z

nemo/collections/common/tokenizers/text_to_speech/ipa_lexicon.py

@@ -190,6 +204,8 @@ def get_ipa_punctuation_list(locale):
    elif locale == "es-ES":
        # ref: https://en.wikipedia.org/wiki/Spanish_orthography#Punctuation
        punct_set.update(['¿', '¡'])
+    elif locale == "fr-FR":
+        punct_set.update(['–', '“', '”', '…', '̀', '́', '̂', '̈', '̧'])


Regarding the extra punctuations used in French, could you pls add comments of unicode for each punctuation just as what "de-DE" did? It is not easy to pinpoint the difference between similar surfaces, such as below. Thanks!

'‒', # figure dash, U+2012, decimal 8210 '–', # en dash, U+2013, decimal 8211

nemo/collections/common/tokenizers/text_to_speech/tts_tokenizers.py

tests/collections/common/tokenizers/text_to_speech/test_tts_tokenizers.py

Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>

…h_g2p

XuesongYang

LGTM. Thanks!

* enable prondict g2p for fr * add processing for contractions * update ipa lexicon * debug and add tests * fix alphabet casing * fix tokenizer utils tests * add ipa tokenizer test for fr Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com> --------- Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com> Signed-off-by: Mariana <47233618+mgrafu@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* enable prondict g2p for fr * add processing for contractions * update ipa lexicon * debug and add tests * fix alphabet casing * fix tokenizer utils tests * add ipa tokenizer test for fr Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com> --------- Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com> Signed-off-by: Mariana <47233618+mgrafu@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

mgrafu added 2 commits September 25, 2023 10:18

enable prondict g2p for fr

8306074

Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>

add processing for contractions

7c98e3f

Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>

github-actions bot added TTS common labels Oct 2, 2023

Merge branch 'main' into french_g2p

762f6cc

Signed-off-by: Mariana <47233618+mgrafu@users.noreply.github.com>

mgrafu changed the title ~~French g2p~~ French g2p with pronunciation dictionary Oct 2, 2023

[pre-commit.ci] auto fixes from pre-commit.com hooks

e397057

for more information, see https://pre-commit.ci

XuesongYang reviewed Oct 2, 2023

View reviewed changes

nemo/collections/common/tokenizers/text_to_speech/ipa_lexicon.py Outdated Show resolved Hide resolved

XuesongYang reviewed Oct 2, 2023

View reviewed changes

nemo/collections/common/tokenizers/text_to_speech/ipa_lexicon.py Outdated Show resolved Hide resolved

mgrafu and others added 4 commits October 3, 2023 14:06

update ipa lexicon

df6256c

Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>

debug and add tests

800f022

Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

6f5045d

for more information, see https://pre-commit.ci

Merge branch 'main' into french_g2p

2ab2188

XuesongYang self-assigned this Oct 7, 2023

XuesongYang reviewed Oct 7, 2023

View reviewed changes

fix alphabet casing

ef158bc

Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>

XuesongYang reviewed Oct 13, 2023

View reviewed changes

nemo/collections/common/tokenizers/text_to_speech/tts_tokenizers.py Show resolved Hide resolved

XuesongYang reviewed Oct 13, 2023

View reviewed changes

tests/collections/common/tokenizers/text_to_speech/test_tts_tokenizers.py Show resolved Hide resolved

mgrafu and others added 7 commits October 18, 2023 11:41

fix tokenizer utils tests

3f60c6f

Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>

add comments to additional punctuation

fac9077

Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

ed5c6e7

for more information, see https://pre-commit.ci

add ipa tokenizer test for fr

d1fcae0

Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>

Merge branch 'french_g2p' of github.com-mgrafu:NVIDIA/NeMo into frenc…

9814a82

…h_g2p

Merge branch 'main' into french_g2p

a6f99b2

Merge branch 'main' into french_g2p

18ed838

XuesongYang approved these changes Oct 20, 2023

View reviewed changes

XuesongYang merged commit 5895a57 into main Oct 20, 2023
12 checks passed

XuesongYang deleted the french_g2p branch October 20, 2023 19:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

French g2p with pronunciation dictionary #7601

French g2p with pronunciation dictionary #7601

mgrafu commented Oct 2, 2023

XuesongYang commented Oct 2, 2023

XuesongYang Oct 7, 2023

XuesongYang Oct 13, 2023

XuesongYang Oct 7, 2023

mgrafu Oct 11, 2023

XuesongYang Oct 13, 2023

XuesongYang left a comment

French g2p with pronunciation dictionary #7601

French g2p with pronunciation dictionary #7601

Conversation

mgrafu commented Oct 2, 2023

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

XuesongYang commented Oct 2, 2023

XuesongYang Oct 7, 2023

Choose a reason for hiding this comment

XuesongYang Oct 13, 2023

Choose a reason for hiding this comment

XuesongYang Oct 7, 2023

Choose a reason for hiding this comment

mgrafu Oct 11, 2023

Choose a reason for hiding this comment

XuesongYang Oct 13, 2023

Choose a reason for hiding this comment

XuesongYang left a comment

Choose a reason for hiding this comment