-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
French g2p with pronunciation dictionary #7601
Conversation
Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>
Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>
Signed-off-by: Mariana <47233618+mgrafu@users.noreply.github.com>
for more information, see https://pre-commit.ci
nemo/collections/common/tokenizers/text_to_speech/ipa_lexicon.py
Outdated
Show resolved
Hide resolved
nemo/collections/common/tokenizers/text_to_speech/ipa_lexicon.py
Outdated
Show resolved
Hide resolved
LGTM. Could you please also add some unit tests for the tokenizers? |
Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>
Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>
for more information, see https://pre-commit.ci
|
||
@pytest.mark.run_only_on('CPU') | ||
@pytest.mark.unit | ||
def test_french_text_preprocessing_lower(self): | ||
input_text = "pomme banane poire" | ||
expected_output = "pomme banane poire" | ||
|
||
output = french_text_preprocessing(input_text) | ||
assert output == expected_output | ||
|
||
@pytest.mark.run_only_on('CPU') | ||
@pytest.mark.unit | ||
def test_french_text_preprocessing_mixed(self): | ||
input_text = "BONJOUR le Monde!" | ||
expected_output = "bonjour le monde!" | ||
|
||
output = french_text_preprocessing(input_text) | ||
assert output == expected_output | ||
|
||
@pytest.mark.run_only_on('CPU') | ||
@pytest.mark.unit | ||
def test_french_text_preprocessing_upper(self): | ||
input_text = "A BIENTÔT." | ||
expected_output = "a bientôt." | ||
|
||
output = french_text_preprocessing(input_text) | ||
assert output == expected_output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these tests are for text_preprocessing
funcs, not for any tokenize funcs. Could you pls revise following the above unit tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can remove the above three unit tests, and instead add test examples inside all functions related to any_locale_word_tokenize
. For example, you could extend below input_text
and expected_output
as a list by adding fr-fr examples.
@pytest.mark.run_only_on('CPU')
@pytest.mark.unit
def test_any_locale_word_tokenize(self):
input_text = "apple banana pear"
expected_output = self._create_expected_output(["apple", " ", "banana", " ", "pear"])
output = any_locale_word_tokenize(input_text)
assert output == expected_output
@pytest.mark.run_only_on('CPU')
@pytest.mark.unit
def test_any_locale_word_tokenize_with_accents(self):
input_text = "The naïve piñata at the café..."
expected_output = self._create_expected_output(
["The", " ", "naïve", " ", "piñata", " ", "at", " ", "the", " ", "café", "..."]
)
output = any_locale_word_tokenize(input_text)
assert output == expected_output
@pytest.mark.run_only_on('CPU')
@pytest.mark.unit
def test_any_locale_word_tokenize_with_numbers(self):
input_text = r"Three times× four^teen ÷divided by [movies] on \slash."
expected_output = self._create_expected_output(
[
"Three",
" ",
"times",
"× ",
"four",
"^",
"teen",
" ÷",
"divided",
" ",
"by",
" [",
"movies",
"] ",
"on",
" \\",
"slash",
".",
]
)
output = any_locale_word_tokenize(input_text)
assert output == expected_output
non_default_punct_list: List of punctuation marks which will be used instead default. | ||
""" | ||
|
||
fr_alphabet = get_grapheme_character_set(locale="fr-FR", case="mixed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you specified case="mixed"
, meaning the tokenizer will distinguish uppercase and lowercase letters, and assign different token index. For example, we should expect the final char alphabet like A-Za-z
. But from your test cases for French, I feel all are lowercase. Could you please double check either you want all lowercased, uppercased, or mixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corrected this in newer commit :)
Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>
@@ -190,6 +204,8 @@ def get_ipa_punctuation_list(locale): | |||
elif locale == "es-ES": | |||
# ref: https://en.wikipedia.org/wiki/Spanish_orthography#Punctuation | |||
punct_set.update(['¿', '¡']) | |||
elif locale == "fr-FR": | |||
punct_set.update(['–', '“', '”', '…', '̀', '́', '̂', '̈', '̧']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the extra punctuations used in French, could you pls add comments of unicode for each punctuation just as what "de-DE" did? It is not easy to pinpoint the difference between similar surfaces, such as below. Thanks!
'‒', # figure dash, U+2012, decimal 8210
'–', # en dash, U+2013, decimal 8211
tests/collections/common/tokenizers/text_to_speech/test_tts_tokenizers.py
Show resolved
Hide resolved
Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>
Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
* enable prondict g2p for fr * add processing for contractions * update ipa lexicon * debug and add tests * fix alphabet casing * fix tokenizer utils tests * add ipa tokenizer test for fr Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com> --------- Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com> Signed-off-by: Mariana <47233618+mgrafu@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Piotr Żelasko <petezor@gmail.com>
* enable prondict g2p for fr * add processing for contractions * update ipa lexicon * debug and add tests * fix alphabet casing * fix tokenizer utils tests * add ipa tokenizer test for fr Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com> --------- Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com> Signed-off-by: Mariana <47233618+mgrafu@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use this
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information