Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

French g2p with pronunciation dictionary #7601

Merged
merged 16 commits into from
Oct 20, 2023
Merged

French g2p with pronunciation dictionary #7601

merged 16 commits into from
Oct 20, 2023

Conversation

mgrafu
Copy link
Collaborator

@mgrafu mgrafu commented Oct 2, 2023

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>
Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>
Signed-off-by: Mariana <47233618+mgrafu@users.noreply.github.com>
@mgrafu mgrafu changed the title French g2p French g2p with pronunciation dictionary Oct 2, 2023
mgrafu and others added 4 commits October 3, 2023 14:06
Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>
Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>
@XuesongYang XuesongYang self-assigned this Oct 7, 2023
Comment on lines 124 to 150

@pytest.mark.run_only_on('CPU')
@pytest.mark.unit
def test_french_text_preprocessing_lower(self):
input_text = "pomme banane poire"
expected_output = "pomme banane poire"

output = french_text_preprocessing(input_text)
assert output == expected_output

@pytest.mark.run_only_on('CPU')
@pytest.mark.unit
def test_french_text_preprocessing_mixed(self):
input_text = "BONJOUR le Monde!"
expected_output = "bonjour le monde!"

output = french_text_preprocessing(input_text)
assert output == expected_output

@pytest.mark.run_only_on('CPU')
@pytest.mark.unit
def test_french_text_preprocessing_upper(self):
input_text = "A BIENTÔT."
expected_output = "a bientôt."

output = french_text_preprocessing(input_text)
assert output == expected_output
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests are for text_preprocessing funcs, not for any tokenize funcs. Could you pls revise following the above unit tests?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove the above three unit tests, and instead add test examples inside all functions related to any_locale_word_tokenize. For example, you could extend below input_text and expected_output as a list by adding fr-fr examples.

@pytest.mark.run_only_on('CPU')
    @pytest.mark.unit
    def test_any_locale_word_tokenize(self):
        input_text = "apple banana pear"
        expected_output = self._create_expected_output(["apple", " ", "banana", " ", "pear"])

        output = any_locale_word_tokenize(input_text)
        assert output == expected_output

    @pytest.mark.run_only_on('CPU')
    @pytest.mark.unit
    def test_any_locale_word_tokenize_with_accents(self):
        input_text = "The naïve piñata at the café..."
        expected_output = self._create_expected_output(
            ["The", " ", "naïve", " ", "piñata", " ", "at", " ", "the", " ", "café", "..."]
        )

        output = any_locale_word_tokenize(input_text)
        assert output == expected_output

    @pytest.mark.run_only_on('CPU')
    @pytest.mark.unit
    def test_any_locale_word_tokenize_with_numbers(self):
        input_text = r"Three times× four^teen ÷divided by [movies] on \slash."
        expected_output = self._create_expected_output(
            [
                "Three",
                " ",
                "times",
                "× ",
                "four",
                "^",
                "teen",
                " ÷",
                "divided",
                " ",
                "by",
                " [",
                "movies",
                "] ",
                "on",
                " \\",
                "slash",
                ".",
            ]
        )

        output = any_locale_word_tokenize(input_text)
        assert output == expected_output

non_default_punct_list: List of punctuation marks which will be used instead default.
"""

fr_alphabet = get_grapheme_character_set(locale="fr-FR", case="mixed")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you specified case="mixed", meaning the tokenizer will distinguish uppercase and lowercase letters, and assign different token index. For example, we should expect the final char alphabet like A-Za-z. But from your test cases for French, I feel all are lowercase. Could you please double check either you want all lowercased, uppercased, or mixed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected this in newer commit :)

Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>
@@ -190,6 +204,8 @@ def get_ipa_punctuation_list(locale):
elif locale == "es-ES":
# ref: https://en.wikipedia.org/wiki/Spanish_orthography#Punctuation
punct_set.update(['¿', '¡'])
elif locale == "fr-FR":
punct_set.update(['–', '“', '”', '…', '̀', '́', '̂', '̈', '̧'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the extra punctuations used in French, could you pls add comments of unicode for each punctuation just as what "de-DE" did? It is not easy to pinpoint the difference between similar surfaces, such as below. Thanks!

'‒',  # figure dash, U+2012, decimal 8210
'–',  # en dash, U+2013, decimal 8211

mgrafu and others added 7 commits October 18, 2023 11:41
Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>
Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>
Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>
Copy link
Collaborator

@XuesongYang XuesongYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@XuesongYang XuesongYang merged commit 5895a57 into main Oct 20, 2023
12 checks passed
@XuesongYang XuesongYang deleted the french_g2p branch October 20, 2023 19:03
pzelasko pushed a commit to pzelasko/NeMo that referenced this pull request Jan 3, 2024
* enable prondict g2p for fr
* add processing for contractions
* update ipa lexicon
* debug and add tests
* fix alphabet casing
* fix tokenizer utils tests
* add ipa tokenizer test for fr

Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>

---------

Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>
Signed-off-by: Mariana <47233618+mgrafu@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
* enable prondict g2p for fr
* add processing for contractions
* update ipa lexicon
* debug and add tests
* fix alphabet casing
* fix tokenizer utils tests
* add ipa tokenizer test for fr

Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>

---------

Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>
Signed-off-by: Mariana <47233618+mgrafu@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants