-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add changes for uroman package to handle non-Roman characters #32404
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super first draft @nandwalritik! It's very close to being ready 🤗
My only issue with the PR is regarding backwards compatibility (explained below). Otherwise, my only request is slightly more verbose installation instructions in the docs.
Other than that, the PR looks great!
@@ -172,11 +174,12 @@ def prepare_for_tokenization( | |||
filtered_text = self._preprocess_char(text) | |||
|
|||
if has_non_roman_characters(filtered_text) and self.is_uroman: | |||
logger.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my only qualm with the PR: it's currently not backwards compatible.
Previously, if a user was omitting the uroman step and inputting non-roman characters, they would have only get a simple warning. Now, the code will error out and stop them from doing this (i.e. breaking backwards compatibility).
I would be in favour of keeping the logger.warning
(rather than an ImportError
) and advising the user to install the uroman
package
logger.warning(
"Text to the tokenizer contains non-Roman characters. To apply the `uroman` pre-processing "
"step automatically, ensure the `uroman` Romanizer is installed with: `pip install uroman`
"Otherwise, apply the Romanizer manually as per the instructions: https://github.com/isi-nlp/uroman
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, didn't noticed the default behaviour, I have modified accordingly.
Now it will only give warning in case the uroman
package is not installed.
Thanks for the inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great after the changes!
docs/source/en/model_doc/vits.md
Outdated
You can then pre-process the text input using the following code snippet. You can either rely on using the bash variable | ||
`UROMAN` to point to the uroman repository, or you can pass the uroman directory as an argument to the `uromaize` function: | ||
|
||
If the is_uroman attribute is True, the tokenizer will automatically apply the uroman package to your text inputs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we encourage users to install uroman
here and explain why it's required?
pip install --upgrade uroman
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added changes for docs.
@@ -1054,6 +1055,10 @@ def is_phonemizer_available(): | |||
return _phonemizer_available | |||
|
|||
|
|||
def is_uroman_available(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome job here!
uromaized_text = uromanize(text, uroman_path=os.environ["UROMAN"]) | ||
|
||
inputs = tokenizer(text=uromaized_text, return_tensors="pt") | ||
inputs = tokenizer(text=text, return_tensors="pt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So much simpler!
Just one doubt here, when I was installing uroman, It requires python version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @nandwalritik! Regarding the requirement of python>=3.10
- let's mention it in the documentation and logging message. If the user has python<3.10
, perhaps we can advise them to use the old method of applying uroman? E.g. using the perl code snippet outside of the tokenizer?
Feel free to request a final review directly from |
ebb09fd
to
7a7ef2d
Compare
Hi @ArthurZucker can you give a final review to the PR, already updated docs as per sanchit's inputs for backward compatibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for my late review, this looks great to me!
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
…gface#32404) * Add changes for uroman package to handle non-Roman characters * Update docs for uroman changes * Modifying error message to warning, for backward compatibility * Update instruction for user to install uroman * Update docs for uroman python version dependency and backward compatibility * Update warning message for python version compatibility with uroman * Refine docs
…gface#32404) * Add changes for uroman package to handle non-Roman characters * Update docs for uroman changes * Modifying error message to warning, for backward compatibility * Update instruction for user to install uroman * Update docs for uroman python version dependency and backward compatibility * Update warning message for python version compatibility with uroman * Refine docs
…gface#32404) * Add changes for uroman package to handle non-Roman characters * Update docs for uroman changes * Modifying error message to warning, for backward compatibility * Update instruction for user to install uroman * Update docs for uroman python version dependency and backward compatibility * Update warning message for python version compatibility with uroman * Refine docs
…gface#32404) * Add changes for uroman package to handle non-Roman characters * Update docs for uroman changes * Modifying error message to warning, for backward compatibility * Update instruction for user to install uroman * Update docs for uroman python version dependency and backward compatibility * Update warning message for python version compatibility with uroman * Refine docs
…gface#32404) * Add changes for uroman package to handle non-Roman characters * Update docs for uroman changes * Modifying error message to warning, for backward compatibility * Update instruction for user to install uroman * Update docs for uroman python version dependency and backward compatibility * Update warning message for python version compatibility with uroman * Refine docs
What does this PR do?
Fixes #32387
Before submitting
Pull Request section?
to it if that's the case. [mms tts] add uroman as a soft-dependency #32387
documentation guidelines, and
here are tips on formatting docstrings. - No
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@sanchit-gandhi @ylacombe
I have added the change for using
uroman
as optional dependency, just want to know if it's correctly implemented.