Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RoFormer] Fix some issues #12397

Merged
merged 18 commits into from
Jul 6, 2021
Merged

[RoFormer] Fix some issues #12397

merged 18 commits into from
Jul 6, 2021

Conversation

JunnYu
Copy link
Contributor

@JunnYu JunnYu commented Jun 28, 2021

What does this PR do?

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@patil-suraj

Copy link
Contributor

@NielsRogge NielsRogge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these fixes. 2 main comments:

  • your PR updates the utils_qa.py file of the Tensorflow examples. Any reason why this is the case?
  • the jieba dependency of the tokenizer should be implemented in file_utils.py and testing_utils.py

is_world_process_zero: bool = True,
log_level: Optional[int] = logging.WARNING,
Copy link
Contributor

@NielsRogge NielsRogge Jul 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you updated this? Same question for the lines below

Don't think this RoFormer PR needs to update the utils_qa.py file of the Tensorflow examples?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forked transformers before this pr 276bc14.
And this pr do not run command fix-copies.

I run command fix-copies and then this file examples/tensorflow/question-answering/utils_qa.py change.

Comment on lines 146 to 167
import rjieba
import jieba
except ImportError:
raise ImportError(
"You need to install rjieba to use RoFormerTokenizer."
"See https://pypi.org/project/rjieba/ for installation."
"You need to install jieba to use RoFormerTokenizer."
"See https://pypi.org/project/jieba/ for installation."
)
self.jieba = rjieba
self.jieba = jieba
Copy link
Contributor

@NielsRogge NielsRogge Jul 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the correct way of handling the jieba dependency @LysandreJik?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided to handle it this way as it's the only model that requires it - and if other models arrive, then to upstream it like it is done for the other models.

Comment on lines 184 to 191
try:
import rjieba
import jieba
except ImportError:
raise ImportError(
"You need to install rjieba to use RoFormerTokenizer."
"See https://pypi.org/project/rjieba/ for installation."
"You need to install jieba to use RoFormerTokenizer."
"See https://pypi.org/project/jieba/ for installation."
)
self.jieba = rjieba
self.jieba = jieba
Copy link
Contributor

@NielsRogge NielsRogge Jul 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you already have the try: except block at the init of the tokenizer, is this required here?

Comment on lines 25 to 36
def is_rjieba_available():
return importlib.util.find_spec("rjieba") is not None
def is_jieba_available():
return importlib.util.find_spec("jieba") is not None


def require_rjieba(test_case):
def require_jieba(test_case):
"""
Decorator marking a test that requires Jieba. These tests are skipped when Jieba isn't installed.
"""
if not is_rjieba_available():
return unittest.skip("test requires rjieba")(test_case)
if not is_jieba_available():
return unittest.skip("test requires jieba")(test_case)
else:
return test_case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_jieba_available() should not be defined within the test file. Rather, it should be defined in the file_utils.py file, as can be seen here.

require_jieba should be defined in testing_utils.py instead of here. You can check how this was done for the timm library as an example here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have changed this :) .

Comment on lines 73 to 74
is_world_process_zero (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether this process is the main process or not (used to determine if logging/saves should be done).
log_level (:obj:`int`, `optional`, defaults to ``logging.WARNING``):
``logging`` log level (e.g., ``logging.WARNING``)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here.

logger.setLevel(logging.INFO if is_world_process_zero else logging.WARN)
logger.setLevel(log_level)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here.

Comment on lines 416 to 423
print(f"Saving predictions to {prediction_file}.")
logger.info(f"Saving predictions to {prediction_file}.")
with open(prediction_file, "w") as writer:
writer.write(json.dumps(all_predictions, indent=4) + "\n")
print(f"Saving nbest_preds to {nbest_file}.")
logger.info(f"Saving nbest_preds to {nbest_file}.")
with open(nbest_file, "w") as writer:
writer.write(json.dumps(all_nbest_json, indent=4) + "\n")
if version_2_with_negative:
print(f"Saving null_odds to {null_odds_file}.")
logger.info(f"Saving null_odds to {null_odds_file}.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here.

logger.setLevel(logging.INFO if is_world_process_zero else logging.WARN)
logger.setLevel(log_level)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here.

is_world_process_zero: bool = True,
log_level: Optional[int] = logging.WARNING,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here.

Comment on lines 283 to 284
is_world_process_zero (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether this process is the main process or not (used to determine if logging/saves should be done).
log_level (:obj:`int`, `optional`, defaults to ``logging.WARNING``):
``logging`` log level (e.g., ``logging.WARNING``)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here.

@JunnYu JunnYu requested a review from NielsRogge July 2, 2021 10:24
Comment on lines 184 to 186
import jieba

self.jieba = jieba
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here for @LysandreJik.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woul dbe nice to leave the try/except statement

@NielsRogge
Copy link
Contributor

Ok thanks, I just will let Lysandre review the try:except block, that's the only thing remaining.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but why did you change rjieba to jieba? Is the latter better?

src/transformers/models/roformer/configuration_roformer.py Outdated Show resolved Hide resolved
Comment on lines 146 to 167
import rjieba
import jieba
except ImportError:
raise ImportError(
"You need to install rjieba to use RoFormerTokenizer."
"See https://pypi.org/project/rjieba/ for installation."
"You need to install jieba to use RoFormerTokenizer."
"See https://pypi.org/project/jieba/ for installation."
)
self.jieba = rjieba
self.jieba = jieba
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided to handle it this way as it's the only model that requires it - and if other models arrive, then to upstream it like it is done for the other models.

Comment on lines 184 to 186
import jieba

self.jieba = jieba
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woul dbe nice to leave the try/except statement

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
@JunnYu
Copy link
Contributor Author

JunnYu commented Jul 5, 2021

This looks good, but why did you change rjieba to jieba? Is the latter better?

@LysandreJik
I want to use Hosted inference API in https://huggingface.co.
image

I found CpmTokenizer and XLMTokenizer use jieba.

class CpmTokenizer(XLNetTokenizer):

class XLMTokenizer(PreTrainedTokenizer):

@LysandreJik
Copy link
Member

cc @Narsil

@Narsil
Copy link
Contributor

Narsil commented Jul 5, 2021

@JunnYu Should work now:

https://huggingface.co/junnyu/roformer_chinese_base?text=%E7%94%9F%E6%B4%BB%E7%9A%84%E7%9C%9F%E8%B0%9B%E6%98%AF+%5BMASK%5D%E3%80%82

We do update the API regularly with dependencies, rjiebawas added pretty recently.
Cheers !

@JunnYu
Copy link
Contributor Author

JunnYu commented Jul 5, 2021

@Narsil thank you!

@JunnYu
Copy link
Contributor Author

JunnYu commented Jul 5, 2021

@LysandreJik
(1) Now I use rust jieba .
(2) This pr #12361 add test_training_new_tokenizer and test_training_new_tokenizer_with_special_tokens_change.
I found test_tokenization_roformer.py can't pass these two tests. Because roformer tokenizer has a custom PreTokenizer.

tokenizer.pre_tokenizer = pre_tokenizers.PreTokenizer.custom(JiebaPreTokenizer(vocab))

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, LGTM! Thanks @JunnYu!

@LysandreJik LysandreJik merged commit 626a0a0 into huggingface:master Jul 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants