Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update RoBERTa vocabulary files #255

Merged
merged 6 commits into from
Nov 30, 2019
Merged

Update RoBERTa vocabulary files #255

merged 6 commits into from
Nov 30, 2019

Conversation

gpengzhi
Copy link
Collaborator

import torch
roberta = torch.hub.load('pytorch/fairseq', 'roberta.base')
roberta.eval()

tokens = roberta.encode('Hello world!')
print(tokens)  # [    0, 31414,   232,   328,     2]
import texar.torch as tx
tokenizer = tx.data.RoBERTaTokenizer(pretrained_model_name='roberta-base')

input_ids, _ = tokenizer.encode_text('Hello world!', max_seq_length=5)
print(input_ids)  # [0, 31414, 232, 328, 2]

@gpengzhi
Copy link
Collaborator Author

resolve #246

Copy link
Member

@ZhitingHu ZhitingHu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input_ids, _ = tokenizer.encode_text('Hello world!', max_seq_length=5)

is max_seq_length necessary here? what's the result in this case without setting max_seq_length?

@codecov
Copy link

codecov bot commented Nov 27, 2019

Codecov Report

Merging #255 into master will increase coverage by <.01%.
The diff coverage is 70%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #255      +/-   ##
==========================================
+ Coverage   83.04%   83.04%   +<.01%     
==========================================
  Files         195      195              
  Lines       15293    15300       +7     
==========================================
+ Hits        12700    12706       +6     
- Misses       2593     2594       +1
Impacted Files Coverage Δ
texar/torch/data/tokenizers/tokenizer_base.py 89.83% <100%> (+0.04%) ⬆️
texar/torch/data/tokenizers/roberta_tokenizer.py 94.73% <100%> (ø) ⬆️
texar/torch/data/tokenizers/xlnet_tokenizer.py 85.38% <50%> (-0.56%) ⬇️
texar/torch/data/tokenizers/bert_tokenizer.py 88.88% <50%> (-0.81%) ⬇️
texar/torch/data/tokenizers/gpt2_tokenizer.py 89.36% <66.66%> (-0.57%) ⬇️
texar/torch/data/data/data_iterators_utils.py 72.72% <0%> (ø) ⬆️
texar/torch/data/data/data_iterators.py 82.24% <0%> (+0.36%) ⬆️
texar/torch/core/layers.py 88.23% <0%> (+0.53%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d17d502...646de51. Read the comment docs.

@gpengzhi gpengzhi closed this Nov 27, 2019
@gpengzhi gpengzhi reopened this Nov 27, 2019
@gpengzhi
Copy link
Collaborator Author

gpengzhi commented Nov 27, 2019

input_ids, _ = tokenizer.encode_text('Hello world!', max_seq_length=5)

is max_seq_length necessary here? what's the result in this case without setting max_seq_length?

max_seq_length is used here just to show the exact same result. If max_seq_length is not specified, it will be [0, 31414, 232, 328, 2, 0, 0, 0, ..., 0](zero-pad up to the maximum sequence length).

@ZhitingHu
Copy link
Member

input_ids, _ = tokenizer.encode_text('Hello world!', max_seq_length=5)
is max_seq_length necessary here? what's the result in this case without setting max_seq_length?

max_seq_length is used here just to show the same result between ours and theirs. If max_seq_length is not specified, it will be [0, 31414, 232, 328, 2, 0, 0, 0, ..., 0](zero-pad up to the maximum sequence length).

If the user doesn't want padding, it's difficult for them to know the correct seq length just as in this case ('Hello world!' needs seq_length 5).. Can we add an argument (or allow a special value of max_seq_length) so that the user can get the encoded text without padding (and without need to specify max-seq-length explicitly)?

@gpengzhi
Copy link
Collaborator Author

gpengzhi commented Nov 27, 2019

input_ids, _ = tokenizer.encode_text('Hello world!', max_seq_length=5)
is max_seq_length necessary here? what's the result in this case without setting max_seq_length?

max_seq_length is used here just to show the same result between ours and theirs. If max_seq_length is not specified, it will be [0, 31414, 232, 328, 2, 0, 0, 0, ..., 0](zero-pad up to the maximum sequence length).

If the user doesn't want padding, it's difficult for them to know the correct seq length just as in this case ('Hello world!' needs seq_length 5).. Can we add an argument (or allow a special value of max_seq_length) so that the user can get the encoded text without padding (and without need to specify max-seq-length explicitly)?

There are two returns, input_ids and input_mask, for the function encode_text. The user can compute the correct sequence length by accessing input_mask. In this case, input_mask is [1, 1, 1, 1, 1, 0, 0, ..., 0].

@ZhitingHu
Copy link
Member

pls try to pass the test and merge asap

@gpengzhi gpengzhi closed this Nov 28, 2019
@gpengzhi gpengzhi reopened this Nov 28, 2019
@gpengzhi
Copy link
Collaborator Author

I tried to disable codecov/patch because some of our tests (pre-trained tests) are done locally. codecov/patch is also reported to be not very reliable. Some other projects have met such issues (argoproj/argo-cd#1926). I think codecov/project is good enough to check the code quality for now. I tried many ways to disable codecov/patch but failed. It seems that someone also met a similar problem (https://community.codecov.io/t/cannot-disable-codecov-patch-check/682/4). Do you think it is reasonable to merge PRs regardless of the status of codecov/patch? @ZhitingHu

@ZhitingHu
Copy link
Member

As long as the build is "passing" we're good for now

@gpengzhi gpengzhi merged commit 3484a18 into asyml:master Nov 30, 2019
@gpengzhi gpengzhi mentioned this pull request Nov 30, 2019
@gpengzhi
Copy link
Collaborator Author

gpengzhi commented Dec 2, 2019

@gpengzhi gpengzhi deleted the roberta-tokenizer branch February 4, 2020 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants