Update RoBERTa vocabulary files #255

gpengzhi · 2019-11-27T19:08:15Z

import torch
roberta = torch.hub.load('pytorch/fairseq', 'roberta.base')
roberta.eval()

tokens = roberta.encode('Hello world!')
print(tokens)  # [    0, 31414,   232,   328,     2]

import texar.torch as tx
tokenizer = tx.data.RoBERTaTokenizer(pretrained_model_name='roberta-base')

input_ids, _ = tokenizer.encode_text('Hello world!', max_seq_length=5)
print(input_ids)  # [0, 31414, 232, 328, 2]

gpengzhi · 2019-11-27T19:08:49Z

resolve #246

ZhitingHu

input_ids, _ = tokenizer.encode_text('Hello world!', max_seq_length=5)

is max_seq_length necessary here? what's the result in this case without setting max_seq_length?

texar/torch/data/tokenizers/gpt2_tokenizer.py

texar/torch/data/tokenizers/roberta_tokenizer.py

codecov · 2019-11-27T21:28:30Z

Codecov Report

Merging #255 into master will increase coverage by <.01%.
The diff coverage is 70%.

@@            Coverage Diff             @@
##           master     #255      +/-   ##
==========================================
+ Coverage   83.04%   83.04%   +<.01%     
==========================================
  Files         195      195              
  Lines       15293    15300       +7     
==========================================
+ Hits        12700    12706       +6     
- Misses       2593     2594       +1

Impacted Files	Coverage Δ
texar/torch/data/tokenizers/tokenizer_base.py	`89.83% <100%> (+0.04%)`	⬆️
texar/torch/data/tokenizers/roberta_tokenizer.py	`94.73% <100%> (ø)`	⬆️
texar/torch/data/tokenizers/xlnet_tokenizer.py	`85.38% <50%> (-0.56%)`	⬇️
texar/torch/data/tokenizers/bert_tokenizer.py	`88.88% <50%> (-0.81%)`	⬇️
texar/torch/data/tokenizers/gpt2_tokenizer.py	`89.36% <66.66%> (-0.57%)`	⬇️
texar/torch/data/data/data_iterators_utils.py	`72.72% <0%> (ø)`	⬆️
texar/torch/data/data/data_iterators.py	`82.24% <0%> (+0.36%)`	⬆️
texar/torch/core/layers.py	`88.23% <0%> (+0.53%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d17d502...646de51. Read the comment docs.

gpengzhi · 2019-11-27T21:33:52Z

input_ids, _ = tokenizer.encode_text('Hello world!', max_seq_length=5)

is max_seq_length necessary here? what's the result in this case without setting max_seq_length?

max_seq_length is used here just to show the exact same result. If max_seq_length is not specified, it will be [0, 31414, 232, 328, 2, 0, 0, 0, ..., 0](zero-pad up to the maximum sequence length).

ZhitingHu · 2019-11-27T21:40:42Z

input_ids, _ = tokenizer.encode_text('Hello world!', max_seq_length=5)
is max_seq_length necessary here? what's the result in this case without setting max_seq_length?

max_seq_length is used here just to show the same result between ours and theirs. If max_seq_length is not specified, it will be [0, 31414, 232, 328, 2, 0, 0, 0, ..., 0](zero-pad up to the maximum sequence length).

If the user doesn't want padding, it's difficult for them to know the correct seq length just as in this case ('Hello world!' needs seq_length 5).. Can we add an argument (or allow a special value of max_seq_length) so that the user can get the encoded text without padding (and without need to specify max-seq-length explicitly)?

gpengzhi · 2019-11-27T21:48:30Z

input_ids, _ = tokenizer.encode_text('Hello world!', max_seq_length=5)
is max_seq_length necessary here? what's the result in this case without setting max_seq_length?

max_seq_length is used here just to show the same result between ours and theirs. If max_seq_length is not specified, it will be [0, 31414, 232, 328, 2, 0, 0, 0, ..., 0](zero-pad up to the maximum sequence length).

If the user doesn't want padding, it's difficult for them to know the correct seq length just as in this case ('Hello world!' needs seq_length 5).. Can we add an argument (or allow a special value of max_seq_length) so that the user can get the encoded text without padding (and without need to specify max-seq-length explicitly)?

There are two returns, input_ids and input_mask, for the function encode_text. The user can compute the correct sequence length by accessing input_mask. In this case, input_mask is [1, 1, 1, 1, 1, 0, 0, ..., 0].

ZhitingHu · 2019-11-28T19:45:16Z

pls try to pass the test and merge asap

gpengzhi · 2019-11-30T01:38:22Z

I tried to disable codecov/patch because some of our tests (pre-trained tests) are done locally. codecov/patch is also reported to be not very reliable. Some other projects have met such issues (argoproj/argo-cd#1926). I think codecov/project is good enough to check the code quality for now. I tried many ways to disable codecov/patch but failed. It seems that someone also met a similar problem (https://community.codecov.io/t/cannot-disable-codecov-patch-check/682/4). Do you think it is reasonable to merge PRs regardless of the status of codecov/patch? @ZhitingHu

ZhitingHu · 2019-11-30T02:12:17Z

As long as the build is "passing" we're good for now

gpengzhi · 2019-12-02T16:59:11Z

facebookresearch/fairseq#1432

Update RoBERTa vocabulary files

35890c2

gpengzhi requested a review from ZhitingHu November 27, 2019 19:08

ZhitingHu requested changes Nov 27, 2019

View reviewed changes

texar/torch/data/tokenizers/gpt2_tokenizer.py Show resolved Hide resolved

texar/torch/data/tokenizers/roberta_tokenizer.py Show resolved Hide resolved

Update docs/requirements.txt

7062f01

gpengzhi closed this Nov 27, 2019

gpengzhi reopened this Nov 27, 2019

Update .codecov.yml

6f1aafd

gpengzhi closed this Nov 28, 2019

gpengzhi reopened this Nov 28, 2019

Disable codecov/patch

5ff64a7

Update .codecov.yml and docs/requirements

2c63646

Update Pygments version

646de51

gpengzhi merged commit 3484a18 into asyml:master Nov 30, 2019

gpengzhi mentioned this pull request Nov 30, 2019

Disable codecov/patch #261

Closed

gpengzhi deleted the roberta-tokenizer branch February 4, 2020 16:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update RoBERTa vocabulary files #255

Update RoBERTa vocabulary files #255

gpengzhi commented Nov 27, 2019

gpengzhi commented Nov 27, 2019

ZhitingHu left a comment •

edited

Loading

codecov bot commented Nov 27, 2019 •

edited

Loading

gpengzhi commented Nov 27, 2019 •

edited

Loading

ZhitingHu commented Nov 27, 2019

gpengzhi commented Nov 27, 2019 •

edited

Loading

ZhitingHu commented Nov 28, 2019

gpengzhi commented Nov 30, 2019

ZhitingHu commented Nov 30, 2019

gpengzhi commented Dec 2, 2019

Update RoBERTa vocabulary files #255

Update RoBERTa vocabulary files #255

Conversation

gpengzhi commented Nov 27, 2019

gpengzhi commented Nov 27, 2019

ZhitingHu left a comment • edited Loading

Choose a reason for hiding this comment

codecov bot commented Nov 27, 2019 • edited Loading

Codecov Report

gpengzhi commented Nov 27, 2019 • edited Loading

ZhitingHu commented Nov 27, 2019

gpengzhi commented Nov 27, 2019 • edited Loading

ZhitingHu commented Nov 28, 2019

gpengzhi commented Nov 30, 2019

ZhitingHu commented Nov 30, 2019

gpengzhi commented Dec 2, 2019

ZhitingHu left a comment •

edited

Loading

codecov bot commented Nov 27, 2019 •

edited

Loading

gpengzhi commented Nov 27, 2019 •

edited

Loading

gpengzhi commented Nov 27, 2019 •

edited

Loading