Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to transformers 2.3.0 & Add ALBERT #990

Merged
merged 32 commits into from
Jan 28, 2020
Merged

Conversation

HaokunLiu
Copy link
Member

@HaokunLiu HaokunLiu commented Jan 11, 2020

Important notice

You will need to change PYTORCH_PRETRAINED_BERT_CACHE to HUGGINGFACE_TRANSFORMERS_CACHE in your own env setting.

Updates

This PR update from pytorch_transformers 1.0 to transformers 2.3. The major changes are

  1. Add albert to input modules
  2. Rename mentions of "pytorch_pretrained_bert", "pytorch_transformers" to "transformers". Sometimes the name "huggingface transformers" is used to avoid ambiguity. I only wish they do not transform their name anymore.
  3. As Adding tokenizer alignment function #953 pointed out, due to huggingface reverted their bytebpe tokenizer, our temporary fix does not work. This PR changes the alignment function in qa.py back to using TokenAligner.

Related Issues

#972 #920 #730

Other

Transformers 2.3 introduced a new feature: AutoModel, AutoTokenizer etc, this may be used to simplify some code. But it is not very necessary right now, so I didn't do it.

@pep8speaks
Copy link

pep8speaks commented Jan 11, 2020

Hello @HaokunLiu! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 10:101: E501 line too long (105 > 100 characters)

Line 117:101: E501 line too long (117 > 100 characters)
Line 131:101: E501 line too long (114 > 100 characters)
Line 133:101: E501 line too long (124 > 100 characters)
Line 134:101: E501 line too long (120 > 100 characters)
Line 173:101: E501 line too long (110 > 100 characters)
Line 175:101: E501 line too long (104 > 100 characters)
Line 178:101: E501 line too long (109 > 100 characters)

Line 773:101: E501 line too long (104 > 100 characters)

Line 2729:99: W291 trailing whitespace

You can repair most issues by installing black and running: black -l 100 ./*. If you contribute often, have a look at the 'Contributing' section of the README for instructions on doing this automatically.

Comment last updated at 2020-01-24 22:15:09 UTC

@@ -57,15 +57,15 @@ def test_moses(self):
]

aligner_fn = retokenize.get_aligner_fn("transfo-xl-wt103")
tas, tokens = zip(*(aligner_fn(sent) for sent in self.text))
tas, tokens = list(tas), list(tokens)
token_aligners, tokens = zip(*(aligner_fn(sent) for sent in self.text))
Copy link
Member Author

@HaokunLiu HaokunLiu Jan 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have written this variable name in the first place, but I found it hard to understand now. So I change it to the full name.

@HaokunLiu
Copy link
Member Author

HaokunLiu commented Jan 15, 2020

So it turned out ALBERT really is better than RoBERTa. I run some experiments on cola and rte.

task roberta-large albert-xxlarge-v1 albert-xxlarge-v2
cola 0.685 0.695 0.701
rte 0.862 0.888 0.906

I didn't use the exact same hyperparameter. But the result is close to their dev set results on the paper.

@sleepinyourhat
Copy link
Contributor

sleepinyourhat commented Jan 15, 2020 via email

@HaokunLiu
Copy link
Member Author

Great! How hard was it to run albert-xxlarge? Did you have to use smaller batch sizes than for roberta-large?

On Wed, Jan 15, 2020 at 10:00 AM Haokun Liu @.***> wrote: So it turned out ALBERT really is better than RoBERTa. I run some experiments on cola and rte. | task | roberta-large | albert-xxlarge-v1 | albert-xxlarge-v2 | |------|---------------|-------------------|-------------------| | cola | 0.685 | 0.695 | 0.701 | | rte | 0.862 | 0.888 | 0.906 | I didn't use the exact same hyperparameter. But the result is already close to their dev set results on the paper. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_nyu-2Dmll_jiant_pull_990-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAAJZSWOPL5K6BC6ERCBS5YDQ54QHHA5CNFSM4KFPSL5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJATEHI-23issuecomment-2D574698013&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=sCzLyHdE8zgQwk2-sKwA1w&m=ACW_26To1ILze-7xVTqVDQT-Y1fafHMNQGxlpSAqBr8&s=yaMZ4q5CTG74o8g5OYHqmv-vRfh8VRitn61kws6aCJg&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAJZSWJ5DGQCPY56PJMWKLTQ54QHHANCNFSM4KFPSL5A&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=sCzLyHdE8zgQwk2-sKwA1w&m=ACW_26To1ILze-7xVTqVDQT-Y1fafHMNQGxlpSAqBr8&s=0F--szix3dfGhsC8TOf1VuzWD7yhqS9PzmkRoaaMLoA&e= .

Input of RTE and CoLA are fairly short, it does not cause any memory problem. But since albert-xxlarge have 4x the hidden size of roberta-large, and half number of layers. It will have to use smaller batch sizes, if the task has greater input length.

@HaokunLiu
Copy link
Member Author

Comparing RoBERTa tokenizer in pytorch_transformers ('ĠBerlin', 'Ġand', 'ĠMunich') and RoBERTa tokenizer in transformers ('Ber', 'lin', 'Ġand', 'ĠMunich') on QAMR.

pytorch_transformers transformers
0.730 0.726

Yes, the new one seems very counter-intuitive, but that's what Huggingface finally settled on. Some people found the new one marginally better on NER . huggingface/transformers#1196

@HaokunLiu HaokunLiu marked this pull request as ready for review January 15, 2020 18:29
@sleepinyourhat sleepinyourhat added the 0.x.0 release on fix Put out a new 0.x.0 release when this is fixed. label Jan 15, 2020
@sleepinyourhat
Copy link
Contributor

Taking a look now...

@pyeres
Copy link
Contributor

pyeres commented Jan 17, 2020

@pyeres @pruksmhc, could either of you give the main review?

@HaokunLiu, I'm starting to review now. Here are the validation items we discussed:

  • Performance regression test (performance on some core tasks not reduced w/ these changes)
  • Benchmark test — Albert performance in jiant is comparable to paper
  • Test coverage on retokenization code


# All the supported input_module from huggingface transformers
# input_modules mapped to the same string share vocabulary
input_module_to_pretokenized = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input_module_to_pretokenized -> transformer_input_module_to_tokenizer_id

@HaokunLiu
Copy link
Member Author

Actually, don't merge this now. I found our current implementation of many tasks use the assumption that tokenizing each word independently and concatenating them will give you the same result as tokenizing the full sentence. The tasks may include CCG, ReCoRD, WiC. I need to run some more tests, and possibly make some changes to these tasks.

@HaokunLiu
Copy link
Member Author

Task Update to transformers Final baseline in Taskmaster Initial hyper-parameter search in Taskmaster
Target
commitbank 1 0.991 1
copa 0.94 0.86 0.93
wsc 0.692 0.673 0.798
rte 0.866 0.835 0.862
multirc 0.628 0.474 0.574
wic 0.724 0.705 0.721
boolq 0.879 0.866 0.877
commonsenseqa 0.735 0.74 0.734
cosmosqa 0.814 0.819 0.817
record 0.863 0.86 0.898
INTERM    
qamr 0.726 N/A 0.735
scitail 0.976 N/A 0.972
social-iqa 0.766 N/A 0.729
ccg 0.96 N/A 0.96
hellaswag 0.851 N/A 0.856
qasrl 0.664 N/A 0.675
sst 0.961 N/A 0.968
qqp 0.903 N/A 0.898
mnli 0.896 N/A 0.901

The first column is the updated roberta-large model, all the tasks use lr=5e-6 and dropout=0.2, the second column is the average result from three random seeds using the previously found "optimal" learning rate and dropout rate, the third column is the best result we got when doing hyper-parameter searching.

Most results are on par with our previous result. Some are marginally lower, I think it's understandable, since it's just using a single hyper-parameter setting. Except for WSC, which seems very unstable.

@pyeres
Copy link
Contributor

pyeres commented Jan 24, 2020

Most results are on par with our previous result. Some are marginally lower, I think it's understandable, since it's just using a single hyper-parameter setting. Except for WSC, which seems very unstable.

This looks OK to me — on tasks where performance with this updated transformers code is below the level reported in your "Initial hyper-parameter search in Taskmaster" column, performance with your updated transformers code is above (or very close to) the performance in your "Final baseline in Taskmaster" column (and I understand that these final baselines are the result of multiple runs).

Copy link
Contributor

@pyeres pyeres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me — thanks for providing the results of your performance and regression tests.

@pyeres
Copy link
Contributor

pyeres commented Jan 28, 2020

Thanks @HaokunLiu for running the additional performance validations. Merging.

@pyeres pyeres merged commit 4a9b058 into master Jan 28, 2020
@pyeres pyeres deleted the update-transformers branch January 28, 2020 14:51
phu-pmh pushed a commit that referenced this pull request Apr 17, 2020
* fix roberta tokenization error

* update transformers

* update alignment func

* trim input_module

* update lm head

* update albert special tokens

* input_module_to_pretokenized -> transformer_input_module_to_tokenizer_id

* update ccg alignment

* fix wic retokenize

* update wic docstring, remove unnecessary condition

* refactor record task to avoid tokenization problem

Co-authored-by: Sam Bowman <bowman@nyu.edu>
@jeswan jeswan added the jiant-v1-legacy Relevant to versions <= v1.3.2 label Sep 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.x.0 release on fix Put out a new 0.x.0 release when this is fixed. jiant-v1-legacy Relevant to versions <= v1.3.2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants