-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to transformers 2.3.0 & Add ALBERT #990
Conversation
Hello @HaokunLiu! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
You can repair most issues by installing black and running: Comment last updated at 2020-01-24 22:15:09 UTC |
@@ -57,15 +57,15 @@ def test_moses(self): | |||
] | |||
|
|||
aligner_fn = retokenize.get_aligner_fn("transfo-xl-wt103") | |||
tas, tokens = zip(*(aligner_fn(sent) for sent in self.text)) | |||
tas, tokens = list(tas), list(tokens) | |||
token_aligners, tokens = zip(*(aligner_fn(sent) for sent in self.text)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have written this variable name in the first place, but I found it hard to understand now. So I change it to the full name.
So it turned out ALBERT really is better than RoBERTa. I run some experiments on cola and rte.
I didn't use the exact same hyperparameter. But the result is close to their dev set results on the paper. |
Input of RTE and CoLA are fairly short, it does not cause any memory problem. But since albert-xxlarge have 4x the hidden size of roberta-large, and half number of layers. It will have to use smaller batch sizes, if the task has greater input length. |
Comparing RoBERTa tokenizer in pytorch_transformers ('ĠBerlin', 'Ġand', 'ĠMunich') and RoBERTa tokenizer in transformers ('Ber', 'lin', 'Ġand', 'ĠMunich') on QAMR.
Yes, the new one seems very counter-intuitive, but that's what Huggingface finally settled on. Some people found the new one marginally better on NER . huggingface/transformers#1196 |
Taking a look now... |
@HaokunLiu, I'm starting to review now. Here are the validation items we discussed:
|
|
||
# All the supported input_module from huggingface transformers | ||
# input_modules mapped to the same string share vocabulary | ||
input_module_to_pretokenized = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input_module_to_pretokenized -> transformer_input_module_to_tokenizer_id
Actually, don't merge this now. I found our current implementation of many tasks use the assumption that tokenizing each word independently and concatenating them will give you the same result as tokenizing the full sentence. The tasks may include CCG, ReCoRD, WiC. I need to run some more tests, and possibly make some changes to these tasks. |
The first column is the updated roberta-large model, all the tasks use lr=5e-6 and dropout=0.2, the second column is the average result from three random seeds using the previously found "optimal" learning rate and dropout rate, the third column is the best result we got when doing hyper-parameter searching. Most results are on par with our previous result. Some are marginally lower, I think it's understandable, since it's just using a single hyper-parameter setting. Except for WSC, which seems very unstable. |
This looks OK to me — on tasks where performance with this updated transformers code is below the level reported in your "Initial hyper-parameter search in Taskmaster" column, performance with your updated transformers code is above (or very close to) the performance in your "Final baseline in Taskmaster" column (and I understand that these final baselines are the result of multiple runs). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me — thanks for providing the results of your performance and regression tests.
Thanks @HaokunLiu for running the additional performance validations. Merging. |
* fix roberta tokenization error * update transformers * update alignment func * trim input_module * update lm head * update albert special tokens * input_module_to_pretokenized -> transformer_input_module_to_tokenizer_id * update ccg alignment * fix wic retokenize * update wic docstring, remove unnecessary condition * refactor record task to avoid tokenization problem Co-authored-by: Sam Bowman <bowman@nyu.edu>
Important notice
You will need to change PYTORCH_PRETRAINED_BERT_CACHE to HUGGINGFACE_TRANSFORMERS_CACHE in your own env setting.
Updates
This PR update from pytorch_transformers 1.0 to transformers 2.3. The major changes are
Related Issues
#972 #920 #730
Other
Transformers 2.3 introduced a new feature: AutoModel, AutoTokenizer etc, this may be used to simplify some code. But it is not very necessary right now, so I didn't do it.