Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CLOSED] XLNet support and overhaul/cleanup of BERT support #845

Closed
jeswan opened this issue Sep 17, 2020 · 15 comments
Closed

[CLOSED] XLNet support and overhaul/cleanup of BERT support #845

jeswan opened this issue Sep 17, 2020 · 15 comments
Labels
0.x.0 release on fix Put out a new 0.x.0 release when this is fixed. bug Something isn't working cleanup This should be fairly easy

Comments

@jeswan
Copy link
Contributor

jeswan commented Sep 17, 2020

Issue by sleepinyourhat
Wednesday Jul 17, 2019 at 19:47 GMT
Originally opened as nyu-mll/jiant#845


There's a lot going on here, and I'm still debugging. Suggestions for tests to add are very welcome!

I'm adding a few semi-related changes that are meant to help with clarity/maintainability:

  • 'auto' is now the default value for args.tokenizer, and should behave correctly for all standard models.
  • pair_task is now a property of Task objects. [update: pair_task is gone.]
  • The addition of start/end/sep/cls tokens now happens slightly later in preprocessing, and it's up to each task object to request it. This allows tasks to more easily decide how they want to use [SEP] tokens. This is likely to introduce some subtle bugs, but it should also fix some subtle bugs, and it's basically necessary. XLNet places [CLS] at the end, after the final [SEP], so we'd have to rewrite all that code anyhow.

I caught a bug along the way:

  • Some tasks, including CoPA, MultiRC, and ReCoRD didn't have pair_task set properly, so we weren't using BERT segment embeddings (i.e., the tokens before and after [SEP] were marked as segment A).

Note to self:

  • Update the site-side documentation when done.

sleepinyourhat included the following code: https://github.com/nyu-mll/jiant/pull/845/commits

@jeswan jeswan added 0.x.0 release on fix Put out a new 0.x.0 release when this is fixed. bug Something isn't working cleanup This should be fairly easy labels Sep 17, 2020
@jeswan
Copy link
Contributor Author

jeswan commented Sep 17, 2020

Comment by pep8speaks
Wednesday Jul 17, 2019 at 19:47 GMT


Hello @sleepinyourhat! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 201:101: E501 line too long (108 > 100 characters)

Line 76:101: E501 line too long (104 > 100 characters)
Line 79:101: E501 line too long (114 > 100 characters)
Line 81:101: E501 line too long (124 > 100 characters)
Line 82:101: E501 line too long (114 > 100 characters)
Line 119:101: E501 line too long (110 > 100 characters)
Line 253:101: E501 line too long (112 > 100 characters)
Line 254:101: E501 line too long (201 > 100 characters)

Line 115:101: E501 line too long (152 > 100 characters)

You can repair most issues by installing black and running: black -l 100 ./*. If you contribute often, have a look at the 'Contributing' section of the README for instructions on doing this automatically.

Comment last updated at 2019-08-07 21:30:58 UTC

@jeswan
Copy link
Contributor Author

jeswan commented Sep 17, 2020

Comment by sleepinyourhat
Thursday Jul 18, 2019 at 22:13 GMT


Removing WIP tag—I think this is ready for review. Anyone have a moment?

I'm still running some larger-scale tests on GLUE/SuperGLUE to make sure that BERT performance doesn't change, and that XLNet gets semi-sane results.

@jeswan
Copy link
Contributor Author

jeswan commented Sep 17, 2020

Comment by sleepinyourhat
Monday Jul 22, 2019 at 15:23 GMT


From the GLUE tasks, I can confirm that BERT models do just as well after the refactor, but XLNet performance is fairly low:

bert-old-g	micro_avg: 0.624, macro_avg: 0.624, cola_mcc: 0.624, cola_accuracy: 0.846
bert-new-g	micro_avg: 0.612, macro_avg: 0.612, cola_mcc: 0.612, cola_accuracy: 0.841
xlnet-new-g	micro_avg: 0.094, macro_avg: 0.094, cola_mcc: 0.094, cola_accuracy: 0.665

bert-old-g	micro_avg: 0.886, macro_avg: 0.886, sts-b_corr: 0.886, sts-b_pearsonr: 0.888, sts-b_spearmanr: 0.885
bert-new-g	micro_avg: 0.885, macro_avg: 0.885, sts-b_corr: 0.885, sts-b_pearsonr: 0.886, sts-b_spearmanr: 0.883
xlnet-new-g	micro_avg: 0.761, macro_avg: 0.761, sts-b_corr: 0.761, sts-b_pearsonr: 0.761, sts-b_spearmanr: 0.761

bert-old-g	micro_avg: 0.834, macro_avg: 0.834, mnli_accuracy: 0.834
bert-new-g	micro_avg: 0.834, macro_avg: 0.834, mnli_accuracy: 0.834
xlnet-new-g	micro_avg: 0.675, macro_avg: 0.675, mnli_accuracy: 0.675

bert-old-g	micro_avg: 0.870, macro_avg: 0.870, mrpc_acc_f1: 0.870, mrpc_accuracy: 0.846, mrpc_f1: 0.894, mrpc_precision: 0.844, mrpc_recall: 0.950
bert-new-g	micro_avg: 0.881, macro_avg: 0.881, mrpc_acc_f1: 0.881, mrpc_accuracy: 0.860, mrpc_f1: 0.902, mrpc_precision: 0.865, mrpc_recall: 0.943
xlnet-new-g	micro_avg: 0.819, macro_avg: 0.819, mrpc_acc_f1: 0.819, mrpc_accuracy: 0.787, mrpc_f1: 0.852, mrpc_precision: 0.812, mrpc_recall: 0.896

bert-old-g	micro_avg: 0.911, macro_avg: 0.911, qnli_accuracy: 0.911
bert-new-g	micro_avg: 0.913, macro_avg: 0.913, qnli_accuracy: 0.913
xlnet-new-g	micro_avg: 0.790, macro_avg: 0.790, qnli_accuracy: 0.790

bert-old-g	micro_avg: 0.863, macro_avg: 0.863, qqp_acc_f1: 0.863, qqp_accuracy: 0.883, qqp_f1: 0.843, qqp_precision: 0.834, qqp_recall: 0.852
bert-new-g	micro_avg: 0.867, macro_avg: 0.867, qqp_acc_f1: 0.867, qqp_accuracy: 0.887, qqp_f1: 0.847, qqp_precision: 0.847, qqp_recall: 0.848
xlnet-new-g	micro_avg: 0.779, macro_avg: 0.779, qqp_acc_f1: 0.779, qqp_accuracy: 0.810, qqp_f1: 0.748, qqp_precision: 0.731, qqp_recall: 0.766

bert-old-g	micro_avg: 0.661, macro_avg: 0.661, rte_accuracy: 0.661
bert-new-g	micro_avg: 0.686, macro_avg: 0.686, rte_accuracy: 0.686
xlnet-new-g	micro_avg: 0.545, macro_avg: 0.545, rte_accuracy: 0.545

bert-old-g	micro_avg: 0.923, macro_avg: 0.923, sst_accuracy: 0.923
bert-new-g	micro_avg: 0.923, macro_avg: 0.923, sst_accuracy: 0.923
xlnet-new-g	micro_avg: 0.796, macro_avg: 0.796, sst_accuracy: 0.796

bert-new-g	micro_avg: 0.380, macro_avg: 0.380, wnli_accuracy: 0.380
bert-old-g	micro_avg: 0.437, macro_avg: 0.437, wnli_accuracy: 0.437
xlnet-new-g	micro_avg: 0.563, macro_avg: 0.563, wnli_accuracy: 0.563

Old vs. new refers to master vs. this branch.

@jeswan
Copy link
Contributor Author

jeswan commented Sep 17, 2020

Comment by HaokunLiu
Monday Jul 22, 2019 at 17:20 GMT


From the GLUE tasks, I can confirm that BERT models do just as well after the refactor, but XLNet performance is fairly low. This is with only one epoch per task:

bert-old-g	micro_avg: 0.624, macro_avg: 0.624, cola_mcc: 0.624, cola_accuracy: 0.846
bert-new-g	micro_avg: 0.612, macro_avg: 0.612, cola_mcc: 0.612, cola_accuracy: 0.841
xlnet-new-g	micro_avg: 0.094, macro_avg: 0.094, cola_mcc: 0.094, cola_accuracy: 0.665

bert-old-g	micro_avg: 0.886, macro_avg: 0.886, sts-b_corr: 0.886, sts-b_pearsonr: 0.888, sts-b_spearmanr: 0.885
bert-new-g	micro_avg: 0.885, macro_avg: 0.885, sts-b_corr: 0.885, sts-b_pearsonr: 0.886, sts-b_spearmanr: 0.883
xlnet-new-g	micro_avg: 0.761, macro_avg: 0.761, sts-b_corr: 0.761, sts-b_pearsonr: 0.761, sts-b_spearmanr: 0.761

bert-old-g	micro_avg: 0.834, macro_avg: 0.834, mnli_accuracy: 0.834
bert-new-g	micro_avg: 0.834, macro_avg: 0.834, mnli_accuracy: 0.834
xlnet-new-g	micro_avg: 0.675, macro_avg: 0.675, mnli_accuracy: 0.675

bert-old-g	micro_avg: 0.870, macro_avg: 0.870, mrpc_acc_f1: 0.870, mrpc_accuracy: 0.846, mrpc_f1: 0.894, mrpc_precision: 0.844, mrpc_recall: 0.950
bert-new-g	micro_avg: 0.881, macro_avg: 0.881, mrpc_acc_f1: 0.881, mrpc_accuracy: 0.860, mrpc_f1: 0.902, mrpc_precision: 0.865, mrpc_recall: 0.943
xlnet-new-g	micro_avg: 0.819, macro_avg: 0.819, mrpc_acc_f1: 0.819, mrpc_accuracy: 0.787, mrpc_f1: 0.852, mrpc_precision: 0.812, mrpc_recall: 0.896

bert-old-g	micro_avg: 0.911, macro_avg: 0.911, qnli_accuracy: 0.911
bert-new-g	micro_avg: 0.913, macro_avg: 0.913, qnli_accuracy: 0.913
xlnet-new-g	micro_avg: 0.790, macro_avg: 0.790, qnli_accuracy: 0.790

bert-old-g	micro_avg: 0.863, macro_avg: 0.863, qqp_acc_f1: 0.863, qqp_accuracy: 0.883, qqp_f1: 0.843, qqp_precision: 0.834, qqp_recall: 0.852
bert-new-g	micro_avg: 0.867, macro_avg: 0.867, qqp_acc_f1: 0.867, qqp_accuracy: 0.887, qqp_f1: 0.847, qqp_precision: 0.847, qqp_recall: 0.848
xlnet-new-g	micro_avg: 0.779, macro_avg: 0.779, qqp_acc_f1: 0.779, qqp_accuracy: 0.810, qqp_f1: 0.748, qqp_precision: 0.731, qqp_recall: 0.766

bert-old-g	micro_avg: 0.661, macro_avg: 0.661, rte_accuracy: 0.661
bert-new-g	micro_avg: 0.686, macro_avg: 0.686, rte_accuracy: 0.686
xlnet-new-g	micro_avg: 0.545, macro_avg: 0.545, rte_accuracy: 0.545

bert-old-g	micro_avg: 0.923, macro_avg: 0.923, sst_accuracy: 0.923
bert-new-g	micro_avg: 0.923, macro_avg: 0.923, sst_accuracy: 0.923
xlnet-new-g	micro_avg: 0.796, macro_avg: 0.796, sst_accuracy: 0.796

bert-new-g	micro_avg: 0.380, macro_avg: 0.380, wnli_accuracy: 0.380
bert-old-g	micro_avg: 0.437, macro_avg: 0.437, wnli_accuracy: 0.437
xlnet-new-g	micro_avg: 0.563, macro_avg: 0.563, wnli_accuracy: 0.563

Old vs. new refers to master vs. this branch.

I found there is some differences in jiant implementation, I don't know if that's the cause.
XLNet use the embedding from the last token instead of the first (like BERT). See pytorch_transformers > modeling_xlnet.py > XLNetForSequenceClassification and pytorch_transformers > modeling_utils.py > SequenceSummary for details.
In jiant, that means the pool_type need to be "final" when using XLNet, I guess?

@jeswan
Copy link
Contributor Author

jeswan commented Sep 17, 2020

Comment by sleepinyourhat
Monday Jul 22, 2019 at 17:23 GMT


Good catch!

@jeswan
Copy link
Contributor Author

jeswan commented Sep 17, 2020

Comment by sleepinyourhat
Wednesday Jul 24, 2019 at 14:17 GMT


Okay, I think this is ready for a real review. Thanks to everyone who commented so far!

  • I've tested all the tasks that I know how to run, but there are still a few (LM, CCG) that I don't have an easy test setup for. Be extra skeptical there, @pruksmhc, @anhad13.
  • I added a few tests.
  • I'll update the site once this is checked in.
  • I've run controlled regression tests on all the GLUE/SuperGLUE tasks, and everything looks good. ReCoRD is still running, and CoLA performance is a bit low, but CoLA always has a fair bit of random variation, and it looked good on earlier tests. Here's performance after two epochs with BERT-Base on master, BERT-Base on this branch, and XLNet-Base on this branch.
bert-old	micro_avg: 0.737, macro_avg: 0.737, boolq_acc_f1: 0.737, boolq_accuracy: 0.696, boolq_f1: 0.777, boolq_precision: 0.714, boolq_recall: 0.854
bert-new	micro_avg: 0.759, macro_avg: 0.759, boolq_acc_f1: 0.759, boolq_accuracy: 0.727, boolq_f1: 0.791, boolq_precision: 0.755, boolq_recall: 0.832
xlnet-new	micro_avg: 0.781, macro_avg: 0.781, boolq_acc_f1: 0.781, boolq_accuracy: 0.749, boolq_f1: 0.813, boolq_precision: 0.757, boolq_recall: 0.879

bert-old	micro_avg: 0.511, macro_avg: 0.511, commitbank_accuracy: 0.732, commitbank_f1: 0.511, commitbank_precision: 0.490, commitbank_recall: 0.535
bert-new	micro_avg: 0.497, macro_avg: 0.497, commitbank_accuracy: 0.714, commitbank_f1: 0.497, commitbank_precision: 0.476, commitbank_recall: 0.520
xlnet-new	micro_avg: 0.518, macro_avg: 0.518, commitbank_accuracy: 0.750, commitbank_f1: 0.518, commitbank_precision: 0.502, commitbank_recall: 0.541

bert-old	micro_avg: 0.600, macro_avg: 0.600, copa_accuracy: 0.600
bert-new	micro_avg: 0.600, macro_avg: 0.600, copa_accuracy: 0.600
xlnet-new	micro_avg: 0.560, macro_avg: 0.560, copa_accuracy: 0.560

bert-old	micro_avg: 0.391, macro_avg: 0.391, multirc_ans_f1: 0.623, multirc_qst_f1: 0.551, multirc_em: 0.159, multirc_avg: 0.391
bert-new	micro_avg: 0.390, macro_avg: 0.390, multirc_ans_f1: 0.626, multirc_qst_f1: 0.564, multirc_em: 0.154, multirc_avg: 0.390
xlnet-new	micro_avg: 0.391, macro_avg: 0.391, multirc_ans_f1: 0.622, multirc_qst_f1: 0.549, multirc_em: 0.161, multirc_avg: 0.391

bert-old	micro_avg: 0.682, macro_avg: 0.682, rte-superglue_accuracy: 0.682
bert-new	micro_avg: 0.675, macro_avg: 0.675, rte-superglue_accuracy: 0.675
xlnet-new	micro_avg: 0.661, macro_avg: 0.661, rte-superglue_accuracy: 0.661

bert-old	micro_avg: 0.705, macro_avg: 0.705, wic_accuracy: 0.705, wic_f1: 0.737, wic_precision: 0.665, wic_recall: 0.828
bert-new	micro_avg: 0.713, macro_avg: 0.713, wic_accuracy: 0.713, wic_f1: 0.740, wic_precision: 0.677, wic_recall: 0.815
xlnet-new	micro_avg: 0.672, macro_avg: 0.672, wic_accuracy: 0.672, wic_f1: 0.705, wic_precision: 0.641, wic_recall: 0.784

bert-old	micro_avg: 0.635, macro_avg: 0.635, winograd-coreference_f1: 0.000, winograd-coreference_acc: 0.635
bert-new	micro_avg: 0.635, macro_avg: 0.635, winograd-coreference_f1: 0.000, winograd-coreference_acc: 0.635
xlnet-new	micro_avg: 0.635, macro_avg: 0.635, winograd-coreference_f1: 0.000, winograd-coreference_acc: 0.635

bert-old	micro_avg: 0.596, macro_avg: 0.596, cola_mcc: 0.596, cola_accuracy: 0.835
bert-new	micro_avg: 0.593, macro_avg: 0.593, cola_mcc: 0.593, cola_accuracy: 0.834
xlnet-new       micro_avg: 0.394, macro_avg: 0.394, cola_mcc: 0.394, cola_accuracy: 0.760

bert-old	micro_avg: 0.835, macro_avg: 0.835, mnli_accuracy: 0.835
bert-new	micro_avg: 0.837, macro_avg: 0.837, mnli_accuracy: 0.837
xlnet-new       micro_avg: 0.866, macro_avg: 0.866, mnli_accuracy: 0.866

bert-old	micro_avg: 0.868, macro_avg: 0.868, mrpc_acc_f1: 0.868, mrpc_accuracy: 0.846, mrpc_f1: 0.891, mrpc_precision: 0.862, mrpc_recall: 0.921
bert-new	micro_avg: 0.882, macro_avg: 0.882, mrpc_acc_f1: 0.882, mrpc_accuracy: 0.860, mrpc_f1: 0.903, mrpc_precision: 0.860, mrpc_recall: 0.950
xlnet-new       micro_avg: 0.888, macro_avg: 0.888, mrpc_acc_f1: 0.888, mrpc_accuracy: 0.868, mrpc_f1: 0.908, mrpc_precision: 0.866, mrpc_recall: 0.953

bert-old	micro_avg: 0.913, macro_avg: 0.913, qnli_accuracy: 0.913
bert-new	micro_avg: 0.913, macro_avg: 0.913, qnli_accuracy: 0.913
xlnet-new       micro_avg: 0.916, macro_avg: 0.916, qnli_accuracy: 0.916

bert-old	micro_avg: 0.885, macro_avg: 0.885, qqp_acc_f1: 0.885, qqp_accuracy: 0.902, qqp_f1: 0.868, qqp_precision: 0.856, qqp_recall: 0.881
bert-new	micro_avg: 0.886, macro_avg: 0.886, qqp_acc_f1: 0.886, qqp_accuracy: 0.903, qqp_f1: 0.869, qqp_precision: 0.862, qqp_recall: 0.877
xlnet-new       micro_avg: 0.887, macro_avg: 0.887, qqp_acc_f1: 0.887, qqp_accuracy: 0.903, qqp_f1: 0.870, qqp_precision: 0.856, qqp_recall: 0.885

bert-old	micro_avg: 0.690, macro_avg: 0.690, rte_accuracy: 0.690
bert-new	micro_avg: 0.693, macro_avg: 0.693, rte_accuracy: 0.693
xlnet-new       micro_avg: 0.581, macro_avg: 0.581, rte_accuracy: 0.581

bert-old	micro_avg: 0.923, macro_avg: 0.923, sst_accuracy: 0.923
bert-new	micro_avg: 0.921, macro_avg: 0.921, sst_accuracy: 0.921
xlnet-new       micro_avg: 0.943, macro_avg: 0.943, sst_accuracy: 0.943

bert-old	micro_avg: 0.878, macro_avg: 0.878, sts-b_corr: 0.878, sts-b_pearsonr: 0.880, sts-b_spearmanr: 0.876
bert-new	micro_avg: 0.874, macro_avg: 0.874, sts-b_corr: 0.874, sts-b_pearsonr: 0.876, sts-b_spearmanr: 0.872
xlnet-new       micro_avg: 0.864, macro_avg: 0.864, sts-b_corr: 0.864, sts-b_pearsonr: 0.865, sts-b_spearmanr: 0.862

bert-old	micro_avg: 0.437, macro_avg: 0.437, wnli_accuracy: 0.437
bert-new	micro_avg: 0.310, macro_avg: 0.310, wnli_accuracy: 0.310
xlnet-new       micro_avg: 0.549, macro_avg: 0.549, wnli_accuracy: 0.549

@jeswan
Copy link
Contributor Author

jeswan commented Sep 17, 2020

Comment by W4ngatang
Wednesday Jul 24, 2019 at 21:57 GMT


Looked through most files (only skimmedtasks.py) and left comments. Some general comments:

  • this PR is massive :( It'd be really helpful for readability if it could be broken up, but I'd understand if you didn't have time for it :P.
  • "pair_task is now a property of Task objects'": where is this happening ? I don't see the relevant changes
  • Also, if this is the case we should probably just blow up Pair*Task. It's kind of an awkward abstraction for tasks and redundant with the new flag.
  • get_seg_ids seems fishy in the single-input case, and there's no test coverage there.

@jeswan
Copy link
Contributor Author

jeswan commented Sep 17, 2020

Comment by sleepinyourhat
Wednesday Jul 24, 2019 at 22:23 GMT


  • Scale: 😬 Yeah. This is all intertwined enough that it'd be tricky to factor this into multiple PRs, and I can't sink too much time into it.
  • pair_task was deleted after some further revisions.
  • Pair*Task: I'm not sure what you're suggesting—it seems genuinely helpful for the more heavy-duty task-specific modules we used with ELMo.
  • get_seg_ids: See above.
    I addressed everything else in a commit that I'll push momentarily. Then I think we're done, at least once I'm sure ReCoRD is working and once @pruksmhc verifies CCG and @anhad13 verifies the LM parsing code.

@jeswan
Copy link
Contributor Author

jeswan commented Sep 17, 2020

Comment by sleepinyourhat
Thursday Jul 25, 2019 at 14:45 GMT


ReCoRD seems to be working now. Here are results from a short half-epoch run:

bert-old 07/25 04:18:37 AM: micro_avg: 0.463, macro_avg: 0.463, record_f1: 0.467, record_em: 0.458, record_avg: 0.463
bert-new 07/25 04:04:25 AM: micro_avg: 0.471, macro_avg: 0.471, record_f1: 0.475, record_em: 0.467, record_avg: 0.471
xlnet-new 07/25 05:40:53 AM: micro_avg: 0.548, macro_avg: 0.548, record_f1: 0.551, record_em: 0.544, record_avg: 0.548

@jeswan
Copy link
Contributor Author

jeswan commented Sep 17, 2020

Comment by sleepinyourhat
Saturday Jul 27, 2019 at 19:05 GMT


@pruksmhc Thanks! I'm testing everything I can, but do make sure to have a close look at CCG. I'm not fully set up to test that.

@jeswan
Copy link
Contributor Author

jeswan commented Sep 17, 2020

Comment by pruksmhc
Monday Jul 29, 2019 at 02:14 GMT


CCG has tokenization as a off-line preprocessing step (which should honestly be changed). I would say just put in the documentation that CCG is not yet set up for XLNet, and leave CCG for another PR.

@jeswan
Copy link
Contributor Author

jeswan commented Sep 17, 2020

Comment by pruksmhc
Monday Jul 29, 2019 at 02:15 GMT


The off-line preprocessing step should honestly be made as part of load_data too.

@jeswan
Copy link
Contributor Author

jeswan commented Sep 17, 2020

Comment by sleepinyourhat
Tuesday Jul 30, 2019 at 07:32 GMT


@pruksmhc - Mind making an issue?

@jeswan
Copy link
Contributor Author

jeswan commented Sep 17, 2020

Comment by sleepinyourhat
Sunday Aug 04, 2019 at 09:45 GMT


This should be ready to go after one last test. @iftenney Any last comments before I merge?

@jeswan
Copy link
Contributor Author

jeswan commented Sep 17, 2020

Comment by sleepinyourhat
Tuesday Aug 06, 2019 at 16:51 GMT


  • use_pytorch_transformers: Cleaned up—pushing shortly.
  • pytorch_pretrained_bert_cache: If this is only a preference, I'll leave it. Given the choice between following a simple but misleading/wrong naming convention and using a more complicated but more accurate naming convention, I tend toward the latter.
    Replying to replyable comments above...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.x.0 release on fix Put out a new 0.x.0 release when this is fixed. bug Something isn't working cleanup This should be fairly easy
Projects
None yet
Development

No branches or pull requests

1 participant