Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bert-large model not attaining ~65% accuracy even after training till 52k timesteps! #10

Open
karthikj11 opened this issue Jul 29, 2020 · 47 comments

Comments

@karthikj11
Copy link

karthikj11 commented Jul 29, 2020

We are using p100 and 25 gb ram to train the bert large model.
But when we tried to run the default code with bs=6 and num_batch_accumulated=4, we got cuda out of memory error.
Thus we changed it to bs=2 and num_batch_accumulated=8 as you said anything between 16...24 would perform similarly.
But now after training till 52000 timesteps, the maximum accuracy we got is ~59.6% at 44000th timestep.
Is it taking more time because we changed the batch_size? Or is there anything else we are missing out?

RESULT at 48000 and 52000 timestep:

Loading model from logdir/bert_run/bs=2,lr=7.4e-04,bert_lr=3.0e-06,end_lr=0e0,att=1/model_checkpoint-00048000
DB connections: 100% 166/166 [02:31<00:00, 1.10it/s]
100% 1034/1034 [05:45<00:00, 2.99it/s]
DB connections: 100% 166/166 [00:00<00:00, 448.81it/s]
Wrote eval results to logdir/bert_run/bs=2,lr=7.4e-04,bert_lr=3.0e-06,end_lr=0e0,att=1/ie_dirs/bert_run_true_1-step48000.eval
48000 0.5638297872340425

Loading model from logdir/bert_run/bs=2,lr=7.4e-04,bert_lr=3.0e-06,end_lr=0e0,att=1/model_checkpoint-00052000
DB connections: 100% 166/166 [00:00<00:00, 443.91it/s]
100% 1034/1034 [05:31<00:00, 3.12it/s]
DB connections: 100% 166/166 [00:00<00:00, 467.06it/s]
Wrote eval results to logdir/bert_run/bs=2,lr=7.4e-04,bert_lr=3.0e-06,end_lr=0e0,att=1/ie_dirs/bert_run_true_1-step52000.eval
52000 0.586073500967118

@berlino
Copy link
Collaborator

berlino commented Jul 29, 2020

We tuned the bert model very carefully with a very small lr (3e-6), but the price to pay is that it has to be trained for much longer. I guess you can still expect the performance to be >60% if you keep training. Another option is that you could try a larger bert lr (like 1e-5).

@DevanshChoubey
Copy link

Hi @berlino,

can you please tell me when can we expect those reparsed trees that are corrupt and is that the reason also for lower accuracy.

@DevanshChoubey
Copy link

Hi @karthikj11

is there a possibility that you can share your trained model?? with me???

@berlino
Copy link
Collaborator

berlino commented Jul 29, 2020

Hi, @DevanshChoubey I don't think that's the reason.

Sorry for the confusion. There seems to be only one or two sqls in Spider that is not well-structured. The patch we do is actually available at Richard's repo https://github.com/rshin/seq2struct/blob/master/data/spider-20190205/train_spider.json.patch .

@DevanshChoubey
Copy link

thanks @berlino
kudos to you and @alexpolozov

just one more question you guys trained on a v100 with 16gb mem with the same default code with bs=6 and num_batch_accumulated=4 without any out of memory errors ????

can you guys please share the optimal trained model if possible???

@senthurRam33
Copy link

senthurRam33 commented Jul 30, 2020

@berlino While evaluating dev dataset did you guys further cleaned it?? Because I trained the model and tested it for accuracy and it gives me only around 60% at 70000 timesteps.

@alexpolozov
Copy link
Contributor

alexpolozov commented Jul 30, 2020

@DevanshChoubey We might be able to share our trained checkpoints later but that would require a separate release review process, unfortunately. Will take a few weeks to go through.

@berlino Richard actually re-parses every SQL into an AST. Many ASTs (more than two) in the original Spider release were broken (taoyds/spider#3). The authors fixed their SQL parser later but did not re-generate the ASTs. The patch file is actually orthogonal, it fixes the SQL string in a couple instances, not the AST.
I thought that this would be updated in the latest release of Spider, but – after checking with them – apparently it wasn't. So we need to include the re-parsing code in the preprocessing of this release. This is a big bug in our release and is probably responsible for many people's drop in accuracy.
This is only a release bug. We have run this re-parsing in the very beginning internally, so our ASTs are correct and thus RAT-SQL can fit a more reasonable model.

@berlino
Copy link
Collaborator

berlino commented Jul 30, 2020

@alexpolozov I see. I guess we could add the script https://github.com/rshin/seq2struct/blob/master/data/spider-20190205/generate.sh and the patch file to this repo, as Richard did. I could help submit a PR to fix this.

@alexpolozov
Copy link
Contributor

Update: upon our request, there should be a new release of Spider with the fixed ASTs over the next few days. I think this is a better solution in the long term. Once it happens, we can close this issue.

@senthurRam33
Copy link

senthurRam33 commented Jul 31, 2020

bert_val_alex_vs_us
bert_train_alex_vs_us

I have trained the BERT model and plotted the loss of my model against the loss of @alexpolozov. Both the loss curves overlaps one another. But the model achieves only 60% accuracy at 80000 steps in spider dev set. The result at 80000 timestep is as follows.

Loading model from logdir/bert_run/bs=2,lr=7.4e-04,bert_lr=3.0e-06,end_lr=0e0,att=1/model_checkpoint-00080000
DB connections: 100% 166/166 [00:00<00:00, 365.57it/s]
100% 1034/1034 [05:31<00:00, 3.12it/s]
DB connections: 100% 166/166 [00:00<00:00, 366.52it/s]
Wrote eval results to logdir/bert_run/bs=2,lr=7.4e-04,bert_lr=3.0e-06,end_lr=0e0,att=1/ie_dirs/bert_run_true_1-step80000.eval
80000 0.6015473887814313

So you guys think this could be the highest accuracy that the BERT based model can achieve in the current spider dataset or there any ways to improve the accuracy?

@karthikj11
Copy link
Author

karthikj11 commented Jul 31, 2020

@alexpolozov @berlino Thanks for your quick response.
Is there any way to re-parse the ASTs in the spider dataset by ourselves?
Or should we wait until the new release of Spider dataset with fixed ASTs?

It would be very helpful if you can guide me to re-parse the ASTs in the current spider dataset.

@slamandar
Copy link

We tuned the bert model very carefully with a very small lr (3e-6), but the price to pay is that it has to be trained for much longer. I guess you can still expect the performance to be >60% if you keep training. Another option is that you could try a larger bert lr (like 1e-5).

Hi, have you tried larger bert lr like 1e-5 or even 5e-5? Will these lrs lead to non-convergence? thanks.

@DevanshChoubey
Copy link

DevanshChoubey commented Aug 9, 2020

@berlino @alexpolozov

Hi,
i was able to reparse all the AST from @rshin generate.sh it did improve something but not much still there is a lot of difference in loss from the ideal log.txt posted by @alexpolozov and this
image

anything else you guys did for, that we are missing???

update-- spider released their update after reparsing the AST, still the same loss not much diff...
image

@senthurRam33
Copy link

senthurRam33 commented Aug 12, 2020

@DevanshChoubey Even I had the same loss initially. Try running the model for about minimum 10000 steps so that the train loss will settle around 2.0 and val loss will settle around 6.0

@DevanshChoubey
Copy link

@senthurRam33,

yeah got that after training for 6000 steps...

anyway how much accuracy did you get???on EVAL??

@senthurRam33
Copy link

It was around 60%. But didn't check the model on the newly released spider dataset

@dorajam
Copy link

dorajam commented Aug 13, 2020

@senthurRam33 do you mind sharing which hyperparams you used? Specifically the batch size and LR? I trained for 80K steps and the performance dropped from 56% to around 51% after around the first 50K steps. It never even approached 60%. I used a batch size of 4x4 (bs x num_batch_accumulated). Thanks in advance!

@senthurRam33
Copy link

senthurRam33 commented Aug 17, 2020

@dorajam I have only changed batch size because of memory issues. Remaining hyperparameters have been used as they are. Changed batch size 2x8 (bs x num_batch_accumulated). Try retraining the model with new spider dataset for improved accuracy.

@senthurRam33
Copy link

senthurRam33 commented Aug 17, 2020

@alexpolozov @rshin #7 (comment) In this issue you have added your log file. And the loss in training has been down to 0 but the val loss has stayed around 6. Perhaps if we apply the flooding technique, is there a chance that the val loss will get reduced?

@ygan
Copy link

ygan commented Aug 20, 2020

I run the BERT model for 81K steps with the original parameters but get only 64% accuracy. The double descent doesn't appear, because it reaches 64% accuracy since 30K step. The version is 648fc87 on 15 Aug.

@senthurRam33
Copy link

@ygan Did you train the model with the new Spider dataset??

@zhangyuchen584
Copy link

Hi @berlino,

Do you use current default hyper parameters to get 69% accuracy? I noticed that in the code, the encoder dropout is set to Null(0.2 in your paper), d_x, d_z in the attention layer are set to 128(256 in the paper), decoder dropout is set to 0.2068(0.21 in the paper).

I run Bert model for 78k with default parameters, only get 56% accuracy,
Wrote eval results to logdir/bert_run/bs=6,lr=7.4e-04,bert_lr=3.0e-06,end_lr=0e0,att=1/ie_dirs/bert_run_true_1-step77100.eval 77100 0.5647969052224371

and some time the experiments just converge at around 35k, then the loss score become bigger and bigger, but I can got 60% accuracy at 35k which more than training 78k.

The best accuracy I can get is 60% currently, could you tell me how to fine-tune? Many Thanks!

@berlino
Copy link
Collaborator

berlino commented Aug 26, 2020

hi, @zhangyuchen584

In the paper we use dropout 0.21 ~=0.2068, encoder dropout is only used for non-bert model (encoderbert does not have this option). Which attention are you referring to?

@berlino
Copy link
Collaborator

berlino commented Aug 26, 2020

hi, all

I've attached my original model config below which is used for evaluation (V3) on Spider leaderboard. After careful comparison with my original config, I think I might missed one config to commit (really sorry for this if it really matters!).

The missing config is "loss_type", which I set to "label_smooth" whereas the default is "softmax". My initial experiments show that this config does not matter if max_steps is 40k. I somehow used 'label_smooth" for training for 81k steps, but didn't have the time to test the difference with "softmax" for this 81k setting then. If you tried to activate label_smooth, please let me know if it really matters. Thanks

{ "model": { "decoder": { "desc_attn": "mha", "dropout": 0.20687225956012834, "enc_recurrent_size": 1024, "enumerate_order": false, "loss_type": "label_smooth", "name": "NL2Code", "recurrent_size": 512, "use_align_loss": true, "use_align_mat": true }, "decoder_preproc": { "grammar": { "clause_order": null, "end_with_from": true, "factorize_sketch": 2, "include_literals": false, "infer_from_conditions": true, "name": "spider", "output_from": true, "use_table_pointer": true }, "save_path": "preproc/", "use_seq_elem_rules": true }, "encoder": { "bert_token_type": true, "bert_version": "bert-large-uncased-whole-word-masking", "name": "spider-bert", "summarize_header": "avg", "update_config": { "cv_link": true, "name": "relational_transformer", "num_heads": 8, "num_layers": 8, "sc_link": true }, "use_column_type": false }, "encoder_preproc": { "bert_version": "bert-large-uncased-whole-word-masking", "compute_cv_link": true, "compute_sc_link": true, "db_path": "database/", "fix_issue_16_primary_keys": true, "include_table_name_in_column": false, "save_path": "preproc/" }, "name": "EncDec" }, }

@zhangyuchen584
Copy link

Hi @berlino,

I used default 'desc_attn' = 'mha' . in the models/attention.py file, MultiHeadedAttention class, around 105 line, self.d_k = value_size // h which is equal 128(I run the experiments using default parameters). In the paper, attention dx = dz = 256.
image

Am I misunderstanding? Thank you.

@mellahysf
Copy link

mellahysf commented Sep 10, 2020

Hi all,

someone can share the best pretrained model ?
with code to use it directly to generate an SQL request from text ?

Thanks.

@DevanshChoubey
Copy link

@berlino @alexpolozov

Finally after changing the loss to label_smooth, I got 65 percent accuracy at around 22000 steps, gonna keep training, let's see how much it can improve....

anyway , thanks again for all the help, you guys are the real heroes...

@Sea-Snell
Copy link

I was having trouble with this for a while, getting 59% and stuff, but then I tried using label_smooth and Bert LR 1e-5, and got 68.8% after 42k steps. So I recommend trying that

@DevanshChoubey
Copy link

DevanshChoubey commented Sep 16, 2020

Hi @Sea-Snell,

yeah, exactly even I am at 67% at 26000 steps, still training going on...but will same LR 3e-6

latest update

got 69.2 accuracy at 41800

seems label_smooth was the deal breaker at least for me

@sanjayss34
Copy link

@DevanshChoubey could you please paste your config here? I think I'm using the default settings except with label_smooth but I'm only getting 60.9% accuracy at 23100 steps, rather than 65% like you said. Thanks!

@slamandar
Copy link

Hi @DevanshChoubey @Sea-Snell

How about the performance gap between different saved models in your experiments? For example, I trained and evaluated on 38k, 39k and 40k steps, and the scores are 66.4%, 67.9% and 65.8%, in which the performance is not stable. Is this normal in your experiments?

@Sea-Snell
Copy link

I was actually able to get up to 72% in about 50k steps and it seemed the accuracy stayed above 68% consistently, so a little maybe bit unstable idk

try this config:
local _base = import 'nl2code-base.libsonnet';
local _output_from = true;
local _fs = 2;

function(args) _base(output_from=_output_from, data_path=args.data_path) + {
local data_path = args.data_path,

local lr_s = '%0.1e' % args.lr,
local bert_lr_s = '%0.1e' % args.bert_lr,
local end_lr_s = if args.end_lr == 0 then '0e0' else '%0.1e' % args.end_lr,

local base_bert_enc_size = if args.bert_version == "bert-large-uncased-whole-word-masking" then 1024 else 768,
local enc_size =  base_bert_enc_size,

model_name: 'bs=%(bs)d,lr=%(lr)s,bert_lr=%(bert_lr)s,end_lr=%(end_lr)s,att=%(att)d' % (args + {
    lr: lr_s,
    bert_lr: bert_lr_s,
    end_lr: end_lr_s,
}),

model+: {
    encoder+: {
        name: 'spider-bert',
        batch_encs_update:: null,
        question_encoder:: null,
        column_encoder:: null,
        table_encoder:: null,
        dropout:: null,
        update_config+:  {
            name: 'relational_transformer',
            num_layers: args.num_layers,
            num_heads: 8,
            sc_link: args.sc_link,
            cv_link: args.cv_link,
        },
        summarize_header: args.summarize_header,
        use_column_type: args.use_column_type,
        bert_version: args.bert_version,
        bert_token_type: args.bert_token_type,
        top_k_learnable:: null,
        word_emb_size:: null,
    },
    encoder_preproc+: {
        word_emb:: null,
        min_freq:: null,
        max_count:: null,
        db_path: data_path + "database",
        compute_sc_link: args.sc_link,
        compute_cv_link: args.cv_link,
        fix_issue_16_primary_keys: true,
        bert_version: args.bert_version,
        count_tokens_in_word_emb_for_vocab:: null,
        save_path: data_path + 'nl2code,output_from=%s,fs=%d,emb=bert,cvlink' % [_output_from, _fs],
    },
    decoder_preproc+: {
        grammar+: {
            end_with_from: args.end_with_from,
            clause_order: args.clause_order,
            infer_from_conditions: true,
            factorize_sketch: _fs,
	include_literals: false,
        },
        save_path: data_path + 'nl2code,output_from=%s,fs=%d,emb=bert,cvlink' % [_output_from, _fs],

        compute_sc_link:: null,
        compute_cv_link:: null,
        db_path:: null,
        fix_issue_16_primary_keys:: null,
        bert_version:: null,
    },
    decoder+: {
        name: 'NL2Code',
        dropout: 0.20687225956012834,
        desc_attn: 'mha',
        enc_recurrent_size: enc_size,
        recurrent_size : args.decoder_hidden_size,
        loss_type: 'label_smooth',
        use_align_mat: args.use_align_mat,
        use_align_loss: args.use_align_loss,
    }
},

train+: {
    batch_size: args.bs,
    num_batch_accumulated: args.num_batch_accumulated,
    clip_grad: 1,

    model_seed: args.att,
    data_seed:  args.att,
    init_seed:  args.att,
    
    max_steps: args.max_steps,
},

optimizer: {
    name: 'bertAdamw',
    lr: 0.0,
    bert_lr: 0.0,
},

lr_scheduler+: {
    name: 'bert_warmup_polynomial_group',
    start_lrs: [args.lr, args.bert_lr],
    end_lr: args.end_lr,
    num_warmup_steps: $.train.max_steps / 8,
},

log: {
    reopen_to_flush: true,
}

}

@slamandar
Copy link

@Sea-Snell Thanks for your sharing! Have you made some improvements to reach 72%? Or the provided code could reach 72% itself?

@sanjayss34
Copy link

Thank you, @Sea-Snell ! Could you also please paste your version of experiments/spider-bert-run.jsonnet?

@Sea-Snell
Copy link

the version I trained to 72% was just with the provided code, no improvements. Here is my experiments/spider-bert-run.jsonnet:

{
logdir: "logdir/bert_run",
model_config: "configs/spider/nl2code-bert.jsonnet",
model_config_args: {
data_path: 'data/spider/',
bs: 6,
num_batch_accumulated: 4,
bert_version: "bert-large-uncased-whole-word-masking",
summarize_header: "avg",
use_column_type: false,
max_steps: 81000,
num_layers: 8,
lr: 7.44e-4,
bert_lr: 1e-5,
att: 1,
end_lr: 0,
sc_link: true,
cv_link: true,
use_align_mat: true,
use_align_loss: true,
bert_token_type: true,
decoder_hidden_size: 512,
end_with_from: true, # equivalent to "SWGOIF" if true
clause_order: null, # strings like "SWGOIF", it will be prioriotized over end_with_from
},

eval_name: "bert_run_%s_%d" % [self.eval_use_heuristic, self.eval_beam_size],
eval_output: "__LOGDIR__/ie_dirs",
eval_beam_size: 1,
eval_use_heuristic: true,
eval_steps: [11200],
eval_section: "val"

}

@duyvuleo
Copy link

@Sea-Snell : I used the same config as you shared except for batch_size=2 and num_batch_accumulated=8 due to an OOM issue (FYI, my GPU has 16GB V100).
The result I got:

9100 0.402321083172147

9600 0.41779497098646035

10100 0.4090909090909091

The accuracy is pretty low compared to yours. Did I do something wrong or I need to wait for longer?

Thanks!

@Sea-Snell
Copy link

I think you need to wait longer. Even if the loss seems to converge the accuracy still increases. I didn't reach 72% until about 42k steps.

@shuqinlee
Copy link

@duyvuleo I also use the same GPU (16GB V100) and the same batch_size=2 and num_batch_accumulated=8, bert lr rate= 3e-6. But got only 62% at 76100, 59% at 81000.
What is your result?

@duyvuleo
Copy link

duyvuleo commented Nov 1, 2020

@shuqinlee : Here are my results which are exactly the same to yours:

9100 0.402321083172147

9600 0.41779497098646035

10100 0.4090909090909091

27600 0.5454545454545454

27900 0.5764023210831721

76100 0.6121856866537717

81000 0.59284332688588

BTW, training RatSQL is pretty slow and unstable. I don't know why @Sea-Snell can get 72% accuracy with the same configuration.

@Sea-Snell
Copy link

Sea-Snell commented Nov 2, 2020

I was having unstable training until I changed the loss to label_smooth instead of softmax (which is the default) in the config. Did you do that ? @duyvuleo

@shuqinlee
Copy link

shuqinlee commented Nov 2, 2020

@duyvuleo Thanks for your result.

Hi, @Sea-Snell I have changed the loss type into 'label_smooth' but still get that result. The only difference between my config and yours is batch(2 vs 6), num_batch_accumulated (8 vs 4) and bert_lr_rate(3e-6 vs 1e-6). But still get up to 62%. This gap I think cannot be explained by unstable.

Did you directly use the spider dataset from https://yale-lily.github.io/spider of the version after day 8.3?

Or else the gap could because of the learning rate and batch size

@duyvuleo
Copy link

duyvuleo commented Nov 2, 2020

@Sea-Snell : yes, i already changed the loss type from softmax to label_smooth (in fact, i copied and pasted your posted config, just changed the batch_size).

@shuqinlee
Copy link

shuqinlee commented Nov 2, 2020

Hi @DevanshChoubey,
I noticed that you also use GPU v100 and get result at 69%! I have bs=2 num_batch_accumulated=8 bert_lr_rate=3e-6, and have changed the loss type to label_smooth, but still got result less than 62%

Could you please share your config? since it takes days to try. Thanks in advance!!

@duyvuleo
Copy link

duyvuleo commented Nov 3, 2020

Hi @shuqinlee : I finally managed to get 68-70 performance with batch_size=2 by adjusting num_batch_accumulated=12.

20100 0.6847195357833655
22000 0.6866537717601547
23900 0.7030947775628626

Will let you know if I can get 72 as @Sea-Snell did when training up to 80K steps.

@shuqinlee
Copy link

shuqinlee commented Nov 5, 2020

@duyvuleo wow that's inspiring!! turns out it is the problem of batch_size and num_batch_accumulated. I am now training with batch_size=3, num_batch_accumulated=8 and getting

20100 0.6199226305609284
21100 0.6247582205029013
21400 0.6315280464216635

will train more epochs to see if it will getting 68-70 rate and try your config later. again, thanks for your result, this is really helpful

@duyvuleo
Copy link

duyvuleo commented Nov 9, 2020

@shuqinlee: It looks like I can get ~72 accuracies with more training:

76100 0.7147001934235977

@mellahysf
Copy link

Hi @duyvuleo @shuqinlee ,

Can you guys share with me your current best pretrained model?

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests