Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

[MXNet] - [BERT] #690

Closed
araitats opened this issue May 2, 2019 · 9 comments · Fixed by apache/mxnet#14868
Closed

[MXNet] - [BERT] #690

araitats opened this issue May 2, 2019 · 9 comments · Fixed by apache/mxnet#14868
Assignees
Labels
bug Something isn't working

Comments

@araitats
Copy link

araitats commented May 2, 2019

There is a problem with a custom BERT model training with the later version of MXNet 1.5.0 (observed with cu90).
mlm_loss stops around 7.2X and nsp_acc stopps around 54.
The last mxnet-cu90 version is still viable is 1.5.0b20190425.
1.5.0b20190426 onward has this issue. Thus, you cannot train a custom BERT model with the latest version of MXNet now.

I assume there was a change in optimization between April 25th and 26th.

I used the latest version of gluonnlp for the following test. I think it is not the problem with gluonnlp (0.6.0).
(i.e. pip install https://github.com/dmlc/gluon-nlp/tarball/master )

With mxnet-cu90==1.5.0b20190425 (This is working)

(mxnet_p36_updated_4) sh-4.2$ python run_pretraining.py --gpus 0,1,2,3,4,5,6,7 --batch_size 8 --lr 1e-3 --data "out_mcg_test-big/part-001.npz"  @--warmup_ratio 0.01 --num_steps 1000000 --log_interval=250 --data_eval "out_mcg_test-big/part-001.npz" --batch_size_eval 8 --ckpt_dir ckpt --ckpt_interval 25000 --accumulate 4 --num_buckets 10 --dtype float16 
INFO:root:Namespace(accumulate=4, batch_size=8, batch_size_eval=8, by_token=False, ckpt_dir='ckpt', ckpt_interval=25000, data='out_mcg_test-big/part-001.npz', data_eval='out_mcg_test-big/part-001.npz', dataset_name='book_corpus_wiki_en_uncased',dtype='float16', eval_only=False, gpus='0,1,2,3,4,5,6,7', kvstore='device', log_interval=250, lr=0.001, model='bert_12_768_12', num_buckets=10, num_steps=1000000, pretrained=False, profile=False, seed=0, start_step=0, verbose=False, warmup_ratio=0.01) 
INFO:root:Using training data at out_mcg_test-big/part-001.npz 
[20:42:24] src/kvstore/././comm_tree.h:356: only 32 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off 
[20:42:24] src/kvstore/././comm_tree.h:365: .vvvv... 
[20:42:24] src/kvstore/././comm_tree.h:365: v.vv.v.. 
[20:42:24] src/kvstore/././comm_tree.h:365: vv.v..v. 
[20:42:24] src/kvstore/././comm_tree.h:365: vvv....v 
... 
[20:42:24] src/kvstore/./././gpu_topology.h:216: cudaDeviceGetP2PAttribute incorrect. Falling back to cudaDeviceEnablePeerAccess for topology detection 
[20:42:24] src/kvstore/././comm_tree.h:380: Using Kernighan-Lin to generate trees 
[20:42:24] src/kvstore/././comm_tree.h:391: Using Tree 
[20:42:25] src/kvstore/././comm_tree.h:488: Size 2 occurs 1 times 
[20:42:25] src/kvstore/././comm_tree.h:488: Size 768 occurs 114 times 
[20:42:25] src/kvstore/././comm_tree.h:488: Size 1536 occurs 2 times 
[20:42:25] src/kvstore/././comm_tree.h:488: Size 3072 occurs 12 times 
[20:42:25] src/kvstore/././comm_tree.h:488: Size 30522 occurs 1 times 
[20:42:25] src/kvstore/././comm_tree.h:488: Size 393216 occurs 1 times 
[20:42:25] src/kvstore/././comm_tree.h:488: Size 589824 occurs 50 times 
[20:42:25] src/kvstore/././comm_tree.h:488: Size 2359296 occurs 24 times 
[20:42:25] src/kvstore/././comm_tree.h:488: Size 23440896 occurs 1 times 
INFO:root:[step 249] mlm_loss=8.02087 mlm_acc=6.88167 nsp_loss=0.69021 nsp_acc=53.343 throughput=24.2K tks/s lr=0.0000249 time=315.28 
INFO:root:[step 499] mlm_loss=6.85134 mlm_acc=11.51758 nsp_loss=0.65648 nsp_acc=60.298 throughput=57.7K tks/s lr=0.0000499 time=133.49 
INFO:root:[step 749] mlm_loss=6.60548 mlm_acc=13.98383 nsp_loss=0.58539 nsp_acc=67.169 throughput=57.7K tks/s lr=0.0000749 time=133.54 

With mxnet-cu90==1.5.0b20190426 (This is not working)

#(Same)# 
INFO:root:[step 249]    mlm_loss=nan    mlm_acc=4.56305 nsp_loss=nan    nsp_acc=54.454  throughput=23.7K tks/s  lr=0.0000249 time=321.78
INFO:root:[step 499]    mlm_loss=7.27492        mlm_acc=5.76089 nsp_loss=0.68847        nsp_acc=54.719  throughput=57.4K tks/s     lr=0.0000499 time=134.22
INFO:root:[step 749]    mlm_loss=7.26470        mlm_acc=5.82224 nsp_loss=0.68894        nsp_acc=54.428  throughput=57.3K tks/s     lr=0.0000749 time=134.40
@araitats
Copy link
Author

araitats commented May 2, 2019

apache/mxnet#14864

@szha szha added the bug Something isn't working label May 2, 2019
@araitats
Copy link
Author

araitats commented May 2, 2019

I used run_pretraining.py from...
http://gluon-nlp.mxnet.io/_downloads/bert.zip

@tlby
Copy link

tlby commented May 3, 2019

I believe this is the same issue I've been seeing, bisected to apache/mxnet@369b66d0f from apache/mxnet#14785

@eric-haibin-lin
Copy link
Member

@tlby thanks reporting. Was your issue also about running the pre-training script?

@tlby
Copy link

tlby commented May 4, 2019

I was testing with the finetuning example. It looks like MXNet reverted the PR in question so I think things are good now.

@eric-haibin-lin
Copy link
Member

@tlby thanks!
BTW - I'm adding some enhancements to the finetuning script in #692 feel free to comment on it

@ZhennanQin
Copy link

@eric-haibin-lin @tlby I'm trying to fix the trouble PR in MXNet, but failed to setup the bert example to reproduce the regression. The reported command line doesn't work for me, because out_mcg_test-big/part-001.npz not found. I didn't try gluon-nlp bert before, may I know the minimal command lines to reproduce this issue on top of MXNet and gluon-nlp only? Thanks in advance.

@eric-haibin-lin
Copy link
Member

@ZhennanQin are you able to reproduce apache/mxnet#14868 with

python run_pretraining.py --gpus 0 --batch_size 32 --lr 2e-5 --data 'out/*.npz' --warmup_ratio 0.5 --num_steps 20 --pretrained --log_interval=2 --data_eval 'out/*.npz' --batch_size_eval 8 --ckpt_dir ckpt

from http://gluon-nlp.mxnet.io/model_zoo/bert/index.html#bert-pre-training ?

@ZhennanQin
Copy link

@eric-haibin-lin This command works. Thanks. The fixed PR is filed again at apache/mxnet#14931, please take a look.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants