-
Notifications
You must be signed in to change notification settings - Fork 538
[MXNet] - [BERT] #690
Comments
I used run_pretraining.py from... |
I believe this is the same issue I've been seeing, bisected to apache/mxnet@369b66d0f from apache/mxnet#14785 |
@tlby thanks reporting. Was your issue also about running the pre-training script? |
I was testing with the finetuning example. It looks like MXNet reverted the PR in question so I think things are good now. |
@eric-haibin-lin @tlby I'm trying to fix the trouble PR in MXNet, but failed to setup the bert example to reproduce the regression. The reported command line doesn't work for me, because out_mcg_test-big/part-001.npz not found. I didn't try gluon-nlp bert before, may I know the minimal command lines to reproduce this issue on top of MXNet and gluon-nlp only? Thanks in advance. |
@ZhennanQin are you able to reproduce apache/mxnet#14868 with
from http://gluon-nlp.mxnet.io/model_zoo/bert/index.html#bert-pre-training ? |
@eric-haibin-lin This command works. Thanks. The fixed PR is filed again at apache/mxnet#14931, please take a look. |
There is a problem with a custom BERT model training with the later version of MXNet 1.5.0 (observed with cu90).
mlm_loss stops around 7.2X and nsp_acc stopps around 54.
The last mxnet-cu90 version is still viable is 1.5.0b20190425.
1.5.0b20190426 onward has this issue. Thus, you cannot train a custom BERT model with the latest version of MXNet now.
I assume there was a change in optimization between April 25th and 26th.
I used the latest version of gluonnlp for the following test. I think it is not the problem with gluonnlp (0.6.0).
(i.e. pip install https://github.com/dmlc/gluon-nlp/tarball/master )
With mxnet-cu90==1.5.0b20190425 (This is working)
With mxnet-cu90==1.5.0b20190426 (This is not working)
#(Same)# INFO:root:[step 249] mlm_loss=nan mlm_acc=4.56305 nsp_loss=nan nsp_acc=54.454 throughput=23.7K tks/s lr=0.0000249 time=321.78 INFO:root:[step 499] mlm_loss=7.27492 mlm_acc=5.76089 nsp_loss=0.68847 nsp_acc=54.719 throughput=57.4K tks/s lr=0.0000499 time=134.22 INFO:root:[step 749] mlm_loss=7.26470 mlm_acc=5.82224 nsp_loss=0.68894 nsp_acc=54.428 throughput=57.3K tks/s lr=0.0000749 time=134.40
The text was updated successfully, but these errors were encountered: