Increased Loss/Validation Perplexity on word language model #14722

nswamy · 2019-04-17T17:36:53Z

Description

Validation perplexity on a word language model went up from 127 to 186 on 02/21. I tested with the cu90 pypi package from 1.5.0b20190221.
Looking into the commits that went in, this led me to the change that happened on softmax accumulation type in #14098.
Tested by reverting the commit 862cbc6 from 1.5.0b20190221 and built a static library using the static library build script, this as expected came down to 127.
There was another change on 03/12 in #14219 that changed the default data type to fp32 which helped reduce the validation perplexity to 140. This was an unintended consequence and not the right fix.

Environment info (Required)

Deep Learning Ubuntu Base AMI id: ami-0ff00f007c727c376
Instance Type: P2.16X

What to do:

Create a P2.16X EC2 instance or similar with the ami id(latest when i tested) above.
sudo pip install mxnet-cu90==1.5.0b20190220 &
git clone https://github.com/awslabs/deeplearning-benchmark
Use the script https://github.com/awslabs/deeplearning-benchmark/blob/master/word_language_model/word_language_model_train.py
Run the script using `python word_language_model/word_language_model.py --gpus 8 --nhid 650 --emsize 650 --dropout 0.5 --epochs 40 --data word_language_model/data/ptb. --mode imperative --kvstore device
Run the same with sudo pip install mxnet-cu90==1.5.0b20190221.

Output:

Version: 1.5.0b20190220

hostname ; pip show mxnet-cu90 ; tail -n 10 lstm_ptb_imperative.log 
ip-172-31-29-103 
DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7. 
Name: mxnet-cu90 
Version: 1.5.0b20190220

INFO:root:[Epoch 38] time cost 26.20s, valid loss 4.84, valid ppl 126.83 
INFO:root:test loss 4.79, test ppl 120.65 
INFO:root:[Epoch 39] time cost 26.03s, valid loss 4.84, valid ppl 126.66 
INFO:root:test loss 4.79, test ppl 120.26 
INFO:root:Best test loss 4.79, test ppl 120.26

Version: 1.5.0b20190221

hostname ; pip show mxnet-cu90 ; tail -n 10 lstm_ptb_imperative.log 
ip-172-31-63-76 
Name: mxnet-cu90 
Version: 1.5.0b20190221 
Summary: MXNet is an ultra-scalable deep learning framework. This version uses CUDA-9.0. 
Home-page: https://github.com/apache/incubator-mxnet 
Author: UNKNOWN 
Author-email: UNKNOWN 
License: Apache 2.0 
Location: /usr/local/lib/python2.7/dist-packages 
Requires: numpy, requests, graphviz 
You are using pip version 9.0.1, however version 19.0.3 is available. 
You should consider upgrading via the 'pip install --upgrade pip' command.

INFO:root:test loss 5.23, test ppl 186.97 
INFO:root:[Epoch 36] time cost 25.76s, valid loss 5.27, valid ppl 194.12 
INFO:root:test loss 5.23, test ppl 186.37 
INFO:root:[Epoch 37] time cost 25.64s, valid loss 5.24, valid ppl 189.45 
INFO:root:test loss 5.20, test ppl 181.52 
INFO:root:[Epoch 38] time cost 26.20s, valid loss 5.24, valid ppl 189.03 
INFO:root:test loss 5.20, test ppl 180.95 
INFO:root:[Epoch 39] time cost 25.59s, valid loss 5.23, valid ppl 185.90 
INFO:root:test loss 5.18, test ppl 177.53 
INFO:root:Best test loss 5.18, test ppl 177.53

Version: 1.5.0b20190313 -> This gives a perplexity of 140.

hostname ; pip show mxnet-cu90 ; tail -n 10 lstm_ptb_imperative.log ; echo "commit-hash" ; cat /usr/local/lib/python2.7/dist-packages/mxnet/COMMIT_HASH 
ip-172-31-29-103 
DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7. 
Name: mxnet-cu90 
Version: 1.5.0b20190313 
Summary: MXNet is an ultra-scalable deep learning framework. This version uses CUDA-9.0. 
Home-page: https://github.com/apache/incubator-mxnet 
Author: UNKNOWN 
Author-email: UNKNOWN 
License: Apache 2.0 
Location: /usr/local/lib/python2.7/dist-packages 
Requires: numpy, requests, graphviz 
Required-by:

INFO:root:test loss 4.92, test ppl 136.69 
INFO:root:[Epoch 36] time cost 25.91s, valid loss 4.96, valid ppl 142.06 
INFO:root:test loss 4.91, test ppl 135.81 
INFO:root:[Epoch 37] time cost 25.51s, valid loss 4.95, valid ppl 141.45 
INFO:root:test loss 4.91, test ppl 135.16 
INFO:root:[Epoch 38] time cost 25.57s, valid loss 4.95, valid ppl 141.29 
INFO:root:test loss 4.91, test ppl 134.99 
INFO:root:[Epoch 39] time cost 26.08s, valid loss 4.95, valid ppl 140.87 
INFO:root:test loss 4.90, test ppl 134.48 
INFO:root:Best test loss 4.90, test ppl 134.48 
commit-hash 
4432af1f47a439517eff9a21bef23ef7ae5e4aa4

What have you tried:

Reverted the softmax commit 862cbc6 on top of 0221 and reran the test to see validation perplexity drop to 127.
Have a PR in progress and tested to see that the validation perplexity drops to 127
https://github.com/apache/incubator-mxnet/compare/master...nswamy:softmax_fix?expand=1

The text was updated successfully, but these errors were encountered:

eric-haibin-lin · 2019-05-22T18:53:05Z

it should be fixed in #15037 which makes the behavior the same as before

nswamy added Bug Operator labels Apr 17, 2019

nswamy self-assigned this Apr 17, 2019

This was referenced Apr 22, 2019

Fix softmax behavior to not cast up the accumulator if no output dtype is specified #14759

Closed

upcasting accumulation type in norm increases loss/Perplexity #14760

Closed

nswamy changed the title ~~Increased Validation Perplexity on word language model~~ Increased Loss/Validation Perplexity on word language model Apr 22, 2019

nswamy mentioned this issue May 7, 2019

Use env var to enforce safe accumulation in ReduceAxesCompute #14830

Merged

8 tasks

eric-haibin-lin closed this as completed May 22, 2019

abhinavs95 mentioned this issue Jul 26, 2019

use MXNET_SAFE_ACCUMULATION for softmax accumulator #15037

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increased Loss/Validation Perplexity on word language model #14722

Increased Loss/Validation Perplexity on word language model #14722

nswamy commented Apr 17, 2019 •

edited

Loading

eric-haibin-lin commented May 22, 2019

Increased Loss/Validation Perplexity on word language model #14722

Increased Loss/Validation Perplexity on word language model #14722

Comments

nswamy commented Apr 17, 2019 • edited Loading

Description

Environment info (Required)

What to do:

Output:

What have you tried:

eric-haibin-lin commented May 22, 2019

nswamy commented Apr 17, 2019 •

edited

Loading