MXNet 2.x significantly slower than 1.x in Sockeye #20636

fhieber · 2021-10-05T07:25:00Z

Description

We observe a significant reduction in Sockeye inference speed with a recent build of MXNet 2.x (master branch). Compared to 1.x versions of MXNet, GPU translation with MXNet 2.x is ~2x slower.

For MXNet 2.x, we migrated Sockeye to the Gluon 2.0 interface and adopted the new Numpy namespaces. Otherwise, code is equivalent to master with the same level of hybridization (static_alloc=True) in both branches. The pull request/branch can be found here: awslabs/sockeye#953.

The runs below use half-precision and run on a p3.2xlarge. Outputs are equal.

p3.2xlarge instance

batch size 64

mxnet-cu112 2.0.0b20211001:

[INFO:__main__] Processed 3003 lines. Total time: 37.2888, sec/sent: 0.0124, sent/sec: 80.5336

mxnet-cu112 1.7:

[INFO:__main__] Processed 3003 lines. Total time: 20.2805, sec/sent: 0.0068, sent/sec: 148.0735

batch size 1

mxnet-cu112 2.0.0b20211001:

[INFO:__main__] Processed 3003 lines. Total time: 858.3818, sec/sent: 0.2858, sent/sec: 3.4984

mxnet-cu112 1.7:

[INFO:__main__] Processed 3003 lines. Total time: 302.0189, sec/sent: 0.1006, sent/sec: 9.9431

g4 instance

mx18/out.1.bpe.log:[2021-10-04:20:02:32:INFO:__main__:read_and_translate] Processed 3003 lines. Total time: 316.4692, sec/sent: 0.1054, sent/sec: 9.4891
mx18/out.64.bpe.log:[2021-10-04:20:03:10:INFO:__main__:read_and_translate] Processed 3003 lines. Total time: 31.8175, sec/sent: 0.0106, sent/sec: 94.3819
mx20/out.1.bpe.log:[2021-10-04:20:17:32:INFO:__main__:read_and_translate] Processed 3003 lines. Total time: 714.5509, sec/sent: 0.2379, sent/sec: 4.2026
mx20/out.64.bpe.log:[2021-10-04:20:18:26:INFO:__main__:read_and_translate] Processed 3003 lines. Total time: 46.4607, sec/sent: 0.0155, sent/sec: 64.6352

To Reproduce

Download the Sockeye sample model
Run translate.sh with the master branch of Sockeye
Run translate.sh with the mx2 branch of Sockeye

Steps to reproduce

(Paste the commands you ran that produced the error.)

wget https://github.com/awslabs/sockeye/releases/download/2.3.22/wmt14_en_de.tgz
tar -xvf wmt14_en_de.tgz
git clone https://github.com/awslabs/sockeye.git
pip install -r sockeye/requirements/requirements.gpu-cu112.txt`
mv sockeye/sockeye wmt_14_en_de
cd wmt_14_en_de
bash translate.sh [translate with master branch]
git checkout mx2
(Install nightly build of mx2: pip uninstall mxnet-cu112 ; pip install --pre -f https://dist.mxnet.io/python 'mxnet-cu112')
bash translate.sh [translate with mx2 branch]

What have you tried to solve it?

Environment

Cuda 11.2 (conda install -c conda-forge nccl cudnn cudatoolkit==11.2)
MXNet 1.8.post0 or MXNet 1.7 vs MXNet 2.x (2.0.0b20211001)

The text was updated successfully, but these errors were encountered:

TristonC · 2021-11-05T18:38:25Z

@blchu has been working on this together with @barry-jin . We found big CPU overhead in 2.x vs. 1.x. One specific op, unravel, runs on CPU instead of GPU in 2.x due to the interface change. The fixing is ongoing. @szha FYI too.

blchu · 2022-01-11T23:07:20Z

I've done some additional profiling of the code, and have noticed that certain parts of the code are being slowed down by functions that currently call asnumpy() and use the numpy array equivalent function instead. Directly implementing these functions should improve the performance considerably. Also, the __getitem__ function is slower than the numpy version, and moving the code to the backend would improve array indexing performance.

I've attached an image of the profile visualization of the related part of the code (getting the best translations at the end of decoding).

fhieber added Bug needs triage labels Oct 5, 2021

szha added Performance Operator and removed needs triage labels Nov 5, 2021

barry-jin mentioned this issue Nov 5, 2021

[NumPy] Wrap unravel_index backend implementation instead of fallback #20730

Merged

6 tasks

fhieber closed this as not planned Won't fix, can't repro, duplicate, stale Dec 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MXNet 2.x significantly slower than 1.x in Sockeye #20636

MXNet 2.x significantly slower than 1.x in Sockeye #20636

fhieber commented Oct 5, 2021

TristonC commented Nov 5, 2021

blchu commented Jan 11, 2022

MXNet 2.x significantly slower than 1.x in Sockeye #20636

MXNet 2.x significantly slower than 1.x in Sockeye #20636

Comments

fhieber commented Oct 5, 2021

Description

p3.2xlarge instance

batch size 64

batch size 1

g4 instance

To Reproduce

Steps to reproduce

What have you tried to solve it?

Environment

TristonC commented Nov 5, 2021

blchu commented Jan 11, 2022