You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
We observe a significant reduction in Sockeye inference speed with a recent build of MXNet 2.x (master branch). Compared to 1.x versions of MXNet, GPU translation with MXNet 2.x is ~2x slower.
For MXNet 2.x, we migrated Sockeye to the Gluon 2.0 interface and adopted the new Numpy namespaces. Otherwise, code is equivalent to master with the same level of hybridization (static_alloc=True) in both branches. The pull request/branch can be found here: awslabs/sockeye#953.
The runs below use half-precision and run on a p3.2xlarge. Outputs are equal.
@blchu has been working on this together with @barry-jin . We found big CPU overhead in 2.x vs. 1.x. One specific op, unravel, runs on CPU instead of GPU in 2.x due to the interface change. The fixing is ongoing. @szha FYI too.
I've done some additional profiling of the code, and have noticed that certain parts of the code are being slowed down by functions that currently call asnumpy() and use the numpy array equivalent function instead. Directly implementing these functions should improve the performance considerably. Also, the __getitem__ function is slower than the numpy version, and moving the code to the backend would improve array indexing performance.
I've attached an image of the profile visualization of the related part of the code (getting the best translations at the end of decoding).
Description
We observe a significant reduction in Sockeye inference speed with a recent build of MXNet 2.x (master branch). Compared to 1.x versions of MXNet, GPU translation with MXNet 2.x is ~2x slower.
For MXNet 2.x, we migrated Sockeye to the Gluon 2.0 interface and adopted the new Numpy namespaces. Otherwise, code is equivalent to master with the same level of hybridization (
static_alloc=True
) in both branches. The pull request/branch can be found here: awslabs/sockeye#953.The runs below use half-precision and run on a p3.2xlarge. Outputs are equal.
p3.2xlarge instance
batch size 64
mxnet-cu112 2.0.0b20211001
:mxnet-cu112 1.7
:batch size 1
mxnet-cu112 2.0.0b20211001
:mxnet-cu112 1.7
:g4 instance
To Reproduce
translate.sh
with themaster
branch of Sockeyetranslate.sh
with themx2
branch of SockeyeSteps to reproduce
(Paste the commands you ran that produced the error.)
mv sockeye/sockeye wmt_14_en_de
wmt_14_en_de
bash translate.sh
[translate with master branch]git checkout mx2
pip uninstall mxnet-cu112 ; pip install --pre -f https://dist.mxnet.io/python 'mxnet-cu112'
)bash translate.sh
[translate with mx2 branch]What have you tried to solve it?
Environment
conda install -c conda-forge nccl cudnn cudatoolkit==11.2
)2.0.0b20211001
)The text was updated successfully, but these errors were encountered: