Slow CPU inference in Gluon GRU module #13634

marekjg · 2018-12-13T07:09:14Z

Description

Gluon.GRU is slow on the CPU comparing to ndarray.RNN GRU for the same input.

Environment info

Deep Learning AMI 19, Tesla V100

----------Python Info----------
Version      : 3.7.1
Compiler     : GCC 7.3.0
Build        : ('default', 'Oct 23 2018 19:19:42')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 18.1
Directory    : /home/ec2-user/anaconda3/envs/gmarek_mx13/lib/python3.7/site-packages/pip
----------MXNet Info-----------
Version      : 1.5.0
Directory    : /home/ec2-user/anaconda3/envs/gmarek_mx13/lib/python3.7/site-packages/mxnet
Commit Hash   : b45e1273ece8eba1a011107ce12032af58efe661
----------System Info----------
Platform     : Linux-4.14.77-70.59.amzn1.x86_64-x86_64-with-glibc2.10
system       : Linux
node         : ip-172-31-44-214
release      : 4.14.77-70.59.amzn1.x86_64
version      : #1 SMP Mon Nov 12 22:02:45 UTC 2018
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               2701.073
BogoMIPS:              4600.18
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-7
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0018 sec, LOAD: 0.7860 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0006 sec, LOAD: 0.5938 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0006 sec, LOAD: 0.0175 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0004 sec, LOAD: 1.0119 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0114 sec, LOAD: 0.4352 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0004 sec, LOAD: 0.0866 sec.

Minimum reproducible example

from time import time

import mxnet as mx
from mxnet import nd
from mxnet import gluon
from mxnet.gluon import rnn

inp_dim = 1024
hid_dim = 1024
n_layers = 1
n_parameters = (inp_dim * hid_dim + hid_dim + hid_dim * hid_dim + hid_dim) * 3
n_steps = 100

for ctx in [mx.cpu(), mx.gpu()]:
    gru_params = nd.random.uniform(low=-1, high=1, shape=(n_parameters,), ctx=ctx)
    gru_ndarray = lambda x, h_0: nd.RNN(x, gru_params, h_0, num_layers=n_layers,
                                        state_size=hid_dim, mode='gru', state_outputs=True)
    gru_gluon = rnn.GRU(hid_dim, n_layers, input_size=inp_dim)
    gru_gluon.collect_params().initialize(ctx=ctx)
    gru_gluon.hybridize()

    x = nd.random_normal(0, 1, (1, 1, inp_dim), ctx=ctx)
    h_0 = x

    # JIC: warm-up
    _, _ = gru_gluon(x, h_0)
    nd.waitall()

    for method, gru in [('ndarray', gru_ndarray), ('gluon', gru_gluon)]:
        h = h_0
        start = time()
        for step in range(n_steps):
            _, h = gru(x, h)
            if method == 'gluon':
                h = h[0]
        nd.waitall()
        dt = time() - start
        print(ctx, method, dt)

Steps to reproduce

Run the above script with python

Output

Gluon.GRU is significantly slower than ndarray.RNN
device,method,time:
cpu(0) ndarray 0.07194805145263672
cpu(0) gluon 4.735473394393921
gpu(0) ndarray 0.013593673706054688
gpu(0) gluon 0.04437994956970215

The text was updated successfully, but these errors were encountered:

pengzhao-intel · 2018-12-13T07:44:32Z

@ciyongch could you help take a look for GRU inference?
Did the fused GRU used?

TaoLv · 2018-12-13T07:50:10Z

I think Gluon GRU is calling unfused RNN cells which contain stacked fully connected and activation operators. But ndarray.RNN is calling a fused implementation. So for me the performance is as expectation.
@marekjg Have you ever compared the result of two implementation?

pengzhao-intel · 2018-12-13T08:05:12Z

Next step, @marekjg if you can build with USE_BLAS=mkl, the performance will boost a lot.

pengzhao-intel · 2018-12-13T08:11:22Z

@szha is it possible to apply fused RNN into Gluon?

marekjg · 2018-12-13T08:17:38Z

Thanks for quick response. @TaoLv yes, they're the same but I've removed the comparison and loading step of the parameters for the sake of brevity. @pengzhao-intel I've installed mxnet-cu92mkl and there was already boost in preformance in compare to mxnet-cu92 which I've installed by mistake earlier. Not sure if it helps but I've checked this script in 1.3, 1.4 (when it was @ master) and 1.5 now.

ciyongch · 2018-12-13T08:27:16Z

@pengzhao-intel @TaoLv @marekjg The current MXNet already supports fusedRNN in Gluon, gluon.rnn.GRU will call fusedGRU, while gluon.rnn.GRUCell will call the fullyconnected + activation implementation. Will take a look at this.

ciyongch · 2018-12-13T08:30:03Z

@marekjg please build MXNet from source with the the option USE_BLAS=mkl, since currently mxnet-mkl package is built with USE_BLAS=openblas by default. Please correct me if this is behavior is changed @TaoLv

TaoLv · 2018-12-13T08:52:29Z

@ciyongch Thank you for correcting me. Yes, rnn.GRU is also calling fused RNN implementation and can be hybridized now.

@marekjg please build MXNet from source with the the option USE_BLAS=mkl, since currently mxnet-mkl package is built with USE_BLAS=openblas by default. Please correct me if this is behavior is changed.

Yes, pip packages are built with openblas.

szha · 2018-12-13T19:44:17Z

gluon.rnn.GRU supports unrolling of samples with different lengths in the same batch, which is not yet supported in the fused kernel interface. cudnn supports that so for GPU implementation we'd need the integration. CPU version is yet to be implemented.

vdantu · 2018-12-13T23:36:54Z

@mxnet-label-bot add [Gluon, performance, question]

eric-haibin-lin · 2020-01-12T06:51:32Z

CPU kernels were added: #9977

marcoabreu added Gluon Performance Question labels Dec 13, 2018

eric-haibin-lin closed this as completed Jan 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow CPU inference in Gluon GRU module #13634

Slow CPU inference in Gluon GRU module #13634

marekjg commented Dec 13, 2018

pengzhao-intel commented Dec 13, 2018

TaoLv commented Dec 13, 2018

pengzhao-intel commented Dec 13, 2018

pengzhao-intel commented Dec 13, 2018

marekjg commented Dec 13, 2018 •

edited

Loading

ciyongch commented Dec 13, 2018

ciyongch commented Dec 13, 2018

TaoLv commented Dec 13, 2018

szha commented Dec 13, 2018

vdantu commented Dec 13, 2018

eric-haibin-lin commented Jan 12, 2020

Slow CPU inference in Gluon GRU module #13634

Slow CPU inference in Gluon GRU module #13634

Comments

marekjg commented Dec 13, 2018

Description

Environment info

Minimum reproducible example

Steps to reproduce

Output

pengzhao-intel commented Dec 13, 2018

TaoLv commented Dec 13, 2018

pengzhao-intel commented Dec 13, 2018

pengzhao-intel commented Dec 13, 2018

marekjg commented Dec 13, 2018 • edited Loading

ciyongch commented Dec 13, 2018

ciyongch commented Dec 13, 2018

TaoLv commented Dec 13, 2018

szha commented Dec 13, 2018

vdantu commented Dec 13, 2018

eric-haibin-lin commented Jan 12, 2020

marekjg commented Dec 13, 2018 •

edited

Loading