Problem of exporting FP16 SyncBN model. #13976

Fiend1213 · 2019-01-23T23:21:30Z

Description

This is a problem when exporting fp16 model containing SyncBN.

import mxnet as mx

mx.random.seed(42)

def data_xform(data):
    """Move channel axis to the beginning, cast to float32, and normalize to [0, 1]."""
    return mx.nd.moveaxis(data, 2, 0).astype('float32') / 255

train_data = mx.gluon.data.vision.MNIST(train=True).transform_first(data_xform)
val_data = mx.gluon.data.vision.MNIST(train=False).transform_first(data_xform)

batch_size = 2000
train_loader = mx.gluon.data.DataLoader(train_data, shuffle=True, batch_size=batch_size)
val_loader = mx.gluon.data.DataLoader(val_data, shuffle=False, batch_size=batch_size)

net = mx.gluon.nn.HybridSequential()
with net.name_scope():
    net.add(mx.gluon.nn.Conv2D(64, (3, 3)))
    net.add(mx.gluon.contrib.nn.SyncBatchNorm())
    net.add(mx.gluon.nn.Conv2D(64, (3, 3)))
    net.add(mx.gluon.contrib.nn.SyncBatchNorm())
    net.add(mx.gluon.nn.Dense(128))
    net.add(mx.gluon.nn.Dense(10))
    net.hybridize()
    net.cast('float16')
print('finish build the network')

ctx =  [mx.gpu(int(id)) for id in [0,1,2,3,4,5,6,7]]
net.initialize(mx.init.Normal(0.01), ctx=ctx)
print('finish initializing')

trainer = mx.gluon.Trainer(
    params=net.collect_params(),
    optimizer='sgd',
    optimizer_params={'learning_rate': 0.04, 'multi_precision':True},
)

loss_function = mx.gluon.loss.SoftmaxCrossEntropyLoss()
metric = mx.metric.Accuracy()

num_epochs = 10
for epoch in range(num_epochs):
    print('start training')
    for inputs, labels in train_loader:
        data = mx.gluon.utils.split_and_load(inputs, ctx_list=ctx, batch_axis=0)
        label = mx.gluon.utils.split_and_load(labels, ctx_list=ctx, batch_axis=0)
        losses = []
        with mx.autograd.record():
            for X, Y in zip(data, label):
                outputs = net(X.astype('float16'))
                loss = loss_function(outputs.astype('float32'), Y)
                losses.append(loss)

        for l in losses:
            l.backward()
        trainer.step(batch_size=inputs.shape[0])

    net.export('mnist_syncbn_fp16.params')

Error Message:

File "/home/ubuntu/Workspace/incubator-mxnet/python/mxnet/gluon/block.py", line 900, in export
    ndarray.save('%s-%04d.params'%(path, epoch), arg_dict)
  File "/home/ubuntu/Workspace/incubator-mxnet/python/mxnet/ndarray/utils.py", line 273, in save
    keys))
  File "/home/ubuntu/Workspace/incubator-mxnet/python/mxnet/base.py", line 255, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [18:43:24] include/mxnet/././tensor_blob.h:203: Check failed: mshadow::DataType<DType>::kFlag == type_flag_ TBlob.get_with_shape: data type do not match specified type.Expected: 2 v.s. given 0

The text was updated successfully, but these errors were encountered:

mxnet-label-bot · 2019-01-23T23:21:32Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Performance

samskalicky · 2019-01-24T00:13:56Z

Hi @Fiend1213,

Can you provide the rest of the info from the issue template to help us debug?

Environment info (Required)

What to do:
1. Download the diagnosis script from https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py
2. Run the script using `python diagnose.py` and paste its output here.

Package used (Python/R/Scala/Julia):

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio):
MXNet commit hash:
(Paste the output of git rev-parse HEAD here.)
Build config:
(Paste the content of config.mk, or the build command.)

Fiend1213 · 2019-01-24T00:28:56Z

Environment info (Required)

----------Python Info----------
Version      : 3.5.2
Compiler     : GCC 5.4.0 20160609
Build        : ('default', 'Nov 12 2018 13:43:14')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 9.0.1
Directory    : /home/ubuntu/.local/lib/python3.5/site-packages/pip
----------MXNet Info-----------
No MXNet installed.
----------System Info----------
Platform     : Linux-4.4.0-1074-aws-x86_64-with-Ubuntu-16.04-xenial
system       : Linux
node         : ip-172-31-28-120
release      : 4.4.0-1074-aws
version      : #84-Ubuntu SMP Thu Dec 6 08:57:58 UTC 2018
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                64
On-line CPU(s) list:   0-63
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               1200.671
CPU max MHz:           3000.0000
CPU min MHz:           1200.0000
BogoMIPS:              4600.16
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-15,32-47
NUMA node1 CPU(s):     16-31,48-63
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq monitor est ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt ida
----------Network Test----------
Setting timeout: 10
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.5594 sec, LOAD: 0.1404 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0125 sec, LOAD: 0.1469 sec.
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0011 sec, LOAD: 0.3971 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0024 sec, LOAD: 0.3090 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1254 sec, LOAD: 0.1582 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0090 sec, LOAD: 0.3155 sec.

Build info (Required if built from source)

git clone --recursive https://github.com/apache/incubator-mxnet.git
    cd incubator-mxnet
    echo "USE_OPENCV = 1" >> ./config.mk
    echo "USE_BLAS = openblas" >> ./config.mk
    echo "USE_CUDA = 1" >> ./config.mk
    echo "USE_CUDA_PATH = /usr/local/cuda" >> ./config.mk
    echo "USE_CUDNN = 1" >> ./config.mk
    make -j $(nproc)

samskalicky · 2019-01-24T01:39:07Z

Thanks @Fiend1213, i tried commenting out this line:

    net.cast('float16')

and then change this line:

outputs = net(X.astype('float16'))

change it to this:

outputs = net(X)

and the script completed. Obviously this means its not doing it in float16, but at least its succeeding. Im building from source using your build flags now, will try rerunning and debugging in to try see what the issue is.

frankfliu · 2019-01-24T17:16:10Z

@mxnet-label-bot add [gluon]

samskalicky · 2019-01-28T20:04:33Z

Hi @Fiend1213

The current implementation of Synchronized Batch Normalization (SyncBN) does not support FP16 training. Since you're use-case is just for inference, SyncBN has exactly the same behavior as BN during inference. Therefore, just replace SyncBN with regular nn.BatchNorm to resolve your problem.

Please reply if this resolves your issue.

marcoabreu added the Gluon label Jan 24, 2019

mseth10 mentioned this issue Feb 2, 2019

modifying SyncBN doc for FP16 use case #14041

Merged

6 tasks

srochel closed this as completed in #14041 Feb 5, 2019

ChaiBapchya mentioned this issue Feb 20, 2019

Is the bot learning? MXNetEdge/mxnet-infrastructure#66

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem of exporting FP16 SyncBN model. #13976

Problem of exporting FP16 SyncBN model. #13976

Fiend1213 commented Jan 23, 2019

mxnet-label-bot commented Jan 23, 2019

samskalicky commented Jan 24, 2019

Fiend1213 commented Jan 24, 2019

samskalicky commented Jan 24, 2019 •

edited

Loading

frankfliu commented Jan 24, 2019

samskalicky commented Jan 28, 2019 •

edited

Loading

Problem of exporting FP16 SyncBN model. #13976

Problem of exporting FP16 SyncBN model. #13976

Comments

Fiend1213 commented Jan 23, 2019

Description

Error Message:

mxnet-label-bot commented Jan 23, 2019

samskalicky commented Jan 24, 2019

Environment info (Required)

Build info (Required if built from source)

Fiend1213 commented Jan 24, 2019

Environment info (Required)

Build info (Required if built from source)

samskalicky commented Jan 24, 2019 • edited Loading

frankfliu commented Jan 24, 2019

samskalicky commented Jan 28, 2019 • edited Loading

samskalicky commented Jan 24, 2019 •

edited

Loading

samskalicky commented Jan 28, 2019 •

edited

Loading