Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Problem of exporting FP16 SyncBN model. #13976

Closed
Fiend1213 opened this issue Jan 23, 2019 · 6 comments · Fixed by #14041
Closed

Problem of exporting FP16 SyncBN model. #13976

Fiend1213 opened this issue Jan 23, 2019 · 6 comments · Fixed by #14041
Labels

Comments

@Fiend1213
Copy link

Description

This is a problem when exporting fp16 model containing SyncBN.

import mxnet as mx

mx.random.seed(42)

def data_xform(data):
    """Move channel axis to the beginning, cast to float32, and normalize to [0, 1]."""
    return mx.nd.moveaxis(data, 2, 0).astype('float32') / 255

train_data = mx.gluon.data.vision.MNIST(train=True).transform_first(data_xform)
val_data = mx.gluon.data.vision.MNIST(train=False).transform_first(data_xform)

batch_size = 2000
train_loader = mx.gluon.data.DataLoader(train_data, shuffle=True, batch_size=batch_size)
val_loader = mx.gluon.data.DataLoader(val_data, shuffle=False, batch_size=batch_size)

net = mx.gluon.nn.HybridSequential()
with net.name_scope():
    net.add(mx.gluon.nn.Conv2D(64, (3, 3)))
    net.add(mx.gluon.contrib.nn.SyncBatchNorm())
    net.add(mx.gluon.nn.Conv2D(64, (3, 3)))
    net.add(mx.gluon.contrib.nn.SyncBatchNorm())
    net.add(mx.gluon.nn.Dense(128))
    net.add(mx.gluon.nn.Dense(10))
    net.hybridize()
    net.cast('float16')
print('finish build the network')

ctx =  [mx.gpu(int(id)) for id in [0,1,2,3,4,5,6,7]]
net.initialize(mx.init.Normal(0.01), ctx=ctx)
print('finish initializing')

trainer = mx.gluon.Trainer(
    params=net.collect_params(),
    optimizer='sgd',
    optimizer_params={'learning_rate': 0.04, 'multi_precision':True},
)

loss_function = mx.gluon.loss.SoftmaxCrossEntropyLoss()
metric = mx.metric.Accuracy()

num_epochs = 10
for epoch in range(num_epochs):
    print('start training')
    for inputs, labels in train_loader:
        data = mx.gluon.utils.split_and_load(inputs, ctx_list=ctx, batch_axis=0)
        label = mx.gluon.utils.split_and_load(labels, ctx_list=ctx, batch_axis=0)
        losses = []
        with mx.autograd.record():
            for X, Y in zip(data, label):
                outputs = net(X.astype('float16'))
                loss = loss_function(outputs.astype('float32'), Y)
                losses.append(loss)

        for l in losses:
            l.backward()
        trainer.step(batch_size=inputs.shape[0])

    net.export('mnist_syncbn_fp16.params')

Error Message:

File "/home/ubuntu/Workspace/incubator-mxnet/python/mxnet/gluon/block.py", line 900, in export
    ndarray.save('%s-%04d.params'%(path, epoch), arg_dict)
  File "/home/ubuntu/Workspace/incubator-mxnet/python/mxnet/ndarray/utils.py", line 273, in save
    keys))
  File "/home/ubuntu/Workspace/incubator-mxnet/python/mxnet/base.py", line 255, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [18:43:24] include/mxnet/././tensor_blob.h:203: Check failed: mshadow::DataType<DType>::kFlag == type_flag_ TBlob.get_with_shape: data type do not match specified type.Expected: 2 v.s. given 0
@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Performance

@samskalicky
Copy link
Contributor

Hi @Fiend1213,

Can you provide the rest of the info from the issue template to help us debug?

Environment info (Required)

What to do:
1. Download the diagnosis script from https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py
2. Run the script using `python diagnose.py` and paste its output here.

Package used (Python/R/Scala/Julia):

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio):
MXNet commit hash:
(Paste the output of git rev-parse HEAD here.)
Build config:
(Paste the content of config.mk, or the build command.)

@Fiend1213
Copy link
Author

Environment info (Required)

----------Python Info----------
Version      : 3.5.2
Compiler     : GCC 5.4.0 20160609
Build        : ('default', 'Nov 12 2018 13:43:14')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 9.0.1
Directory    : /home/ubuntu/.local/lib/python3.5/site-packages/pip
----------MXNet Info-----------
No MXNet installed.
----------System Info----------
Platform     : Linux-4.4.0-1074-aws-x86_64-with-Ubuntu-16.04-xenial
system       : Linux
node         : ip-172-31-28-120
release      : 4.4.0-1074-aws
version      : #84-Ubuntu SMP Thu Dec 6 08:57:58 UTC 2018
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                64
On-line CPU(s) list:   0-63
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               1200.671
CPU max MHz:           3000.0000
CPU min MHz:           1200.0000
BogoMIPS:              4600.16
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-15,32-47
NUMA node1 CPU(s):     16-31,48-63
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq monitor est ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt ida
----------Network Test----------
Setting timeout: 10
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.5594 sec, LOAD: 0.1404 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0125 sec, LOAD: 0.1469 sec.
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0011 sec, LOAD: 0.3971 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0024 sec, LOAD: 0.3090 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1254 sec, LOAD: 0.1582 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0090 sec, LOAD: 0.3155 sec.

Build info (Required if built from source)

git clone --recursive https://github.com/apache/incubator-mxnet.git
    cd incubator-mxnet
    echo "USE_OPENCV = 1" >> ./config.mk
    echo "USE_BLAS = openblas" >> ./config.mk
    echo "USE_CUDA = 1" >> ./config.mk
    echo "USE_CUDA_PATH = /usr/local/cuda" >> ./config.mk
    echo "USE_CUDNN = 1" >> ./config.mk
    make -j $(nproc)

@samskalicky
Copy link
Contributor

samskalicky commented Jan 24, 2019

Thanks @Fiend1213, i tried commenting out this line:

    net.cast('float16')

and then change this line:

outputs = net(X.astype('float16'))

change it to this:

outputs = net(X)

and the script completed. Obviously this means its not doing it in float16, but at least its succeeding. Im building from source using your build flags now, will try rerunning and debugging in to try see what the issue is.

@frankfliu
Copy link
Contributor

@mxnet-label-bot add [gluon]

@samskalicky
Copy link
Contributor

samskalicky commented Jan 28, 2019

Hi @Fiend1213

The current implementation of Synchronized Batch Normalization (SyncBN) does not support FP16 training. Since you're use-case is just for inference, SyncBN has exactly the same behavior as BN during inference. Therefore, just replace SyncBN with regular nn.BatchNorm to resolve your problem.

Please reply if this resolves your issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants