v1.0 Stable Release TODO List #2944

piiswrong · 2016-08-05T22:57:09Z

It's about time for a feature complete stable release.

We are in the process of a major refactor. While most changes are in backend side and therefore should not significantly affect users, we do expect to break a few little things and maybe compatibility with other language bindings.
So authors of Julia, R, Scala, etc, package please stay tuned and adopt the new API. It should be a quick fix and we will have guide for the transition.
@thirdwing @pluskid @vchuravy @Ldpe2G

Transition Guide/List of Breaking Changes:

Developer

TBlob and TShape are moved from mshadow namespace to mxnet namespace. Fix: Change mshadow::TBlob and mshadow::TShape to TBlob and TShape in your code.
Please do not use cudaMalloc & cudaFree directly anywhere in MXNet. Use Storage::Get()->Alloc(size, Context::GPU()) to allocate memory to current GPU instead.

User

If you were training networks with BatchNormalization layer on CPU or on GPU with cudnn v4 or below before Jul 5th, you may find your model outputting totally wrong results after loading it back for testing. The simplest fix is to load your .param files with ndarray.load and set all arrays with key ending with '_gamma' to 1.0 and save them back.

If you load model trained before Dec 2015 for prediction and the model uses BatchNorm, your model may output totally wrong results. This can be fixed by adding fix_gamma=True to all BatchNorm layers in your symbol construction script or adding 'fix_gamma': 'True' to all BatchNorm layers in your .json model file.
sum_axis, max_axis, and min_axis are removed. Please use mx.nd.max(src, axis=n) to do the same thing
element_mask is removed. Please use src*mask.reshape((mask.size, 1, 1, ..., 1)) directly as binary ops now support broadcasting.

TODOs

The text was updated successfully, but these errors were encountered:

vchuravy · 2016-08-05T22:59:14Z

I would propose Float16 support as an additional target.

antinucleon · 2016-08-05T23:02:50Z

antinucleon · 2016-08-05T23:23:48Z

For optimization part, @tqchen and I are thinking about supporting throw optimizer into computation graph, so less ccxx will be needed.

piiswrong · 2016-08-05T23:25:06Z

Until we have RTC that doesn't help much. You still need at least 2x buffer.

antinucleon · 2016-08-06T00:00:52Z

We may consider to building document on EC2, then sync back to readdoc because doc build fail for time out in compile.

piiswrong · 2016-08-06T00:02:58Z

yes. or maybe just host from ec2

tornadomeet · 2016-08-06T02:31:51Z

great!!
@piiswrong what does nnvm mean?

antinucleon · 2016-08-06T02:34:25Z

@vchuravy we may need to put more effort on int8 rather than fp16. From current info, int8 will be mainstream in future.

vchuravy · 2016-08-06T02:38:21Z

@antinucleon Great to hear, the work @Godricly and I have been working focused purely on making our operators support arbitrary DTypes. That should help the Int8 work as well?

(this is of topic but I would expect FixedPoint with Int8 instead of truly Int8?)

antinucleon · 2016-08-06T02:40:10Z

@vchuravy It is still investigated by @winstywang If use int8 directly, there is no performance gain. But official document mentions for new TitanX, the int8 performance is 44T, almost 4 times than fp32.

winstywang · 2016-08-06T08:41:04Z

@vchuravy NV should have specific instructions for int8, currently using int8 directly only brings 25% performance gain according to our test.

Godricly · 2016-08-07T17:30:25Z

My suggestion as follows:

Documentation (most important)
Some kind of graph creation debugging tool

it would be nice if we can have gui for this, its painful to debug the graph
dynamic execution capability for Operators (for example, stochastic depth and fractal network )
customOp is not Dtype compatible yet
A simple debugging Operator ( Just printing output and gradient, so u can insert them anywhere, can make some switch to decide what to print)
Check if ps-lite is compatible with DType

piiswrong · 2016-08-07T19:48:25Z

stochastic depth can be done with bucketing.
we have monitor for debugging.

antinucleon · 2016-08-08T00:35:55Z

with NNVM we may enable fully dynamic execution.

antinucleon · 2016-08-08T01:27:45Z

@piiswrong @leopd We need to move doc building system to EC2. Readthedoc system is keeping failure because building out of time.

Godricly · 2016-08-08T01:42:52Z

@antinucleon Is there any paper available right now for uint8 NN? And what is NNVM stands for? I'm having a hard time searching for it.

winstywang · 2016-08-08T05:40:05Z

Here are some thoughts about the docs:

A summary page of all the examples
A summary page of new features recently added. Each time a new feature is added, simple explanation and sample codes must be provided.
WE CANNOT SAY "YOU CAN JUST USE XXX" to users, but there is no doc or simple examples for XXX. Each time we mention that, a doc or example must be provided.
A step by step tutorial to teach beginners how to implement some basic operations in NN, such as finetune, extract features, etc. These could cover more than 80% usage
Finish CS231n homework and projects with Minpy and MXNet

@piiswrong @antinucleon

Godricly · 2016-08-08T07:31:36Z

Another thing I'd like to ask for is a refactor of LSTM; if it is possible.
Can we hide those provide_data and provide_label in an elegant way? I understand that currently approach works pretty well. But exposing the internal stuff may bring some troubles (like extra provided_data_type for me in fp16 lstm #2564).

winstywang · 2016-08-08T09:53:36Z

I would vote for another issue which is very important for user:

Make sure the speed and accuracy in all test cases same or better than Caffe.
Currently, we have CPU slower than Caffe, small batch slower than Caffe, and Resnet on Imagenet worse than Caffe all kinds of issues related to performance.

antinucleon · 2016-08-08T15:12:14Z

Resnet is caused by IO. Min has reproduced exact result by using Torch IO.
The problem is who will do that.
On Mon, Aug 8, 2016 at 02:53 Naiyan Wang notifications@github.com wrote:

I would vote for another issue which is very important for user:

Make sure the speed and accuracy in all test cases same or better than
Caffe.

Currently, we have CPU slower than Caffe, small batch slower than
Caffe, and Resnet on Imagenet worse than Caffe all kinds of issues related
to performance.

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#2944 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABM13o2eTqQRmXl6jgQ4zWF3EmE8-BL9ks5qdvyjgaJpZM4JeH5k
.

Sent from mobile phone

tqchen · 2016-08-08T16:27:53Z

I hope that for each of the issue raised, people can show up and assign, or self assign each of the issue, so we are moving forward effectively.

mli · 2016-08-08T17:38:26Z

it's good to have a single page containing all things. but total agree that we can open issue for each point and cite the links here.

piiswrong · 2016-08-08T18:08:15Z

@mli Yes. If someone wants to talk more about/start working on a task, feel free to open a new issue and link it here. Also assign it to milestone v1.0

antinucleon · 2016-08-13T00:07:37Z

Also we may consider to treat warning as error in the future.

yzhliu · 2016-08-18T17:14:08Z

I'll list a roadmap for scala pkg this weekend.

taoari · 2016-08-19T08:14:00Z

@antinucleon Can I know what's wrong with IO that causes the performance drop?

pluskid · 2016-08-19T17:20:53Z

For docs, I think the query of our github issues with keyword "how to" is a good source for getting a list of topics to potentially cover.

windywinter · 2016-08-19T21:33:59Z

@piiswrong What does NNVM stands for?

tornadomeet · 2016-08-20T01:09:14Z

@windywinter about NNVM: dmlc/MXNet.jl#115

sxjscience · 2016-08-21T07:11:43Z

@antinucleon, @jennyzhang0215 and I have implemented MemN2N and NTM and replicated the results in the paper, we may release the code after AAAI or WWW. I can send you the code if you need now.

dianyancao · 2016-08-22T19:52:57Z

Is ok to do some code optimization in NNVM? #3105

RogerBorras · 2016-08-23T09:35:07Z

Thanks all DMLC for this great effort

piiswrong added this to the v1.0 milestone Aug 5, 2016

antinucleon mentioned this issue Aug 8, 2016

do multiple "batch_dot" in loop can cause the layer unable to be linked with Convolution layer #2945

Closed

pluskid mentioned this issue Aug 10, 2016

Get MXNet.jl ready for v0.5 dmlc/MXNet.jl#115

Merged

piiswrong added the Roadmap label Aug 12, 2016

yzhliu mentioned this issue Aug 20, 2016

Scala Package for v1.0 TODO List #3084

Closed

11 tasks

sbodenstein mentioned this issue Sep 10, 2016

[OP] Fix req and reserve_space in cudnn_rnn #3274

Merged

szha mentioned this issue Oct 2, 2017

[RoadMap] Legacy issue resolution before 1.0 release #7319

Closed

tqchen closed this as completed Oct 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.0 Stable Release TODO List #2944

v1.0 Stable Release TODO List #2944

piiswrong commented Aug 5, 2016 •

edited by mli

Loading

vchuravy commented Aug 5, 2016

antinucleon commented Aug 5, 2016 •

edited

Loading

antinucleon commented Aug 5, 2016

piiswrong commented Aug 5, 2016

antinucleon commented Aug 6, 2016

piiswrong commented Aug 6, 2016

tornadomeet commented Aug 6, 2016

antinucleon commented Aug 6, 2016

vchuravy commented Aug 6, 2016

antinucleon commented Aug 6, 2016

winstywang commented Aug 6, 2016

Godricly commented Aug 7, 2016

piiswrong commented Aug 7, 2016

antinucleon commented Aug 8, 2016

antinucleon commented Aug 8, 2016

Godricly commented Aug 8, 2016

winstywang commented Aug 8, 2016 •

edited

Loading

Godricly commented Aug 8, 2016 •

edited

Loading

winstywang commented Aug 8, 2016

antinucleon commented Aug 8, 2016

Make sure the speed and accuracy in all test cases same or better than
Caffe.

tqchen commented Aug 8, 2016

mli commented Aug 8, 2016 •

edited

Loading

piiswrong commented Aug 8, 2016

antinucleon commented Aug 13, 2016

yzhliu commented Aug 18, 2016

taoari commented Aug 19, 2016

pluskid commented Aug 19, 2016

windywinter commented Aug 19, 2016

tornadomeet commented Aug 20, 2016

sxjscience commented Aug 21, 2016

dianyancao commented Aug 22, 2016

RogerBorras commented Aug 23, 2016

v1.0 Stable Release TODO List #2944

v1.0 Stable Release TODO List #2944

Comments

piiswrong commented Aug 5, 2016 • edited by mli Loading

Transition Guide/List of Breaking Changes:

Developer

User

TODOs

vchuravy commented Aug 5, 2016

antinucleon commented Aug 5, 2016 • edited Loading

antinucleon commented Aug 5, 2016

piiswrong commented Aug 5, 2016

antinucleon commented Aug 6, 2016

piiswrong commented Aug 6, 2016

tornadomeet commented Aug 6, 2016

antinucleon commented Aug 6, 2016

vchuravy commented Aug 6, 2016

antinucleon commented Aug 6, 2016

winstywang commented Aug 6, 2016

Godricly commented Aug 7, 2016

piiswrong commented Aug 7, 2016

antinucleon commented Aug 8, 2016

antinucleon commented Aug 8, 2016

Godricly commented Aug 8, 2016

winstywang commented Aug 8, 2016 • edited Loading

Godricly commented Aug 8, 2016 • edited Loading

winstywang commented Aug 8, 2016

antinucleon commented Aug 8, 2016

Make sure the speed and accuracy in all test cases same or better than Caffe.

tqchen commented Aug 8, 2016

mli commented Aug 8, 2016 • edited Loading

piiswrong commented Aug 8, 2016

antinucleon commented Aug 13, 2016

yzhliu commented Aug 18, 2016

taoari commented Aug 19, 2016

pluskid commented Aug 19, 2016

windywinter commented Aug 19, 2016

tornadomeet commented Aug 20, 2016

sxjscience commented Aug 21, 2016

dianyancao commented Aug 22, 2016

RogerBorras commented Aug 23, 2016

piiswrong commented Aug 5, 2016 •

edited by mli

Loading

antinucleon commented Aug 5, 2016 •

edited

Loading

winstywang commented Aug 8, 2016 •

edited

Loading

Godricly commented Aug 8, 2016 •

edited

Loading

Make sure the speed and accuracy in all test cases same or better than
Caffe.

mli commented Aug 8, 2016 •

edited

Loading