[RoadMap] Legacy issue resolution before 1.0 release #7319

piiswrong · 2017-08-02T22:15:27Z

We are working on multiple new features and refactors (gluon, sparse, engine, etc) towards an 1.0 release. But there are also some legacy issues that needs to be resolved. Here is a list of issues I have noted. Feel free to raise new issues or contribute fixes.

@mli @tqchen @eric-haibin-lin @reminisce @asmushetzel @jermainewang @ptrendx

Basic

Sort out int32, int64, index_t, mx_uint etc.
Currently TShape uses int64_t but the front-end and back-end interface still use uint32_t for indices.
This need cleaning up and int64_t interface need to be exposed through a new set of CAPI. The general policy going forward should be to use int64_t for indices/size and int32_t for number of dimensions. Signed int should be used for function arguments unless there is a really strong reason to use unsigned.
Most indexing related interface should support negative indexing.
Deprecate mshadow.
Remove usage of mshadow template evaluations and replace with Kernel::Launch or hand written cpu/gpu kernels.
Verify type and shape support for operators.
Currently some operators don't support types other than fp32 for legacy reasons. Proper type support and/or documentation should be added.
Currently some operators have limit support for tensor ranks (maximum 5 dims). The limit needs to be increased or removed if possible.
move legacy operators to new nnvm registration.
Conv, FC, BN etc should be refactored into stateless operators and use thread_local to store cudnn tensor descriptors.
remove mkl experimental and use the new storage type interface.

Stretch goals

Support pybind11 or cython interface for faster python API.
Use variant type instead of string for operator argument parsing.

feiyulv · 2017-08-03T06:38:11Z

should check kAddTo support for operators

piiswrong · 2017-08-04T06:07:08Z

I saw a few complaints that MXNet doesn't support IDE code completion due to operator registration.
one solution is to generate an operator file instead of creating it everytime

jmacglashan · 2017-08-04T18:04:09Z

Yes, an operator file (or otherwise) to support IDE code completion would be greatly welcomed.

piiswrong · 2017-08-04T22:34:18Z

also need to change all the DType(expf()) etc to use math function with proper precision

szha · 2017-08-05T19:44:54Z

Default epsilon in Symbol/NDArray batch norm are too large (1e-3). Gluon now uses 1e-5, which is more commonly used.

eric-haibin-lin · 2017-08-06T18:37:29Z

kvstore has a new str interface, while the updater always uses int as the key, which is not consistent. https://github.com/apache/incubator-mxnet/blob/master/src/kvstore/kvstore_local.h#L83

jmacglashan · 2017-08-08T20:44:07Z

I think the biggest feature mxnet lacks is the higher order gradients (see #5699). This is probably a fairly substantial feature, but is there any plan for this or Hessian-vector products for 1.0?

ptrendx · 2017-08-08T21:04:46Z

For me the biggest feature mxnet lacks is consistent and full documentation and tutorials. Gluon tutorial seems to be pretty awesome (although still incomplete), but the rest of the API does not have such good treatment. It got even worse once you removed most examples from the website (even though I agree that they were not well explained).
From the technical and performance point of view MXNet is a great (and probably the best actually) but it's hard to take off when others have lower barrier of entry and spend a lot on PR.

ZihengJiang · 2017-08-08T22:27:58Z

Should enable multiple times resource requests

piiswrong · 2017-08-14T06:31:44Z

@ptrendx @madjam @bhavinthaker The removed tutorials need to be brought back ASAP!

asmushetzel · 2017-08-16T12:45:58Z

Should we also work on error handling? Basically getting more useful and more consistent messages when a model not build correctly by the user (shape inference fails etc).

szha · 2017-08-23T22:54:13Z

Ops that are differentiable are missing gradients. (e.g. 'norm')

jaanli · 2017-09-07T13:37:02Z

+1 on higher-order gradients #5699

madjam · 2017-09-07T20:46:11Z

Create appropriate namespaces so that APIs are grouped logically and do not end up with prefix qualifiers such as linalg_ , random_ etc.

szha · 2017-09-07T20:49:43Z

@madjam this is already worked on by @reminisce and @eric-haibin-lin

madjam · 2017-09-07T20:50:54Z

@szha thanks. Is it being tracked in a separate issue?

szha · 2017-09-07T20:56:30Z

@madjam I think it's already merged.

reminisce · 2017-09-07T20:56:55Z

@madjam Namespace refactoring is covered in this PR. #7604
@eric-haibin-lin may have more coverage for documentation.

eric-haibin-lin · 2017-09-07T21:16:15Z

@madjam the docs for separate namespace is merged in #7712
@piiswrong could you update the task status so that ppl are aware which ones have been assigned / done?

formath · 2017-09-13T07:01:48Z

Embedding op should be optimized for large sparse id. Now, the embedding layer use the input id as the raw index of embedding matrix. In some circumstance, id may be generated using uint64 hash so not suitable. This feature is much needed in industrial click through rate prediction, recommendation system and other uses.
Maybe, embedding matrix should be like this and partitioned to the server nodes of ps using sparse_id like that of tensorflow.

sparse_id1 vector
sparse_id2 vector
...
...

eric-haibin-lin · 2017-09-13T16:53:39Z

@formath you bring up a good point. Large indices is definitely a feature we want to support in the long-term. We might want to open a separate issue and discuss this.

First of all, we do plan to add sparse support for Embedding op, where the weight can be in row_sparse format, and the gradient for the weight should be generated in row_sparse format, too. I am currently working on code refactoring and documentations so this sparse operator is not implemented yet.

Regarding large indices up to 64 bits, this requires the first task @piiswrong brought up regarding int types in the C API, and the Kernel::Launch API in the backend uses 32-bit int instead of 64-bit, which is problematic for many operators which operate on ndarrays of large shape. So the scope is bigger than just the embedding op and definitely, it takes some more time to resolve.

Are you working on any industrial scale dataset? Two ways to circumvent the 64-bit hashed-index problem in my mind:

rehashing the indices into around 23 or 24 bit to reduce the dimensionality, which doesn't hurt much as claimed by some paper, and doesn't cause the operator to break in MXNet.
preprocessing the dataset to find out the number of unique features and map them to continuous indices instead.
@formath what's your thought on this?

formath · 2017-09-14T02:59:07Z

@eric-haibin-lin Both ok. But it does not solve the efficiency problem when the raw of embedding matrix is several millions or even billions because of the lack of sparse update. Those problems are the primary limits to use mxnet in industry. The sparse tensor support developed recently is a big progress. I think it and its mating part should be assigned a higher priority.

eric-haibin-lin · 2017-09-15T17:07:00Z

@formath Yes, I'll work on the sparse embedding operator to support at least millions of features after I am done with basic documentations for sparse. We do have a few sparse optimizers like SGD and Adam. Ftrl and Adagrad are coming in #7720 #7903

szha · 2017-09-21T00:21:25Z

It would be easier if this issue is converted to a github project so that item progresses can be tracked.

szha · 2017-09-23T06:00:37Z

I have the impression that many ops don't respect grad_req.

szha · 2017-09-27T03:41:45Z

Many examples are outdated or don't uphold the style standard. Duplicates of the same or similar (most popular being MNIST dataset) are omnipresent.

szha · 2017-09-29T23:23:43Z

Certain convolution layouts on CPU are not supported though API claims them to be supported (e.g. NWC NHWC NDHWC).

eric-haibin-lin · 2017-10-01T05:32:21Z

All examples should be runnable. We should have a check list for these

szha · 2017-10-02T02:42:47Z

#2944 may have other open issues.

taliesinb · 2017-11-20T10:19:50Z

@szha I'm wondering the same thing: the Convolution op explicitly does not support "NWC", for examlpe, but gluon mentions "NWC" in the docs. Searching the codebase shows that string only occurs in the high-level docs, so are the gluon docs simply wrong here?

Godricly · 2018-02-05T02:37:12Z

@szha I met same issue as @taliesinb did using conv1d in mxnet(mxnet-cu80 (1.0.0.post2)
). The document is not matched with conv1d behavior.

jmacglashan mentioned this issue Aug 6, 2017

The intellisence support for mxnet with Python #7351

Closed

eric-haibin-lin mentioned this issue Aug 16, 2017

Sparse Tensor: request for reviews #7082

Merged

This was referenced Nov 2, 2017

program crash when run sparse model predict #8500

Closed

Refactor operators & MKLDNN #8302

Merged

eric-haibin-lin added the Roadmap label Mar 1, 2018

tqchen closed this as completed Sep 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RoadMap] Legacy issue resolution before 1.0 release #7319

[RoadMap] Legacy issue resolution before 1.0 release #7319

piiswrong commented Aug 2, 2017 •

edited

Loading

feiyulv commented Aug 3, 2017

piiswrong commented Aug 4, 2017

jmacglashan commented Aug 4, 2017 •

edited

Loading

piiswrong commented Aug 4, 2017

szha commented Aug 5, 2017

eric-haibin-lin commented Aug 6, 2017

jmacglashan commented Aug 8, 2017

ptrendx commented Aug 8, 2017

ZihengJiang commented Aug 8, 2017

piiswrong commented Aug 14, 2017 •

edited

Loading

asmushetzel commented Aug 16, 2017

szha commented Aug 23, 2017 •

edited

Loading

jaanli commented Sep 7, 2017

madjam commented Sep 7, 2017

szha commented Sep 7, 2017

madjam commented Sep 7, 2017

szha commented Sep 7, 2017

reminisce commented Sep 7, 2017

eric-haibin-lin commented Sep 7, 2017

formath commented Sep 13, 2017 •

edited

Loading

eric-haibin-lin commented Sep 13, 2017

formath commented Sep 14, 2017 •

edited

Loading

eric-haibin-lin commented Sep 15, 2017

szha commented Sep 21, 2017

szha commented Sep 23, 2017

szha commented Sep 27, 2017

szha commented Sep 29, 2017

eric-haibin-lin commented Oct 1, 2017

szha commented Oct 2, 2017

taliesinb commented Nov 20, 2017

Godricly commented Feb 5, 2018

[RoadMap] Legacy issue resolution before 1.0 release #7319

[RoadMap] Legacy issue resolution before 1.0 release #7319

Comments

piiswrong commented Aug 2, 2017 • edited Loading

Basic

Stretch goals

feiyulv commented Aug 3, 2017

piiswrong commented Aug 4, 2017

jmacglashan commented Aug 4, 2017 • edited Loading

piiswrong commented Aug 4, 2017

szha commented Aug 5, 2017

eric-haibin-lin commented Aug 6, 2017

jmacglashan commented Aug 8, 2017

ptrendx commented Aug 8, 2017

ZihengJiang commented Aug 8, 2017

piiswrong commented Aug 14, 2017 • edited Loading

asmushetzel commented Aug 16, 2017

szha commented Aug 23, 2017 • edited Loading

jaanli commented Sep 7, 2017

madjam commented Sep 7, 2017

szha commented Sep 7, 2017

madjam commented Sep 7, 2017

szha commented Sep 7, 2017

reminisce commented Sep 7, 2017

eric-haibin-lin commented Sep 7, 2017

formath commented Sep 13, 2017 • edited Loading

eric-haibin-lin commented Sep 13, 2017

formath commented Sep 14, 2017 • edited Loading

eric-haibin-lin commented Sep 15, 2017

szha commented Sep 21, 2017

szha commented Sep 23, 2017

szha commented Sep 27, 2017

szha commented Sep 29, 2017

eric-haibin-lin commented Oct 1, 2017

szha commented Oct 2, 2017

taliesinb commented Nov 20, 2017

Godricly commented Feb 5, 2018

piiswrong commented Aug 2, 2017 •

edited

Loading

jmacglashan commented Aug 4, 2017 •

edited

Loading

piiswrong commented Aug 14, 2017 •

edited

Loading

szha commented Aug 23, 2017 •

edited

Loading

formath commented Sep 13, 2017 •

edited

Loading

formath commented Sep 14, 2017 •

edited

Loading