[Discussion] MXNet 2.0 Roadmap (was: APIs that might be a good idea to break in 2.0) #9686

szha · 2018-02-02T21:27:20Z

Let's start a discussion here about the roadmap towards MXNet 2.0. We are looking for:

New features that are useful to your research and development.
Improvements and patches to existing features.
APIs that should be fixed.

If you have any item that you'd like to propose to have in the roadmap, please do:

Create (or locate existing) issue for the item, note the issue number.
Comment in this issue: 1) the above issue number, 2) one sentence of what the item is about and why it's useful to you.
Indicate whether you'd be willing to help out on the item.

Given that this would be a major release, we'd have the opportunity to make backward incompatible changes. This would allow us to visit some topics that require large changes such as dropping support for python2, transitioning fully to cmake, making the tensor library numpy-compatible, or even new programming models.

Now that we decided to follow semantic versioning for releases, it would be a good idea to coordinate features and API changes to make the best use of the next major release. Thus, I propose that we use this issue to track the APIs we'd like to change in the next major version.

The candidates I've collected so far:

remove legacy ops such as batch-norm v1
reorganizing namespace for utility functions such as download in Exp backoff for downloads. #9671
transform argument in the constructor of existing vision dataset API.

Once there are more of such requests, I will try to organize these API-breaking requests better.

The text was updated successfully, but these errors were encountered:

bhavinthaker · 2018-02-07T16:17:11Z

Do we have sufficient automated testing to catch accidental lapses?

If not, can we have a volunteer to work on writing these automated test-cases? How do we track this task?

larroy · 2018-02-15T12:28:12Z

Refactors of the cpp-package and other C++ APIs. I would like that.

fhieber · 2018-02-15T12:31:44Z

consistent use of "axis"/"dim(s)" keyword arguments in all operators, for example:
- swapaxes uses dim1, dim2
- expand_dims uses axis
expose optimizer in Module API

anirudhacharya · 2018-02-27T20:58:16Z

@sandeep-krishnamurthy Please tag this - API Change, Call for Contribution, Roadmap.

szha · 2018-04-09T03:27:21Z

#9881

eric-haibin-lin · 2018-05-03T22:06:17Z

kvstore should not be public API

szha · 2018-05-10T06:27:38Z

we should merge element wise ops with broadcast ops and dispatch the different implementation only based on the shape, so that symbol and ndarray +-*/ are consistent

szha · 2018-05-17T04:23:09Z

contrib.ctc_loss should make into supported operator.

szha · 2018-05-29T06:35:07Z

#11031

szha · 2018-05-30T22:18:32Z

#10807

RogerChern · 2018-06-01T07:05:28Z

fix_gamma=False for mxnet.symbol.BatchNorm

szha · 2018-06-04T22:24:13Z

#11141

szha · 2018-06-04T22:25:56Z

#11134

szha · 2018-07-30T18:23:33Z

Gluon RNN layer parameters are currently saved through unfused cells, causing the name to be something like "_unfused.0.l_cell.weight". This caused trouble in #11482 when I removed unfused cells. The workaround is to override _collect_params_with_prefix function to add the prefix. In 2.0, we should:

remove the _collect_params_with_prefix in Gluon RNN layer
write a converter for parameter formats
start versioning parameter files.

szha · 2018-08-03T22:51:26Z

#11953

szha · 2018-08-16T18:18:14Z

#12197 in using integer types for index instead of float.

anirudhacharya · 2018-10-17T21:44:46Z

Taking a brief look at Data Iterators, it would seem the iterators are split up between the mx.io module and mx.image module. And there does not seem to be any method/process( correct me if I am wrong) in the split. For instance,

ImageRecordIter and ImageRecordUInt8Iter( along with CSVIter, NDArrayIter, LibSVMIter) are under mx.io
whereas ImageIter and ImageDetIter are under mx.image module

And

There are image processing functions like imdecode, scale_down, resize_short etc.. under mx.image here - http://mxnet.incubator.apache.org/api/python/image/image.html#image-processing-functions
And there are very similar image augmenter functions like - ResizeAug, CenterCropAug, etc.. here - http://mxnet.incubator.apache.org/api/python/image/image.html#image-iterators

Is there any specific reason for this kind of design? It might be good to take a relook at this and reorganize this, even if it leads to breaking a few APIs.

And there is similar functionality in the gluon interface too( and I am not including that in this discussion).

anirudhacharya · 2018-10-17T22:47:30Z

3. transform argument in the constructor of existing vision dataset API.

What is the proposed change here? is the plan to remove transform as an argument to the constructor?

szha · 2018-10-17T23:40:25Z

@anirudhacharya yes, because dataset interface has a .transform method that serves the same purpose but strictly more flexible.

zhreshold · 2018-10-18T00:32:26Z

@anirudhacharya

I can see your concern, but iterators included in mxnet.image are specifically designed for images and serve as compliments to the purpose agnostic iterators in mxnet.io.

Same stories apply to all these image transformation functions provided in mxnet.image, users basically can use them as an opencv alternative in order to process mx.nd.NDArray instead of numpy.array

anirudhacharya · 2018-10-18T18:25:58Z

@zhreshold but ImageRecordIter and ImageRecordUInt8Iter which are image specific are defined under mx.io.

With regards to image transforms, I was thinking the symbolic interface should also have something similar to the interface available in gluon-CV transforms - https://gluon-cv.mxnet.io/api/data.transforms.html, which is very intuitive and not cluttered. Because we have users who have gone to production using MXNet's symbolic interface. We can discuss this in-person, it will be better.

ptrendx · 2019-06-25T20:54:27Z

I would like to remove the reliance on topological ordering of inputs in communication between frontend and backend: #15362. The cleanest solution to this is to change the C API to pass dictionaries instead of lists to the backend (and get dictionaries back).

szha · 2019-06-28T03:15:45Z

#9182

apeforest · 2019-07-01T23:36:24Z

Removing deprecate operators from code base (or at lease hide them from users).

Some operators, such as SoftmaxActivation has been made deprecate for a few minor releases. Ideally, we should have a process to remove deprecate operators in the next major release given users have had sufficient time to update them to the newer version in the minor releases.

larroy · 2019-07-15T21:29:29Z

I was thinking that I would like to drop Amalgamation and instead have a dynamic operator registry, imagine for a moment that you can register your operators in a set of yaml files that would do the same as NNVM_REGISTER, before build time you can configure which operators to compile and produce a very lean library in a similar way that almalgamation is doing but more clean and in a per operator basis, with a codegen step that would parse the operator registry, and also not compile the training code if you just want inference. This would make it possible to make "MXNet lite builds". Would this be desirable?

braindotai · 2019-07-16T13:40:59Z

I have some suggestions down here:-

MXNet only has 4 prebuild datasets, I think we should have more...
Here are some important missing datasets-
KMNIST, EMNIST, LSUN, STL10, SVHN, PhotoTour, SBU, Flickr8k, Flickr30k, Cityscapes, SBD and FakeData(A fake data generator that generates fake images, useful for GANS)
As we got the Estimator for gluon, (which will be supposedly released with MXNet 1.5), we can have many predefined estimators(very similar to Scikit Learn, with awesome gpu support), for example:-

from mxnet.gluon import estimator
model = estimator.linearregression
model = estimator.logisticregression
model = estimator.ridgeregression
model = estimator.lassoregression
model = estimator.knearestneighbors
model = estimator.kmeansclusttering
model = estimator.svm
....etc

These classical ML algorithms work better than DL for some specific tasks and many users want such ML algorithms with gpu support, so that'd be quite awesome.

We need to have a good and updated website design, for instance in the bottom of homepage, even the "copyright(2017 - 2018)" statement is not updated, which sounds like MXNet is no longer sponsored by the Apache Incubator(BTW is this True?).
- Also, you can't access this useful link at all, because the link is unnecessarily and massively hidden somewhere inside the website.(fortunately I got it from an email from my friend, and I don't know how he got it!!)
- We should update the benchmarks available here
- We should put these FAQ tutorials available directly from our default homepage because these are really good tutorials(but hidden under docs) for a beginner in MXNet.
- If you click on any link for documentation available here under some section, let's say "Gluon API" then afterwards any link for other docs(on the left side) available outside of "Gluon API" section wouldn't work. And this behaviour is common for all sections.
- We should have more information about MXNet displayed right on the homepage(instead of provided links), for example, its key features, ecosystem, community, and resources like this holy book of MXNet, GluonCV, GluonNLP... etc.

The reason why I am so worried about the website is because "IT IS IMPORTANT", more we show to the user directly, better the understanding a user can have.(For instance ME!, when I opened the website first time, it was very difficult for me to find good tutorials and examples, instead, I had to rely on GitHub and had to ask in the forum separately about that.)

This Build Status available at the GitHub Homepage under "README.md" is not working .
KVStore API should note be public, because I haven't really seen any use of it in any tutorial or implementation, secondly gluon.Trainer already handles the functionality of KVStore internally, thirdly at the top of KVStore Tutorial it's clearly written that

.. note:: Direct interactions with KVStore are dangerous and not recommended.

so why we are telling users how to use it if it's so dangerous and not recommended?

That's a lot to take, I know.
Sorry if I've written something wrong above.

arcadiaphy · 2019-07-17T11:25:20Z

I think we should provide a user-friendly thread-safe inference API for deploying in c++, java, etc. We can focus on naive engine in inference since it's very hard to refactor threaded engine to be thread-safe. A good and easy-to-use executor should have the following properties:

One instance of the executor is enough for multi-threaded inference, which means it can be used simultaneously in different threads.
The immutable ndarray in executor should be shared in multi-threaded inference to save memory footprint.

Now we have MXPredCreateMultiThread in C API, but it's buggy and we still need to create multiple executors for each thread.

cloudhan · 2019-07-17T12:26:02Z

I think we should provide a user-friendly thread-safe inference API for deploying in c++, java, etc. We can focus on naive engine in inference since it's very hard to refactor threaded engine to be thread-safe. A good and easy-to-use executor should have the following properties:

One instance of the executor is enough for multi-threaded inference, which means it can be used simultaneously in different threads.

The immutable ndarray in executor should be shared in multi-threaded inference to save memory footprint.

Now we have MXPredCreateMultiThread in C API, but it's buggy and we still need to create multiple executors for each thread.

Sounds like refactoring execution engine with TBB and adding some buffering mechanism？

marcoabreu · 2019-07-18T21:52:06Z

Agree, the entire API (at the C-API level) should be designed to be entirely threadsafe for all requests - whether it's inference or training. This includes parallel calls from different threads - speak no locking or sticky threads.

marcoabreu · 2019-07-18T21:56:39Z

Could we get rid of all the different pre-processor statements in the codebase that evolved due to the different accelerators (USE_CUDA, USE_TVM, USE_MKLDNN, etc) and fully replace them with the accelerator API from @samskalicky ? This would heavily improve the maintainability.

In terms of operator definitions, we could use ONNX as standard (or derive from it if it's not sufficient). At the moment, there's a tight coupling between the operator definitions and the accelerator choice.

szha · 2019-07-19T04:52:51Z

different pre-processor statements ... replace them with the accelerator API

No, we cannot. They don't serve the same purpose.

we could use ONNX as standard

I don't believe we should. ONNX is only good for model exchange and not much for anything else. Also, community has already reached consensus to move towards numpy so it's probably not a good idea to get married to ONNX

samskalicky · 2019-07-19T05:08:08Z

@marcoabreu We can definitely remove some of these pre-processor statements with the accelerator API (MKLDNN) but not all of them like @szha points out. USE_CUDA needs to stay since we have GPU embedded pretty tightly. We might be able to push it out into an accelerator library, but not in time for 2.0.

I agree with @szha ONNX is not the way to do this. We need to keep our operator registration in NNVM for now. What we could separate out are the operator definitions (NNVM reg) from the compute functions (infer shape/type, fcompute, etc) though. But again I think we should take this slow, enable actual accelerators first. Then see if it makes sense for TensorRT/MKLDNN and then maybe GPU.

I would like to see the accelerator API (or a first pass at it) as part of 2.0 though. Is this feasible from your perspective @szha ?

szha · 2019-07-19T05:28:07Z

@samskalicky I'm not sure about the accelerator API. It seems that the existing subgraph API in combination with 1) better third-party operator support and 2) exposing graph partitioning as an API should be able to serve the same goal as the accelerator API. Those items are useful in other contexts too and deserve more attention, so I'd suggest those as an alternative to a specialized API just for accelerators.

samskalicky · 2019-07-19T05:38:08Z

@szha I agree the third-party operator support could be more useful to the broader community and have been continually working on it in my spare time. I would be interested in collaborating with others to move this along faster. Should we consider that as part of the 1.6 release?

But after discussing with @zheng-da today the subgraph API + operator support does not serve the same goal as the accelerator API. Some additional external APIs (like external subgraph properties for example, or supporting compiled binaries for subgraphs, or binding accelerators to subgraphs) would be needed to serve the same goal.

szha · 2019-07-19T05:51:09Z

I think we're talking about the same thing. (i.e. item 2 in my last response)

samskalicky · 2019-07-19T05:55:13Z

Maybe I misunderstood item 2, assumed that meant better APIs for partitioning. Could clarify what you mean by item 2? Do you mean third-party graph partitioning (similar to third-part operator support)?

Neutron3529 · 2019-07-25T09:24:20Z

(1)#10840 Add einsum since it is useful and it could simplify linear algebra ops. (it is now supported by tensorflow and pyThrch)
(2)seperate MXNet.dll into small files to reduce the import time.
nvcc will generate a HUGE file with -arch=(...a lot of sm_*...)
it seems seperate the HUGE KNOWN_CUDA_ARCHS instruction may increase the import performance.

Zha0q1 · 2019-07-25T22:03:32Z

I would like to bring up one issue with profiler: currently, there is a flag kSimbolic that supposedly should control whether to profiler operators called in symbolic mode. However, there is rarely a use case where users would want to set it to False; also, in the backend, this flag is not used at all, meaning that even if it's set to False, the profiler will still profiler symbolic operators.

I have a issue about this: #15658.

I think maybe we should have this flag removed in 2.0 to avoid confusion?

anirudh2290 · 2019-08-30T00:31:37Z

module.bind API has a param named data_shapes which is misleading because the param is not limited to just shapes but they are data descriptors and accept DataDesc instances. I think this should be fixed in 2.0

iblislin · 2019-09-07T18:04:17Z

Julia-related issue

remove MXNET_HOME backward compatibility julia: rename build env var MXNET_HOME to MXNET_ROOT #15568 (comment)

ShownX · 2019-09-13T01:07:29Z

Expect more image operations: adjust_colors (not random), rotate, and more

eric-haibin-lin · 2019-09-18T23:38:04Z

remove the deprecated infer_range argument for nd.arange. The option is actually not supported.

access2rohit · 2019-09-19T21:41:25Z

we need to fix this issue as well #16216. It's a breaking change

eric-haibin-lin · 2020-06-04T06:54:25Z

mx.symbol.LogisticRegressionOutput
mx.symbol.LinearRegressionOutput
mx.symbol.MAERegressionOutput
mx.symbol.LogisticRegressionOutput
mx.symbol.LinearRegressionOutput
mx.symbol.SVMOutput

marcoabreu · 2020-07-26T22:59:01Z

Are there any plans to move the training logic (dataset handling, distributed training, etc) into the core to avoid having all that logic in the frontend languages?

szha mentioned this issue Feb 19, 2018

Macro F1 score depends on the size of the minibatch #9830

Open

szha added Call for Contribution Roadmap API change labels Feb 27, 2018

szha changed the title ~~APIs that might be a good idea to remove in 2.0~~ APIs that might be a good idea to break in 2.0 Apr 9, 2018

szha mentioned this issue May 30, 2018

How to debug hybridize() failures? #10875

Closed

szha mentioned this issue Aug 25, 2018

topk regression #12197

Closed

anirudh2290 mentioned this issue Sep 14, 2018

[MXNET-798] Fix the dtype cast from non float32 in Gradient computation #12290

Merged

7 tasks

anirudh2290 mentioned this issue Nov 6, 2018

randn operator for symbol and NDarray API #12775

Closed

hubutui mentioned this issue Jul 2, 2019

Fix build with system's openmp #15369

Closed

5 tasks

iblislin mentioned this issue Sep 7, 2019

julia: rename build env var MXNET_HOME to MXNET_ROOT #15568

Merged

apeforest mentioned this issue Sep 9, 2019

Different default values for NDArray and Symbol in random.multinomial #16103

Closed

szha unpinned this issue Sep 13, 2019

szha mentioned this issue Sep 13, 2019

[RFC] Apache MXNet 2.0 Roadmap #16167

Open

szha closed this as completed Sep 13, 2020

[Discussion] MXNet 2.0 Roadmap (was: APIs that might be a good idea to break in 2.0) #9686

[Discussion] MXNet 2.0 Roadmap (was: APIs that might be a good idea to break in 2.0) #9686

Comments

szha commented Feb 2, 2018 • edited Loading

bhavinthaker commented Feb 7, 2018

larroy commented Feb 15, 2018

fhieber commented Feb 15, 2018 • edited Loading

anirudhacharya commented Feb 27, 2018

szha commented Apr 9, 2018

eric-haibin-lin commented May 3, 2018

szha commented May 10, 2018

szha commented May 17, 2018

szha commented May 29, 2018

szha commented May 30, 2018

RogerChern commented Jun 1, 2018

szha commented Jun 4, 2018

szha commented Jun 4, 2018

szha commented Jul 30, 2018

szha commented Aug 3, 2018

szha commented Aug 16, 2018

anirudhacharya commented Oct 17, 2018 • edited Loading

anirudhacharya commented Oct 17, 2018

szha commented Oct 17, 2018

zhreshold commented Oct 18, 2018

anirudhacharya commented Oct 18, 2018

ptrendx commented Jun 25, 2019

szha commented Jun 28, 2019

apeforest commented Jul 1, 2019

larroy commented Jul 15, 2019 • edited Loading

braindotai commented Jul 16, 2019 • edited Loading

arcadiaphy commented Jul 17, 2019

cloudhan commented Jul 17, 2019 • edited Loading

marcoabreu commented Jul 18, 2019

marcoabreu commented Jul 18, 2019

szha commented Jul 19, 2019

samskalicky commented Jul 19, 2019

szha commented Jul 19, 2019

samskalicky commented Jul 19, 2019 • edited Loading

szha commented Jul 19, 2019

samskalicky commented Jul 19, 2019 • edited Loading

Neutron3529 commented Jul 25, 2019 • edited Loading

Zha0q1 commented Jul 25, 2019 • edited Loading

anirudh2290 commented Aug 30, 2019

iblislin commented Sep 7, 2019

ShownX commented Sep 13, 2019

eric-haibin-lin commented Sep 18, 2019

access2rohit commented Sep 19, 2019

eric-haibin-lin commented Jun 4, 2020

marcoabreu commented Jul 26, 2020

szha commented Feb 2, 2018 •

edited

Loading

fhieber commented Feb 15, 2018 •

edited

Loading

anirudhacharya commented Oct 17, 2018 •

edited

Loading

larroy commented Jul 15, 2019 •

edited

Loading

braindotai commented Jul 16, 2019 •

edited

Loading

cloudhan commented Jul 17, 2019 •

edited

Loading

samskalicky commented Jul 19, 2019 •

edited

Loading

samskalicky commented Jul 19, 2019 •

edited

Loading

Neutron3529 commented Jul 25, 2019 •

edited

Loading

Zha0q1 commented Jul 25, 2019 •

edited

Loading