-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[Discussion] MXNet 2.0 Roadmap (was: APIs that might be a good idea to break in 2.0) #9686
Comments
Do we have sufficient automated testing to catch accidental lapses? If not, can we have a volunteer to work on writing these automated test-cases? How do we track this task? |
Refactors of the cpp-package and other C++ APIs. I would like that. |
|
@sandeep-krishnamurthy Please tag this - API Change, Call for Contribution, Roadmap. |
kvstore should not be public API |
we should merge element wise ops with broadcast ops and dispatch the different implementation only based on the shape, so that symbol and ndarray +-*/ are consistent |
contrib.ctc_loss should make into supported operator. |
fix_gamma=False for mxnet.symbol.BatchNorm |
Gluon RNN layer parameters are currently saved through unfused cells, causing the name to be something like "_unfused.0.l_cell.weight". This caused trouble in #11482 when I removed unfused cells. The workaround is to override _collect_params_with_prefix function to add the prefix. In 2.0, we should:
|
#12197 in using integer types for index instead of float. |
Taking a brief look at Data Iterators, it would seem the iterators are split up between the mx.io module and mx.image module. And there does not seem to be any method/process( correct me if I am wrong) in the split. For instance,
And
Is there any specific reason for this kind of design? It might be good to take a relook at this and reorganize this, even if it leads to breaking a few APIs. And there is similar functionality in the gluon interface too( and I am not including that in this discussion). |
What is the proposed change here? is the plan to remove |
@anirudhacharya yes, because dataset interface has a |
I can see your concern, but iterators included in Same stories apply to all these image transformation functions provided in |
@zhreshold but ImageRecordIter and ImageRecordUInt8Iter which are image specific are defined under mx.io. With regards to image transforms, I was thinking the symbolic interface should also have something similar to the interface available in gluon-CV transforms - https://gluon-cv.mxnet.io/api/data.transforms.html, which is very intuitive and not cluttered. Because we have users who have gone to production using MXNet's symbolic interface. We can discuss this in-person, it will be better. |
I would like to remove the reliance on topological ordering of inputs in communication between frontend and backend: #15362. The cleanest solution to this is to change the C API to pass dictionaries instead of lists to the backend (and get dictionaries back). |
Removing deprecate operators from code base (or at lease hide them from users). Some operators, such as |
I was thinking that I would like to drop Amalgamation and instead have a dynamic operator registry, imagine for a moment that you can register your operators in a set of yaml files that would do the same as NNVM_REGISTER, before build time you can configure which operators to compile and produce a very lean library in a similar way that almalgamation is doing but more clean and in a per operator basis, with a codegen step that would parse the operator registry, and also not compile the training code if you just want inference. This would make it possible to make "MXNet lite builds". Would this be desirable? |
I have some suggestions down here:-
from mxnet.gluon import estimator
model = estimator.linearregression
model = estimator.logisticregression
model = estimator.ridgeregression
model = estimator.lassoregression
model = estimator.knearestneighbors
model = estimator.kmeansclusttering
model = estimator.svm
....etc These classical ML algorithms work better than DL for some specific tasks and many users want such ML algorithms with gpu support, so that'd be quite awesome.
The reason why I am so worried about the website is because "IT IS IMPORTANT", more we show to the user directly, better the understanding a user can have.(For instance ME!, when I opened the website first time, it was very difficult for me to find good tutorials and examples, instead, I had to rely on GitHub and had to ask in the forum separately about that.)
so why we are telling users how to use it if it's so dangerous and not recommended? That's a lot to take, I know. |
I think we should provide a user-friendly thread-safe inference API for deploying in c++, java, etc. We can focus on naive engine in inference since it's very hard to refactor threaded engine to be thread-safe. A good and easy-to-use executor should have the following properties:
Now we have |
Sounds like refactoring execution engine with TBB and adding some buffering mechanism? |
Agree, the entire API (at the C-API level) should be designed to be entirely threadsafe for all requests - whether it's inference or training. This includes parallel calls from different threads - speak no locking or sticky threads. |
Could we get rid of all the different pre-processor statements in the codebase that evolved due to the different accelerators (USE_CUDA, USE_TVM, USE_MKLDNN, etc) and fully replace them with the accelerator API from @samskalicky ? This would heavily improve the maintainability. In terms of operator definitions, we could use ONNX as standard (or derive from it if it's not sufficient). At the moment, there's a tight coupling between the operator definitions and the accelerator choice. |
No, we cannot. They don't serve the same purpose.
I don't believe we should. ONNX is only good for model exchange and not much for anything else. Also, community has already reached consensus to move towards numpy so it's probably not a good idea to get married to ONNX |
@marcoabreu We can definitely remove some of these pre-processor statements with the accelerator API (MKLDNN) but not all of them like @szha points out. USE_CUDA needs to stay since we have GPU embedded pretty tightly. We might be able to push it out into an accelerator library, but not in time for 2.0. I agree with @szha ONNX is not the way to do this. We need to keep our operator registration in NNVM for now. What we could separate out are the operator definitions (NNVM reg) from the compute functions (infer shape/type, fcompute, etc) though. But again I think we should take this slow, enable actual accelerators first. Then see if it makes sense for TensorRT/MKLDNN and then maybe GPU. I would like to see the accelerator API (or a first pass at it) as part of 2.0 though. Is this feasible from your perspective @szha ? |
@samskalicky I'm not sure about the accelerator API. It seems that the existing subgraph API in combination with 1) better third-party operator support and 2) exposing graph partitioning as an API should be able to serve the same goal as the accelerator API. Those items are useful in other contexts too and deserve more attention, so I'd suggest those as an alternative to a specialized API just for accelerators. |
@szha I agree the third-party operator support could be more useful to the broader community and have been continually working on it in my spare time. I would be interested in collaborating with others to move this along faster. Should we consider that as part of the 1.6 release? But after discussing with @zheng-da today the subgraph API + operator support does not serve the same goal as the accelerator API. Some additional external APIs (like external subgraph properties for example, or supporting compiled binaries for subgraphs, or binding accelerators to subgraphs) would be needed to serve the same goal. |
I think we're talking about the same thing. (i.e. item 2 in my last response) |
Maybe I misunderstood item 2, assumed that meant better APIs for partitioning. Could clarify what you mean by item 2? Do you mean third-party graph partitioning (similar to third-part operator support)? |
(1)#10840 Add einsum since it is useful and it could simplify linear algebra ops. (it is now supported by tensorflow and pyThrch) |
I would like to bring up one issue with profiler: currently, there is a flag I have a issue about this: #15658. I think maybe we should have this flag removed in 2.0 to avoid confusion? |
|
Julia-related issue
|
Expect more image operations: adjust_colors (not random), rotate, and more |
remove the deprecated |
we need to fix this issue as well #16216. It's a breaking change |
|
Are there any plans to move the training logic (dataset handling, distributed training, etc) into the core to avoid having all that logic in the frontend languages? |
Let's start a discussion here about the roadmap towards MXNet 2.0. We are looking for:
If you have any item that you'd like to propose to have in the roadmap, please do:
Given that this would be a major release, we'd have the opportunity to make backward incompatible changes. This would allow us to visit some topics that require large changes such as dropping support for python2, transitioning fully to cmake, making the tensor library numpy-compatible, or even new programming models.
Now that we decided to follow semantic versioning for releases, it would be a good idea to coordinate features and API changes to make the best use of the next major release. Thus, I propose that we use this issue to track the APIs we'd like to change in the next major version.
The candidates I've collected so far:
download
in Exp backoff for downloads. #9671Once there are more of such requests, I will try to organize these API-breaking requests better.
The text was updated successfully, but these errors were encountered: