Releases: apache/mxnet
v2.0.0.alpha.rc1
v2.0.0 Alpha RC1
v2.0.0.alpha.rc0
v2.0.0 Alpha RC0
Apache MXNet (incubating) 1.7.0 Release
New features
MXNet Extensions: custom operators, partitioning, and graph passes
Adds support for extending MXNet with custom operators, partitioning strategies, and graph passes. All implemented in a library easily compiled separately from the MXNet codebase, and dynamically loaded at runtime into any prebuilt installation of MXNet.
fix for number of inputs/outputs for backward custom ops (#17069)
Enhancements for custom subgraph op (#17194)
Disable flaky test_custom_op_fork (#17481)
fix custom op makefile (#17516)
Update CustomOp doc with changes for GPU support (#17486)
[WIP] MXNet Extensions enhancements (#17885) (#18128)
Dynamic subgraph property (#17034)
Dynamic subgraph property doc (#17585)
[1.7] Backport MXNet Extension PRs (#17623, #17569, #17762) #18063 (#18069)
OpPerf utility enabled in the binary distribution
[OpPerf] Add Neural network loss ops (#17482)
[OpPerf] Fixes the issue when you pass NDArray to run_perf_test (#17508)
[OpPerf] Fix markdown for native profile and add profile param in function desc (#17494)
[OpPerf] Add Indexing ops (#16253)
[OpPerf] Implement remaining random sampling ops (#17502)
[OpPerf] Implement remaining GEMM ops (#17501)
[OpPerf] Implement all linalg ops (#17528)
[OpPerf] Fixed native output ordering, added warmup & runs command line args (#17571)
[OpPerf] Add norm, cast ops, remaining optimizer ops (#17542)
[Large Tensor] Fixed Embedding op (#17599)
[OpPerf] Fixed Python profiler bug (#17642)
MKL-DNN
MKL-DNN as the default CPU backend in binary distribution
Branding change to DNNL
Upgrade MKL-DNN dependency to v1.1 (#16823)
Support bfloat16 datatype
Add bfloat16 floating-point format support based on AMP (#17265)
New operators
[New Op] Add deformable conv v2 (#16341)
Add MXNet Ops for fast multihead attention (#16408)
Support boolean elemwise/broadcast binary add, multiply and true_divide (#16728)
add gammaln, erf, erfinv (#16811)
add aligned roi introduced in Detectron2 (#16619)
Implement atleast_1d/2d/3d (#17099)
Interleaved MHA for CPU path (#17138)
Lamb optimizer update (#16715)
Quantized Embedding (#16691)
Add gelu fuse ops (#18082) (#18092)
Feature improvements
Numpy compatible interface(experimental)
[NumPy] NumPy support for linalg.inv (#16730)
add numpy op nan_to_num (#16717)
[Numpy] Add sampling method for bernoulli (#16638)
Fix numpy-compatible mean output type for integer inputs (#16792)
[Numpy] Fix collect_params().zero_grad() in gluon numpy interface (#16716)
[Numpy][Operator] 'where' Implementation in MXNet (#16829)
[Numpy] Random.normal() with backward (#16330)
Add OP diag [numpy] (#16786)
Mixed precison binary op backward (use in) for numpy (#16791)
add numpy op diagflat [numpy] (#16813)
add op bitwise_or [numpy] (#16801)
[Numpy] Implementation npx.{sample}n (#16876)
[Numpy] Add NumPy support for np.linalg.det and np.linalg.slogdet (#16800)
Op Unravel_index PR [Numpy] (#16862)
[Numpy] Fix imperative basic indexing in numpy (#16902)
[Numpy] Basic indexing in symbolic interface of DeepNumpy (#16621)
[Numpy] add op full_like, c++ impl, fix zeros_like, ones_like type inference (#16804)
[Numpy] Implement numpy operator 'average' (#16720)
[Bugfix] [Numpy] Add kAddTo
and kNullOp to Transpose (#16979)
set rtol = 1e-2 and atol = 1e-4 when dtype == np.float32 in test_numpy_op.py:test_np_linalg_solve (#17025)
Op_Diagonal [Numpy] (#16989)
numpy bincount (#16965)
[numpy] add op bitwise_not (#16947)
[Numpy ]Modify np.random.shuffle to enable inplace by default (#17133)
[numpy] fix argsort typo (#17150)
[numpy] add op round (#17175)
[numpy]Add op delete (#17023)
[numpy] add op flipud, fliplr (#17192)
[CI] Re-enable testing with numpy 1.18 (#17200)
[Numpy] Add broadcast_to scalar case (#17233)
[Numpy] Random.gamma() implemented (#16152)
[Numpy] add row_stack (=vstack) (#17171)
[Numpy] Add infra for performing constraint check (#17272)
porting numpy-compatible hstack to master and add dstack for interoperability (#17030)
adding asnumpy() to output of gather(implicitly called) to fix gather test in large vector and tensor tests (#17290)
[numpy] add op random.exponential (#17280)
[NumPy] Add NumPy support for norm (#17014)
[numpy]add op random.lognormal (#17415)
Add numpy random weibull operator (#17505)
[numpy] Add np.random.pareto and np.random.power (#17517)
[Numpy] Add sort op (#17393)
[numpy]implement exponential backward (#17401)
[Numpy] Where operator scalar version (#17249)
[numpy] add op matmul (#16990)
[numpy]add op random.logistic, random.gumbel (#17302)
[numpy][Do Not Review]add op insert (#16865)
[numpy] add op random.rayleigh (#17541)
[numpy] add fallback ops (#17609)
[numpy] add op pad (#17328)
[numpy] add op fabs, sometrue, round (#17619)
Add arange_like to npx (#16883)
try to move shape_array to npx (#16897)
support np.argsort (#16949)
np.broadcast_to extension (#17358)
support bitwise_and (#16861)
fix np.argmax/argmin output data type (#17476)
add op random.beta (#17390)
add op isnan isinf (#17535)
array_split pr (#17032)
Mixed data type binary ops (#16699)
randn implemented (#17141)
refactor and reduce float types for some functions, also add bitwise_xor (#16827)
any/all (#17087)
amax (#17176)
fix format (#17100)
add op empty_like, add nan_to_num to dispatch (#17169)
handle array_like fill_value for np.full; add unit test coverage (#17245)
add np.amin (#17538)
add npx.gather_nd (#17477)
add np.random.chisquare (#17524)
add polyval (#17416)
add isposinf isneginf isfinite (#17563)
Support broadcast assign for npi_boolean_mask_assign_tensor
(#17131)
Implement Weibull backward (#17590)
support np.dsplit, fix some error msgs and corner cases for hsplit and vsplit, add interoperability tests for h/v/dsplit (#17478)
add np.product (#17489)
Implement np.random.pareto backward (#17607)
add np.ediff1d (#17624)
more support for boolean indexing and assign (#18352)
Fix einsum gradient (#18482)
[v1.7.x] Backport PRs of numpy features (#18653)
[v1.7.x] backport mixed type binary ops to v1.7.x (#18649)
revise activations (#18700)
Large tensor support
[Large Tensor] Add support to Random Sample & Pdf ops (#17445)
[Large Tensor] Add LT support for NN optimizers and 1 activation function (#17444)
[Large Tensor] Fixed SoftmaxActivation op (#17634)
[Large Tensor] Fixed col2im op (#17622)
[Large Tensor] Fixed Spatial Transformer op (#17617)
[Large Tensor] Fix ravel_multi_index op (#17644)
Sparse int64 Large tensor support (#16898)
Re-Enabling Large Tensor Nightly on GPU (#16164)
enabling build stage gpu_int64 to enable large tensor nightly runs (#17546)
MKL-DNN enhancement
MKLDNN FC : Add error info when mkldnn fc bias dimension is wrong (#16692)
[MKLDNN] support mkldnn gelu (#16710)
[MKLDNN] Fix int8 convolution/fc bias overflow (#16734)
[MKLDNN] use dim_t instead of int in slice/transpose operators (#16737)
Mkldnn fullyConnect bwd bug fix (#16890)
Revert Mkldnn fullyConnect bwd bug fix (#16890) (#16907)
[MKLDNN] Use MKLDNNRun (#16772)
[MKLDNN] mkldnn RNN operator enhancement (#17075)
[MKLDNN] enable MaxPooling with full pooling convention (#16860)
update mkldnn to v1.1.2 (#17165)
improve mkldnn doc (#17198)
[MKLDNN] Fix _copyto (#17173)
[MKLDNN] Support channel wise quantization for FullyConnected (#17187)
fixed seed for mkldnn test (#17386)
add mkldnn softmax backward (#17170)
cmake: copy dnnl headers to include/mkldnn (#17647)
[mkldnn]Mkldnn bn opt backport from master to 1.7x (#18009)
[v1.x] Update 3rdparty/mkldnn remote URL and pin to v1.3 (#17972) (#18033)
[v1.x] backport #17900 [MKLDNN] support using any format in pooling backward (#18067)
Static link MKL-DNN library (#16731)
Add large tensor nightly tests for MKL-DNN operators (#16184)
[MKL-DNN] Enable and Optimization for s8 eltwise_add (#16931)
[MKL-DNN] Enhance Quantization Method (#17161)
Static Build and CD for mxnet-cu102/mxnet-cu102mkl (#17074)
MKL-DNN RNN backward path enhancement (#17183)
cmake: check USE_OPENMP and pass proper MKL-DNN build flags (#17356)
update mkl to 2020.0 (#17355)
Enable MKL-DNN by default in pip packages (#16899)
Enable MKL-DNN FullyConnected backward (#17318)
Softmax primitive cache and in-place computation (#17152)
boolean_mask_assign with start_axis (#16886)
use identity_with_cast (#16913)
change error tolerance for bf16 bn (#18110)
[v1.x] Backport #17689 and #17884 to v1.x branch (#18064)
refactor codes and add an option to skip/check weight's version to reduce overhead (#17707) (#18039)
[v1.x] Backport #17702 and #17872 to v1.x branch (#18038)
TensorRT integration
Update TensorRT tutorial to build-from-source. (#14860)
Minor fix, use RAII for TensorRT builder and network object (#17189)
Quantization
Add silent option to quantization script (#17094)
Profiler
Implemented final two binary ops, added default params for functionality (#17407)
Implement remaining nn_activation ops in opperf (#17475)
Implement all miscellaneous ops (#17511)
Implement remaining nn_basic ops in opperf (#17456)
ONNX
Fix memory leak reported by ASAN in NNVM to ONNX conversion (#15516)
ONNX export: Gather (#15995)
ONNX export: Slice op - Handle None value for ends (#14942)
New models
[Model] Implement Neural Collaborative Filtering with MXNet (#16689)
Further optimization for NCF model (#17148)
HMM Model (#17120)
Operator improvements
Faster GPU NMS operator (#16542)
[MXNET-1421] Added (CuDNN)BatchNorm operator to the list of mirrored operators (#16022)
dynamic custom operator support (#15921)
Multi Precision Lamb Update operator (#16885)
Add im2col and col2im operator (#16502)
Quantized Elemwise Mul Operator (#17147)
Enhancements for MXTensor for custom operators (#17204)
Enabling large tensor support for binary broadcast operators (#16755)
Fix operators lying about their number of inputs (#17049)
[W...
Apache MXNet (incubating) 1.6.0
Deprecation of Python 2
MXNet community voted to no longer support Python 2 in future releases of MXNet. Therefore, MXNet 1.6 release is going to be the last MXNet release to support Python 2.
New features
NumPy compatible interface and using TVM to generate operators
NumPy has long been established as the standard math library in Python, the most prevalent language for the deep learning community. With this library as the cornerstone, there are now the largest ecosystem and community for scientific computing. The popularity of NumPy comes from its flexibility and generality.
In #14253, the MXNet community reached consensus on moving towards a NumPy-compatible programing experience and committed to a major endeavor on providing NumPy compatible operators.
The primary goal of the projects below is to provide the equivalent usability and expressiveness of NumPy in MXNet to facilitate Deep Learning model development, which not only helps existing deep learning practitioners but also provides people in the existing NumPy community with a shortcut for getting started in Deep Learning. The efforts towards this goal would also help a secondary goal, which is to enable the existing NumPy ecosystem to utilize GPUs and accelerators to speed up large scale computation.
- Infra to use tvm write op kernels (#15550)
- fix boolean_mask for 0-size output (#15731)
- fix tvm cmake (#15781)
- Numpy-compatible Infra (#15581)
- [MXNET-1206] Support NDArray indexing with None and Ellipsis (#13143)
- numpy-compatible sum (#15810)
- [Numpy] Numpy compatible slicing (#15798)
- Numpy Tensordot and Dot Operator (#15820)
- numpy linspace (#15852)
- tvm infra for op attrs (#15854)
- Port several np ops to master (#15867)
- numpy-compatible split upstream (#15841)
- Numpy-compatible concatenate upstream (#15894)
- Numpy-compatible stack upstream (#15842)
- [Numpy] Numpy behavior random.uniform() (#15858)
- Tvm broadcast backward (#15938)
- np elemwise unary ops upstream (#15831)
- [Numpy] random.randint() implemented (#15956)
- Refines NDArray indexing and adds numpy ndarray indexing [READY FOR REVIEW] (#15942)
- Port ops from np branch (#16018)
- numpy-compatible cumsum upstream (#15924)
- NumPy-compatible infrastructure on Gluon (#16024)
- [OP] Support range as advanced index for ndarrays (#16047)
- Numpy compatible max min (#16046)
- NumPy-compatible Mean, Std and Var (#16014)
- Add fluent methods mean, std, var for ndarray (#16077)
- numpy multinomial op (#15878)
- add numpy operator remainder (#16080)
- [Numpy] Random.choice implemented (#16089)
- Fix sample.normal shape inference
- Numpy add numpy op indices (#15837)
- [Numpy] Numpy copysign (#15851)
- numpy operator ravel, derive from reshape (#16016)
- Add array_function
- Improved error mesages
- Fix np.choice
- add exception check for numpy reshape (#16180)
- [Numpy] Numpy behavior normal distribution (#16109)
- fix multinomial bug on gpu (#16204)
- [Numpy] Differentiable svd (#15795)
- add epsilon to sum(pvalue) upperbound (#16211)
- np compatible vstack (#15850)
- Numpy add numpy op roll (#15902)
- add numpy compatible trace (#16008)
- add numpy op hanning, hamming, blackman (#15815)
- [Numpy]flip (#15819)
- numpy operator around (#16126)
- numpy operator arctan2 (#15890)
- numpy operator nonzero (#15838)
- numpy operator hypot (#15901)
- tvm numpy operator deg2rad && rad2deg (#16015)
- numpy op unique
- try to fix bug
- fix memory bug and disable some test
- fix according to review
- Numpy operators:
lcm
,tril
,identity
andtake
(#16264) - [numpy] Cosmetic improvement on mxnet.numpy builtin op signature in documentation (#16305)
- Disable Pylint false error in numpy_op_signature (#16370)
- boolean_mask_assign operator for future boolean indexing (#16361)
- Implements ldexp. (#15845)
- Numpy Operators: Inner, Outer, vdot (#15846)
- Numpy det and slogdet operators (#15861)
- Fix random op signature
- fix choice signature
- add raise test for shape
- Add boolean ndarray (#15940)
- global numpy shape flag (#16335)
- numpy-compatible histogram (#16266)
- [Numpy] Numpy compatible dstack (#15871)
- numpy eye op (#16132)
- Numpy compatible vsplit; minor changes to split (#15983)
- add numpy op logspace (#15825)
- add numpy op bitwise_xor, hsplit, moveaxis, rot90 (#16257)
- Fix optimizer bug for np attribute (#16494)
- Tests of NumPy interoperability (#16469)
- improve unary and binary operator handling and refactor tests (#16423)
- [DOC] Fix numpy op doc (#16504)
- [Numpy] More numpy dispatch tests (#16426)
- [Numpy] einsum (#15911)
- Add test pipeline for USE_TVM_OP=OFF on Unix (#16450)
- Numpy dispatch test of ...... (#16422)
- setup and concatenate, copy, expand_dims, expm1 (#16493)
- add sum for boolean type in mainline (#16436)
- [Numpy] SVD outputs tuple (#16530)
- numpy op doc: max, min, prod (#16506)
- add interface for rand
- Fix numpy bugs (#16537)
- pickler override for np ndarrays (#16561)
- [numpy]op test in new pattern (#16556)
- Enforce adding documentation for builtin numpy operators (#16575)
- [Numpy] Support N_D(N>=3) batch_dot (#16586)
- [Numpy] Loading numpy-incompatible NDArray in numpy-compatible mode (#16597)
- Fix index overflow bug in einsum (#16589)
- add npx reshape (#16640)
- add type switch to weight tensor (#16543)
- numpy doc enhancement (#16637)
- Infra for tvm op runtime dispatch (#16100)
- [NumPy][Operator] NumPy operator
may_share_memory
andshares_memory
(#16533) - [Numpy] Numpy operator diff (#15906)
- Miscellaneous fix for several numpy issues (#16664)
- [Numpy] implement np.column_stack (#16594)
- [numpy] add numpy operator : append (#16564)
- Backport of #16711, #16737, #16408 to 1.6 branch (#16763)
- Backport to 1.6 (#16773, #16781, #16783, #16716, #16699, #16728, #16769, #16792) (#16832)
- [Backport][v1.6.x] Fix the wrong result of sum, mean, argmin, argmax when inputs contain inf or nan (#16884)
- Backport of #16827, #16791 and #16888 to 1.6 branch (#16901)
- port shape op to 1.6.x (#16912)
- [Numpy] Fix imperative basic indexing in numpy (#16902) (#16919)
- Backport #16895, #16922, #16878, #16979 and #16900 to 1.6 (#17029)
Graph optimizations
Pointwise fusion for GPU
DL models, besides compute intensive operations like convolutions and fully connected layers, feature a lot of simple pointwise (aka elementwise) operations (like elementwise addition etc.). Performance of those operations is fully memory bandwidth bound and so limit speedups from newer GPU hardware, which typically has high compute/memory bandwidth ratio. When multiple of such operations are chained one after another, it results in a series of unnecessary stores and loads as well as potential increased memory usage to store the intermediate results. Pointwise fusion helps in alleviating those problems by just-in-time generation of fused operators, which do not store intermediate results in memory, resulting in performance and memory usage improvements.
- Pointwise fusion for GPU (#15167)
- Backport #16798, #16836 and #16838 to 1.6 (#16874)
- Add support for boolean inputs to FusedOp (#16796) (#16892)
- Workaround problem with fusion in CUDA 9 (#17028) (#17035)
Eliminate common subexpressions
- Eliminate common expressions (#15657)
Default MKLDNN Subgraph fusion
- [MKLDNN] Enable subgraph backend mkldnn by default. (#15518)
New operators
- [OP] Add a new arange_like operator to contrib (#15400)
- PDF operators for each distribution for which we have a random sampler (plus also the PDF of the Dirichlet). Supports probabilities and log-probabilities, as well as gradients. (#14617)
- Group Normalization (#14959)
- Add RROIAlign (#16017)
- Add fast implementation of LARS (#16122)
- Round and sign straight-through-estimators C operators. (#16373)
- New ops for RCNN + old ops improvements for RCNN (#16215)
- Comparison ops implemented using mshadow (#16414)
- Add mask target generator operator for Mask-RCNN (#16268)
- Move MRCNNMaskTarget op to contrib (#16486)
- Mxnet allclose (#14443)
- Aggregated adamw update (#16398)
- Make mrcnn_mask_target arg mask_size a 2d tuple (#16567)
- Dgl ops 2 (#16416)
- Lamb optimizer update (#16715)
- [OP] changing data type of 't' to int in lamb_update_phase1 (#16903)
- Multi Precision Lamb Update operator (#16885)
- Interleaved MHA for CPU path (#17138) (#17211)
Feature improvements
Automatic Mixed Precision
- [AMP] Move topk from FP16_FP32_FUNCS to FP32_FUNCS (#15342)
- Conversion from FP32 model to Mixed Precision model (#15118)
- Update fp16 docs: Block.cast is inplace (#15458)
- FP16 Support for C Predict API (#15245)
- Add AMP Conversion support for BucketingModule (#15528)
Gluon Fit API
- Fixing build for gluon estimator test, including libtvm in pack libs (#16148)
- [Estimator] handle composite metrics in estimator (#16676)
- [Estimator] refactor estimator to allow overriding evaluate/fit of a batch (#16678)
- [Estimator] refactor estimator and clarify docs (#16694)
- [Gluon] Improve estimator usability and fix logging logic (#16810) (#16846)
- Backport Gluon estimator changes to 1.6 (#17048)
- fix parameter names in the estimator api (#17051) (#17162)
MKLDNN
- Upgrade MKL-DNN submodule to v0.20 release (#15422)
- Fix quantized concat when inputs are mixed int8 and uint8 (#15693)
- [MKLDNN]Enhance Quantization APIs and Tutorial (#15448)
- Add quantization support for GluonCV (#15754)
- add int8 bn mkldnn implementation and test (#15664)
- [Quantization]support exclude operators while quantization (...
Apache MXNet (incubating) 1.5.1 patch release
Apache MXNet (incubating) 1.5.1 is a maintenance release incorporating important bug fixes and important performance improvements. All users of Apache MXNet (incubating) 1.5.0 are advised to upgrade. You can install Apache MXNet (incubating) 1.5.1 at the usual place. Please review these Release Notes to learn the bug fixes.
Bug-fixes
- add deconv in TRT subgraph (#15666) (#16043)
- Update TRT tutorial with new APIs (#16044)
- Fix _copy_to on MKLDNN backend (#15637) (#15803)
- Benchmark doc fix (#15769) (#16029)
- remove Julia cat image for license issue (#15964) (#16026)
- added check for empty params file and unknown param (not arg/aux) (#15917)
- fix license issues (#15806) (#15860)
- prevent TRT_Logger to be destroyed before TRT engine (#14898) (#15877)
- [MXNET-1086] added sub and mul to ONNX->TensorRT conversion (#15344) (#15875)
- handle fix_gamma in tensorrt subgraph conversion correctly (#15645) (#15874)
- fix LinearRegressionOutput with empty label (#15620) (#15873)
- [v1.5.x] [MKLDNN] Independent gradients requests check with respect to weights… (#15805)
- fix dropout mask output (#15697) (#15804)
- fix fp32 flatten issue (#15351) (#15802)
- Clojure package remove source images (#15828)
- changed constructor args (#15601) (#15827)
- Add MKLDNN 4c layout to fix gluoncv se_resnext101_64x4d (#15692) (#15801)
- Fix the bug of
MXEnginePushAsyncND
andMXEnginePushSyncND
(#15751) (#15792)
How to build MXNet
Please follow the instructions at https://mxnet.incubator.apache.org/get_started
List of submodules used by Apache MXNet (Incubating) and when they were updated last
Name | Commit-id | Last update in MXNet | Last update in module |
---|---|---|---|
dlpack | 10892ac | Oct 30, 2017 | Aug 12, 2019 |
dmlc-core | 3943914 | May 14, 2019 | Sep 2, 2019 |
googletest | eb9225c | Jan 14, 2019 | Aug 29, 2019 |
mkldnn | 41bee20 | May 14, 2019 | Aug 27, 2019 |
mshadow | 1d79ecf | May 13, 2019 | Aug 4, 2019 |
nvidia_cub | c3cceac | Feb 16, 2018 | Jul 17, 2019 |
onnx-tensorrt | 1e209e5 | Jan 3, 2019 | Aug 22, 2019 |
openmp | 37c7212 | Nov 14, 2017 | Aug 28, 2019 |
ps-lite | 8a76389 | Apr 25, 2018 | Sep 2, 2019 |
tvm | 21935dc | May 21, 2019 | Sep 2, 2019 |
Apache MXNet (incubating) 1.5.0
New Features
Automatic Mixed Precision (experimental)
Training Deep Learning networks is a very computationally intensive task. Novel model architectures tend to have increasing numbers of layers and parameters, which slow down training. Fortunately, software optimizations and new generations of training hardware make it a feasible task.
However, most of the hardware and software optimization opportunities exist in exploiting lower precision (e.g. FP16) to, for example, utilize Tensor Cores available on new Volta and Turing GPUs. While training in FP16 showed great success in image classification tasks, other more complicated neural networks typically stayed in FP32 due to difficulties in applying the FP16 training guidelines.
That is where AMP (Automatic Mixed Precision) comes into play. It automatically applies the guidelines of FP16 training, using FP16 precision where it provides the most benefit, while conservatively keeping in full FP32 precision operations unsafe to do in FP16. To learn more about AMP, check out this tutorial.
MKL-DNN Reduced precision inference and RNN API support
Two advanced features, fused computation and reduced-precision kernels, are introduced by MKL-DNN in the recent version. These features can significantly speed up the inference performance on CPU for a broad range of deep learning topologies. MXNet MKL-DNN backend provides optimized implementations for various operators covering a broad range of applications including image classification, object detection, and natural language processing. Refer to the MKL-DNN operator documentation for more information.
Dynamic Shape(experimental)
MXNet now supports Dynamic Shape in both imperative and symbolic mode. MXNet used to require that operators statically infer the output shapes from the input shapes. However, there exist some operators that don't meet this requirement. Examples are:
- while_loop: its output size depends on the number of iterations in the loop.
- boolean indexing: its output size depends on the value of the input data.
- many operators can be extended to take a shape symbol as input and the shape symbol can determine the output shape of these operators (with this extension, the symbol interface of MXNet can fully support shape).
To support dynamic shape and such operators, we have modified MXNet backend. Now MXNet supports operators with dynamic shape such ascontrib.while_loop
,contrib.cond
, andmxnet.ndarray.contrib.boolean_mask
Note: Currently dynamic shape does not work with Gluon deferred initialization.
Large Tensor Support
Currently, MXNet supports maximal tensor size of around 4 billon (2^32). This is due to uint32_t being used as the default data type for tensor size, as well as variable indexing.
This limitation has created many problems when larger tensors are used in the model.
A naive solution to this problem is to replace all uint32_t in the MXNet backend source code to int64_t.
This solution is not viable, however, because many data structures use uint32_t as the data type for its members.
Unnecessarily replacing these variables to int64_t will increase the memory consumption causing another limitation. Second, MXNet has many submodule dependencies.
Updating the variable types in the MXNet repository is not enough. We also need to make sure different libraries, such as MKLDNN, MShadow etc. supports the int64_t integer data type.
Third, many front end APIs assume unsigned 32-bit integer interface. Only updating the interface in C/C++ will cause all the language bindings to fail.
Therefore, we need a systematic approach to enhance MXNet to support large tensors.
Now you can enable large tensor support by changing the following build flag to 1: USE_INT64_TENSOR_SIZE = 1
. Note this is set to 0 by default.
For more details please refer to the design document.
Dependency Update
MXNet has added support for CUDA 10, CUDA 10.1, cudnn7.5, NCCL 2.4.2, and numpy 1.16.0.
These updates are available through PyPI packages and build from source, refer to installation guid for more details.
Gluon Fit API(experimental)
Training a model in Gluon requires users to write the training loop. This is useful because of its imperative nature, however repeating the same code across multiple models can become tedious and repetitive with boilerplate code.
The training loop can also be overwhelming to some users new to deep learning. We have introduced an Estimator and Fit API to help facilitate training loop.
Note: this feature is still experimental, for more details, refer to design document.
New Operators
- split_v2 (#13687)
- Gradient multiplier (contrib) operator (#13632)
- Image normalize operator - GPU support, 3D/4D inputs (#13802)
- Image ToTensor operator - GPU support, 3D/4D inputs (#13837)
- Add Gluon Transformer Crop (#14259)
- GELU (#14449)
- AdamW operator (Fixing Weight Decay Regularization in Adam) (#13728)
- [MXNET-1382] Add the index_array operator (#14638)
- add an operator for computing the likelihood of a Hawkes self-exciting process (#14683)
- Add numpy linspace (#14927)
Feature Improvements
Operators
- make ROIAlign support position-sensitive pooling (#13088)
- Add erfinv operator for calculating inverse error function (#13811)
- Added optional parameters to BilinearResize2D to do relative scaling (#13985)
- MXNET-1295 Adding integer index support to Sequence* family of operators. (#13880)
- Export resize and support batch size (#14014)
- CUDNN dropout (#13896)
- Relaxing type requirements for slice_like op (#14097)
- Relaxing type requirements for reshape_like op (#14325)
- Parallelize CPU version and add GPU version of boolean_mask op (#14090)
- Add NHWC layout support to Pooling (cpu, gpu cuda, gpu cuDNN) (#13749)
- Multi-precision AdamW update op (#14171)
- [op] add back support for scalar type rescale_grad argument for adamw_update/mp_adamw_update (#14221)
- move choose_element_0index to operator (#14273)
- Optimize NMS (#14290)
- Optimize NMS part 2 (#14352)
- add background class in box_nms (#14058)
- Use cudnn for dropout by default (#14278)
- In-place updates for Nadam, Adadelta, Adamax and SGLD (#13960)
- Aggregate SGD (#13346)
- Add proper exception message for negative shape in array creation routines (#14362)
- Support multi-threading for Custom Operator (#14363)
- moveaxis operator now accepts negative indices and sequence of ints as well. (#14321)
- Support SyncBatchNorm5D (#14542)
- Add nd.power and sym.pow (#14606)
- Change RNN OP to stateful (#14476)
- Add imresize and copyMakeBorder to mx.image (#13357)
- add ctx for rand_ndarray and rand_sparse_ndarray (#14966)
- Add cpu implementation for Deformable PSROIPooling (#14886)
- Add warning for fp16 inputs with MXNET_SAFE_ACCUMULATION=0 (#15046)
- Safe LayerNorm (#15002)
- use MXNET_SAFE_ACCUMULATION for softmax accumulator (#15037)
- LayerNorm acceleration on GPU (#14935)
- Add matrix inversion operator in linalg (#14963)
- implementation for equivalence of tf.moments (#14842)
- Use env var to enforce safe accumulation in ReduceAxesCompute (#14830)
- [MXNet-1211] Factor and "Like" modes in BilinearResize2D operator (#13226)
- added extraction/generation of diagonal and triangonal matrices to linalg (#14501)
- [Mxnet-1397] Support symbolic api for requantize and dequantize (#14749)
- [MXNET-978] Support higher order gradient for
log
. (#14992) - Add cpu implementation for Deformable Convolution (#14879)
MKLDNN
- Feature/mkldnn static (#13628)
- Feature/mkldnn static 2 (#13503)
- support mkl log when dtype is fp32 or fp64 (#13150)
- Add reshape op supported by MKL-DNN (#12980)
- Move the debug output message into MXNET_MKLDNN_DEBUG (#13662)
- Integrate MKLDNN Conv1d and support 3d layout (#13530)
- Making MKL-DNN default on MXNet master (#13681)
- Add mkldnn OP for slice (#13730)
- mkldnn s8 conv API change for master (#13903)
- [MKLDNN] Enable signed int8 support for convolution. (#13697)
- add mkldnn softmax_output (#13699)
- MKLDNN based Quantized FullyConnected Operator and its fusion (#14128)
- Fix entropy for uint8 (#14150)
- Update MKL-DNN to v0.18 release (was: fix the Dense layer issue) (#13668)
- [MKL-DNN] Enable s8 support for inner product and 3d input with flatten=false (#14466)
- Optimize transpose operator with MKL-DNN (#14545)
- [MKLDNN] Remove repeat parts in MKLDNN.md (#14995)
- [MKLDNN] Enable more convolution + activation fusion (#14819)
- Update MKL-DNN submodule to v0.19 (#14783)
- Add mkldnn_version.h to pip package (#14899)
- [MKLDNN] add quantized sum (#14614)
- [MKLDNN]Refactor requantize to speed up execution (#14608)
- [MKLDNN]Add quantized relu (#14604)
- Add MKLDNN headers to pip package (#14339)
- add symbolic link to mkldnn header files in include (#14300)
- disable default MKLDNN for cross compilation (#13893)
- Update MKLDNN_README.md (#13653)
- [Quantization] Support zero-size tensor input for quantization flow (#15031)
- Support 3D input for MKL-DNN softmax operator (#14818)
- Add primitive cache for MKL-DNN sum(elemwise_add operator (#14914)
- Fix reshape to add in-place back (#14903)
- [int8] Add MobileNetV2_1.0 & Re...
Apache MXNet (incubating) 1.4.1
Apache MXNet (incubating) 1.4.1 is a maintenance release incorporating important bug fixes and important performance improvements. All users of Apache MXNet (incubating) 1.4.0 are advised to upgrade. You can install Apache MXNet (incubating) 1.4.1 at the usual place. Please review these Release Notes to learn the bug fixes.
Bug-fixes
- Java bug-fix cherry pick (#14834)
- Use DEFAULT macro in C APIs (#14767) (#14789)
- Set idx2name for Optimizer object (#14703) (#14772)
- Add pin_device_id option to Gluon DataLoader (#14136) (#14771)
- Tidy up storage allocation and deallocation (#14480) (#14768)
- Add MXEnginePushAsync and MXEnginePushSync C APIs (#14615) (#14770)
- Less cudaGet/SetDevice calls in Gluon execution (#13764)
- Fix nightly build of 1.4.x (#14556)
- Memory fixes. Resolves #10867, and resolves #14080 (#14372) (#14586)
- Fixes for data links (#14526)
- Backport of Windows CI Fixes (#14420)
Apache MXNet (incubating) 1.4.0
MXNet Change Log
1.4.0
- New Features
- New Operators
- Feature improvements
- Frontend API updates
- Language API updates
- Performance benchmarks and improvements
- Bug fixes
- Licensing updates
- Improvements
- Deprecations
- Other
- How to build MXNet
- List of submodules used by Apache MXNet (Incubating) and when they were updated last
New Features
Java Inference API
Model inference is often managed in a production ecosystem using primarily Java/Scala tools and frameworks. This release seeks to alleviate the need for software engineers to write custom MXNet wrappers to fit their production environment.
Inference on a trained model has a couple of common use cases:
- Real-time or Online Inference - tasks that require immediate feedback, such as fraud detection
- Batch or Offline Inference - tasks that don't require immediate feedback, these are use cases where you have massive amounts of data and want to run inference or pre-compute inference results Real-time Inference is often performed and deployed on popular web frameworks such as Tomcat, Netty, Jetty, etc., all of which use Java. Batch Inference is often performed on big data platforms such as Spark using Scala or Java.
With this project, we had the following goals:
- Build a new set of APIs that are Java friendly, compatible with Java 7+, are easy to use for inference.
- Lower the barrier to entry of consuming MXNet for production use cases.
More details can be found at the Java Inference API document.
Julia API
MXNet.jl is the Julia package of Apache MXNet. MXNet.jl brings flexible and efficient GPU computing and state-of-art deep learning to Julia. Some highlights of features include:
- Efficient tensor/matrix computation across multiple devices, including multiple CPUs, GPUs and distributed server nodes.
- Flexible manipulation of symbolic to composite for construction of state-of-the-art deep learning models.
Control Flow Operators (experimental)
Today we observe more and more dynamic neural network models, especially in the fields of natural language processing and graph analysis. The dynamics in these models come from multiple sources, including:
- Models are expressed with control flow, such as conditions and loops.
- NDArrays in a model may have dynamic shapes, meaning the NDArrays of a model or some of the NDArrays have different shapes for different batches.
- Models may want to use more dynamic data structures, such as lists or dictionaries. It's natural to express dynamic models in frameworks with an imperative programming interface (e.g., Gluon, Pytorch, TensorFlow Eager). In this kind of interface, developers can use Python control flows, or NDArrays with any shape at any moment, or use Python lists and dictionaries to store data as they want. The problem of this approach is that it highly dependent on the originating front-end programming language (mainly Python). A model implemented in one language can only run in the same language.
A common use case is that machine learning scientists want to develop their models in Python, whereas engineers who deploy the models usually have to use a different "production" language (e.g., Java or C). Gluon tries to close the gap between the model development and production deployment. Machine learning scientists design and implement their models in Python with the imperative interface, and then Gluon converts the implementations from imperative to symbolic by invoking hybridize() for model exporting.
The goal of this project is to enhance Gluon to turn a dynamic neural network into a static computation graph. The dynamic control flows are expressed by control flow operators with Gluon hybridization, and these are exported for deployment.
More information can be found at Optimize dynamic neural network models with control flow operators
MXNet Horovod Integration
Apache MXNet now supports distributed training using Horovod framework. Horovod is an open source distributed framework created at Uber. It leverages efficient inter-GPU communication to distribute and aggregate model parameters across multiple workers thus allowing efficient use of network bandwidth and scaling of training of deep learning models. To learn more about MXNet-Horovod integration, check out this blog.
SVRG Optimization
SVRG stands for Stochastic Variance Reduced Gradient, which was first introduced in the paper Accelerating Stochastic Gradient Descent using Predicative Variance Reduction in 2013. It is an optimization technique that complements SGD.
SGD is known for large scale optimization, but it suffers from slow convergence asymptotically due to the inherent variance. SGD approximates the full gradient using a small batch of samples which introduces variance. In order to converge faster, SGD often needs to start with a smaller learning rate.
SVRG remedies the slow convergence problem by keeping a version of the estimated weights that is close to the optimal parameters and maintains the average of the full gradient over the full pass of data. The average of the full gradients of all data is calculated w.r.t to parameters of last mth epochs. It has provable guarantees for strongly convex smooth functions; a detailed proof can be found in section 3 of the paper. SVRG uses a different update rule than SGD: gradients w.r.t current parameters minus gradients w.r.t parameters from the last mth epoch, plus the average of gradients over all data.
Key Characteristics of SVRG:
- Explicit variance reduction
- Ability to use relatively large learning rate compared to SGD, which leads to faster convergence.
More details can be found at SVRG Optimization in MXNet Python Module
Subgraph API (experimental)
MXNet can integrate with many different kinds of backend libraries, including TVM, MKLDNN, TensorRT, Intel nGraph and more. In general, these backends support a limited number of operators, so running computation in a model usually involves an interaction between backend-supported operators and MXNet operators. These backend libraries share some common requirements:
TVM , MKLDNN and nGraph use customized data formats. Interaction between these backends with MXNet requires data format conversion.
TVM, MKLDNN, TensorRT and nGraph fuses operators.
Integration with these backends should happen in the granularity of subgraphs instead of in the granularity of operators. To fuse operators, it's obvious that we need to divide a graph into subgraphs so that the operators in a subgraph can be fused into a single operator. To handle customized data formats, we should partition a computation graph into subgraphs as well. Each subgraph contains only TVM, MKLDNN or nGraph operators. In this way, MXNet converts data formats only when entering such a subgraph, and the operators inside a subgraph handle format conversion themselves if necessary. This makes interaction of TVM and MKLDNN with MXNet much easier. Neither the MXNet executor nor the MXNet operators need to deal with customized data formats. Even though invoking these libraries from MXNet requires similar steps, the partitioning rule and the subgraph execution of these backends can be different. As such, we define the following interface for backends to customize graph partitioning and subgraph execution inside an operator. More details can be found at PR 12157 and Subgraph API.
JVM Memory Management
The MXNet S...
Apache MXNet (incubating) 1.3.1
MXNet Change Log
1.3.1
Bug fixes
-
[MXNET-953] Fix oob memory read (v1.3.x) / #13118
Simple bugfix addressing an out-of-bounds memory read. -
[MXNET-969] Fix buffer overflow in RNNOp (v1.3.x) / #13119
This fixes an buffer overflow detected by ASAN. -
CudnnFind() usage improvements (v1.3.x) / #13123
This PR improves the MXNet's use of cudnnFind() to address a few issues:- With the gluon imperative style, cudnnFind() is called during forward(), and so might have its timings perturbed by other GPU activity (including potentially other cudnnFind() calls).
- With some cuda drivers versions, care is needed to ensure that the large I/O and workspace cudaMallocs() performed by cudnnFind() are immediately released and available to MXNet.
- cudnnFind() makes both conv I/O and workspace allocations that must be covered by the GPU global memory headroom defined by MXNET_GPU_MEM_POOL_RESERVE. Per issue #12662, large convolutions can result in out-of-memory errors, even when MXNet's storage allocator has free memory in its pool.
This PR addresses these issues, providing the following benefits:
- Consistent algo choice for a given convolution type in a model, both for instances in the same GPU and in other GPUs in a multi-GPU training setting.
- Consistent algo choice from run to run, based on eliminating sources of interference of the cudnnFind() timing process.
- Consistent model global memory footprint, both because of the consistent algo choice (algo's can have markedly different workspace requirements) and changes to MXNet's use of cudaMalloc.
- Increased training performance based on being able to consistently run with models that approach the GPU's full global memory footprint.
- Adds a unittest for and solves issue #12662.
-
[MXNET-922] Fix memleak in profiler (v1.3.x) / #13120
Fix a memleak reported locally by ASAN during a normal inference test. -
Fix lazy record io when used with dataloader and multi_worker > 0 (v1.3.x) / #13124
Fixes multi_worker data loader when record file is used. The MXRecordIO instance needs to require a new file handler after fork to be safely manipulated simultaneously.This fix also safely voids the previous temporary fixes #12093 #11370.
-
fixed symbols naming in RNNCell, LSTMCell, GRUCell (v1.3.x) / #13158
This fixes #12783, by assigning all nodes in hybrid_forward a unique name. Some operations were in fact performed without attaching the appropriate (time) prefix to the name, which makes serialized graphs non-deserializable. -
Fixed
__setattr__
method of_MXClassPropertyMetaClass
(v1.3.x) / #13157
Fixed__setattr__
method -
allow foreach on input with 0 length (v1.3.x) / #13151
Fix #12470. With this change, outs shape can be inferred correctly. -
Infer dtype in SymbolBlock import from input symbol (v1.3.x) / #13117
Fix for the issue - #11849
Currently, Gluon symbol block cannot import any symbol with type other than fp32. All the parameters are created as FP32 leading to failure in importing the params when it is of type fp16, fp64 etc,
In this PR, we infer the type of the symbol being imported and create the Symbol Block Parameters with that inferred type.
Added the tests
Documentation fixes
-
Document the newly added env variable (v1.3.x) / #13156
Document the env variable: MXNET_ENFORCE_DETERMINISM added in PR: #12992 -
fix broken links (v1.3.x) / #13155
This PR fixes broken links on the website. -
fix broken Python IO API docs (v1.3.x) / #13154
Fixes #12854: Data Iterators documentation is brokenThis PR manually specifies members of the IO module so that the docs will render as expected. This is workaround in the docs to deal with a bug introduced in the Python code/structure since v1.3.0. See the comments for more info.
This PR also fixes another issue that may or may not be related. Cross references to same-named entities like name, shape, or type are confusing Sphinx and it seems to just link to whatever it last dealt with that has the same name, and not the current module. To fix this you have to be very specific. Don't use type, use np.type if that's what you want. Otherwise you might end up with mxnet.kvstore.KVStore.type. This is a known Sphinx issue, so it might be something we have to deal with for the time being.
This is important for any future modules - that they recognize this issue and make efforts to map the params and other elements.
-
add/update infer_range docs (v1.3.x) / #13153
This PR adds or updates the docs for the infer_range feature.Clarifies the param in the C op docs
Clarifies the param in the the Scala symbol docs
Adds the param for the the Scala ndarray docs
Adds the param for the Python symbol docs
Adds the param for the Python ndarray docs
Other Improvements
- [MXNET-1179] Enforce deterministic algorithms in convolution layers (v1.3.x) / #13152
Some of the CUDNN convolution algorithms are non-deterministic (see issue #11341). This PR adds an env variable to enforce determinism in the convolution operators. If set to true, only deterministic CUDNN algorithms will be used. If no deterministic algorithm is available, MXNet will error out.
Submodule updates
- update mshadow (v1.3.x) / #13122
Update mshadow for omp acceleration when nvcc is not present
Known issues
The test test_operator.test_dropout has issues and has been disabled on the branch:
- Disable flaky test test_operator.test_dropout (v1.3.x) / #13200
For more information and examples, see full release notes
Apache MXNet (incubating) 1.3.0
MXNet Change Log
1.3.0
New Features - Gluon RNN layers are now HybridBlocks
- In this release, Gluon RNN layers such as
gluon.rnn.RNN
,gluon.rnn.LSTM
,gluon.rnn.GRU
becomesHybridBlock
s as part of gluon.rnn improvements project (#11482). - This is the result of newly available fused RNN operators added for CPU: LSTM(#10104), vanilla RNN(#11399), GRU(#10311)
- Now many dynamic networks that are based on Gluon RNN layers can now be completely hybridized, exported, and used in the inference APIs in other language bindings such as R, Scala, etc.
MKL-DNN improvements
- Introducing more functionality support for MKL-DNN as follows:
New Features - Gluon Model Zoo Pre-trained Models
- Gluon Vision Model Zoo now provides MobileNetV2 pre-trained models (#10879) in addition to
AlexNet, DenseNet, Inception V3, MobileNetV1, ResNet V1 and V2, SqueezeNet 1.0 and 1.1, and VGG
pretrained models. - Updated pre-trained models provide state-of-the-art performance on all resnetv1, resnetv2, and vgg16, vgg19, vgg16_bn, vgg19_bn models (#11327 #11860 #11830).
New Features - Clojure package (experimental)
- MXNet now supports the Clojure programming language. The MXNet Clojure package brings flexible and efficient GPU computing and state-of-art deep learning to Clojure. It enables you to write seamless tensor/matrix computation with multiple GPUs in Clojure. It also lets you construct and customize the state-of-art deep learning models in Clojure, and apply them to tasks, such as image classification and data science challenges.(#11205)
- Checkout examples and API documentation here.
New Features - Synchronized Cross-GPU Batch Norm (experimental)
- Gluon now supports Synchronized Batch Normalization (#11502).
- This enables stable training on large-scale networks with high memory consumption such as FCN for image segmentation.
New Features - Sparse Tensor Support for Gluon (experimental)
- Sparse gradient support is added to
gluon.nn.Embedding
. Setsparse_grad=True
to enable when constructing the Embedding block. (#10924) - Gluon Parameter now supports "row_sparse" storage type, which reduces communication cost and memory consumption for multi-GPU training for large models.
gluon.contrib.nn.SparseEmbedding
is an example empowered by this. (#11001, #11429) - Gluon HybridBlock now supports hybridization with sparse operators (#11306).
New Features - Control flow operators (experimental)
- This is the first step towards optimizing dynamic neural networks with variable computation graphs, by adding symbolic and imperative control flow operators. Proposal.
- New operators introduced: foreach(#11531), while_loop(#11566), cond(#11760).
New Features - Scala API Improvements (experimental)
- Improvements to MXNet Scala API usability(#10660, #10787, #10991)
- Symbol.api and NDArray.api would bring new set of functions that have complete definition for all arguments.
- Please see this Type safe API design document for more details.
New Features - Rounding GPU Memory Pool for dynamic networks with variable-length inputs and outputs (experimental)
- MXNet now supports a new memory pool type for GPU memory (#11041).
- Unlike the default memory pool requires exact size match to reuse released memory chunks, this new memory pool uses exponential-linear rounding so that similar sized memory chunks can all be reused, which is more suitable for all the workloads with dynamic-shape inputs and outputs. Set environment variable
MXNET_GPU_MEM_POOL_TYPE=Round
to enable.
New Features - Topology-aware AllReduce (experimental)
- This features uses trees to perform the Reduce and Broadcast. It uses the idea of minimum spanning trees to do a binary tree Reduce communication pattern to improve it. This topology aware approach reduces the existing limitations for single machine communication shown by mehods like parameter server and NCCL ring reduction. It is an experimental feature (#11591).
- Paper followed for implementation: Optimal message scheduling for aggregation.
- Set environment variable
MXNET_KVSTORE_USETREE=1
to enable.
New Features - Export MXNet models to ONNX format (experimental)
- With this feature, now MXNet models can be exported to ONNX format(#11213). Currently, MXNet supports ONNX v1.2.1. API documentation.
- Checkout this tutorial which shows how to use MXNet to ONNX exporter APIs. ONNX protobuf so that those models can be imported in other frameworks for inference.
New Features - TensorRT Runtime Integration (experimental)
- TensorRT provides significant acceleration of model inference on NVIDIA GPUs compared to running the full graph in MxNet using unfused GPU operators. In addition to faster fp32 inference, TensorRT optimizes fp16 inference, and is capable of int8 inference (provided the quantization steps are performed). Besides increasing throughput, TensorRT significantly reduces inference latency, especially for small batches.
- This feature in MXNet now introduces runtime integration of TensorRT into MXNet, in order to accelerate inference.(#11325)
- Currently, its in contrib package.
New Examples - Scala
- Refurnished Scala Examples with improved API, documentation and CI test coverage. (#11753, #11621 )
- Now all Scala examples have:
- No bugs block in the middle
- Good Readme to start with
- with Type-safe API usage inside
- monitored in CI in each PR runs
Maintenance - Flaky Tests improvement effort
Maintenance - MXNet Model Backwards Compatibility Checker
- This tool (#11626) helps in ensuring consistency and sanity while performing inference on the latest version of MXNet using models trained on older versions of MXNet.
- This tool will help in detecting issues earlier in the development cycle which break backwards compatibility on MXNet and would contribute towards ensuring a healthy and stable release of MXNet.
Maintenance - Integrated testing for "the Straight Dope"
- "Deep Learning - The Straight Dope" is a deep learning book based on Apache MXNet Gluon that are contributed by many Gluon users.
- Now the testing of this book is integrated in the nightly tests.
Bug-fixes
- Fix gperftools/jemalloc and lapack warning bug. (#11110)
- Fix mkldnn performance regression + improve test logging (#11262)
- Fix row_sparse_param.save() (#11266)
- Fix trainer init_kvstore (#11266)
- Fix axis Bug in MKLDNN Softmax (#11335)
- Fix 'AttributeError: '_thread._local' object has no attribute 'value'' on distributed processing applications (#11332)
- Fix recordfile dataset with multi worker (#11370)
- Manually check node existence in CachedOp (#11545)
- Javadoc fix (#11239)
- Fix bugs in MKLDNN operators to handle the kAddTo request (#11129)
- Fix InferStorage for sparse fallback in FullyConnected (#11498)
- Fix batchnorm problem with sparse matrices when fix_gamma=True (#11656)
- Fix rnn layer save (#11776)
- Fix BucketSentenceIter bug related to #11430 (#11580)
- Fix for _backward_softsign activation (#11827)
- Fix a bug in CachedOp. (#11675)
- Fix quantization divide by zero errors (#11833)
- Refactor R optimizers to fix memory leak (#11374)
- Avoid use of troublesome cudnnFind() results when grad_req='add' (#11338)
- Fix shared memory with gluon dataloader, add option pin_memory (#11908)
- Fix quantized graph pass bug (#11937)
- Fix MXPredReshape in the c_predict_api (#11493)
- Fix the topk regression issue (#12197)
- Fix image-classification example and add missing optimizers w/ momentum support (#11826)