13 Jul 06:14

FDecaYed

v23.06.00 (v1.0) Latest

Latest

What’s Changed

New Features

Added support for row-slicing as a parallel strategy
Added support for data-parallel as a parallel strategy
Allow mix-matching data-parallel, table-parallel, row-slicing, and column-slicing. Refer to User Guide for more details.
Added IntegerLookup layer that supports on-the-fly vocabulary building, on both CPU and GPU

Breaking Changes

Added NVIDIA cuCollections as submodule for GPU hash map support
Now support TensorFlow 2.12. Note that this change breaks the build with TF 2.09 and earlier.

Improvements

Improved package import

Bug Fixes

fixes input offset overflow due to automatic table concatenating
fixes potential graph mismatching problems in broadcast

Full Changelog: v23.03.00...v23.06.00

Assets 2

19 Apr 13:13

FDecaYed

v23.03.00

What’s Changed

New Features

NVIDIA Hopper™ architecture families support (compute capability 9.0).
Added support for Keras Model fit api.
Added support for Horovod callbacks in case of hybrid data/model parallel.

Breaking Changes

Now support TensorFlow 2.12. Note that this change breaks build with TF 2.09 and earlier.
Now require horovod version 0.27 or later.

Improvements

Improved unit tests

Bug Fixes

Use tf.shape for graph mode support by @edknv in #6

New Contributors

@edknv made their first contribution in #6

Full Changelog: v0.3...v23.03.00

Contributors

edknv

Assets 2

13 Feb 06:13

FDecaYed

v0.3

What’s Changed

New Features

CUDA 12 support
Automatic concatenation of multiple embedding tables for greatly improved speed
Support model parallel with user-defined custom keras layer through DistributedEmbedding wrapper

Improvements

Support cases where number of workers is greater than number of tables.
For corner cases where diffrerent slices of a table are placed onto same worker, they will be merged into single slice now.

Breaking Changes

move submodule from CUB to NVIDIA Thrust for better compatibilities

Bug Fixes

Better error handling in set_weight() when weights are not initialized
Better error handling when global batchsize is not divisible by number of workers

Full Changelog: v0.2...v0.3

Assets 2

09 Feb 08:06

FDecaYed

v0.2

What’s Changed

Breaking Changes

added new dependency NVIDIA CUB as submodule

New Features

SparseTensor is supported as embedding input, in addition to Dense and Ragged Tensor.
Add support and example for keras model.fit() api through custom train_step() function

Improvements

Improved embedding lookup speed when input is multi-hot with combiner.
Improved embedding lookup speed when input is one-hot, regardless of its combiner and format(Tensor, SparseTensor or RaggedTensor)
Add support for data parallel input, cpu embedding and TF native embedding api as options in benchmark

Bug Fixes

fix build with tensorflow 2.10+
fix a bug where batch dimension could be None at early stage in graph mode

Full Changelog: v0.1...v0.2

Assets 2

09 Feb 07:40

FDecaYed

v0.1 Pre-release

Pre-release

Initial release

Assets 2