Skip to content

Latest commit

 

History

History
74 lines (40 loc) · 9.65 KB

release_notes.md

File metadata and controls

74 lines (40 loc) · 9.65 KB

Release Notes

What's New in Version 3.0.1

  • DLRM Inference Benchmark: We've added two detailed Jupyter notebooks to illustrate how to train and deploy a DLRM model with HugeCTR whilst benchmarking its performance. The inference notebook demonstrates how to create Triton and HugeCTR backend configs, prepare the inference data, and deploy a trained model by another notebook on Triton Inference Server. It also shows the way of benchmarking its performance (throughput and latency), based on Triton Performance Analyzer. For more details, check out our HugeCTR inference repository.
  • FP16 Speicific Optimization in More Dense Layers: We've optimized DotProduct, ELU, and Sigmoid layers based on __half2 vectorized loads and stores, so that they better utilize device memory bandwidth. Now most layers have been optimized in such a way except MultiCross, FmOrder2, ReduceSum, and Multiply layers.
  • More Finely Tunable Synthetic Data Generator: Our new data generator can generate uniformly distributed datasets in addition to power law based datasets. Instead of specifying vocabulary_size in total and max_nnz, you can specify such information per categorical feature. See our user guide to learn its changed usage.
  • Decreased Memory Demands of Trained Model Exportation: To prevent the out of memory error from happening in saving a trained model including a very large embedding table, the actual amount of memory allocated by the related functions was effectively reduced.
  • CUDA Graph Compatible Dropout Layer: HugeCTR Dropout Layer uses cuDNN by default, so that it can be used together with CUDA Graph. In the previous version, if Dropout was used, CUDA Graph was implicitly turned off.

What’s New in Version 3.0

  • Inference Support: To streamline the recommender system workflow, we’ve implemented a custom HugeCTR backend on the NVIDIA Triton Inference Server. The HugeCTR backend leverages the embedding cache and parameter server to efficiently manage embeddings of different sizes and models in a hierarchical manner. For additional information, see our inference repository.

  • New High-Level API: You can now also construct and train your models using the Python interface with our new high-level API. See our preview example code to grasp how it works.

  • FP16 Support in More Layers: All the layers except MultiCross support mixed precision mode. We’ve also optimized some of the FP16 layer implementations based on vectorized loads and stores.

  • Enhanced TensorFlow Embedding Plugin: Our embedding plugin now supports LocalizedSlotSparseEmbeddingHash mode. With this enhancement, the DNN model no longer needs to be split into two parts since it now connects with the embedding op through MirroredStrategy within the embedding layer.

  • Extended Model Oversubscription: We’ve extended the model oversubscription feature to support LocalizedSlotSparseEmbeddingHash and LocalizedSlotSparseEmbeddingHashOneHot.

  • Epoch-Based Training Enhancement: The num_epochs option in the Solver clause can now be used with the Raw dataset format.

  • Deprecation of the eval_batches Parameter: The eval_batches parameter has been deprecated and replaced with the max_eval_batches and max_eval_samples parameters. In epoch mode, these parameters control the maximum number of evaluations. An error message will appear when attempting to use the eval_batches parameter.

  • MultiplyLayer Renamed: To clarify what the MultiplyLayer does, it was renamed to WeightMultiplyLayer.

  • Optimized Initialization Time: HugeCTR’s initialization time, which includes the GEMM algorithm search and parameter initialization, was significantly reduced.

  • Sample Enhancements: Our samples now rely upon the Criteo 1TB Click Logs dataset instead of the Kaggle Display Advertising Challenge dataset. Our preprocessing scripts (Perl, Pandas, and NVTabular) have also been unified and simplified.

  • Configurable DataReader Worker: You can now specify the number of data reader workers, which run in parallel, with the num_workers parameter. Its default value is 12. However, if you are using the Parquet data reader, you can't configure the num_workers parameter since it always corresponds to the number of active GPUs.

What's New in Version 2.3

  • New Python Interface: To enhance the interoperability with NVTabular and other Python-based libraries, we're introducing a new Python interface for HugeCTR.

  • HugeCTR Embedding with Tensorflow: To help users easily integrate HugeCTR’s optimized embedding into their Tensorflow workflow, we now offer the HugeCTR embedding layer as a Tensorflow plugin. To better understand how to intall, use, and verify it, see our Jupyter notebook tutorial. It also demonstrates how you can create a new Keras layer, EmbeddingLayer, based on the hugectr.py helper code that we provide.

  • Model Oversubscription: To enable a model with large embedding tables that exceeds the single GPU's memory limit, we've added a new model oversubscription feature, giving you the ability to load a subset of an embedding table into the GPU in a coarse grained, on-demand manner during the training stage.

  • TF32 Support: We've added TensorFloat-32 (TF32), a new math mode and third-generation of Tensor Cores, support on Ampere. TF32 uses the same 10-bit mantissa as FP16 to ensure accuracy while providing the same range as FP32 by using an 8-bit exponent. Since TF32 is an internal data type that accelerates FP32 GEMM computations with tensor cores, you can simply turn it on with a newly added configuration option. For additional information, see Solver.

  • Enhanced AUC Implementation: To enhance the performance of our AUC computation on multi-node environments, we've redesigned our AUC implementation to improve how the computational load gets distributed across nodes.

  • Epoch-Based Training: In addition to the max_iter parameter, you can now set the num_epochs parameter in the Solver clause within the configuration file. This mode can only currently be used with Norm dataset formats and their corresponding file lists. All dataset formats will be supported in the future.

  • New Multi-Node Training Tutorial: To better support multi-node training use cases, we've added a new a step-by-step tutorial.

  • Power Law Distribution Support with Data Generator: Because of the increased need for generating a random dataset whose categorical features follows the power-law distribution, we've revised our data generation tool to support this use case. For additional information, refer to the --long-tail description [here](../docs/hugectr_user_guide.md#Generating Synthetic Data and Benchmarks).

  • Multi-GPU Preprocessing Script for Criteo Samples: Multiple GPUs can now be used when preparing the dataset for our samples. For additional information, see how preprocess_nvt.py is used to preprocess the Criteo dataset for DCN, DeepFM, and W&D samples.

Known Issues

  • Since the automatic plan file generator isn't able to handle systems that contain one GPU, you must manually create a JSON plan file with the following parameters and rename it using the name listed in the HugeCTR configuration file: {"type": "all2all", "num_gpus": 1, "main_gpu": 0, "num_steps": 1, "num_chunks": 1, "plan": [[0, 0]], and "chunks": [1]}.

  • If using a system that contains two GPUs with two NVLink connections, the auto plan file generator will print the following warning message: RuntimeWarning: divide by zero encountered in true_divide. This is an erroneous warning message and should be ignored.

  • The current plan file generator doesn't support a system where the NVSwitch or a full peer-to-peer connection between all nodes is unavailable.

  • Users need to set an export CUDA_DEVICE_ORDER=PCI_BUS_ID environment variable to ensure that the CUDA runtime and driver have a consistent GPU numbering.

  • LocalizedSlotSparseEmbeddingOneHot only supports a single-node machine where all the GPUs are fully connected such as NVSwitch.

  • HugeCTR version 3.0 crashes when running the DLRM sample on DGX2 due to a CUDA Graph issue. To run the sample on DGX2, disable the CUDA Graph by setting the cuda_graph parameter to false even if it degrades the performance a bit. This issue doesn't exist when using the DGX A100.

  • The HugeCTR embedding TensorFlow plugin only works with single-node machines.

  • The HugeCTR embedding TensorFlow plugin assumes that the input keys are in int64 and its output is in float.

  • If the number of samples in a dataset is not divisible by the batch size when in epoch mode and using the num_epochs instead of max_iter, a few remaining samples are truncated. If the training dataset is large enough, its impact can be negligible. If you want to minimize the wasted batches, try adjusting the number of data reader workers. For example, using a file list source, set the num_workers parameter to an advisor based on the number of data files in the file list.

  • MultiCross layer doesn't support mixed precision mode yet.