Skip to content

Merlin: HugeCTR V3.9 (Merlin 22.08)

Compare
Choose a tag to compare
@minseokl minseokl released this 23 Aug 05:33

What's New in Version 3.9

  • Updates to 3G Embedding:

    • Sparse Operation Kit (SOK) is updated to use the HugeCTR 3G embedding as a developer preview feature.
      For more information, refer to the Python programs in the sparse_operation_kit/experiment/benchmark/dlrm directory of the repository on GitHub.
    • Dynamic embedding table mode is added.
      The mode is based on the cuCollection with some functionality enhancement.
      A dynamic embedding table grows its size when the table is full so that you no longer need to configure the memory usage information for embedding.
      For more information, refer to the embedding_storage/dynamic_embedding_storage directory of the repository on GitHub.
  • Enhancements to the HPS Plugin for TensorFlow:
    This release includes improvements to the interoperability of SOK and HPS.
    The plugin now supports the sparse lookup layer.
    The documentation for the HPS plugin is enhanced as follows:

  • Enhancements to the HPS Backend for Triton Inference Server
    This release adds support for integrating the HPS Backend and the TensorFlow Backend through the ensemble mode with Triton Inference Server.
    The enhancement enables deploying a TensorFlow model with large embedding tables with Triton by leveraging HPS.
    For more information, refer to the sample programs in the hps-triton-ensemble directory of the HugeCTR Backend repository in GitHub.

  • New Multi-Node Tutorial:
    The multi-node training tutorial is new.
    The additions show how to use HugeCTR to train a model with multiple nodes and is based on our most recent Docker container.
    The tutorial should be useful to users who do not have a job-scheduler-installed cluster such as Slurm Workload Manager.
    The update addresses a issue that was first reported in GitHub issue 305.

  • Support Offline Inference for MMoE:
    This release includes MMoE offline inference where both per-class AUC and average AUC are provided.
    When the number of class AUCs is greater than one, the output includes a line like the following example:

    [HCTR][08:52:59.254][INFO][RK0][main]: Evaluation, AUC: {0.482141, 0.440781}, macro-averaging AUC: 0.46146124601364136
  • Enhancements to the API for the HPS Database Backend
    This release includes several enhancements to the API for the DatabaseBackend class.
    For more information, see database_backend.hpp and the header files for other database backends in the HugeCTR/include/hps directory of the repository.
    The enhancments are as follows:

    • You can now specify a maximum time budget, in nanoseconds, for queries so that you can build an application that must operate within strict latency limits.
      Fetch queries return execution control to the caller if the time budget is exhausted.
      The unprocessed entries are indicated to the caller through a callback function.
    • The dump and load_dump methods are new.
      These methods support saving and loading embedding tables from disk.
      The methods support a custom binary format and the RocksDB SST table file format.
      These methods enable you to import and export embedding table data between your custom tools and HugeCTR.
    • The find_tables method is new.
      The method enables you to discover all table data that is currently stored for a model in a DatabaseBackend instance.
      A new overloaded method for evict is added that can process the results from find_tables to quickly and simply drop all the stored information that is related to a model.
  • Documentation Enhancements

    • The documentation for the max_all_to_all_bandwidth parameter of the HybridEmbeddingParam class is clarified to indicate that the bandwidth unit is per-GPU.
      Previously, the unit was not specified.
  • Issues Fixed:

    • Hybrid embedding with IB_NVLINK as the communication_type of the
      HybridEmbeddingParam
      is fixed in this release.
    • Training performance is affected by a GPU routine that checks if an input key can be out of the embedding table. If you can guarantee that the input keys can work with the specified workspace_size_per_gpu_in_mb, we have a workaround to disable the routine by setting the environment variable HUGECTR_DISABLE_OVERFLOW_CHECK=1. The workaround restores the training performance.
    • Engineering discovered and fixed a correctness issue with the Softmax layer.
    • Engineering removed an inline profiler that was rarely used or updated. This change relates to GitHub issue 340.
  • Known Issues:

    • HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources.
      If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:

        -shm-size=1g -ulimit memlock=-1

      See also the NCCL known issue and the GitHub issue.

    • KafkaProducers startup succeeds even if the target Kafka broker is unresponsive.
      To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.

    • The number of data files in the file list should be greater than or equal to the number of data reader workers.
      Otherwise, different workers are mapped to the same file and data loading does not progress as expected.

    • Joint loss training with a regularizer is not supported.