HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-Through Rates (CTRs). HugeCTR supports model-parallel embedding tables and data-parallel neural networks and their variants such as Wide and Deep Learning (WDL), Deep Cross Network (DCN), DeepFM, and Deep Learning Recommendation Model (DLRM). HugeCTR is a component of NVIDIA Merlin Open Beta, which is used to build large-scale deep learning recommender systems. For additional information, see HugeCTR User Guide.
Design Goals:
- Fast: HugeCTR is a speed-of-light CTR model framework that can outperform popular recommender systems such as TensorFlow (TF).
- Efficient: HugeCTR provides the essentials so that you can efficiently train your CTR model.
- Easy: Regardless of whether you are a data scientist or machine learning practitioner, we've made it easy for anybody to use HugeCTR.
HugeCTR supports a variety of features, including the following:
- multi-node training
- mixed precision training
- SGD optimizer and learning rate scheduling
- model oversubscription
To learn about our latest enhancements, see our release notes.
If you'd like to quickly train a model using the Python interface, follow these steps:
-
Start a NGC container with your local host directory (/your/host/dir mounted) by running the following command:
docker run --runtime=nvidia --rm -v /your/host/dir:/your/container/dir -w /your/container/dir -it -u $(id -u):$(id -g) -it nvcr.io/nvidia/merlin/merlin-training:0.5
NOTE: The /your/host/dir directory is just as visible as the /your/container/dir directory. The /your/host/dir directory is also your starting directory.
-
Activate the merlin conda environment by running the following command:
source activate merlin
-
Inside the container, copy the DCN configuration file to our mounted directory (/your/container/dir).
This config file specifies the DCN model architecture and its optimizer. With any Python use case, the solver clause within the configuration file is not used at all.
-
Generate a synthetic dataset based on the configuration file by running the following command:
./data_generator --config-file dcn.json --voc-size-array 39884,39043,17289,7420,20263,3,7120,1543,39884,39043,17289,7420,20263,3,7120,1543,63,63,39884,39043,17289,7420,20263,3,7120,1543 --distribution powerlaw --alpha -1.2
-
Write a simple Python code using the hugectr module as shown here:
# train.py import sys import hugectr from mpi4py import MPI def train(json_config_file): solver_config = hugectr.solver_parser_helper(batchsize = 16384, batchsize_eval = 16384, vvgpu = [[0,1,2,3,4,5,6,7]], repeat_dataset = True) sess = hugectr.Session(solver_config, json_config_file) sess.start_data_reading() for i in range(10000): sess.train() if (i % 100 == 0): loss = sess.get_current_loss() print("[HUGECTR][INFO] iter: {}; loss: {}".format(i, loss)) if __name__ == "__main__": json_config_file = sys.argv[1] train(json_config_file)
NOTE: Update the vvgpu (the active GPUs), batchsize, and batchsize_eval parameters according to your GPU system.
-
Train the model by running the following command:
python train.py dcn.json
For additional information, see the HugeCTR User Guide.
If you encounter any issues and/or have questions, please file an issue here so that we can provide you with the necessary resolutions and answers. To further advance the Merlin/HugeCTR Roadmap, we encourage you to share all the details regarding your recommender system pipeline using this survey.