18 Oct 01:21

TroyGarden

bdb27b5

v1.4.0-rc1 Pre-release

Pre-release

release cut for v1.4.0
in-sync with fbgemm-gpu release v1.4.0
in-sync with pytorch 2.9

Assets 2

13 Sep 22:45

TroyGarden

v1.3.0

79538d7

v1.3.0 Latest

Latest

New Features

New Flavors of Training Pipelines

Fused SDD: A new pipeline optimization schema that overlaps optimizer with embedding lookup. Training QPS gain is observed for models with heavy optimizer (e.g., Shampoo opt). [#2916, #2933]
2D Sharding support: common SDD train pipeline now supports 2D sharding schema. [#2929]
PostProc module support in train pipeline. [#2939, #2978, #2982, #2999]

Delta Tracker and Delta Store

ModelDeltaTracker is a utility for tracking and retrieving unique IDs and their corresponding embeddings or states from embedding modules in model using Torchrec. [#3056, #3060, #3064, ...]
It's particularly useful for:

Identifying which embedding rows were accessed during model execution
Retrieving the latest delta or unique rows for a model
Computing top-k changed embeddings
Supporting streaming updated embeddings between systems during online training

Resharding API

TorchRec Resharding API provides a new capability to reshard the embedding tables during training. It can be used for use cases such as manual tuning of the sharding plans during training, and provides resharding capability for Dynamic Resharding. It enables resharding of the existing sharded embedding tables based on a newer sharding plan. Resharding API accepts the changing shards compared to the current sharding plan. [#2911, #2912, #2944, #3053, ...]

Resharding API supports Table-Wise (TW) and Column-Wise (CW) resharding
Optimizer support includes SGD and Adagrad (with Row-wise Adagrad for TW)
Provides a highly performant API, tested on up to 128 GPUs across 16 nodes with NVIDIA A100 80GB GPUs, achieving an average resharding downtime of approximately 200 milliseconds for around 100GB of total data.
Achieved 0.1% average downtime per reshard compared to total training time for DLRM ~100GB model.

Prototyping KVZCH (Key-Value Zero-Collision Hashing)

Extend current TBE: There is considerable effort and expertise which has gone toward enabling performance optimized TBE for accessing HBM as well as host DRAM. We want to leverage such capabilities, and extend on top of TBE.
Abstract out the details of the backend memory: The memory we use could be SSD, Remote memory tiers through back end, or remote memory through front end. We want to enable all such capabilities, without adding backend specific logic to the TBE code.

KV TBE Design document [#2942]
KVZCH embedding lookup module [#2922]

MPZCH (Multi-Probe Zero-Collision Hashing) [#3089]

We are introducing a novel Multi-Probe Zero Collision Hash (MPZCH) solution based on multi-round linear probing to address the long-standing hash collision problem in sparse embedding lookup. The proposed solution is general, highly performant, scalable and simple.
A fast CUDA kernel is developed to map input sparse features to indices/slots with minimum chance of collision with others under a given budget. Eviction or fallback may happen when a collision occurs. Mapped indices and eviction information are returned for the downstream embedding lookup and optimizer states update. The process only takes a couple of milliseconds per batch at training. A CPU kernel was introduced to provide good performance in the inference environment.
A row-wise sharded ManagedCollisionModule (MCH) module is added as a part of TorchRec library that enables seamless integration with large scale distributed model training in production. No extra limit was applied for model scaling and the training throughput regression is little-to-none.
The solution has been adopted and tested by various product models with multi-billion hash size across retrieval and ranking. Promising results were observed from both offline and online experiments.

Change Log

ITEP (in-training emebedding pruning) changes [#2902, #2934, #2986, #3002]
stride_per_key_per_rank related changes [#2959, #3112, #3120, #3124]
benchmark refactoring and unification [#3107, #3104, #3082, #3096, ...]
input kjt validator [#2963]

compatability

fbgemm-gpu==1.3.0
torch==2.8.0

Assets 2

13 Sep 17:47

TroyGarden

v1.3.0-rc3

79538d7

v1.3.0-rc3 Pre-release

Pre-release

Wheel build test: passed
Binary validation: passed
CPU CI test: passed
GPU CI test: passed

Assets 2

13 Sep 16:35

TroyGarden

v1.3.0-rc2

bc8ef9c

v1.3.0-rc2 Pre-release

Pre-release

bump the torchrec version, pin the torch version

Assets 2

12 Sep 17:02

TroyGarden

v1.3.0-rc1

1f19023

v1.3.0-rc1 Pre-release

Pre-release

align with fbgemm release cut around 6/28

Assets 2

06 Jun 17:00

TroyGarden

v1.2.0

440b1c6

v1.2.0

New Features

TensorDict support for EBC and EC

an EBC/EC module can now take in TensorDict as the data input in alternative to KeyedJaggedTensor: #2581 #2596

Customized Embedding Lookup Kernel Support

NVIDIA dynamicemb package depends on an old TorchRec release (r0.7) plus a PR(#2533), refactor TorchRec embedding lookup structures to be easy to plug in a customized emb-lookup kernel: #2887 #2891

Prototype of Dynamic Sharding

Add initial dynamic sharding API and test. This current version supports EBC, TW, and Sharded Tensor. And other variants beyond those configurations (e.g. CW, RW, DTensor etc..): #2852 #2875 #2877 #2863

TorchRec 2D Parallel for EmbeddingCollection

Adding support for EmbeddingCollection modules in 2D parallel. This supports all sharding types that are supported for EC. #2737

Changelog

Support MCH for semi-sync (assuming no eviction): #2753
Multi forward MCH eviction fix: #2836
Fix RW Support and checkpointing: #2890

Assets 2

06 Jun 07:01

TroyGarden

v1.2.0-rc3

16f6124

v1.2.0-rc3 Pre-release

Pre-release

revert #2876 and update the binary validation script

Assets 2

06 Jun 07:00

TroyGarden

v1.2.0-rc2

bc831a1

v1.2.0-rc2 Pre-release

Pre-release

version number change

Assets 2

06 Jun 00:13

TroyGarden

v1.2.0-rc1

9f0bd7e

v1.2.0-rc1 Pre-release

Pre-release

first release candidate of v1.2.0

Assets 2

30 Jan 19:33

PaulZhang12

v1.1.0-rc4

2c5f6ee

v1.1.0

New Features

Grid-based Sharding For EBC

This is a form of CW sharding and then TWRW sharding the respective CW shards. One of the key changes is how the metadata from sharding placements is constructed in grid sharding. We leverage the concept of per_node from TWRW and combine it with the permutations and concatenation required in CW. Pull Request #2445

Re-shardable Hash Zch

Fully reshardable ZCH: we can handle any value which is in common with default value (768) so WS 1,2,4,8,16,24,32,48,64,96,128, etc. and go up and down. Pull Request #2538

TorchRec 2D Parallel

In this diff we introduce a new parallelism strategy for scaling recommendation model training called 2D parallel. In this case, we scale model parallel through data parallel, hence, the 2D name. Our new entry point, DMPCollection, subclasses DMP and is meant to be a drop in replacement to integrate 2D parallelism in distributed training. By setting the total number of GPUs to train across and the number of GPUs to locally shard across (aka one replication group), users can train their models in the same training loop but now over a larger number of GPUs. The current implementation shards the model such that, for a given shard, its replicated shards lie on the ranks within the node. This significantly improves the performance of the all-reduce communication (parameter sync) by utilizing intra-node bandwidth. Under this scheme the supported sharding types are RW, CW, and GRID. TWRW is not supported due to no longer being able to take advantage of the intra node bandwidth in the 2D scheme. Pull Request #2554

Changelog

torch.compile compatibility support: #2381, #2475, #2583

torch.export module support: #2388, #2390, #2393,

DTensor improvement: #2585, #2626

Assets 2

Releases: meta-pytorch/torchrec

v1.4.0-rc1

Uh oh!

v1.3.0

New Features

New Flavors of Training Pipelines

Delta Tracker and Delta Store

Resharding API

Prototyping KVZCH (Key-Value Zero-Collision Hashing)

MPZCH (Multi-Probe Zero-Collision Hashing) [#3089]

Change Log

compatability

Uh oh!

v1.3.0-rc3

Uh oh!

v1.3.0-rc2

Uh oh!

v1.3.0-rc1

Uh oh!

v1.2.0

New Features

TensorDict support for EBC and EC

Customized Embedding Lookup Kernel Support

Prototype of Dynamic Sharding

TorchRec 2D Parallel for EmbeddingCollection

Changelog

Uh oh!

v1.2.0-rc3

Uh oh!

v1.2.0-rc2

Uh oh!

v1.2.0-rc1

Uh oh!

v1.1.0

New Features

Grid-based Sharding For EBC

Re-shardable Hash Zch

TorchRec 2D Parallel

Changelog

Uh oh!