You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To avoid pitfall with shuffling and sharding of datapipes in distributed training environments
To ensure consistent experience of TorchData based datasets across domains.
Pitch
TorchText datasets return datapipes. In order to perform distributed computing, users would typically apply a sharding filter in order to shard the data across ranks. Furthermore, to make sure that we don’t shuffle data only within the corresponding shards, it is important to ensure that the sharding filter is applied after shuffling. As per the investigations from TorchVision, this is not always obvious for users and could lead to suboptimal results if not being done in proper order.
We could do this by simply wrapping the datapipe at the very end
when users want to shuffle the dataset, they would simply set shuffle=True in DataLoader. Furthermore since the sharding filter is already applied, users do not have to explicitly call it when doing distributed training.
Alternatives
keep the datasets implementation as such and educate (tutorials, documentation) users to perform shuffling before sharding.
Additional context
In addition to making sure that shuffling is always done before sharding, this also comes with the benefit that shuffling can be done before the datapipe contains heavy objects (like images) as shuffle datapipe creates a buffer internally to shuffle the corresponding data items. Hence for vision datasets, this is more than just convenient/helper utility.
If shuffle and sharding are done internally, it would mean that we must document the usage such that users do not apply shuffle and sharding again.
We also want to ensure that users have similar experiences across domains and hence should have consistent solutions for common pitfalls.
Thanks for agreeing to make these changes @parmeet , I think this will be positively impact users' experience with datapipes in the long run.
Considering the release and branch cut dates are approaching (fast!), please don't hesitate to let me know if you'd like me to help by submitting PRs for the datapipes, or for the docs.
🚀 Feature
Motivation
Pitch
TorchText datasets return datapipes. In order to perform distributed computing, users would typically apply a sharding filter in order to shard the data across ranks. Furthermore, to make sure that we don’t shuffle data only within the corresponding shards, it is important to ensure that the sharding filter is applied after shuffling. As per the investigations from TorchVision, this is not always obvious for users and could lead to suboptimal results if not being done in proper order.
We could do this by simply wrapping the datapipe at the very end
when users want to shuffle the dataset, they would simply set shuffle=True in DataLoader. Furthermore since the sharding filter is already applied, users do not have to explicitly call it when doing distributed training.
Alternatives
keep the datasets implementation as such and educate (tutorials, documentation) users to perform shuffling before sharding.
Additional context
cc: @NicolasHug , @ejguan , @kevinchn , @Nayef211 , @abhinavarora , @VirgileHlav
The text was updated successfully, but these errors were encountered: