Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding adding shuffling and sharding datapipes to in-built datasets #1727

Open
parmeet opened this issue May 17, 2022 · 1 comment
Open

Comments

@parmeet
Copy link
Contributor

parmeet commented May 17, 2022

🚀 Feature

Motivation

  • To avoid pitfall with shuffling and sharding of datapipes in distributed training environments
  • To ensure consistent experience of TorchData based datasets across domains.

Pitch

TorchText datasets return datapipes. In order to perform distributed computing, users would typically apply a sharding filter in order to shard the data across ranks. Furthermore, to make sure that we don’t shuffle data only within the corresponding shards, it is important to ensure that the sharding filter is applied after shuffling. As per the investigations from TorchVision, this is not always obvious for users and could lead to suboptimal results if not being done in proper order.

We could do this by simply wrapping the datapipe at the very end

def MyDataSet(...):
    dp = ...
    dp = dp.shuffle().set_shuffle(False)
    dp = dp.sharding_filter()
    return dp

when users want to shuffle the dataset, they would simply set shuffle=True in DataLoader. Furthermore since the sharding filter is already applied, users do not have to explicitly call it when doing distributed training.

Alternatives

keep the datasets implementation as such and educate (tutorials, documentation) users to perform shuffling before sharding.

Additional context

  • In addition to making sure that shuffling is always done before sharding, this also comes with the benefit that shuffling can be done before the datapipe contains heavy objects (like images) as shuffle datapipe creates a buffer internally to shuffle the corresponding data items. Hence for vision datasets, this is more than just convenient/helper utility.
  • If shuffle and sharding are done internally, it would mean that we must document the usage such that users do not apply shuffle and sharding again.
  • We also want to ensure that users have similar experiences across domains and hence should have consistent solutions for common pitfalls.

cc: @NicolasHug , @ejguan , @kevinchn , @Nayef211 , @abhinavarora , @VirgileHlav

@NicolasHug
Copy link
Member

NicolasHug commented May 17, 2022

Thanks for agreeing to make these changes @parmeet , I think this will be positively impact users' experience with datapipes in the long run.

Considering the release and branch cut dates are approaching (fast!), please don't hesitate to let me know if you'd like me to help by submitting PRs for the datapipes, or for the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants