Regarding adding shuffling and sharding datapipes to in-built datasets #1727

parmeet · 2022-05-17T01:48:09Z

🚀 Feature

Motivation

To avoid pitfall with shuffling and sharding of datapipes in distributed training environments
To ensure consistent experience of TorchData based datasets across domains.

Pitch

TorchText datasets return datapipes. In order to perform distributed computing, users would typically apply a sharding filter in order to shard the data across ranks. Furthermore, to make sure that we don’t shuffle data only within the corresponding shards, it is important to ensure that the sharding filter is applied after shuffling. As per the investigations from TorchVision, this is not always obvious for users and could lead to suboptimal results if not being done in proper order.

We could do this by simply wrapping the datapipe at the very end

def MyDataSet(...):
    dp = ...
    dp = dp.shuffle().set_shuffle(False)
    dp = dp.sharding_filter()
    return dp

when users want to shuffle the dataset, they would simply set shuffle=True in DataLoader. Furthermore since the sharding filter is already applied, users do not have to explicitly call it when doing distributed training.

Alternatives

keep the datasets implementation as such and educate (tutorials, documentation) users to perform shuffling before sharding.

Additional context

In addition to making sure that shuffling is always done before sharding, this also comes with the benefit that shuffling can be done before the datapipe contains heavy objects (like images) as shuffle datapipe creates a buffer internally to shuffle the corresponding data items. Hence for vision datasets, this is more than just convenient/helper utility.
If shuffle and sharding are done internally, it would mean that we must document the usage such that users do not apply shuffle and sharding again.
We also want to ensure that users have similar experiences across domains and hence should have consistent solutions for common pitfalls.

cc: @NicolasHug , @ejguan , @kevinchn , @Nayef211 , @abhinavarora , @VirgileHlav

NicolasHug · 2022-05-17T10:11:31Z

Thanks for agreeing to make these changes @parmeet , I think this will be positively impact users' experience with datapipes in the long run.

Considering the release and branch cut dates are approaching (fast!), please don't hesitate to let me know if you'd like me to help by submitting PRs for the datapipes, or for the docs.

This was referenced May 17, 2022

Add Shuffle and sharding datapipes to datasets #1729

Merged

Add support for all datasets of the GLUE benchmark #1710

Closed

parmeet mentioned this issue Jun 1, 2022

Add recommendations regarding use of datapipes for multi-processing, shuffling, DDP, etc. #1755

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regarding adding shuffling and sharding datapipes to in-built datasets #1727

Regarding adding shuffling and sharding datapipes to in-built datasets #1727

parmeet commented May 17, 2022

NicolasHug commented May 17, 2022 •

edited

Loading

Regarding adding shuffling and sharding datapipes to in-built datasets #1727

Regarding adding shuffling and sharding datapipes to in-built datasets #1727

Comments

parmeet commented May 17, 2022

🚀 Feature

NicolasHug commented May 17, 2022 • edited Loading

NicolasHug commented May 17, 2022 •

edited

Loading