You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Giving the option to shut down the automatic data distributed sampler when using StreamingDataset.
Motivation
When I use StreamingDataset in 'DDP' environment, the dataset_len of StreamingDataset seems always be original_len/world_size.
But I want the different processes (with different local_ranks) to share the totally same StreamingDataset, without any data splitting.
Pitch
How to stop the automatic data distribution when using StreamingDataset in DDP? Or could you provide a setting to this? Or could you explain why we can't shut down the distribution?
Alternatives
Additional context
The text was updated successfully, but these errors were encountered:
Hey @ygtxr1997. You can override the distributed env on the dataset. It is inferred automatically from torch.
What is your use case ?
I think overriding the litdata class should be a method to solve my above issue.
Initially, I want to use litdata to optimize my dataset which consists of ~500k small files (each is about 200KB file size). All these files are stored in a remote storage server. However, which is a bit different from other image datasets, my dataloader needs to read from two distinct files, and the indices between the two file vary in [20,50], like a sliding window. For instance, the file indices in a data batch (batch_size=4) could be like this:
According to your example of usage and distributed data loading illustration GIF, litdata seems not good at dealing with such random-read-like case, am I right? Maybe the performance depends on how the original files are merged into a litdata chunk. Maybe keeping the order of original files (from small index to large index) could result in a faster loading speed, but this potentially impact the learning of deep models?
Therefore, I don't know if litdata could help me and boost the data loading speed in my case.
🚀 Feature
Giving the option to shut down the automatic data distributed sampler when using StreamingDataset.
Motivation
When I use StreamingDataset in 'DDP' environment, the dataset_len of StreamingDataset seems always be
original_len/world_size
.But I want the different processes (with different local_ranks) to share the totally same StreamingDataset, without any data splitting.
Pitch
How to stop the automatic data distribution when using StreamingDataset in DDP? Or could you provide a setting to this? Or could you explain why we can't shut down the distribution?
Alternatives
Additional context
The text was updated successfully, but these errors were encountered: