-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data duplicate when setting num_works > 1 with streaming data #3423
Comments
Hi ! Thanks for reporting :) When using a PyTorch's data loader with We can probably fix this in |
Do u have some plans to fix the problem? |
Isn’t that somehow a bug on PyTorch side? (Just asking because this behavior seems quite general and maybe not what would be intended) |
From PyTorch's documentation here:
It looks like an intended behavior from PyTorch As suggested in the docstring of the IterableDataset class, we could pass a However, while this solution works, I'm worried that many users simply don't know about this parameter and just start their training with duplicate data without knowing it. That's why I'm more in favor of integrating the check on the worker id directly in |
Fixed by #4375 |
Thanks! |
Hi there @lhoestq @cloudyuyuyu |
If the worker_info.id is unique per process it should work fine, could you check that they're unique ? The code to get the worker_info in each worker is |
test.py import json
import os
import torch
from torch.utils.data import IterableDataset, DataLoader
from transformers import PreTrainedTokenizer, TrainingArguments
from common.arguments import DataTrainingArguments, ModelArguments
class MyIterableDataset(IterableDataset):
def __iter__(self):
worker_info = torch.utils.data.get_worker_info()
print(worker_info)
return iter(range(3))
if __name__ == '__main__':
dataset = MyIterableDataset()
dataloader = DataLoader(dataset, num_workers=1)
for i in dataloader:
print(i) $ python3 -m torch.distributed.launch \
--nproc_per_node=2 test.py
WorkerInfo(id=0, num_workers=1, seed=5545685212307804959, dataset=<__main__.MyIterableDataset object at 0x7f92648cf6a0>)
WorkerInfo(id=0, num_workers=1, seed=3174108029709729025, dataset=<__main__.MyIterableDataset object at 0x7f19ab961670>)
tensor([0])
tensor([1])
tensor([2])
tensor([0])
tensor([1])
tensor([2]) @lhoestq they are not unique |
It looks like a bug from pytorch no ? How can we know which data should go in which process when using DDP ? I guess we need to check |
This comment was marked as resolved.
This comment was marked as resolved.
Never mind. After reading the code, |
I'm re-opening this one since I think it should be supported by |
hmm actually let me open a new issue on DDP - original post was for single node |
Describe the bug
The data is repeated num_works times when we load_dataset with streaming and set num_works > 1 when construct dataloader
Steps to reproduce the bug
Expected results
data do not duplicate
Actual results
data duplicate NUM_OF_WORKERS = 16
Environment info
datasets
version:datasets==1.14.0The text was updated successfully, but these errors were encountered: