-
Notifications
You must be signed in to change notification settings - Fork 987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
about dataloader through prepare() #2316
Comments
from accelerate import Accelerator
from torch.utils.data import DataLoader
accelerator = Accelerator()
dataloader = DataLoader(list(range(24)), shuffle=False, batch_size=1)
dataloader = accelerator.prepare(dataloader)
for batch in dataloader:
print(batch)
### will return
tensor([3], device='cuda:3')
tensor([8], device='cuda:3')
tensor([13], device='cuda:3')
tensor([18], device='cuda:3')
tensor([23], device='cuda:3')
tensor([2], device='cuda:2')
tensor([7], device='cuda:2')
tensor([12], device='cuda:2')
tensor([17], device='cuda:2')
tensor([22], device='cuda:2')
tensor([4], device='cuda:4')
tensor([9], device='cuda:4')
tensor([14], device='cuda:4')
tensor([19], device='cuda:4')
tensor([0], device='cuda:4')
tensor([0], device='cuda:0')
tensor([5], device='cuda:0')
tensor([10], device='cuda:0')
tensor([15], device='cuda:0')
tensor([20], device='cuda:0')
tensor([1], device='cuda:1')
tensor([6], device='cuda:1')
tensor([11], device='cuda:1')
tensor([16], device='cuda:1')
tensor([21], device='cuda:1')
It might but the impact will be very low since only a small part of the data will be duplicated. The maximum number of duplicated data is the number of process which is very small compared to the number of data. If you really don't want the duplicated data, the easier way is to make sure that you have a number of data proportional to the number of processes. |
Thanks a lot! btw, the maximum number of duplicated data is #process*#batch_size-1, rather than #process, right? just like following examples, with 3 processes, 10 datapoints and batch_size=3, I got 9 duplicated datapoints. If #process*#batch_size-1 is not very small compared to the number of data(like this case), will let shuffle=True and drop_last=True be an alternative solution?
|
Yes generally that’s what we recommend doing, and then during validation we drop the extra samples during |
I think passing drop_last=True through Dataloader may cause some problem after prepare(), as following, the gathered batch expected to be [0, 1, 2, 3, 4, 5, 6, 7, 8], rather than [0, 1, 2, 3, 4, 5, 6, 7]
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Can someone please take a look at this? |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
In tutorial, it is mentioned that
Some data at the end of the dataset may be duplicated so the batch can be divided equally among all workers.
So if my train dataset size cannot divided by the #gpus, dataloader after prepare() will include duplicated data during training? Won't this property affect model performance(loss etc.) since it includes more data in train dataset? If it will cause a large difference, is there any method to not include these duplicated data in calculating loss?The text was updated successfully, but these errors were encountered: