Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about dataloader through prepare() #2316

Closed
shliu0 opened this issue Jan 9, 2024 · 7 comments
Closed

about dataloader through prepare() #2316

shliu0 opened this issue Jan 9, 2024 · 7 comments
Assignees

Comments

@shliu0
Copy link

shliu0 commented Jan 9, 2024

In tutorial, it is mentioned that Some data at the end of the dataset may be duplicated so the batch can be divided equally among all workers. So if my train dataset size cannot divided by the #gpus, dataloader after prepare() will include duplicated data during training? Won't this property affect model performance(loss etc.) since it includes more data in train dataset? If it will cause a large difference, is there any method to not include these duplicated data in calculating loss?

@SunMarc
Copy link
Member

SunMarc commented Jan 9, 2024

So if my train dataset size cannot divided by the #gpus, dataloader after prepare() will include duplicated data during training?
Yes, it will include duplicated data. With 5 processes and 24 datapoints, you can see that 0 have been duplicated on process 0 and 4 on the following example:

from accelerate import Accelerator
from torch.utils.data import DataLoader

accelerator = Accelerator()
dataloader = DataLoader(list(range(24)), shuffle=False, batch_size=1)
dataloader = accelerator.prepare(dataloader)
for batch in dataloader:
    print(batch)

### will return 

tensor([3], device='cuda:3')
tensor([8], device='cuda:3')
tensor([13], device='cuda:3')
tensor([18], device='cuda:3')
tensor([23], device='cuda:3')
tensor([2], device='cuda:2')
tensor([7], device='cuda:2')
tensor([12], device='cuda:2')
tensor([17], device='cuda:2')
tensor([22], device='cuda:2')
tensor([4], device='cuda:4')
tensor([9], device='cuda:4')
tensor([14], device='cuda:4')
tensor([19], device='cuda:4')
tensor([0], device='cuda:4')
tensor([0], device='cuda:0')
tensor([5], device='cuda:0')
tensor([10], device='cuda:0')
tensor([15], device='cuda:0')
tensor([20], device='cuda:0')
tensor([1], device='cuda:1')
tensor([6], device='cuda:1')
tensor([11], device='cuda:1')
tensor([16], device='cuda:1')
tensor([21], device='cuda:1')

Won't this property affect model performance(loss etc.) since it includes more data in train dataset?

It might but the impact will be very low since only a small part of the data will be duplicated. The maximum number of duplicated data is the number of process which is very small compared to the number of data. If you really don't want the duplicated data, the easier way is to make sure that you have a number of data proportional to the number of processes.

@shliu0
Copy link
Author

shliu0 commented Jan 10, 2024

Thanks a lot! btw, the maximum number of duplicated data is #process*#batch_size-1, rather than #process, right? just like following examples, with 3 processes, 10 datapoints and batch_size=3, I got 9 duplicated datapoints. If #process*#batch_size-1 is not very small compared to the number of data(like this case), will let shuffle=True and drop_last=True be an alternative solution?

from accelerate import Accelerator
from torch.utils.data import DataLoader

accelerator = Accelerator()
dataloader = DataLoader(list(range(10)), shuffle=False, batch_size=3)
dataloader = accelerator.prepare(dataloader)
for batch in dataloader:
    print(batch)

# will return 
tensor([0, 1, 2], device='cuda:0')
tensor([9, 0, 1], device='cuda:0')
tensor([3, 4, 5], device='cuda:1')
tensor([2, 3, 4], device='cuda:1')
tensor([6, 7, 8], device='cuda:2')
tensor([5, 6, 7], device='cuda:2')

@muellerzr
Copy link
Collaborator

Yes generally that’s what we recommend doing, and then during validation we drop the extra samples during gather_for_metrics for an accurate calculation

@shliu0
Copy link
Author

shliu0 commented Jan 10, 2024

I think passing drop_last=True through Dataloader may cause some problem after prepare(), as following, the gathered batch expected to be [0, 1, 2, 3, 4, 5, 6, 7, 8], rather than [0, 1, 2, 3, 4, 5, 6, 7]

from accelerate import Accelerator
from torch.utils.data import DataLoader

accelerator = Accelerator()
dataloader = DataLoader(list(range(17)), shuffle=False, batch_size=3, drop_last=True)
dataloader = accelerator.prepare(dataloader)
for epoch in range(1):
    for batch in dataloader:
        print(f"epoch-{epoch},{batch}")
        all_batch = accelerator.gather_for_metrics(batch)
        if accelerator.is_main_process:
            print(f"epoch-{epoch},{all_batch}")
    accelerator.wait_for_everyone()

# will return
epoch-0,tensor([0, 1, 2], device='cuda:0')
epoch-0,tensor([3, 4, 5], device='cuda:1')
epoch-0,tensor([6, 7, 8], device='cuda:2')
epoch-0,tensor([0, 1, 2, 3, 4, 5, 6, 7], device='cuda:0')

Copy link

github-actions bot commented Feb 8, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@shliu0
Copy link
Author

shliu0 commented Mar 6, 2024

I think passing drop_last=True through Dataloader may cause some problem after prepare(), as following, the gathered batch expected to be [0, 1, 2, 3, 4, 5, 6, 7, 8], rather than [0, 1, 2, 3, 4, 5, 6, 7]

from accelerate import Accelerator
from torch.utils.data import DataLoader

accelerator = Accelerator()
dataloader = DataLoader(list(range(17)), shuffle=False, batch_size=3, drop_last=True)
dataloader = accelerator.prepare(dataloader)
for epoch in range(1):
    for batch in dataloader:
        print(f"epoch-{epoch},{batch}")
        all_batch = accelerator.gather_for_metrics(batch)
        if accelerator.is_main_process:
            print(f"epoch-{epoch},{all_batch}")
    accelerator.wait_for_everyone()

# will return
epoch-0,tensor([0, 1, 2], device='cuda:0')
epoch-0,tensor([3, 4, 5], device='cuda:1')
epoch-0,tensor([6, 7, 8], device='cuda:2')
epoch-0,tensor([0, 1, 2, 3, 4, 5, 6, 7], device='cuda:0')

Can someone please take a look at this?

@muellerzr muellerzr reopened this Mar 6, 2024
@muellerzr muellerzr self-assigned this Mar 6, 2024
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants