about dataloader through prepare() #2316

shliu0 · 2024-01-09T06:47:59Z

In tutorial, it is mentioned that Some data at the end of the dataset may be duplicated so the batch can be divided equally among all workers. So if my train dataset size cannot divided by the #gpus, dataloader after prepare() will include duplicated data during training? Won't this property affect model performance(loss etc.) since it includes more data in train dataset? If it will cause a large difference, is there any method to not include these duplicated data in calculating loss?

The text was updated successfully, but these errors were encountered:

SunMarc · 2024-01-09T17:24:01Z

So if my train dataset size cannot divided by the #gpus, dataloader after prepare() will include duplicated data during training?
Yes, it will include duplicated data. With 5 processes and 24 datapoints, you can see that 0 have been duplicated on process 0 and 4 on the following example:

from accelerate import Accelerator
from torch.utils.data import DataLoader

accelerator = Accelerator()
dataloader = DataLoader(list(range(24)), shuffle=False, batch_size=1)
dataloader = accelerator.prepare(dataloader)
for batch in dataloader:
    print(batch)

### will return 

tensor([3], device='cuda:3')
tensor([8], device='cuda:3')
tensor([13], device='cuda:3')
tensor([18], device='cuda:3')
tensor([23], device='cuda:3')
tensor([2], device='cuda:2')
tensor([7], device='cuda:2')
tensor([12], device='cuda:2')
tensor([17], device='cuda:2')
tensor([22], device='cuda:2')
tensor([4], device='cuda:4')
tensor([9], device='cuda:4')
tensor([14], device='cuda:4')
tensor([19], device='cuda:4')
tensor([0], device='cuda:4')
tensor([0], device='cuda:0')
tensor([5], device='cuda:0')
tensor([10], device='cuda:0')
tensor([15], device='cuda:0')
tensor([20], device='cuda:0')
tensor([1], device='cuda:1')
tensor([6], device='cuda:1')
tensor([11], device='cuda:1')
tensor([16], device='cuda:1')
tensor([21], device='cuda:1')

Won't this property affect model performance(loss etc.) since it includes more data in train dataset?

It might but the impact will be very low since only a small part of the data will be duplicated. The maximum number of duplicated data is the number of process which is very small compared to the number of data. If you really don't want the duplicated data, the easier way is to make sure that you have a number of data proportional to the number of processes.

shliu0 · 2024-01-10T02:07:49Z

Thanks a lot! btw, the maximum number of duplicated data is #process*#batch_size-1, rather than #process, right? just like following examples, with 3 processes, 10 datapoints and batch_size=3, I got 9 duplicated datapoints. If #process*#batch_size-1 is not very small compared to the number of data(like this case), will let shuffle=True and drop_last=True be an alternative solution?

from accelerate import Accelerator
from torch.utils.data import DataLoader

accelerator = Accelerator()
dataloader = DataLoader(list(range(10)), shuffle=False, batch_size=3)
dataloader = accelerator.prepare(dataloader)
for batch in dataloader:
    print(batch)

# will return 
tensor([0, 1, 2], device='cuda:0')
tensor([9, 0, 1], device='cuda:0')
tensor([3, 4, 5], device='cuda:1')
tensor([2, 3, 4], device='cuda:1')
tensor([6, 7, 8], device='cuda:2')
tensor([5, 6, 7], device='cuda:2')

muellerzr · 2024-01-10T02:26:03Z

Yes generally that’s what we recommend doing, and then during validation we drop the extra samples during gather_for_metrics for an accurate calculation

shliu0 · 2024-01-10T03:16:55Z

I think passing drop_last=True through Dataloader may cause some problem after prepare(), as following, the gathered batch expected to be [0, 1, 2, 3, 4, 5, 6, 7, 8], rather than [0, 1, 2, 3, 4, 5, 6, 7]

from accelerate import Accelerator
from torch.utils.data import DataLoader

accelerator = Accelerator()
dataloader = DataLoader(list(range(17)), shuffle=False, batch_size=3, drop_last=True)
dataloader = accelerator.prepare(dataloader)
for epoch in range(1):
    for batch in dataloader:
        print(f"epoch-{epoch},{batch}")
        all_batch = accelerator.gather_for_metrics(batch)
        if accelerator.is_main_process:
            print(f"epoch-{epoch},{all_batch}")
    accelerator.wait_for_everyone()

# will return
epoch-0,tensor([0, 1, 2], device='cuda:0')
epoch-0,tensor([3, 4, 5], device='cuda:1')
epoch-0,tensor([6, 7, 8], device='cuda:2')
epoch-0,tensor([0, 1, 2, 3, 4, 5, 6, 7], device='cuda:0')

github-actions · 2024-02-08T15:06:36Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

shliu0 · 2024-03-06T07:50:46Z

I think passing drop_last=True through Dataloader may cause some problem after prepare(), as following, the gathered batch expected to be [0, 1, 2, 3, 4, 5, 6, 7, 8], rather than [0, 1, 2, 3, 4, 5, 6, 7]

from accelerate import Accelerator
from torch.utils.data import DataLoader

accelerator = Accelerator()
dataloader = DataLoader(list(range(17)), shuffle=False, batch_size=3, drop_last=True)
dataloader = accelerator.prepare(dataloader)
for epoch in range(1):
    for batch in dataloader:
        print(f"epoch-{epoch},{batch}")
        all_batch = accelerator.gather_for_metrics(batch)
        if accelerator.is_main_process:
            print(f"epoch-{epoch},{all_batch}")
    accelerator.wait_for_everyone()

# will return
epoch-0,tensor([0, 1, 2], device='cuda:0')
epoch-0,tensor([3, 4, 5], device='cuda:1')
epoch-0,tensor([6, 7, 8], device='cuda:2')
epoch-0,tensor([0, 1, 2, 3, 4, 5, 6, 7], device='cuda:0')

Can someone please take a look at this?

github-actions · 2024-03-30T15:06:24Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions bot closed this as completed Feb 17, 2024

muellerzr reopened this Mar 6, 2024

muellerzr self-assigned this Mar 6, 2024

github-actions bot closed this as completed Apr 8, 2024

SunMarc reopened this Apr 9, 2024

github-actions bot closed this as completed Apr 17, 2024

natolambert mentioned this issue Jun 5, 2024

rewardbench.py results are different for different batch size for beaver-7b allenai/reward-bench#137

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

about dataloader through prepare() #2316

about dataloader through prepare() #2316

shliu0 commented Jan 9, 2024 •

edited

Loading

SunMarc commented Jan 9, 2024 •

edited

Loading

shliu0 commented Jan 10, 2024 •

edited

Loading

muellerzr commented Jan 10, 2024

shliu0 commented Jan 10, 2024

github-actions bot commented Feb 8, 2024

shliu0 commented Mar 6, 2024 •

edited

Loading

github-actions bot commented Mar 30, 2024

about dataloader through prepare() #2316

about dataloader through prepare() #2316

Comments

shliu0 commented Jan 9, 2024 • edited Loading

SunMarc commented Jan 9, 2024 • edited Loading

shliu0 commented Jan 10, 2024 • edited Loading

muellerzr commented Jan 10, 2024

shliu0 commented Jan 10, 2024

github-actions bot commented Feb 8, 2024

shliu0 commented Mar 6, 2024 • edited Loading

github-actions bot commented Mar 30, 2024

shliu0 commented Jan 9, 2024 •

edited

Loading

SunMarc commented Jan 9, 2024 •

edited

Loading

shliu0 commented Jan 10, 2024 •

edited

Loading

shliu0 commented Mar 6, 2024 •

edited

Loading