Slower multi-GPU training with DynamicBucketingSampler vs BucketingSampler #857

david20181 · 2022-10-18T15:41:17Z

david20181
Oct 18, 2022

When I switch from BucketingSampler to DynamicBucketingSampler, training time increases for multi-GPU training.
It seems that these two approaches draw from the duration buckets differently? Specifically, I’ve observed that when I use BucketingSampler, the GPU’s get batches from the same duration bucket for each step. I see different behavior for DynamicBucketingSampler -- GPU’s seem to be getting batches from different duration buckets. As a result, some batches finish more quickly than others and GPU’s are idle while other batches finish.
Would it be possible for DynamicBucketingSampler to have the same behavior as BucketingSampler, i.e. that for multi-GPU training we draw from same duration bucket for each step?

pzelasko · 2022-10-18T19:17:50Z

pzelasko
Oct 18, 2022
Maintainer

Thank you David, very cool observation!! I did not realize that. I'll look into adapting the implementation to follow your suggestion.

6 replies

pzelasko Oct 25, 2022
Maintainer

Pls check out #863 and let me know if you noticed any speed up.

david20181 Oct 27, 2022
Author

Unfortunately I did not observe a speedup with this change, though I think the solution is close. For a given step it looks like a few of the batches draw from the same bucket, while others draw from buckets with very different duration.

In the DynamicBucketer while loop I added a counter and a logging statement. Here is what I observe for batches in training step 65:

rank	train step	# of cuts in batch	bucket_idx	while_loop_counter
0	65	29	26	535
1	65	117	3	521
2	65	15	28	530
3	65	57	10	523
4	65	49	13	524
5	65	65	8	525
6	65	29	26	535
7	65	29	26	535

You can see that for ranks 0, 6, and 7 they've all been through the while loop the same number of times so they draw from the same bucket and thus return same # of cuts. However, the others have iterated a different number of times so they draw from a different bucket.

pzelasko Oct 27, 2022
Maintainer

Thanks, when I find a spare moment, I will try logging myself in that place and see if I can come up with a better solution. You're welcome to try as well :)

lifeiteng Nov 23, 2023

@pzelasko Is this solved? If not, I'll spend some time on it(maybe in 2024.01).

pzelasko Nov 30, 2023
Maintainer

No, I don't believe it is. I was thinking about this a bit from time to time though. I'd rather avoid using interprocess communication to select a similar duration bucket on each step.

One possible improvement may be to have an RNG choose one from N buckets (both "ready" and "not ready"), and whenever it chooses a "not ready" bucket, find the closest "ready" bucket to it and select it instead. At each step the RNG should of course choose independently of previous steps. The RNG in each dataloading process has to be seeded with the same number at the start to ensure every GPU observes identical batch sizes on the same step.

However, this approach may be overly optimistic in that it assumes dataloading workers always yield mini-batches in the same order (round-robin) by the dataloader, which may not be guaranteed by PyTorch. It only takes one worker with ID i+1 to yield sooner than worker with ID i once to break this approach.

lifeiteng · 2024-03-23T08:54:44Z

lifeiteng
Mar 23, 2024

@david20181 Have you solved this problem?

1 reply

david20181 Mar 31, 2024
Author

Sorry haven't had a chance to resolve this issue w/DynamicBucketer

lifeiteng · 2024-03-25T12:20:37Z

lifeiteng
Mar 25, 2024

#1309

0 replies

pzelasko · 2024-06-11T18:15:05Z

pzelasko
Jun 11, 2024
Maintainer

This issue has been addressed in #1341

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slower multi-GPU training with DynamicBucketingSampler vs BucketingSampler #857

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 7 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Slower multi-GPU training with DynamicBucketingSampler vs BucketingSampler #857

david20181 Oct 18, 2022

Replies: 4 comments · 7 replies

pzelasko Oct 18, 2022 Maintainer

pzelasko Oct 25, 2022 Maintainer

david20181 Oct 27, 2022 Author

pzelasko Oct 27, 2022 Maintainer

lifeiteng Nov 23, 2023

pzelasko Nov 30, 2023 Maintainer

lifeiteng Mar 23, 2024

david20181 Mar 31, 2024 Author

lifeiteng Mar 25, 2024

pzelasko Jun 11, 2024 Maintainer

david20181
Oct 18, 2022

Replies: 4 comments 7 replies

pzelasko
Oct 18, 2022
Maintainer

pzelasko Oct 25, 2022
Maintainer

david20181 Oct 27, 2022
Author

pzelasko Oct 27, 2022
Maintainer

pzelasko Nov 30, 2023
Maintainer

lifeiteng
Mar 23, 2024

david20181 Mar 31, 2024
Author

lifeiteng
Mar 25, 2024

pzelasko
Jun 11, 2024
Maintainer