Distributed pretraining dataset question #22

sangmichaelxie · 2022-06-04T01:28:33Z

academic-budget-bert/pretraining/dataset/distributed_pretraining_dataset.py

Line 280 in ea00083

global_rank = 0

In the above line, the global_rank is set to 0 for all workers, meaning that the function will return the same file_index for all the workers. If world_size = 8, then it seems like this code is reading every 8th file and skipping the files in between. Can you explain why this is done? Thanks.

peteriz · 2022-06-09T11:55:28Z

Hi @sangmichaelxie, you're right, that line should be there. Thanks for letting us know!

Xinpeng-Wang · 2022-07-03T19:39:11Z

Hi, @peteriz seems like there is an issue if deleting the line global_rank = 0. With different worker reading different shard, the total num of iteration for each worker in an epoch is different. So at the end of an epoch, it has synchronization issue and gets stuck. With global_bank=0 the issue disappeared, since the torch data sampler gives each worker the same amount of data. But this has the issue as @sangmichaelxie described, it will skip every 8 files to read.

leoozy · 2022-07-12T09:04:36Z

Hello, if you simply fix this error by setting global_bank=0, 7/8 of your dataset will not be trained in an epoch. And if all workers work on the same data file, multi-gpu traning becomes meaningless. Because all processes works on the same data. Could you fix the stuck problem for the current code? @peteriz

peteriz added a commit that referenced this issue Jun 9, 2022

Fixes issue raised in #22

9a017f2

peteriz added a commit that referenced this issue Jun 9, 2022

Fixes issue raised in #22 (#23)

04f6da6

This was referenced Jul 12, 2022

The training process will get stuck after training for one epoch #26

Open

GLUE results not reproducible #18

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed pretraining dataset question #22

Distributed pretraining dataset question #22

sangmichaelxie commented Jun 4, 2022

peteriz commented Jun 9, 2022 •

edited

Loading

Xinpeng-Wang commented Jul 3, 2022

leoozy commented Jul 12, 2022

Distributed pretraining dataset question #22

Distributed pretraining dataset question #22

Comments

sangmichaelxie commented Jun 4, 2022

peteriz commented Jun 9, 2022 • edited Loading

Xinpeng-Wang commented Jul 3, 2022

leoozy commented Jul 12, 2022

peteriz commented Jun 9, 2022 •

edited

Loading