-
Notifications
You must be signed in to change notification settings - Fork 47
Distributed pretraining dataset question #22
Comments
Hi @sangmichaelxie, you're right, that line should be there. Thanks for letting us know! |
Hi, @peteriz seems like there is an issue if deleting the line global_rank = 0. With different worker reading different shard, the total num of iteration for each worker in an epoch is different. So at the end of an epoch, it has synchronization issue and gets stuck. With global_bank=0 the issue disappeared, since the torch data sampler gives each worker the same amount of data. But this has the issue as @sangmichaelxie described, it will skip every 8 files to read. |
Hello, if you simply fix this error by setting global_bank=0, 7/8 of your dataset will not be trained in an epoch. And if all workers work on the same data file, multi-gpu traning becomes meaningless. Because all processes works on the same data. Could you fix the stuck problem for the current code? @peteriz |
academic-budget-bert/pretraining/dataset/distributed_pretraining_dataset.py
Line 280 in ea00083
In the above line, the global_rank is set to 0 for all workers, meaning that the function will return the same file_index for all the workers. If world_size = 8, then it seems like this code is reading every 8th file and skipping the files in between. Can you explain why this is done? Thanks.
The text was updated successfully, but these errors were encountered: