Skip to content
This repository has been archived by the owner on Sep 18, 2023. It is now read-only.

Distributed pretraining dataset question #22

Open
sangmichaelxie opened this issue Jun 4, 2022 · 3 comments
Open

Distributed pretraining dataset question #22

sangmichaelxie opened this issue Jun 4, 2022 · 3 comments

Comments

@sangmichaelxie
Copy link

In the above line, the global_rank is set to 0 for all workers, meaning that the function will return the same file_index for all the workers. If world_size = 8, then it seems like this code is reading every 8th file and skipping the files in between. Can you explain why this is done? Thanks.

@peteriz
Copy link
Contributor

peteriz commented Jun 9, 2022

Hi @sangmichaelxie, you're right, that line should be there. Thanks for letting us know!

peteriz added a commit that referenced this issue Jun 9, 2022
peteriz added a commit that referenced this issue Jun 9, 2022
@Xinpeng-Wang
Copy link

Hi, @peteriz seems like there is an issue if deleting the line global_rank = 0. With different worker reading different shard, the total num of iteration for each worker in an epoch is different. So at the end of an epoch, it has synchronization issue and gets stuck. With global_bank=0 the issue disappeared, since the torch data sampler gives each worker the same amount of data. But this has the issue as @sangmichaelxie described, it will skip every 8 files to read.

@leoozy
Copy link

leoozy commented Jul 12, 2022

Hello, if you simply fix this error by setting global_bank=0, 7/8 of your dataset will not be trained in an epoch. And if all workers work on the same data file, multi-gpu traning becomes meaningless. Because all processes works on the same data. Could you fix the stuck problem for the current code? @peteriz

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants