Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow dataloading every N batches, where N=num_threads #252

Open
exnx opened this issue Apr 29, 2021 · 2 comments
Open

Slow dataloading every N batches, where N=num_threads #252

exnx opened this issue Apr 29, 2021 · 2 comments

Comments

@exnx
Copy link

exnx commented Apr 29, 2021

I’m trying to train a 3D resnet model for classification, but the training time is doing something weird. I have 28 cpus, and 28 num_workers for dataloading, and so every 28th iteration takes about a min, while the first 1-27 iterations takes 0 secs. I tried different num_workers, and it’s the same pattern, every nth num_workers it gets held up some order of magnitude of time longer, while the other iterations are very fast. Not sure if anybody is familiar with this kind of behavior?

I know it's the data loading that is held up because the training code has total batch time, and data loading time broken out. It seems like the dataloader is waiting for all the workers / threads to finish before proceeding, that's my guess. Does anybody have any remedies for this?

@guilhermesurek
Copy link

Hello @exnx, I do not have an answer, but I can share things that I passed through.

First I tried to undertand CPU/GPU training time for a fixed batch size, second I tried understand the data loading time uping and dowing the num_workers, and then how training and data loading time vary with batch_size.

Generally, your goal is to maximize training time as it is the most limited resource. To do this, you need to track the % of your CPU / GPU usage. The more workers the more inputs your CPU / GPU will have without having to wait for the data to load (when you say that the 28th it. took 1 min, is that the workers are loading the data). However, RAM starts to become a problem, unless you have a lot of RAM. So, you will have to balance batch_size and num_workers to achieve this goal, with the resources you have, and / or other goals that you also could have with batch_size.

PS: I think you should have at least 4 workers per CPU.

@exnx
Copy link
Author

exnx commented May 2, 2021

Interesting, thanks for the thoughtful response.

So I am using a cluster at school, and have 28 cpu (or cores?) available. I've been doing 1 worker/cpu, which I thought was the optimum? I can try doing more, it just seems like 4/cpu sounds super high! I tried lower workers, but higher seems faster over all. I am using 2 gpus at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants