-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggested Config.py settings for a DGX-1 #5
Comments
|
When do you expect Multi GPUs implementation is going to be ready? . 99% of researchers or AI users use Multiple Nvidia GPUs on a single system for research, tests and quick training before they pull in the Big Guns -- Grid Super computers. I am not sure why your team never thought of the Multiple GPUs implementation first ? It would have made your code very efficient in using multiple GPUs by simply selecting the number of GPUs to use 1, 2, 3 4, or 8 . |
@developeralgo8888 we don't have an eta yet but working on it. As @ifrosio mentioned, a naive multi-GPU implementing does not improve the converge rate and may cause instabilities. A naive data parallelism implementation (which I believe is what you are suggesting %99 of researchers are using) will put more pressure on GA3C bottleneck (i.e. CPU-GPU communication) and therefore the return is nothing. Feel free to implement the code you are suggesting (it shouldn't be more than two lines of code as you suggested) but it's very unlikely that improves the performance. |
After running the _train.sh with the default Config.py on a DGX-1 for about an hour I see that the CPU usage stays pretty constant at about 15%, and one GPU is being used at about 40%.
The settngs in Config.py are unchanged: DYNAMIC_SETTINGS = True. The number of trainers varies between 2 and 6, the number of predictors varies between 1 and 2 and the number of agents varies from 34 to 39. I would have expected them to grow to use the available CPU resources.
The text was updated successfully, but these errors were encountered: