-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU memory and multi-GPU mode #22
Comments
I have a same question . I dont run in GTX2080Ti with 12GB
|
Hi! Sorry for my late reply, and sorry for the inconvenience this might have caused you. I used a Quadro P5000 GPU to train and evaluate the models. It has ~16GB. However, for me, training uses only ~4-5 GB of memory. For the OOM errors during training, could you tell me if they happen at some point during training or you get them straight after starting? In the former case, what could be happening is that a batch unluckily samples graphs with many nodes and edges. One possible workaround would be to look at the collate_fn method in the dataloader, and prevent such batches from being fully fed to the model. One way to do that would be by replacing the DataLoader definition line (line 76 in
As for inference, I have made a major update to the code. Inference should now comfortably run on GPUs with under 10 GB of memory. So it'd be great if you could confirm whether you are still experiencing these issues after pulling from the repo. As for multi-gpu support @fguney, it is not currently planned. It turns out that implementing it is not very straightforward, since the interactions between pytorch lightning and pytorch geometric's multi-gpu functionality are a bit messy. Since in principle the model should be trainable with smaller gpus (hopefully with the solution above), I'd like to avoid going into it. But if it's the only way, then I guess I could do it! Best, Guillem |
Thank you very much for your reply: |
Hi @Newdxz. It seems like you run out of GPU memory when CNN embeddings are being stored. I believe this is happening because the batch size being used for the CNN is too large. Can you try setting |
Thank you for your reply: When I try to set dataset_params.img_batch_size=3000, the problem still exists, and I modify img_batch_size to 50, the problem still exists. Is this related to the previous steps? Thank you |
Can you please send me a screenshot (no copy-paste please) of your entire output when you set img_batch_size to 50? (I wanna see how the configuration gets printed out). Thanks! |
I see! I believe that the problem was that the config was being loaded from the checkpoint, and not your config file/command line options. It should be fixed now. Could you please pull again and run the same command? |
Thanks, with your help, it works normally. Thank you again |
Hi,
what kind of GPU have you used to train the model? I cannot run the code (neither training nor evaluation) on a 2-GPU machine - each 12 GB without running into a memory problem. Is there a multi-GPU support planned?
Thanks!
The text was updated successfully, but these errors were encountered: