Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abnormal GPU 0 mem consumption when calling torch.cuda.synchronize() at creating envs #1702

Open
hnyu opened this issue Sep 15, 2024 · 0 comments
Assignees

Comments

@hnyu
Copy link
Collaborator

hnyu commented Sep 15, 2024

I noticed that after PR #1692 , when training a job that is already GPU mem intensive, GPU 0 uses much more mem than normal.

For example, for Hobot agent training with 2 envs per GPU and 5 eval envs, after the training actually starts, I checked gpu 0 mem usage.

With torch.cuda.synchronize() (20G total)
| 0 N/A N/A 1586336 C+G python 4539MiB |
| 0 N/A N/A 1586384 C /usr/bin/python 825MiB |
| 0 N/A N/A 1586385 C /usr/bin/python 825MiB |
| 0 N/A N/A 1586552 C /usr/bin/python 823MiB |
| 0 N/A N/A 1586553 C+G /usr/bin/python 941MiB |
| 0 N/A N/A 1586554 C /usr/bin/python 823MiB |
| 0 N/A N/A 1586555 C /usr/bin/python 823MiB |
| 0 N/A N/A 1587277 C+G /usr/bin/python 939MiB |
| 0 N/A N/A 1587279 C /usr/bin/python 823MiB |
| 0 N/A N/A 1587558 C /usr/bin/python 823MiB |
| 0 N/A N/A 1587765 C /usr/bin/python 823MiB |
| 0 N/A N/A 1588504 C+G /usr/bin/python 2385MiB |
| 0 N/A N/A 1589074 C+G /usr/bin/python 939MiB |
| 0 N/A N/A 1589550 C+G /usr/bin/python 941MiB |
| 0 N/A N/A 1589685 C+G /usr/bin/python 941MiB |
| 0 N/A N/A 1589839 C+G /usr/bin/python 941MiB |
| 0 N/A N/A 1589973 C+G /usr/bin/python 941MiB |

Without (10G total)
| 0 N/A N/A 1581927 C+G python 4539MiB |
| 0 N/A N/A 1581985 C /usr/bin/python 825MiB |
| 0 N/A N/A 1581986 C /usr/bin/python 825MiB |
| 0 N/A N/A 1581987 C /usr/bin/python 825MiB |
| 0 N/A N/A 1582144 G /usr/bin/python 118MiB |
| 0 N/A N/A 1582679 G /usr/bin/python 118MiB |
| 0 N/A N/A 1583963 C+G /usr/bin/python 2369MiB |
| 0 N/A N/A 1584240 G /usr/bin/python 116MiB |
| 0 N/A N/A 1584399 G /usr/bin/python 118MiB |
| 0 N/A N/A 1584532 G /usr/bin/python 118MiB |
| 0 N/A N/A 1584663 G /usr/bin/python 116MiB |
| 0 N/A N/A 1584795 G /usr/bin/python 118MiB |

This issue causes a serious trouble if we want to increase the num of envs per GPU because a CUDA out-of-mem issue will be reported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants