Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dmoe tests GPU OOM #1216

Merged
merged 1 commit into from
May 16, 2024
Merged

Fix dmoe tests GPU OOM #1216

merged 1 commit into from
May 16, 2024

Conversation

snarayan21
Copy link
Contributor

There seems to be an issue with GPU memory not being freed in between tests, and specifically, the torch dmoe tests are causing GPU OOM in private. Calling torch.cuda.empty_cache() is not helping since this appears to be non-releasable memory (see below), so apparently we're actually using some object(s) between each tests that are taking up a lot of memory. Tried to dig into the memory leak but decided to just cut out some extraneous test cases. This PR is to bring public foundry in line with changes in private here

Result of calling print(torch.cuda.memory_summary()) right before the last dmoe test:

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 1                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      | 314368 B   | 320304 KiB |   6901 MiB |   6900 MiB |
|       from large pool |      0 B   | 315536 KiB |   6149 MiB |   6149 MiB |
|       from small pool | 314368 B   |  15636 KiB |    751 MiB |    751 MiB |
|---------------------------------------------------------------------------|
| Active memory         | 314368 B   | 320304 KiB |   6901 MiB |   6900 MiB |
|       from large pool |      0 B   | 315536 KiB |   6149 MiB |   6149 MiB |
|       from small pool | 314368 B   |  15636 KiB |    751 MiB |    751 MiB |
|---------------------------------------------------------------------------|
| Requested memory      | 255744 B   | 320295 KiB |   6884 MiB |   6884 MiB |
|       from large pool |      0 B   | 315536 KiB |   6137 MiB |   6137 MiB |
|       from small pool | 255744 B   |  15564 KiB |    747 MiB |    746 MiB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |   4096 KiB | 542720 KiB |   3188 MiB |   3184 MiB |
|       from large pool |      0 KiB | 536576 KiB |   3000 MiB |   3000 MiB |
|       from small pool |   4096 KiB |  16384 KiB |    188 MiB |    184 MiB |
|---------------------------------------------------------------------------|
| Non-releasable memory |   3789 KiB | 223737 KiB |   2031 MiB |   2027 MiB |
|       from large pool |      0 KiB | 221040 KiB |   1234 MiB |   1234 MiB |
|       from small pool |   3789 KiB |   8126 KiB |    796 MiB |    792 MiB |
|---------------------------------------------------------------------------|
| Allocations           |     217    |     304    |   16041    |   15824    |
|       from large pool |       0    |       4    |     196    |     196    |
|       from small pool |     217    |     301    |   15845    |   15628    |
|---------------------------------------------------------------------------|
| Active allocs         |     217    |     305    |   16041    |   15824    |
|       from large pool |       0    |       4    |     196    |     196    |
|       from small pool |     217    |     302    |   15845    |   15628    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       2    |      11    |     180    |     178    |
|       from large pool |       0    |       3    |      86    |      86    |
|       from small pool |       2    |       8    |      94    |      92    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      33    |      40    |    9083    |    9050    |
|       from large pool |       0    |       2    |      95    |      95    |
|       from small pool |      33    |      40    |    8988    |    8955    |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|

@snarayan21 snarayan21 merged commit 3a15082 into main May 16, 2024
9 checks passed
@mvpatel2000 mvpatel2000 deleted the saaketh/test_oom_fix branch May 16, 2024 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants