Fix dmoe tests GPU OOM #1216

snarayan21 · 2024-05-16T18:36:07Z

There seems to be an issue with GPU memory not being freed in between tests, and specifically, the torch dmoe tests are causing GPU OOM in private. Calling torch.cuda.empty_cache() is not helping since this appears to be non-releasable memory (see below), so apparently we're actually using some object(s) between each tests that are taking up a lot of memory. Tried to dig into the memory leak but decided to just cut out some extraneous test cases. This PR is to bring public foundry in line with changes in private here

Result of calling print(torch.cuda.memory_summary()) right before the last dmoe test:

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 1                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      | 314368 B   | 320304 KiB |   6901 MiB |   6900 MiB |
|       from large pool |      0 B   | 315536 KiB |   6149 MiB |   6149 MiB |
|       from small pool | 314368 B   |  15636 KiB |    751 MiB |    751 MiB |
|---------------------------------------------------------------------------|
| Active memory         | 314368 B   | 320304 KiB |   6901 MiB |   6900 MiB |
|       from large pool |      0 B   | 315536 KiB |   6149 MiB |   6149 MiB |
|       from small pool | 314368 B   |  15636 KiB |    751 MiB |    751 MiB |
|---------------------------------------------------------------------------|
| Requested memory      | 255744 B   | 320295 KiB |   6884 MiB |   6884 MiB |
|       from large pool |      0 B   | 315536 KiB |   6137 MiB |   6137 MiB |
|       from small pool | 255744 B   |  15564 KiB |    747 MiB |    746 MiB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |   4096 KiB | 542720 KiB |   3188 MiB |   3184 MiB |
|       from large pool |      0 KiB | 536576 KiB |   3000 MiB |   3000 MiB |
|       from small pool |   4096 KiB |  16384 KiB |    188 MiB |    184 MiB |
|---------------------------------------------------------------------------|
| Non-releasable memory |   3789 KiB | 223737 KiB |   2031 MiB |   2027 MiB |
|       from large pool |      0 KiB | 221040 KiB |   1234 MiB |   1234 MiB |
|       from small pool |   3789 KiB |   8126 KiB |    796 MiB |    792 MiB |
|---------------------------------------------------------------------------|
| Allocations           |     217    |     304    |   16041    |   15824    |
|       from large pool |       0    |       4    |     196    |     196    |
|       from small pool |     217    |     301    |   15845    |   15628    |
|---------------------------------------------------------------------------|
| Active allocs         |     217    |     305    |   16041    |   15824    |
|       from large pool |       0    |       4    |     196    |     196    |
|       from small pool |     217    |     302    |   15845    |   15628    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       2    |      11    |     180    |     178    |
|       from large pool |       0    |       3    |      86    |      86    |
|       from small pool |       2    |       8    |      94    |      92    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      33    |      40    |    9083    |    9050    |
|       from large pool |       0    |       2    |      95    |      95    |
|       from small pool |      33    |      40    |    8988    |    8955    |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|

torch dmoe tests gpu oom

3b469a2

snarayan21 requested review from mvpatel2000 and dakinggg May 16, 2024 18:36

mvpatel2000 approved these changes May 16, 2024

View reviewed changes

snarayan21 merged commit 3a15082 into main May 16, 2024
9 checks passed

mvpatel2000 deleted the saaketh/test_oom_fix branch May 16, 2024 19:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dmoe tests GPU OOM #1216

Fix dmoe tests GPU OOM #1216

snarayan21 commented May 16, 2024

Fix dmoe tests GPU OOM #1216

Fix dmoe tests GPU OOM #1216

Conversation

snarayan21 commented May 16, 2024