Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA memory leak #25

Open
linminhtoo opened this issue May 18, 2022 · 8 comments
Open

CUDA memory leak #25

linminhtoo opened this issue May 18, 2022 · 8 comments
Assignees

Comments

@linminhtoo
Copy link

linminhtoo commented May 18, 2022

Hello authors and anyone else who has tried dMaSIF,

do you ever face a CUDA memory leak issue?

I'm training it on a dataset of protein surface patches on an RTX 2080Ti and the GPU VRAM usage creeps up over time. Even using a small batch size to start with, the VRAM usage starts at ~3 GB, but slowly creeps up to 8 GB after several epochs (maybe 10 epochs). This is even with me calling torch.cuda.empty_cache() 10 times during the epoch.

This is the exact error message:

python3: /opt/conda/conda-bld/magma-cuda113_1619629459349/work/interface_cuda/interface.cpp:901: void magma_queue_create_from_cuda_internal(magma_device_t, cudaStream_t, cublasHandle_t, cusparseHandle_t, magma_queue**, const char*, const char*, int): Assertion `queue->dCarray__ != __null' failed.
3

I have not done any profiling checks as this is not particularly trivial... but I may not have much of a choice. I suspect it is from pykeops ? Maybe it is bad at releasing VRAM that it was holding onto, even with tensor.detach(). Honestly, I don't know how well writen pykeops is from an engineering standpoint. It was quite difficult to even install it properly and get it to see CUDA, which doesn't inspire confidence. C++ libs can have issues with efficient memory allocation.

I also notice that changing the batch size doesn't change the training speed (in terms of items per second), which is quite suspicious, as usually increasing the batch size increases the throughput.

It would be supremely helpful if anyone has found workarounds around this. Currently, I'm using a bash script that runs the experiment for N times (say 10 times), so even if one crashes halfway, it can resume from the last checkpoint. It works and I can train for many epochs (50+), but it is inconvenient and very hacky, to say the least.

I truly believe works like this are in the right direction for modelling proteins, their properties & interactions. As someone with a chemistry & biology background, this just makes much more sense than learning naiively from 1D string sequences and hoping the model somehow generalises. I hope we can work together to improve this work.

@jeanfeydy
Copy link
Collaborator

Hi @linminhtoo,

Thanks for your interest in our work!
Could you tell us what version of KeOps you are using?
You can see this with:

import pykeops
print(pykeops.__version__)
print(pykeops.__file__)

There was, indeed, a memory leak in KeOps v1.5 that was fixed in mid-March with KeOps v2.0, which should also be much easier to install. If you have found another memory leak and if it is caused by KeOps, we will try to fix it for the v2.1.

With respect to your comments:

  • The PyTorch profiler is great, especially the Chrome trace export. If you run your code and upload your .json traces somewhere, I can certainly have a look to see if I catch anything suspicious.
  • Please understand that developing KeOps is highly technical work, that we have to do on top of writing scientific papers, etc. There is no big company behind this - just regular academic researchers doing their best to push the field forward. Being able to train advanced networks for protein science on a single gaming GPU (instead of a Google-scale cluster) is already incredible... So sharp edges are to be expected, unfortunately, even if we work hard to smooth them out. Strange C++ bugs are the price to pay for being at the cutting edge of research, just like you would encounter lots of “undocumented behaviour” when designing new experiments in a wet lab :-)

Best regards,
Jean

@jeanfeydy jeanfeydy self-assigned this May 18, 2022
@linminhtoo
Copy link
Author

linminhtoo commented May 23, 2022

Dear Jean,

Thank you so much for your reply. First of all, I sincerely apologise for the wording and tone of my message. I do not mean to discredit the utility of PyKeops in any way. Please understand that it was written out of a level of frustration with several things (generally with reproducing various repos) not working out the way I had expected them to be, which looking back was too harsh. Coming from an academic background myself having worked on retrosynthesis at MIT, I completely understand your struggles having to balance your time on research versus engineering. It is a shame that there is not a higher level of support/funding for such groundbreaking work. Please keep up the great work!

I have checked my version and it is indeed v2.0, so I believe this is a new memory leak. My team at our startup is still very much interested in using dMaSIF to learn good representations of proteins so I am committed to following this through. I will re-run one of my experiments with the PyTorch Profiler and send you the .json trace as soon as possible. If there is anything (specific arguments, etc.) I should specifically note when using the Profiler to fully trace PyKeops, please let me know.

Thanks again,
Min Htoo

@linminhtoo
Copy link
Author

linminhtoo commented Jul 5, 2022

hello @jeanfeydy , sorry for the delay, but I finally got around to running the torch profiler as I was busy with other projects at work.

I'm not sure what's the best way to do it, but I followed the instructions in this section: https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html#using-profiler-to-analyze-long-running-jobs

def trace_handler(p, current_epoch: int = 0):
    # just copied from the PyTorch documentation
    output = p.key_averages().table(sort_by="cuda_memory_usage", row_limit=10)
    print(output)
    p.export_chrome_trace(str(model_path / str(str(current_epoch) + "trace_" + str(p.step_num) + ".json")))

with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
        profile_memory=True,
        record_shapes=True,
        schedule=torch.profiler.schedule(
            wait=1,
            warmup=20,
            active=1),
        on_trace_ready=partial(trace_handler, current_epoch=epoch)
    ) as p:
   for batch_data in dataloader:
        # prepare the data, etc
       ....
       preds = model(batch_data)
       p.step()

I wrap the context manager on my training code, so each batch, I call profiler.step() immediately after preds = net(batch_data).

I realized that if I set even a small number for the "active" parameter, it takes very long to save the chrome trace, and the resultant trace is enormous in size (>1 GB), or the program just runs out of CPU RAM. So, I could only get 2 kinds of traces:

  1. active = 1 (so just monitoring 1 batch forward pass) , please see: https://drive.google.com/file/d/1dmFcyF6RJgZXWHhgbYIKuhDDOr37Xfjs/view?usp=sharing
  2. active = 5 (5 batches, but the file is massive, at 600 MB), please see:
    https://drive.google.com/file/d/1H3O_s4pLddmi_8LgtR8ue9iFdocZM3p4/view?usp=sharing

please let me know if these traces are useful at all and how else I can provide you the information needed to pinpoint the memory leak. thank you!

@linminhtoo
Copy link
Author

HI @jeanfeydy , hope you have been doing well. it's been a while since I heard back from you so just wanted to give this a bump. cheers.

@pearl-rabbit
Copy link

pearl-rabbit commented Mar 27, 2023

Hello, do you have any further findings regarding this issue, or have you resolved it?I also encountered this error at runtime, but I don't know how to resolve it.

@linminhtoo
Copy link
Author

Hello, do you have any further findings regarding this issue, or have you resolved it?I also encountered this error at runtime, but I don't know how to resolve it.

Hi, unfortunately we moved onto a different project and I didn't manage to find a fix for it. The nasty but working solution I did was to just restart the training from the checkpoint if it crashed due to OOM.

@rubenalv
Copy link

@jeanfeydy, I wonder if you found the issue here? A workaround is to add torch.cuda.empty_cache() after each train/val/test within each epoch, but still memory creeps up a few Mb per iteration when gradients are stored. The cache is retaining something, and it is not fully released after the loss.backward(). So far, that workaround plus linminhtoo restarting is doing the job, but it's not the best way forward.

@jeanfeydy
Copy link
Collaborator

jeanfeydy commented Nov 9, 2023

Hi all,

@joanglaunes just fixed a small but steady memory leak in KeOps (getkeops/keops#342). The fix has been merged in the main branch and will be part of our next release. I suspect that this was the problem here. If you see an improvement, please let me know and we will finally be able to close this issue :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants