-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tensorflow/tfjs][tfjs-node-gpu] cuda_malloc_async fails with CUDA device attribute error #5740
Comments
@danwexler just want to make sure there are no memory leaks, in your preprocessing code:
I assume you have dispose the tensors within stack array? can you show the tf.memory output before and after the inference? |
Yes, apologies, I was just mocking the real function in the bug report. As I said, I print
|
Here's a typical output from
The allocated memory is the core upscale layer model, after warmup/predict. |
@danwexler the other thing I want to confirm is that are you using tfjs model or tf saved model for inference? |
I'm using a pretrained super-resolution model loaded from a cached version of the Idealo ESRGAN. The model is currently loaded from Unpkg at this location: IOW, this is not a TFJS-provided model from TFHub, and I do believe it is a TF saved model. Please correct me if I'm wrong as I did not do the original training. I feel very much like I need to understand more about the internals of how models work in order to understand this issue. I believe these are the model files: gans.zip Looking at this model file, it seems to be a Keras 2.4.0 model converted using the TFJS Converter v2.0.1.post1 |
FYI, this is all part of an unannounced product that is in development which allows you to run TFJS models both locally in the browser and at scale on a dedicated cluster of cloud VMs. So I do run this code both in |
@danwexler are you using cuda 11.2? I believe TF 2.5.0+ would require 11.2 at least. seems this problem is fixed in the upcoming TF 2.7.0 |
Understood. Good info. Unfortunately, 11.2 is not available using the default Google Kubernetes Engine (GKE) nvidia-driver-installer Daemon Set. I've upgraded to using the I believe there is a way to install a different driver than is installed on GKE based on the backplane version. Do you know of any documentation or instructions on how to upgrade the CUDA version on a GKE VM via the standard nvidia-driver-installer Daemon Set? This is not a blocking issue for me during development. I'll be testing workarounds while I wait for the TF 2.7.0 release. However, it would be great if there was a way to reuse existing allocations rather than re-allocating the same large tensors for data pre-fetch and |
@danwexler engine reset could de-allocate all your weight tensors for the model, you would need to recreate and upload them to gpu again, and I am not sure it will improve GPU memory fragmentation. |
Using tfjs-node-gpu on a GKE cluster running on an n1-higmem-8 with an NVIDIA P4 or V100 GPU fails when the cuda_malloc_async allocater is set using TF_GPU_ALLOCATOR.
System information
Describe the current behavior
The app is a video filter that loads applies a super-resolution layer model to each frame in a video file, batching N frames together into a Tensor4D to scale up the resolution by 4x. I run
tf.memory()
after each frame to ensure that I am not leaking any tensors. After processing slightly more than 100x 1280x720 frames correctly, TF bails out and dumps the memory allocations, as well as displaying the message:However, when I do set
TF_GPU_ALLOCATOR=cuda_malloc_async
, my normally correct startup process fails with:Describe the expected behavior
My primary issue is being able to use
model.predict()
on several hundred video frames, grouped together into batches, without running out of memory. I have eliminated any tensor leaks according totf.memory()
, so I'm not sure what to try next? I have seen discussions mentioningtf.engine.startScope/endScope
, and I can also trydispose()
ing my model every N frames and re-loading it, or eventf.engine.reset()
every N frames, but these seem like band-aids for internal TFJS issues.I do not explicitly allocate any TF variables within my code, so I do not expect
tf.disposeVariables()
to help. Is it possible that the model allocates variables internally that would benefit from runningtf.disposeVariables()
every frame?I repeat the same allocation pattern for each video frame batch, but I cannot find any way of re-using the existing Tensors to avoid fragmentation.
Standalone code to reproduce the issue
Producing repro code is possible, but a significant effort. If there are no simple answers to this issue, then I will take the necessary time to mock up a repro.
Basically, I start by decoding frames into separate files using ffmpeg. Then, the processing loop will pre-fetch the next batch of N frames (N typically is 1-10) into a T4D by loading the individual frames:
Once pre-fetched, processing is just:
superresModel.predict(t4d)
The output batch is finished, I then extract the individual frames and save them back to new output files using:
After all batches are finished, I just call
ffmpeg
again to re-encode the output frame files.Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.
err.log
nvidia_smi.log
The text was updated successfully, but these errors were encountered: