-
Notifications
You must be signed in to change notification settings - Fork 469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High CUDA Memory Usage in ONNX Runtime with Inconsistent Memory Release #2069
Comments
Hi, the code you provided doesn't explain how you got the chart in you issue and what is "sample number" in this case ? |
Hi @IlyasMoutawwakil, the code I’ve provided shows how I’m loading the model and performing inference. I’ve also included a graph showing the GPU memory consumed as inferencing progresses (I recorded the GPU usage after each inference by using the following code: result = subprocess.run(['nvidia-smi', '--query-compute-apps=pid,gpu_name,used_memory', '--format=csv,noheader,nounits'],
stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
if result.returncode != 0:
print("Failed to run nvidia-smi:", result.stderr)
return None
gpu_processes = result.stdout.strip().split('\n')
for process in gpu_processes:
process_info = process.split(', ')
process_pid = process_info[0]
if process_pid == str(pid):
used_memory_mib = int(process_info[2]) Graph demonstrate that the model, which I’ve converted to ONNX using the save_pretrained() method, is not releasing memory when it encounters a lower input sequence after processing a higher input sequence, whereas the PyTorch model releases memory in such cases. I've plotted another graph showing input shape (batch size,sequence length) |
and I assume, "sample number" is supposed to mean sequence length ? edit: okay thanks I see the updated graph. |
Thanks, @IlyasMoutawwakil. Do you think this is normal behavior for ONNX Runtime? |
must be an issue in onnxruntime memory arena/allocation, please open an issue there 🤗 |
System Info
Who can help?
@JingyaHuang @echarlaix
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
Expected behavior
I expect the CUDA memory to decrease and be released after processing smaller inputs, optimizing memory usage for subsequent inputs.
The text was updated successfully, but these errors were encountered: