High CUDA Memory Usage in ONNX Runtime with Inconsistent Memory Release #2069

niyathimariya · 2024-10-19T02:45:54Z

System Info

Optimum version: 1.22.0
Platform: Linux (Ubuntu 22.04.4 LTS)
Python version: 3.12.2
ONNX Runtime Version: 1.19.2
CUDA Version: 12.1
CUDA Execution Provider: Yes (CUDA 12.1)

Who can help?

@JingyaHuang @echarlaix

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

def load_model(self, model_name):
    session_options = ort.SessionOptions()
    session_options.add_session_config_entry('cudnn_conv_use_max_workspace', '0')
    session_options.enable_mem_pattern = False
    session_options.arena_extend_strategy = "kSameAsRequested"
    session_options.gpu_mem_limit = 10 * 1024 * 1024 * 1024
    
    model = ORTModelForSeq2SeqLM.from_pretrained(model_name, provider="CUDAExecutionProvider", session_options=session_options)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return tokenizer, model

def inference(self, batch, doc_id='-1'):
    responses, status = '', False
    try:
        encodings = self.tokenizer(batch, padding=True, truncation=True, max_length=8192, return_tensors="pt").to(self.device)
        with torch.no_grad():
            generated_ids = self.model.generate(
                encodings.input_ids,
                max_new_tokens=1024
            )
            responses = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
            status = True  
    except Exception as e:
        logger.error(f"Failed to do inference on LLM, error: {e}")

    torch.cuda.empty_cache()
    return status, responses

Expected behavior

I expect the CUDA memory to decrease and be released after processing smaller inputs, optimizing memory usage for subsequent inputs.

The text was updated successfully, but these errors were encountered:

IlyasMoutawwakil · 2024-10-21T09:29:31Z

Hi, the code you provided doesn't explain how you got the chart in you issue and what is "sample number" in this case ?

niyathimariya · 2024-10-21T10:00:11Z

Hi @IlyasMoutawwakil, the code I’ve provided shows how I’m loading the model and performing inference. I’ve also included a graph showing the GPU memory consumed as inferencing progresses (I recorded the GPU usage after each inference by using the following code:

result = subprocess.run(['nvidia-smi', '--query-compute-apps=pid,gpu_name,used_memory', '--format=csv,noheader,nounits'], 
                        stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)

if result.returncode != 0:
    print("Failed to run nvidia-smi:", result.stderr)
    return None
gpu_processes = result.stdout.strip().split('\n')

for process in gpu_processes:
    process_info = process.split(', ')
    process_pid = process_info[0]

    if process_pid == str(pid):
        used_memory_mib = int(process_info[2])

Graph demonstrate that the model, which I’ve converted to ONNX using the save_pretrained() method, is not releasing memory when it encounters a lower input sequence after processing a higher input sequence, whereas the PyTorch model releases memory in such cases.

I've plotted another graph showing input shape (batch size,sequence length)

IlyasMoutawwakil · 2024-10-21T10:24:18Z

and I assume, "sample number" is supposed to mean sequence length ? edit: okay thanks I see the updated graph.
either ways, this doesn't seem like an optimum issue, but rather on onnxruntime side (the inference session), since it's the part that handles memory allocation and release.

niyathimariya · 2024-10-25T04:32:21Z

Thanks, @IlyasMoutawwakil. Do you think this is normal behavior for ONNX Runtime?

IlyasMoutawwakil · 2024-11-18T14:52:30Z

must be an issue in onnxruntime memory arena/allocation, please open an issue there 🤗

niyathimariya added the bug Something isn't working label Oct 19, 2024

IlyasMoutawwakil added question Further information is requested and removed bug Something isn't working labels Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High CUDA Memory Usage in ONNX Runtime with Inconsistent Memory Release #2069

High CUDA Memory Usage in ONNX Runtime with Inconsistent Memory Release #2069

niyathimariya commented Oct 19, 2024 •

edited

Loading

IlyasMoutawwakil commented Oct 21, 2024 •

edited

Loading

niyathimariya commented Oct 21, 2024 •

edited

Loading

IlyasMoutawwakil commented Oct 21, 2024 •

edited

Loading

niyathimariya commented Oct 25, 2024

IlyasMoutawwakil commented Nov 18, 2024

High CUDA Memory Usage in ONNX Runtime with Inconsistent Memory Release #2069

High CUDA Memory Usage in ONNX Runtime with Inconsistent Memory Release #2069

Comments

niyathimariya commented Oct 19, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior

IlyasMoutawwakil commented Oct 21, 2024 • edited Loading

niyathimariya commented Oct 21, 2024 • edited Loading

IlyasMoutawwakil commented Oct 21, 2024 • edited Loading

niyathimariya commented Oct 25, 2024

IlyasMoutawwakil commented Nov 18, 2024

niyathimariya commented Oct 19, 2024 •

edited

Loading

IlyasMoutawwakil commented Oct 21, 2024 •

edited

Loading

niyathimariya commented Oct 21, 2024 •

edited

Loading

IlyasMoutawwakil commented Oct 21, 2024 •

edited

Loading