Is there a way to terminate vllm.LLM and release the GPU memory #1908

sfc-gh-zhwang · 2023-12-04T00:12:27Z

After below code, is there an api(maybe like llm.terminate) to kill llm and release the GPU memory?

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
outputs = llm.generate(prompts, sampling_params)

The text was updated successfully, but these errors were encountered:

SuperBruceJia · 2023-12-04T00:28:55Z

After below code, is there an api(maybe like llm.terminate) to kill llm and release the GPU memory?

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
outputs = llm.generate(prompts, sampling_params)

Please check the codes below. It works.

import gc

import torch
from vllm import LLM, SamplingParams
from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel

# Load the model via vLLM
llm = LLM(model=model_name, download_dir=saver_dir, tensor_parallel_size=num_gpus, gpu_memory_utilization=0.70)

# Delete the llm object and free the memory
destroy_model_parallel()
del llm
gc.collect()
torch.cuda.empty_cache()
torch.distributed.destroy_process_group()
print("Successfully delete the llm pipeline and free the GPU memory!")

Best regards,

Shuyue
Dec. 3rd, 2023

hijkzzz · 2023-12-04T01:30:28Z

mark

deepbrain · 2024-02-08T22:21:10Z

Even after executing the code above, the GPU memory is not freed with the latest vllm built from source. Any recommendations?

huylenguyen · 2024-02-24T23:26:53Z

Are there any updates on this? the above code does not work for me either

puddingfjz · 2024-03-01T15:18:19Z

+1

puddingfjz · 2024-03-01T16:31:13Z

I find that we need to explicitly run "del llm.llm_engine.driver_worker" to release in when using a single worker.
Can anybody explain why this is the case?

shyringo · 2024-04-23T09:46:04Z

+1

shyringo · 2024-04-24T07:51:57Z

I find that we need to explicitly run "del llm.llm_engine.driver_worker" to release in when using a single worker. Can anybody explain why this is the case?

I tried the above code block and also this line "del llm.llm_engine.driver_worker". Both failed for me.

But I managed, with the following code, to terminate the vllm.LLM(), release the GPU memory, and shut down ray in convenience for using vllm.LLM() for the next model. After this, I succeeded in using vllm.LLM() again for the next model.

        #llm is a vllm.LLM object
        import gc
        import torch
        from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel
        
        destroy_model_parallel()
        #del a vllm.executor.ray_gpu_executor.RayGPUExecutor object
        del llm.llm_engine.model_executor
        del llm
        gc.collect()
        torch.cuda.empty_cache()
        import ray
        ray.shutdown()

Anyway, even if it works, it is just a temporary solution and this issue still needs fixing.

shyringo · 2024-04-24T09:45:05Z

I tried the above code block and also this line "del llm.llm_engine.driver_worker". Both failed for me.

But I managed, with the following code, to terminate the vllm.LLM(), release the GPU memory, and shut down ray in convenience for using vllm.LLM() for the next model. After this, I succeeded in using vllm.LLM() again for the next model.
        #llm is a vllm.LLM object
        import gc
        import torch
        from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel
        
        destroy_model_parallel()
        #del a vllm.executor.ray_gpu_executor.RayGPUExecutor object
        del llm.llm_engine.model_executor
        del llm
        gc.collect()
        torch.cuda.empty_cache()
        import ray
        ray.shutdown()
Anyway, even if it works, it is just a temporary solution and this issue still needs fixing.

update:
the following code would work better, without the possible dead lock warning.

        #llm is a vllm.LLM object
        import gc
        import torch
        from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel
        import os

        #avoid huggingface/tokenizers process dead lock
        os.environ["TOKENIZERS_PARALLELISM"] = "false"
        destroy_model_parallel()
        #del a vllm.executor.ray_gpu_executor.RayGPUExecutor object
        del llm.llm_engine.model_executor
        del llm
        gc.collect()
        torch.cuda.empty_cache()
        import ray
        ray.shutdown()

ticoneva · 2024-04-25T10:32:10Z

In the latest version of vLLM destroy_model_parallel has moved to vllm.distributed.parallel_state. The objects you have to delete have also changed:

from vllm.distributed.parallel_state import destroy_model_parallel
...
destroy_model_parallel()
del llm.llm_engine.model_executor.driver_worker
del llm # Isn't necessary for releasing memory, but why not
gc.collect()
torch.cuda.empty_cache()

rbao2018 · 2024-05-04T12:29:44Z

In the latest version of vLLM destroy_model_parallel has moved to vllm.distributed.parallel_state. The objects you have to delete have also changed:
from vllm.distributed.parallel_state import destroy_model_parallel
...
destroy_model_parallel()
del llm.llm_engine.model_executor.driver_worker
del llm # Isn't necessary for releasing memory, but why not
gc.collect()
torch.cuda.empty_cache()

thx a lot

mmoskal · 2024-05-08T18:03:34Z

vLLM seems to hang to the first allocated LLM() instance. It does not hang to later instances. Maybe that helps with diagnosing the issue?

from vllm import LLM


def show_memory_usage():
    import torch.cuda
    import torch.distributed
    import gc

    print(f"cuda memory: {torch.cuda.memory_allocated()//1024//1024}MB")
    gc.collect()
    # torch.distributed.destroy_process_group()
    torch.cuda.empty_cache()
    print(f"  --> after gc: {torch.cuda.memory_allocated()//1024//1024}MB")


def gc_problem():
    show_memory_usage()
    print("loading llm0")
    llm0 = LLM(model="facebook/opt-125m", num_gpu_blocks_override=180)
    del llm0
    show_memory_usage()

    print("loading llm1")
    llm1 = LLM(model="facebook/opt-125m", num_gpu_blocks_override=500)
    del llm1
    show_memory_usage()

    print("loading llm2")
    llm2 = LLM(model="facebook/opt-125m", num_gpu_blocks_override=600)
    del llm2
    show_memory_usage()

gc_problem()

root@c09a058c2d5b:/workspaces/aici/py/vllm# python tests/core/block/e2e/gc_problem.py |grep -v INFO
cuda memory: 0MB
  --> after gc: 0MB
loading llm0
cuda memory: 368MB
  --> after gc: 368MB
loading llm1
cuda memory: 912MB
  --> after gc: 368MB
loading llm2
[rank0]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
cuda memory: 961MB
  --> after gc: 368MB
root@c09a058c2d5b:/workspaces/aici/py/vllm#

The llm1 consumes more than llm0 but you can see that the allocated memory stays at llm0 level.

yudataguy · 2024-05-09T12:13:39Z

In the latest version of vLLM destroy_model_parallel has moved to vllm.distributed.parallel_state. The objects you have to delete have also changed:
from vllm.distributed.parallel_state import destroy_model_parallel
...
destroy_model_parallel()
del llm.llm_engine.model_executor.driver_worker
del llm # Isn't necessary for releasing memory, but why not
gc.collect()
torch.cuda.empty_cache()

Tried this including ray.shutdown() but the memory is not released on my end, any other suggestion?

shyringo · 2024-05-09T12:27:25Z

Tried this including ray.shutdown() but the memory is not released on my end, any other suggestion?

could try the "del llm.llm_engine.model_executor" in the following code instead:

update: the following code would work better, without the possible dead lock warning.

        #llm is a vllm.LLM object
        import gc
        import torch
        from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel
        import os

        #avoid huggingface/tokenizers process dead lock
        os.environ["TOKENIZERS_PARALLELISM"] = "false"
        destroy_model_parallel()
        #del a vllm.executor.ray_gpu_executor.RayGPUExecutor object
        del llm.llm_engine.model_executor
        del llm
        gc.collect()
        torch.cuda.empty_cache()
        import ray
        ray.shutdown()

yudataguy · 2024-05-11T01:46:11Z

Tried this including ray.shutdown() but the memory is not released on my end, any other suggestion?

could try the "del llm.llm_engine.model_executor" in the following code instead:
update: the following code would work better, without the possible dead lock warning.
        #llm is a vllm.LLM object
        import gc
        import torch
        from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel
        import os

        #avoid huggingface/tokenizers process dead lock
        os.environ["TOKENIZERS_PARALLELISM"] = "false"
        destroy_model_parallel()
        #del a vllm.executor.ray_gpu_executor.RayGPUExecutor object
        del llm.llm_engine.model_executor
        del llm
        gc.collect()
        torch.cuda.empty_cache()
        import ray
        ray.shutdown()

did that as well, still no change in gpu memory allocation. Not sure how to go further

zheyang0825 · 2024-05-11T02:15:38Z

In the latest version of vLLM destroy_model_parallel has moved to vllm.distributed.parallel_state. The objects you have to delete have also changed:
from vllm.distributed.parallel_state import destroy_model_parallel
...
destroy_model_parallel()
del llm.llm_engine.model_executor.driver_worker
del llm # Isn't necessary for releasing memory, but why not
gc.collect()
torch.cuda.empty_cache()

We tried this in version 0.4.2, but GPU memory did not released.

shyringo · 2024-05-11T06:46:43Z

did that as well, still no change in gpu memory allocation. Not sure how to go further

Then I do not have a clue either. Meanwhile, I should add an information: the vllm version with which I succeeded with the above code was 0.4.0.post1

mnoukhov · 2024-05-11T16:43:14Z

@zheyang0825 does adding this lines at the end make it work?

torch.distributed.destroy_process_group()

yudataguy · 2024-05-12T01:00:34Z

did that as well, still no change in gpu memory allocation. Not sure how to go further

Then I do not have a clue either. Meanwhile, I should add an information: the vllm version with which I succeeded with the above code was 0.4.0.post1

tried on 0.4.0.post1 and method worked, not sure what changed in the latest version that's not releasing the memory, possible bug?

GurvanR · 2024-05-13T14:22:55Z

Hello ! so if I'm not wrong, no one achieved to release memory on vllm 0.4.2 yet ?

njhill · 2024-05-13T14:30:03Z

A new bug was introduced in 0.4.2, but fixed in #4737. Please try with that PR or as a workaround you can also install tensorizer.

This should resolve such errors at least for TP=1. For TP > 1, there may be other issues with creating a new LLM instance after deleting one in the same process.

GurvanR · 2024-05-14T09:11:43Z

A new bug was introduced in 0.4.2, but fixed in #4737. Please try with that PR or as a workaround you can also install tensorizer.

This should resolve such errors at least for TP=1. For TP > 1, there may be other issues with creating a new LLM instance after deleting one in the same process.

I updated vllm yesterday and still have the problem, I'm using those lines :

destroy_model_parallel()
del llm.llm_engine.model_executor.driver_worker
del llm
gc.collect()
torch.cuda.empty_cache()

Misterrendal · 2024-05-14T14:47:45Z

This code is worked for me

vllm==0.4.0.post1

        import gc
        import ray
        from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel

        print('service stopping ..')
        print(f"cuda memory: {torch.cuda.memory_allocated() // 1024 // 1024}MB")

        destroy_model_parallel()

        del model.llm_engine.model_executor.driver_worker
        del model

        gc.collect()
        torch.cuda.empty_cache()
        ray.shutdown()

        print(f"cuda memory: {torch.cuda.memory_allocated() // 1024 // 1024}MB")

        print("service stopped")

cassanof · 2024-05-16T05:14:50Z

There should be a built-in way! We cannot keep writing code that breaks on the next minor release :(

youkaichao · 2024-05-16T05:26:40Z

In general it is very difficult to clean up all resources correctly, especially when we use multiple GPUs, and might be prone to deadlocks .

I would say, the most stable way to terminate vLLM is to shut down the process.

Vincent-Li-9701 · 2024-05-20T20:37:40Z

A new bug was introduced in 0.4.2, but fixed in #4737. Please try with that PR or as a workaround you can also install tensorizer.

This should resolve such errors at least for TP=1. For TP > 1, there may be other issues with creating a new LLM instance after deleting one in the same process.

I encountered this issue when TP = 8. I'm doing this in a iterative manner since I need to run the embedding model after the generative model so there are so loading / offloading. The first iteration is fine but the second iteration the instantiation of vllm ray server hangs.

cassanof · 2024-05-21T06:14:05Z

In general it is very difficult to clean up all resources correctly, especially when we use multiple GPUs, and might be prone to deadlocks .

I would say, the most stable way to terminate vLLM is to shut down the process.

I understand your point. However, this feature is extremely useful for situations where you need to switch between models. For instance, reinforcement learning loops. I am writing an off-policy RL loop, requiring me to train one model (target policy) while its previous version performs inference (behavior policy). As a result, I frequently load and unload models. While I know vLLM is not intended for training, using transformers would be too slow, making my technique unviable.

Let me know if this is a feature that's wanted and the team would be interested in maintaining it. I can open a separate issue and start working on it.

DuZKai · 2024-05-31T06:11:00Z

I don't know if anyone can currently clear memory correctly, but in version 0.4.2, I applied the above code that failed to clear memory correctly. I can only use a slightly extreme method of creating a new process before calling and closing the process after calling to roughly solve the problem:

from multiprocessing import Process, set_start_method
set_start_method('spawn', force=True)
def vllm_texts(model_path):
    prompts=""
    sampling_params = SamplingParams(max_tokens=512)
    llm = LLM(model=model_path)
    outputs = llm.generate(prompts, sampling_params)

...
print(torch.cuda.memory_summary())
p = Process(target=vllm_texts, args=(model_path))
p.start()
p.join()
if p.is_alive():
    p.terminate()
p.close()
print(torch.cuda.memory_summary())
...

I still hope there is a way in the future to correctly and perfectly clear memory

SuperBruceJia · 2024-06-11T01:56:26Z

While I am using multiple GPUs to serve a LLM (tensor_parallel_size > 1), the GPUs' memory is not released, except the first GPU (cuda:0).

ywang96 · 2024-06-19T06:38:47Z

In general it is very difficult to clean up all resources correctly, especially when we use multiple GPUs, and might be prone to deadlocks .
I would say, the most stable way to terminate vLLM is to shut down the process.

I understand your point. However, this feature is extremely useful for situations where you need to switch between models. For instance, reinforcement learning loops. I am writing an off-policy RL loop, requiring me to train one model (target policy) while its previous version performs inference (behavior policy). As a result, I frequently load and unload models. While I know vLLM is not intended for training, using transformers would be too slow, making my technique unviable.

Let me know if this is a feature that's wanted and the team would be interested in maintaining it. I can open a separate issue and start working on it.

Glad to see you here @cassanof and to hear that you have been using vLLM in this kind of workflow!

Given how much wanted this feature seems to be, I will bring this back to the team to discuss! If multi-gpu instance is prone to deadlocks, then perhaps we can at least start with single-gpu instances. Everyone on the maintainer team does have limited bandwidth and we have a lot of things to work on, so contributions are very welcomed as always!

kota-iizuka · 2024-07-19T06:12:43Z

I tried inferring multiple models consecutively with vLLM v0.5.2. As mentioned above, the behavior differs depending on the value of TP.

If you want to unload a model with TP=1 and then load a model with TP=1, the process described in Is there a way to terminate vllm.LLM and release the GPU memory #1908 (comment) will work without any problems.
If you want to unload a model with TP=1 and then load a model with TP=2, or vice versa, in addition to the code above, you need to do export VLLM_WORKER_MULTIPROC_METHOD=spawn in [Bug]: When tensor_parallel_size>1, RuntimeError: Cannot re-initialize CUDA in forked subprocess. #6152 (comment), and run vllm.distributed.parallel_state.destroy_distributed_environment().
- There is an issue where the warning message [Bug]: Producer process has been terminated before all shared CUDA tensors released (v 0.5.0 post1, v 0.4.3) #6025 appears after execution.
The above does not work to unload and reload the model with TP=2. I tried ray.shutdown() but it did not work.

I use this function in a pipeline where I explain images with VLM and then summarize them with LLM. I hope that this kind of processing will be officially provided and become common.

from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import destroy_model_parallel, destroy_distributed_environment
import torch
import gc
import os

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

def main():
    prompts = ["Hello, my name is"]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    llm = LLM(model="facebook/opt-125m", tensor_parallel_size=1)
    outputs = llm.generate(prompts, sampling_params)
    print(outputs)

    destroy_model_parallel()
    destroy_distributed_environment()
    del llm.llm_engine.model_executor
    del llm
    gc.collect()
    torch.cuda.empty_cache()

    llm = LLM(model="facebook/opt-125m", tensor_parallel_size=2)
    outputs = llm.generate(prompts, sampling_params)
    print(outputs)

    destroy_model_parallel()
    destroy_distributed_environment()
    del llm.llm_engine.model_executor
    del llm
    gc.collect()
    torch.cuda.empty_cache()

if __name__ == "__main__":
    main()

xansar · 2024-07-29T08:25:35Z

I tried inferring multiple models consecutively with vLLM v0.5.2. As mentioned above, the behavior differs depending on the value of TP.

If you want to unload a model with TP=1 and then load a model with TP=1, the process described in Is there a way to terminate vllm.LLM and release the GPU memory #1908 (comment) will work without any problems.

If you want to unload a model with TP=1 and then load a model with TP=2, or vice versa, in addition to the code above, you need to do export VLLM_WORKER_MULTIPROC_METHOD=spawn in [Bug]: When tensor_parallel_size>1, RuntimeError: Cannot re-initialize CUDA in forked subprocess. #6152 (comment), and run vllm.distributed.parallel_state.destroy_distributed_environment().

There is an issue where the warning message [Bug]: Producer process has been terminated before all shared CUDA tensors released (v 0.5.0 post1, v 0.4.3) #6025 appears after execution.

The above does not work to unload and reload the model with TP=2. I tried ray.shutdown() but it did not work.

I use this function in a pipeline where I explain images with VLM and then summarize them with LLM. I hope that this kind of processing will be officially provided and become common.
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import destroy_model_parallel, destroy_distributed_environment
import torch
import gc
import os

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

def main():
    prompts = ["Hello, my name is"]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    llm = LLM(model="facebook/opt-125m", tensor_parallel_size=1)
    outputs = llm.generate(prompts, sampling_params)
    print(outputs)

    destroy_model_parallel()
    destroy_distributed_environment()
    del llm.llm_engine.model_executor
    del llm
    gc.collect()
    torch.cuda.empty_cache()

    llm = LLM(model="facebook/opt-125m", tensor_parallel_size=2)
    outputs = llm.generate(prompts, sampling_params)
    print(outputs)

    destroy_model_parallel()
    destroy_distributed_environment()
    del llm.llm_engine.model_executor
    del llm
    gc.collect()
    torch.cuda.empty_cache()

if __name__ == "__main__":
    main()

It works for me.

bhavnicksm · 2024-08-05T17:02:20Z

I tried inferring multiple models consecutively with vLLM v0.5.2. As mentioned above, the behavior differs depending on the value of TP.

If you want to unload a model with TP=1 and then load a model with TP=1, the process described in Is there a way to terminate vllm.LLM and release the GPU memory #1908 (comment) will work without any problems.

If you want to unload a model with TP=1 and then load a model with TP=2, or vice versa, in addition to the code above, you need to do export VLLM_WORKER_MULTIPROC_METHOD=spawn in [Bug]: When tensor_parallel_size>1, RuntimeError: Cannot re-initialize CUDA in forked subprocess. #6152 (comment), and run vllm.distributed.parallel_state.destroy_distributed_environment().

There is an issue where the warning message [Bug]: Producer process has been terminated before all shared CUDA tensors released (v 0.5.0 post1, v 0.4.3) #6025 appears after execution.

The above does not work to unload and reload the model with TP=2. I tried ray.shutdown() but it did not work.

I use this function in a pipeline where I explain images with VLM and then summarize them with LLM. I hope that this kind of processing will be officially provided and become common.
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import destroy_model_parallel, destroy_distributed_environment
import torch
import gc
import os

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

def main():
    prompts = ["Hello, my name is"]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    llm = LLM(model="facebook/opt-125m", tensor_parallel_size=1)
    outputs = llm.generate(prompts, sampling_params)
    print(outputs)

    destroy_model_parallel()
    destroy_distributed_environment()
    del llm.llm_engine.model_executor
    del llm
    gc.collect()
    torch.cuda.empty_cache()

    llm = LLM(model="facebook/opt-125m", tensor_parallel_size=2)
    outputs = llm.generate(prompts, sampling_params)
    print(outputs)

    destroy_model_parallel()
    destroy_distributed_environment()
    del llm.llm_engine.model_executor
    del llm
    gc.collect()
    torch.cuda.empty_cache()

if __name__ == "__main__":
    main()

With TP=1, I am able to unload the model without difficulty with the method described above, but the re-loading fails with a esoteric error, like:

File "/home/bhavnick/fd/workspace/vllm-api/modules/llm/generator.py", line 139, in load
    engine = AsyncLLMEngine.from_engine_args(args)
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
    engine = cls(
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 380, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 251, in __init__
    self.model_executor = executor_class(
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 35, in _init_executor
    self.driver_worker.init_device()
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/worker/worker.py", line 132, in init_device
    init_worker_distributed_environment(self.parallel_config, self.rank,
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/worker/worker.py", line 346, in init_worker_distributed_environment
    ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 920, in ensure_model_parallel_initialized
    backend = backend or torch.distributed.get_backend(
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1074, in get_backend
    return Backend(not_none(pg_store)[0])
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/torch/utils/_typing_utils.py", line 12, in not_none
    raise TypeError("Invariant encountered: value was None when it should not be")
TypeError: Invariant encountered: value was None when it should not be

Some differences I can think of is I am using the AysncLLMEngine instead of the synchronous version, but the unload works just the same.

bhavnicksm · 2024-08-12T15:12:59Z

I tried inferring multiple models consecutively with vLLM v0.5.2. As mentioned above, the behavior differs depending on the value of TP.

If you want to unload a model with TP=1 and then load a model with TP=1, the process described in Is there a way to terminate vllm.LLM and release the GPU memory #1908 (comment) will work without any problems.

If you want to unload a model with TP=1 and then load a model with TP=2, or vice versa, in addition to the code above, you need to do export VLLM_WORKER_MULTIPROC_METHOD=spawn in [Bug]: When tensor_parallel_size>1, RuntimeError: Cannot re-initialize CUDA in forked subprocess. #6152 (comment), and run vllm.distributed.parallel_state.destroy_distributed_environment().

There is an issue where the warning message [Bug]: Producer process has been terminated before all shared CUDA tensors released (v 0.5.0 post1, v 0.4.3) #6025 appears after execution.

The above does not work to unload and reload the model with TP=2. I tried ray.shutdown() but it did not work.

I use this function in a pipeline where I explain images with VLM and then summarize them with LLM. I hope that this kind of processing will be officially provided and become common.
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import destroy_model_parallel, destroy_distributed_environment
import torch
import gc
import os

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

def main():
    prompts = ["Hello, my name is"]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    llm = LLM(model="facebook/opt-125m", tensor_parallel_size=1)
    outputs = llm.generate(prompts, sampling_params)
    print(outputs)

    destroy_model_parallel()
    destroy_distributed_environment()
    del llm.llm_engine.model_executor
    del llm
    gc.collect()
    torch.cuda.empty_cache()

    llm = LLM(model="facebook/opt-125m", tensor_parallel_size=2)
    outputs = llm.generate(prompts, sampling_params)
    print(outputs)

    destroy_model_parallel()
    destroy_distributed_environment()
    del llm.llm_engine.model_executor
    del llm
    gc.collect()
    torch.cuda.empty_cache()

if __name__ == "__main__":
    main()

With TP=1, I am able to unload the model without difficulty with the method described above, but the re-loading fails with a esoteric error, like:

File "/home/bhavnick/fd/workspace/vllm-api/modules/llm/generator.py", line 139, in load
    engine = AsyncLLMEngine.from_engine_args(args)
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
    engine = cls(
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 380, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 251, in __init__
    self.model_executor = executor_class(
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 35, in _init_executor
    self.driver_worker.init_device()
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/worker/worker.py", line 132, in init_device
    init_worker_distributed_environment(self.parallel_config, self.rank,
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/worker/worker.py", line 346, in init_worker_distributed_environment
    ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 920, in ensure_model_parallel_initialized
    backend = backend or torch.distributed.get_backend(
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1074, in get_backend
    return Backend(not_none(pg_store)[0])
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/torch/utils/_typing_utils.py", line 12, in not_none
    raise TypeError("Invariant encountered: value was None when it should not be")
TypeError: Invariant encountered: value was None when it should not be

Some differences I can think of is I am using the AysncLLMEngine instead of the synchronous version, but the unload works just the same.

UPDATE: Works with ray at TP=2 but for some reason not with multiprocessing with either fork or spawn. Also, only unload works with ray but not reload right now.

jvlinsta · 2024-08-12T15:41:18Z

@bhavnicksm I can reproduce the same error you have.

jvlinsta · 2024-08-12T17:45:46Z

When debugging, I found that when using 'spawn' the main GPU used (if using PCI_BUS) would still keep some small amount of memory allocated, indicating that the clean up is unsuccessful.

When I then check all available GPUs that have zero memory allocated and export CUDA_VISIBLE_DEVICES, I am able to progress into changing into another model until another issue hits that seemingly indicates the workers distributed state is still somewhere in memory.

    self.engine = self._init_engine(*args, **kwargs)
  File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 551, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/ubuntu/vllm/vllm/engine/llm_engine.py", line 250, in __init__
    self.model_executor = executor_class(
  File "/home/ubuntu/vllm/vllm/executor/multiproc_gpu_executor.py", line 215, in __init__
    super().__init__(*args, **kwargs)
  File "/home/ubuntu/vllm/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
    super().__init__(*args, **kwargs)
  File "/home/ubuntu/vllm/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/home/ubuntu/vllm/vllm/executor/multiproc_gpu_executor.py", line 137, in _init_executor
    self._run_workers("init_device")
  File "/home/ubuntu/vllm/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers
    driver_worker_output = driver_worker_method(*args, **kwargs)
  File "/home/ubuntu/vllm/vllm/worker/worker.py", line 132, in init_device
    init_worker_distributed_environment(self.parallel_config, self.rank,
  File "/home/ubuntu/vllm/vllm/worker/worker.py", line 343, in init_worker_distributed_environment
    init_distributed_environment(parallel_config.world_size, rank,
  File "/home/ubuntu/vllm/vllm/distributed/parallel_state.py", line 874, in init_distributed_environment
    assert _WORLD.world_size == torch.distributed.get_world_size(), (
AssertionError: world group already initialized with a different world **size```

jvlinsta · 2024-08-12T21:00:19Z

Another update:

The problem is the global _WORLD variable that stays None. When overriding the backend to be 'nccl' hardcoded at vllm/distributed/parallel_state.py", line 920, in ensure_model_parallel_initialized, the whole state goes in deadlock:

    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 470, in from_engine_args
    engine = cls(
  File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 380, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 551, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/ubuntu/vllm/vllm/engine/llm_engine.py", line 250, in __init__
    self.model_executor = executor_class(
  File "/home/ubuntu/vllm/vllm/executor/multiproc_gpu_executor.py", line 215, in __init__
    super().__init__(*args, **kwargs)
  File "/home/ubuntu/vllm/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
    super().__init__(*args, **kwargs)
  File "/home/ubuntu/vllm/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/home/ubuntu/vllm/vllm/executor/multiproc_gpu_executor.py", line 137, in _init_executor
    self._run_workers("init_device")
  File "/home/ubuntu/vllm/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers
    driver_worker_output = driver_worker_method(*args, **kwargs)
  File "/home/ubuntu/vllm/vllm/worker/worker.py", line 132, in init_device
    init_worker_distributed_environment(self.parallel_config, self.rank,
  File "/home/ubuntu/vllm/vllm/worker/worker.py", line 346, in init_worker_distributed_environment
    ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
  File "/home/ubuntu/vllm/vllm/distributed/parallel_state.py", line 969, in ensure_model_parallel_initialized
    initialize_model_parallel(tensor_model_parallel_size,
  File "/home/ubuntu/vllm/vllm/distributed/parallel_state.py", line 931, in initialize_model_parallel
    _TP = init_model_parallel_group(group_ranks,
  File "/home/ubuntu/vllm/vllm/distributed/parallel_state.py", line 773, in init_model_parallel_group
    return GroupCoordinator(
  File "/home/ubuntu/vllm/vllm/distributed/parallel_state.py", line 154, in __init__
    self.pynccl_comm = PyNcclCommunicator(
  File "/home/ubuntu/vllm/vllm/distributed/device_communicators/pynccl.py", line 74, in __init__
    dist.broadcast(tensor, src=ranks[0], group=group)
  File "/opt/conda/envs/llm2/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/envs/llm2/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast

hammer-wang · 2024-10-14T22:14:23Z

When using TP>1 for the first model, it seems there's no working method that can successfully release the GPU memory. I've tried all scripts in this thread and none worked.

njhill · 2024-10-14T22:58:08Z

@hammer-wang what version of vLLM are you using, which distributed backend (ray or multiprocessing) and how are you running the server?

sadanand1120 · 2024-10-31T23:17:04Z

Got it working on vllm 0.5.x
Adding these two flags to my LLM call:

distributed_executor_backend='mp'
disable_custom_all_reduce=True

and

def vllm_cleanup(llm):
    del llm.llm_engine.model_executor
    del llm
    destroy_model_parallel()
    destroy_distributed_environment()
    with contextlib.suppress(AssertionError):
        torch.distributed.destroy_process_group()
    gc.collect()
    if not is_cpu():
        torch.cuda.empty_cache()
    ray.shutdown()

using this method after llm.generate works for me

SuperBruceJia · 2024-11-07T02:13:58Z

This setup works perfectly on my end when using multiple GPUs! FYI, I am using pip install vllm==0.6.3.post1:

import gc
import contextlib

import ray
import torch
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (
    destroy_model_parallel,
    destroy_distributed_environment,
)


llm = LLM(
    model=save_dir,
    tokenizer=model_name,
    dtype='bfloat16',
    # Acknowledgement: Benjamin Kitor
    # https://github.com/vllm-project/vllm/issues/2794
    distributed_executor_backend="mp",
    tensor_parallel_size=num_gpus_vllm,
    gpu_memory_utilization=gpu_utilization_vllm,
    # Note: We add this only to save the GPU Memories!
    max_model_len=max_model_len_vllm,
    disable_custom_all_reduce=True,
    enable_lora=False,
)

# Delete the llm object and free the memory
destroy_model_parallel()
destroy_distributed_environment()
del llm.llm_engine.model_executor
del llm
with contextlib.suppress(AssertionError):
    torch.distributed.destroy_process_group()
gc.collect()
torch.cuda.empty_cache()
ray.shutdown()
print("Successfully delete the llm pipeline and free the GPU memory.")

sfc-gh-zhwang · 2024-12-07T06:28:00Z

why is del llm.llm_engine.model_executor needed?

saattrupdan mentioned this issue Jan 23, 2024

fix: Clear memory after benchmarking with vLLM ScandEval/ScandEval#172

Merged

mnoukhov mentioned this issue Mar 28, 2024

unload the model #3281

Open

shyringo mentioned this issue Apr 24, 2024

vllm hangs when reinitializing ray #1058

Closed

shyringo mentioned this issue Apr 30, 2024

Vllm RayWoker process hangs when use llm engine #2050

Closed

yudataguy mentioned this issue May 8, 2024

[Usage]: Out of Memory w/ multiple models #4678

Closed

kzawora-intel mentioned this issue Jun 7, 2024

Helper scripts for profiling HabanaAI/vllm-fork#53

Closed

lizzzcai mentioned this issue Jun 13, 2024

[Feature]: load/unload API to run multiple LLMs in a single GPU instance #5491

Open

wheresmyhair mentioned this issue Jun 18, 2024

[Feature] vllm inferencer and memory safe vllm inferencer OptimalScale/LMFlow#860

Merged

danmcp mentioned this issue Jul 5, 2024

vllm serving with multiple gpus isn't shutting down properly instructlab/instructlab#1601

Closed

R-C101 mentioned this issue Jul 23, 2024

[Bug]: CUDA OOM error when loading another model after exiting the first one. #6682

Closed

spigo900 mentioned this issue Oct 15, 2024

GPU memory not freed after deleting an InferenceSetup ChiSym/genparse#84

Closed

e7217 mentioned this issue Nov 28, 2024

fix: refactor method to properly release vllm instance resources Marker-Inc-Korea/AutoRAG#1012

Merged

butsugiri mentioned this issue Dec 6, 2024

Upgrade vLLM from v0.5 series to v0.6 sbintuitions/flexeval#103

Merged

NathanHB mentioned this issue Jan 30, 2025

Adds vllm as backend huggingface/smolagents#337

Open

robertgshaw2-redhat closed this as completed Feb 8, 2025

Is there a way to terminate vllm.LLM and release the GPU memory #1908

Is there a way to terminate vllm.LLM and release the GPU memory #1908

Comments

sfc-gh-zhwang commented Dec 4, 2023

SuperBruceJia commented Dec 4, 2023

hijkzzz commented Dec 4, 2023

deepbrain commented Feb 8, 2024

huylenguyen commented Feb 24, 2024

puddingfjz commented Mar 1, 2024

puddingfjz commented Mar 1, 2024

shyringo commented Apr 23, 2024

shyringo commented Apr 24, 2024

shyringo commented Apr 24, 2024 • edited Loading

ticoneva commented Apr 25, 2024

rbao2018 commented May 4, 2024

mmoskal commented May 8, 2024

yudataguy commented May 9, 2024

shyringo commented May 9, 2024

yudataguy commented May 11, 2024

zheyang0825 commented May 11, 2024 • edited Loading

shyringo commented May 11, 2024

mnoukhov commented May 11, 2024

yudataguy commented May 12, 2024

GurvanR commented May 13, 2024

njhill commented May 13, 2024

GurvanR commented May 14, 2024

Misterrendal commented May 14, 2024

cassanof commented May 16, 2024

youkaichao commented May 16, 2024

Vincent-Li-9701 commented May 20, 2024

cassanof commented May 21, 2024 • edited Loading

DuZKai commented May 31, 2024

SuperBruceJia commented Jun 11, 2024 • edited Loading

ywang96 commented Jun 19, 2024

kota-iizuka commented Jul 19, 2024 • edited Loading

xansar commented Jul 29, 2024

bhavnicksm commented Aug 5, 2024 • edited Loading

bhavnicksm commented Aug 12, 2024

jvlinsta commented Aug 12, 2024

jvlinsta commented Aug 12, 2024 • edited Loading

jvlinsta commented Aug 12, 2024

hammer-wang commented Oct 14, 2024

njhill commented Oct 14, 2024

sadanand1120 commented Oct 31, 2024

SuperBruceJia commented Nov 7, 2024 • edited Loading

sfc-gh-zhwang commented Dec 7, 2024

shyringo commented Apr 24, 2024 •

edited

Loading

zheyang0825 commented May 11, 2024 •

edited

Loading

cassanof commented May 21, 2024 •

edited

Loading

SuperBruceJia commented Jun 11, 2024 •

edited

Loading

kota-iizuka commented Jul 19, 2024 •

edited

Loading

bhavnicksm commented Aug 5, 2024 •

edited

Loading

jvlinsta commented Aug 12, 2024 •

edited

Loading

SuperBruceJia commented Nov 7, 2024 •

edited

Loading