Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to terminate vllm.LLM and release the GPU memory #1908

Open
sfc-gh-zhwang opened this issue Dec 4, 2023 · 40 comments
Open

Is there a way to terminate vllm.LLM and release the GPU memory #1908

sfc-gh-zhwang opened this issue Dec 4, 2023 · 40 comments

Comments

@sfc-gh-zhwang
Copy link
Contributor

After below code, is there an api(maybe like llm.terminate) to kill llm and release the GPU memory?

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
outputs = llm.generate(prompts, sampling_params)
@SuperBruceJia
Copy link

After below code, is there an api(maybe like llm.terminate) to kill llm and release the GPU memory?

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
outputs = llm.generate(prompts, sampling_params)

Please check the codes below. It works.

import gc

import torch
from vllm import LLM, SamplingParams
from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel

# Load the model via vLLM
llm = LLM(model=model_name, download_dir=saver_dir, tensor_parallel_size=num_gpus, gpu_memory_utilization=0.70)

# Delete the llm object and free the memory
destroy_model_parallel()
del llm
gc.collect()
torch.cuda.empty_cache()
torch.distributed.destroy_process_group()
print("Successfully delete the llm pipeline and free the GPU memory!")

Best regards,

Shuyue
Dec. 3rd, 2023

@hijkzzz
Copy link

hijkzzz commented Dec 4, 2023

mark

@deepbrain
Copy link

Even after executing the code above, the GPU memory is not freed with the latest vllm built from source. Any recommendations?

@huylenguyen
Copy link

Are there any updates on this? the above code does not work for me either

@puddingfjz
Copy link

+1

@puddingfjz
Copy link

I find that we need to explicitly run "del llm.llm_engine.driver_worker" to release in when using a single worker.
Can anybody explain why this is the case?

@shyringo
Copy link

+1

@shyringo
Copy link

I find that we need to explicitly run "del llm.llm_engine.driver_worker" to release in when using a single worker. Can anybody explain why this is the case?

I tried the above code block and also this line "del llm.llm_engine.driver_worker". Both failed for me.


But I managed, with the following code, to terminate the vllm.LLM(), release the GPU memory, and shut down ray in convenience for using vllm.LLM() for the next model. After this, I succeeded in using vllm.LLM() again for the next model.

        #llm is a vllm.LLM object
        import gc
        import torch
        from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel
        
        destroy_model_parallel()
        #del a vllm.executor.ray_gpu_executor.RayGPUExecutor object
        del llm.llm_engine.model_executor
        del llm
        gc.collect()
        torch.cuda.empty_cache()
        import ray
        ray.shutdown()

Anyway, even if it works, it is just a temporary solution and this issue still needs fixing.

@shyringo
Copy link

shyringo commented Apr 24, 2024

I tried the above code block and also this line "del llm.llm_engine.driver_worker". Both failed for me.

But I managed, with the following code, to terminate the vllm.LLM(), release the GPU memory, and shut down ray in convenience for using vllm.LLM() for the next model. After this, I succeeded in using vllm.LLM() again for the next model.

        #llm is a vllm.LLM object
        import gc
        import torch
        from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel
        
        destroy_model_parallel()
        #del a vllm.executor.ray_gpu_executor.RayGPUExecutor object
        del llm.llm_engine.model_executor
        del llm
        gc.collect()
        torch.cuda.empty_cache()
        import ray
        ray.shutdown()

Anyway, even if it works, it is just a temporary solution and this issue still needs fixing.

update:
the following code would work better, without the possible dead lock warning.

        #llm is a vllm.LLM object
        import gc
        import torch
        from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel
        import os

        #avoid huggingface/tokenizers process dead lock
        os.environ["TOKENIZERS_PARALLELISM"] = "false"
        destroy_model_parallel()
        #del a vllm.executor.ray_gpu_executor.RayGPUExecutor object
        del llm.llm_engine.model_executor
        del llm
        gc.collect()
        torch.cuda.empty_cache()
        import ray
        ray.shutdown()

@ticoneva
Copy link

In the latest version of vLLM destroy_model_parallel has moved to vllm.distributed.parallel_state. The objects you have to delete have also changed:

from vllm.distributed.parallel_state import destroy_model_parallel
...
destroy_model_parallel()
del llm.llm_engine.model_executor.driver_worker
del llm # Isn't necessary for releasing memory, but why not
gc.collect()
torch.cuda.empty_cache()

@rbao2018
Copy link

rbao2018 commented May 4, 2024

In the latest version of vLLM destroy_model_parallel has moved to vllm.distributed.parallel_state. The objects you have to delete have also changed:

from vllm.distributed.parallel_state import destroy_model_parallel
...
destroy_model_parallel()
del llm.llm_engine.model_executor.driver_worker
del llm # Isn't necessary for releasing memory, but why not
gc.collect()
torch.cuda.empty_cache()

thx a lot

@mmoskal
Copy link
Contributor

mmoskal commented May 8, 2024

vLLM seems to hang to the first allocated LLM() instance. It does not hang to later instances. Maybe that helps with diagnosing the issue?

from vllm import LLM


def show_memory_usage():
    import torch.cuda
    import torch.distributed
    import gc

    print(f"cuda memory: {torch.cuda.memory_allocated()//1024//1024}MB")
    gc.collect()
    # torch.distributed.destroy_process_group()
    torch.cuda.empty_cache()
    print(f"  --> after gc: {torch.cuda.memory_allocated()//1024//1024}MB")


def gc_problem():
    show_memory_usage()
    print("loading llm0")
    llm0 = LLM(model="facebook/opt-125m", num_gpu_blocks_override=180)
    del llm0
    show_memory_usage()

    print("loading llm1")
    llm1 = LLM(model="facebook/opt-125m", num_gpu_blocks_override=500)
    del llm1
    show_memory_usage()

    print("loading llm2")
    llm2 = LLM(model="facebook/opt-125m", num_gpu_blocks_override=600)
    del llm2
    show_memory_usage()

gc_problem()
root@c09a058c2d5b:/workspaces/aici/py/vllm# python tests/core/block/e2e/gc_problem.py |grep -v INFO
cuda memory: 0MB
  --> after gc: 0MB
loading llm0
cuda memory: 368MB
  --> after gc: 368MB
loading llm1
cuda memory: 912MB
  --> after gc: 368MB
loading llm2
[rank0]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
cuda memory: 961MB
  --> after gc: 368MB
root@c09a058c2d5b:/workspaces/aici/py/vllm# 

The llm1 consumes more than llm0 but you can see that the allocated memory stays at llm0 level.

@yudataguy
Copy link

In the latest version of vLLM destroy_model_parallel has moved to vllm.distributed.parallel_state. The objects you have to delete have also changed:

from vllm.distributed.parallel_state import destroy_model_parallel
...
destroy_model_parallel()
del llm.llm_engine.model_executor.driver_worker
del llm # Isn't necessary for releasing memory, but why not
gc.collect()
torch.cuda.empty_cache()

Tried this including ray.shutdown() but the memory is not released on my end, any other suggestion?

@shyringo
Copy link

shyringo commented May 9, 2024

Tried this including ray.shutdown() but the memory is not released on my end, any other suggestion?

could try the "del llm.llm_engine.model_executor" in the following code instead:

update: the following code would work better, without the possible dead lock warning.

        #llm is a vllm.LLM object
        import gc
        import torch
        from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel
        import os

        #avoid huggingface/tokenizers process dead lock
        os.environ["TOKENIZERS_PARALLELISM"] = "false"
        destroy_model_parallel()
        #del a vllm.executor.ray_gpu_executor.RayGPUExecutor object
        del llm.llm_engine.model_executor
        del llm
        gc.collect()
        torch.cuda.empty_cache()
        import ray
        ray.shutdown()

@yudataguy
Copy link

Tried this including ray.shutdown() but the memory is not released on my end, any other suggestion?

could try the "del llm.llm_engine.model_executor" in the following code instead:

update: the following code would work better, without the possible dead lock warning.

        #llm is a vllm.LLM object
        import gc
        import torch
        from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel
        import os

        #avoid huggingface/tokenizers process dead lock
        os.environ["TOKENIZERS_PARALLELISM"] = "false"
        destroy_model_parallel()
        #del a vllm.executor.ray_gpu_executor.RayGPUExecutor object
        del llm.llm_engine.model_executor
        del llm
        gc.collect()
        torch.cuda.empty_cache()
        import ray
        ray.shutdown()

did that as well, still no change in gpu memory allocation. Not sure how to go further

@zheyang0825
Copy link

zheyang0825 commented May 11, 2024

In the latest version of vLLM destroy_model_parallel has moved to vllm.distributed.parallel_state. The objects you have to delete have also changed:

from vllm.distributed.parallel_state import destroy_model_parallel
...
destroy_model_parallel()
del llm.llm_engine.model_executor.driver_worker
del llm # Isn't necessary for releasing memory, but why not
gc.collect()
torch.cuda.empty_cache()

We tried this in version 0.4.2, but GPU memory did not released.

@shyringo
Copy link

did that as well, still no change in gpu memory allocation. Not sure how to go further

Then I do not have a clue either. Meanwhile, I should add an information: the vllm version with which I succeeded with the above code was 0.4.0.post1

@mnoukhov
Copy link

@zheyang0825 does adding this lines at the end make it work?

torch.distributed.destroy_process_group()         

@yudataguy
Copy link

did that as well, still no change in gpu memory allocation. Not sure how to go further

Then I do not have a clue either. Meanwhile, I should add an information: the vllm version with which I succeeded with the above code was 0.4.0.post1

tried on 0.4.0.post1 and method worked, not sure what changed in the latest version that's not releasing the memory, possible bug?

@GurvanR
Copy link

GurvanR commented May 13, 2024

Hello ! so if I'm not wrong, no one achieved to release memory on vllm 0.4.2 yet ?

@njhill
Copy link
Member

njhill commented May 13, 2024

A new bug was introduced in 0.4.2, but fixed in #4737. Please try with that PR or as a workaround you can also install tensorizer.

This should resolve such errors at least for TP=1. For TP > 1, there may be other issues with creating a new LLM instance after deleting one in the same process.

@GurvanR
Copy link

GurvanR commented May 14, 2024

A new bug was introduced in 0.4.2, but fixed in #4737. Please try with that PR or as a workaround you can also install tensorizer.

This should resolve such errors at least for TP=1. For TP > 1, there may be other issues with creating a new LLM instance after deleting one in the same process.

I updated vllm yesterday and still have the problem, I'm using those lines :

destroy_model_parallel()
del llm.llm_engine.model_executor.driver_worker
del llm
gc.collect()
torch.cuda.empty_cache()

@Misterrendal
Copy link

This code is worked for me

vllm==0.4.0.post1

        import gc
        import ray
        from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel

        print('service stopping ..')
        print(f"cuda memory: {torch.cuda.memory_allocated() // 1024 // 1024}MB")

        destroy_model_parallel()

        del model.llm_engine.model_executor.driver_worker
        del model

        gc.collect()
        torch.cuda.empty_cache()
        ray.shutdown()

        print(f"cuda memory: {torch.cuda.memory_allocated() // 1024 // 1024}MB")

        print("service stopped")

@cassanof
Copy link
Contributor

There should be a built-in way! We cannot keep writing code that breaks on the next minor release :(

@youkaichao
Copy link
Member

In general it is very difficult to clean up all resources correctly, especially when we use multiple GPUs, and might be prone to deadlocks .

I would say, the most stable way to terminate vLLM is to shut down the process.

@Vincent-Li-9701
Copy link

A new bug was introduced in 0.4.2, but fixed in #4737. Please try with that PR or as a workaround you can also install tensorizer.

This should resolve such errors at least for TP=1. For TP > 1, there may be other issues with creating a new LLM instance after deleting one in the same process.

I encountered this issue when TP = 8. I'm doing this in a iterative manner since I need to run the embedding model after the generative model so there are so loading / offloading. The first iteration is fine but the second iteration the instantiation of vllm ray server hangs.

@cassanof
Copy link
Contributor

cassanof commented May 21, 2024

In general it is very difficult to clean up all resources correctly, especially when we use multiple GPUs, and might be prone to deadlocks .

I would say, the most stable way to terminate vLLM is to shut down the process.

I understand your point. However, this feature is extremely useful for situations where you need to switch between models. For instance, reinforcement learning loops. I am writing an off-policy RL loop, requiring me to train one model (target policy) while its previous version performs inference (behavior policy). As a result, I frequently load and unload models. While I know vLLM is not intended for training, using transformers would be too slow, making my technique unviable.

Let me know if this is a feature that's wanted and the team would be interested in maintaining it. I can open a separate issue and start working on it.

@DuZKai
Copy link

DuZKai commented May 31, 2024

I don't know if anyone can currently clear memory correctly, but in version 0.4.2, I applied the above code that failed to clear memory correctly. I can only use a slightly extreme method of creating a new process before calling and closing the process after calling to roughly solve the problem:

from multiprocessing import Process, set_start_method
set_start_method('spawn', force=True)
def vllm_texts(model_path):
    prompts=""
    sampling_params = SamplingParams(max_tokens=512)
    llm = LLM(model=model_path)
    outputs = llm.generate(prompts, sampling_params)

...
print(torch.cuda.memory_summary())
p = Process(target=vllm_texts, args=(model_path))
p.start()
p.join()
if p.is_alive():
    p.terminate()
p.close()
print(torch.cuda.memory_summary())
...

I still hope there is a way in the future to correctly and perfectly clear memory

@SuperBruceJia
Copy link

SuperBruceJia commented Jun 11, 2024

While I am using multiple GPUs to serve a LLM (tensor_parallel_size > 1), the GPUs' memory is not released, except the first GPU (cuda:0).

image

@ywang96
Copy link
Member

ywang96 commented Jun 19, 2024

In general it is very difficult to clean up all resources correctly, especially when we use multiple GPUs, and might be prone to deadlocks .
I would say, the most stable way to terminate vLLM is to shut down the process.

I understand your point. However, this feature is extremely useful for situations where you need to switch between models. For instance, reinforcement learning loops. I am writing an off-policy RL loop, requiring me to train one model (target policy) while its previous version performs inference (behavior policy). As a result, I frequently load and unload models. While I know vLLM is not intended for training, using transformers would be too slow, making my technique unviable.

Let me know if this is a feature that's wanted and the team would be interested in maintaining it. I can open a separate issue and start working on it.

Glad to see you here @cassanof and to hear that you have been using vLLM in this kind of workflow!

Given how much wanted this feature seems to be, I will bring this back to the team to discuss! If multi-gpu instance is prone to deadlocks, then perhaps we can at least start with single-gpu instances. Everyone on the maintainer team does have limited bandwidth and we have a lot of things to work on, so contributions are very welcomed as always!

@kota-iizuka
Copy link
Contributor

kota-iizuka commented Jul 19, 2024

I tried inferring multiple models consecutively with vLLM v0.5.2. As mentioned above, the behavior differs depending on the value of TP.

I use this function in a pipeline where I explain images with VLM and then summarize them with LLM. I hope that this kind of processing will be officially provided and become common.

from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import destroy_model_parallel, destroy_distributed_environment
import torch
import gc
import os

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

def main():
    prompts = ["Hello, my name is"]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    llm = LLM(model="facebook/opt-125m", tensor_parallel_size=1)
    outputs = llm.generate(prompts, sampling_params)
    print(outputs)

    destroy_model_parallel()
    destroy_distributed_environment()
    del llm.llm_engine.model_executor
    del llm
    gc.collect()
    torch.cuda.empty_cache()

    llm = LLM(model="facebook/opt-125m", tensor_parallel_size=2)
    outputs = llm.generate(prompts, sampling_params)
    print(outputs)

    destroy_model_parallel()
    destroy_distributed_environment()
    del llm.llm_engine.model_executor
    del llm
    gc.collect()
    torch.cuda.empty_cache()

if __name__ == "__main__":
    main()

@xansar
Copy link

xansar commented Jul 29, 2024

I tried inferring multiple models consecutively with vLLM v0.5.2. As mentioned above, the behavior differs depending on the value of TP.

I use this function in a pipeline where I explain images with VLM and then summarize them with LLM. I hope that this kind of processing will be officially provided and become common.

from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import destroy_model_parallel, destroy_distributed_environment
import torch
import gc
import os

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

def main():
    prompts = ["Hello, my name is"]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    llm = LLM(model="facebook/opt-125m", tensor_parallel_size=1)
    outputs = llm.generate(prompts, sampling_params)
    print(outputs)

    destroy_model_parallel()
    destroy_distributed_environment()
    del llm.llm_engine.model_executor
    del llm
    gc.collect()
    torch.cuda.empty_cache()

    llm = LLM(model="facebook/opt-125m", tensor_parallel_size=2)
    outputs = llm.generate(prompts, sampling_params)
    print(outputs)

    destroy_model_parallel()
    destroy_distributed_environment()
    del llm.llm_engine.model_executor
    del llm
    gc.collect()
    torch.cuda.empty_cache()

if __name__ == "__main__":
    main()

It works for me.

@bhavnicksm
Copy link

bhavnicksm commented Aug 5, 2024

I tried inferring multiple models consecutively with vLLM v0.5.2. As mentioned above, the behavior differs depending on the value of TP.

I use this function in a pipeline where I explain images with VLM and then summarize them with LLM. I hope that this kind of processing will be officially provided and become common.

from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import destroy_model_parallel, destroy_distributed_environment
import torch
import gc
import os

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

def main():
    prompts = ["Hello, my name is"]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    llm = LLM(model="facebook/opt-125m", tensor_parallel_size=1)
    outputs = llm.generate(prompts, sampling_params)
    print(outputs)

    destroy_model_parallel()
    destroy_distributed_environment()
    del llm.llm_engine.model_executor
    del llm
    gc.collect()
    torch.cuda.empty_cache()

    llm = LLM(model="facebook/opt-125m", tensor_parallel_size=2)
    outputs = llm.generate(prompts, sampling_params)
    print(outputs)

    destroy_model_parallel()
    destroy_distributed_environment()
    del llm.llm_engine.model_executor
    del llm
    gc.collect()
    torch.cuda.empty_cache()

if __name__ == "__main__":
    main()

With TP=1, I am able to unload the model without difficulty with the method described above, but the re-loading fails with a esoteric error, like:

File "/home/bhavnick/fd/workspace/vllm-api/modules/llm/generator.py", line 139, in load
    engine = AsyncLLMEngine.from_engine_args(args)
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
    engine = cls(
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 380, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 251, in __init__
    self.model_executor = executor_class(
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 35, in _init_executor
    self.driver_worker.init_device()
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/worker/worker.py", line 132, in init_device
    init_worker_distributed_environment(self.parallel_config, self.rank,
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/worker/worker.py", line 346, in init_worker_distributed_environment
    ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 920, in ensure_model_parallel_initialized
    backend = backend or torch.distributed.get_backend(
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1074, in get_backend
    return Backend(not_none(pg_store)[0])
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/torch/utils/_typing_utils.py", line 12, in not_none
    raise TypeError("Invariant encountered: value was None when it should not be")
TypeError: Invariant encountered: value was None when it should not be

Some differences I can think of is I am using the AysncLLMEngine instead of the synchronous version, but the unload works just the same.

@bhavnicksm
Copy link

I tried inferring multiple models consecutively with vLLM v0.5.2. As mentioned above, the behavior differs depending on the value of TP.

I use this function in a pipeline where I explain images with VLM and then summarize them with LLM. I hope that this kind of processing will be officially provided and become common.

from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import destroy_model_parallel, destroy_distributed_environment
import torch
import gc
import os

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

def main():
    prompts = ["Hello, my name is"]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    llm = LLM(model="facebook/opt-125m", tensor_parallel_size=1)
    outputs = llm.generate(prompts, sampling_params)
    print(outputs)

    destroy_model_parallel()
    destroy_distributed_environment()
    del llm.llm_engine.model_executor
    del llm
    gc.collect()
    torch.cuda.empty_cache()

    llm = LLM(model="facebook/opt-125m", tensor_parallel_size=2)
    outputs = llm.generate(prompts, sampling_params)
    print(outputs)

    destroy_model_parallel()
    destroy_distributed_environment()
    del llm.llm_engine.model_executor
    del llm
    gc.collect()
    torch.cuda.empty_cache()

if __name__ == "__main__":
    main()

With TP=1, I am able to unload the model without difficulty with the method described above, but the re-loading fails with a esoteric error, like:

File "/home/bhavnick/fd/workspace/vllm-api/modules/llm/generator.py", line 139, in load
    engine = AsyncLLMEngine.from_engine_args(args)
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
    engine = cls(
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 380, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 251, in __init__
    self.model_executor = executor_class(
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 35, in _init_executor
    self.driver_worker.init_device()
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/worker/worker.py", line 132, in init_device
    init_worker_distributed_environment(self.parallel_config, self.rank,
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/worker/worker.py", line 346, in init_worker_distributed_environment
    ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 920, in ensure_model_parallel_initialized
    backend = backend or torch.distributed.get_backend(
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1074, in get_backend
    return Backend(not_none(pg_store)[0])
  File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/torch/utils/_typing_utils.py", line 12, in not_none
    raise TypeError("Invariant encountered: value was None when it should not be")
TypeError: Invariant encountered: value was None when it should not be

Some differences I can think of is I am using the AysncLLMEngine instead of the synchronous version, but the unload works just the same.

UPDATE: Works with ray at TP=2 but for some reason not with multiprocessing with either fork or spawn. Also, only unload works with ray but not reload right now.

@jvlinsta
Copy link

@bhavnicksm I can reproduce the same error you have.

@jvlinsta
Copy link

jvlinsta commented Aug 12, 2024

When debugging, I found that when using 'spawn' the main GPU used (if using PCI_BUS) would still keep some small amount of memory allocated, indicating that the clean up is unsuccessful.

When I then check all available GPUs that have zero memory allocated and export CUDA_VISIBLE_DEVICES, I am able to progress into changing into another model until another issue hits that seemingly indicates the workers distributed state is still somewhere in memory.

    self.engine = self._init_engine(*args, **kwargs)
  File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 551, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/ubuntu/vllm/vllm/engine/llm_engine.py", line 250, in __init__
    self.model_executor = executor_class(
  File "/home/ubuntu/vllm/vllm/executor/multiproc_gpu_executor.py", line 215, in __init__
    super().__init__(*args, **kwargs)
  File "/home/ubuntu/vllm/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
    super().__init__(*args, **kwargs)
  File "/home/ubuntu/vllm/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/home/ubuntu/vllm/vllm/executor/multiproc_gpu_executor.py", line 137, in _init_executor
    self._run_workers("init_device")
  File "/home/ubuntu/vllm/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers
    driver_worker_output = driver_worker_method(*args, **kwargs)
  File "/home/ubuntu/vllm/vllm/worker/worker.py", line 132, in init_device
    init_worker_distributed_environment(self.parallel_config, self.rank,
  File "/home/ubuntu/vllm/vllm/worker/worker.py", line 343, in init_worker_distributed_environment
    init_distributed_environment(parallel_config.world_size, rank,
  File "/home/ubuntu/vllm/vllm/distributed/parallel_state.py", line 874, in init_distributed_environment
    assert _WORLD.world_size == torch.distributed.get_world_size(), (
AssertionError: world group already initialized with a different world **size```

@jvlinsta
Copy link

Another update:

The problem is the global _WORLD variable that stays None. When overriding the backend to be 'nccl' hardcoded at vllm/distributed/parallel_state.py", line 920, in ensure_model_parallel_initialized, the whole state goes in deadlock:

    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 470, in from_engine_args
    engine = cls(
  File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 380, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 551, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/ubuntu/vllm/vllm/engine/llm_engine.py", line 250, in __init__
    self.model_executor = executor_class(
  File "/home/ubuntu/vllm/vllm/executor/multiproc_gpu_executor.py", line 215, in __init__
    super().__init__(*args, **kwargs)
  File "/home/ubuntu/vllm/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
    super().__init__(*args, **kwargs)
  File "/home/ubuntu/vllm/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/home/ubuntu/vllm/vllm/executor/multiproc_gpu_executor.py", line 137, in _init_executor
    self._run_workers("init_device")
  File "/home/ubuntu/vllm/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers
    driver_worker_output = driver_worker_method(*args, **kwargs)
  File "/home/ubuntu/vllm/vllm/worker/worker.py", line 132, in init_device
    init_worker_distributed_environment(self.parallel_config, self.rank,
  File "/home/ubuntu/vllm/vllm/worker/worker.py", line 346, in init_worker_distributed_environment
    ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
  File "/home/ubuntu/vllm/vllm/distributed/parallel_state.py", line 969, in ensure_model_parallel_initialized
    initialize_model_parallel(tensor_model_parallel_size,
  File "/home/ubuntu/vllm/vllm/distributed/parallel_state.py", line 931, in initialize_model_parallel
    _TP = init_model_parallel_group(group_ranks,
  File "/home/ubuntu/vllm/vllm/distributed/parallel_state.py", line 773, in init_model_parallel_group
    return GroupCoordinator(
  File "/home/ubuntu/vllm/vllm/distributed/parallel_state.py", line 154, in __init__
    self.pynccl_comm = PyNcclCommunicator(
  File "/home/ubuntu/vllm/vllm/distributed/device_communicators/pynccl.py", line 74, in __init__
    dist.broadcast(tensor, src=ranks[0], group=group)
  File "/opt/conda/envs/llm2/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/envs/llm2/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast

@hammer-wang
Copy link

When using TP>1 for the first model, it seems there's no working method that can successfully release the GPU memory. I've tried all scripts in this thread and none worked.

@njhill
Copy link
Member

njhill commented Oct 14, 2024

@hammer-wang what version of vLLM are you using, which distributed backend (ray or multiprocessing) and how are you running the server?

@sadanand1120
Copy link

Got it working on vllm 0.5.x
Adding these two flags to my LLM call:

distributed_executor_backend='mp'
disable_custom_all_reduce=True

and

def vllm_cleanup(llm):
    del llm.llm_engine.model_executor
    del llm
    destroy_model_parallel()
    destroy_distributed_environment()
    with contextlib.suppress(AssertionError):
        torch.distributed.destroy_process_group()
    gc.collect()
    if not is_cpu():
        torch.cuda.empty_cache()
    ray.shutdown()

using this method after llm.generate works for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests