-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ray OOM in tensor parallel #322
Comments
Hi @liulfy, it's because we allocate 4gb of cpu memory per gpu Adding swap_space=1 when initializing LLM will solve the problem. |
@WoosukKwon Thank you for answering my problem! When I try the swap_space, the problem has not been solved. my CPU has 32GB memory, and I use 4 A100 40GB. |
I met same problem. model:
free -h
Initializing an LLM engine with config:
This is error:
I guess vllm allocate memory size for model more than it's physical size,Is there a formula for calculating memory size? |
Me too. |
disable the ray memory monitor by related issue: ray-project/ray#10895 |
This does not work for me. I set NCCL_DEBUG=INFO and my log is as follows:
|
hi, we're having the same issue. Has anyone found a solution for this yet? |
Same issue here, but I doubt it has nothing to do with ray |
mark |
mark,i have the same problem |
same problem here |
I'm having the same issue. |
same here, mark |
In my humble opinion, vllm/vllm/model_executor/models/llama.py Lines 336 to 339 in bbbf865
For this loop, it needs some cpu memories per GPU device for loading a checkpoint file. For @liulfy 's case, |
Indeed, after sharding my model's checkpoints to small pieces, |
I know that there is no way to partially load a large checkpoint file at code level. Any ideas on how vLLM can solve these problems? |
Same issue here,anyone fix it now? |
same here, mark |
I met the same issue and figured out how to fix it. Already created a PR #1395 |
@boydfd seems did not fix this issue,not when load model,i get oom after runing several days |
maybe you can share more infos? |
Same issue here,I'v found some info may help: 1.It goes well when --tensor-parallel-size==1, that is with out ray. The cpu memory usage is static. Model: llama-7b |
It seems that if turn down the --max_model_len ,it'll start。 |
If anybody run vllm on Triton server |
I wrote same answer to Issue #721, Can you try this?
|
I encountered the same oom error message and
|
I resolved my case by |
Updated to move to ROCm 6.3 and post the issue with saving Tunable Ops due to PyTorch bug.
In my case , I can deploy the vllm service on single GPU. but when I use multi gpu, I meet the ray OOM error. Could you please help solve this problem?
my model is yahma/llama-7b-hf
my transformers version is 4.28.0
my cuda version is 11.4
2023-06-30 09:24:53,455 WARNING utils.py:593 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set
RAY_USE_MULTIPROCESSING_CPU_COUNT=1
as an env var before starting Ray. Set the env var:RAY_DISABLE_DOCKER_CPU_WARNING=1
to mute this warning.2023-06-30 09:24:53,459 WARNING services.py:1826 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=6.12gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2023-06-30 09:24:53,584 INFO worker.py:1636 -- Started a local Ray instance.
INFO 06-30 09:24:54 llm_engine.py:59] Initializing an LLM engine with config: model='/opt/app/yahma-llama-lora', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=4, seed=0)
WARNING 06-30 09:24:54 config.py:131] Possibly too large swap space. 16.00 GiB out of the 32.00 GiB total CPU memory is allocated for the swap space.
/opt/app/yahma-llama-lora
Exception in thread ray_print_logs:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 900, in print_logs
global_worker_stdstream_dispatcher.emit(data)
File "/usr/local/lib/python3.8/dist-packages/ray/_private/ray_logging.py", line 264, in emit
handle(data)
File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 1788, in print_to_stdstream
print_worker_logs(batch, sink)
File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 1950, in print_worker_logs
restore_tqdm()
File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 1973, in restore_tqdm
tqdm_ray.instance().unhide_bars()
File "/usr/local/lib/python3.8/dist-packages/ray/experimental/tqdm_ray.py", line 344, in instance
_manager = _BarManager()
File "/usr/local/lib/python3.8/dist-packages/ray/experimental/tqdm_ray.py", line 256, in init
self.should_colorize = not ray.widgets.util.in_notebook()
File "/usr/local/lib/python3.8/dist-packages/ray/widgets/util.py", line 205, in in_notebook
shell = _get_ipython_shell_name()
File "/usr/local/lib/python3.8/dist-packages/ray/widgets/util.py", line 194, in _get_ipython_shell_name
import IPython
File "/usr/local/lib/python3.8/dist-packages/IPython/init.py", line 30, in
raise ImportError(
ImportError:
IPython 8.13+ supports Python 3.9 and above, following NEP 29.
IPython 8.0-8.12 supports Python 3.8 and above, following NEP 29.
When using Python 2.7, please install IPython 5.x LTS Long Term Support version.
Python 3.3 and 3.4 were supported up to IPython 6.x.
Python 3.5 was supported with IPython 7.0 to 7.9.
Python 3.6 was supported with IPython up to 7.16.
Python 3.7 was still supported with the 7.x branch.
See IPython
README.rst
file for more information:Traceback (most recent call last):
File "", line 1, in
File "/opt/app/vllm-0.1.1/vllm/entrypoints/llm.py", line 55, in init
self.llm_engine = LLMEngine.from_engine_args(engine_args)
File "/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py", line 151, in from_engine_args
engine = cls(*engine_configs, distributed_init_method, devices,
File "/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py", line 102, in init
self._init_cache()
File "/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py", line 114, in _init_cache
num_blocks = self._run_workers(
File "/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py", line 317, in _run_workers
all_outputs = ray.get(all_outputs)
File "/usr/local/lib/python3.8/dist-packages/ray/_private/auto_init_hook.py", line 18, in auto_init_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 2542, in get
raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 10.30.192.36, ID: 17400c6c9eee3bc1384c172eecd4e1ecf2992cbc7f50cb27d2dc60d7) where the task (task ID: ffffffffffffffff283e91f20257d747969124a201000000, name=Worker.init, pid=26332, memory used=4.54GB) was running was 31.27GB / 32.00GB (0.977298), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: cb6154315a0e1a33d85683935ae20cf76eecd48230c3c4b3a5563fe4) because it was the most recently scheduled task; to see more information about memory usage on this node, use
ray logs raylet.out -ip 10.30.192.36
. To see the logs of the worker, useray logs worker-cb6154315a0e1a33d85683935ae20cf76eecd48230c3c4b3a5563fe4*out -ip 10.30.192.36. Top 10 memory users: PID MEM(GB) COMMAND 26333 4.60 ray::Worker.__init__ 26332 4.54 ray::Worker.__init__ 26331 4.51 ray::Worker.__init__ 26330 4.47 ray::Worker.__init__ 25044 0.23 python 25099 0.19 /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_20... 25340 0.06 ray::IDLE 25174 0.06 /usr/bin/python /usr/local/lib/python3.8/dist-packages/ray/dashboard/dashboard.py --host=127.0.0.1 -... 25310 0.06 /usr/bin/python -u /usr/local/lib/python3.8/dist-packages/ray/dashboard/agent.py --node-ip-address=1... 25349 0.05 ray::IDLE Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable
RAY_memory_usage_thresholdwhen starting Ray. To disable worker killing, set the environment variable
RAY_memory_monitor_refresh_ms` to zero.The text was updated successfully, but these errors were encountered: