Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taking too much memory on multiple GPUs #872

Closed
krishanu-deloitte opened this issue Aug 25, 2023 · 5 comments
Closed

Taking too much memory on multiple GPUs #872

krishanu-deloitte opened this issue Aug 25, 2023 · 5 comments

Comments

@krishanu-deloitte
Copy link

krishanu-deloitte commented Aug 25, 2023

I am trying to load a llama13b model on a machine with 4 16GB V100 GPUs (Combined 64 GB GPU memory), 64 GB memory and 16 CPUs.
This is the command I am using:

llm = LLM(model = 'meta-llama/Llama-2-13b-chat-hf', 
          trust_remote_code=True, 
          tensor_parallel_size = 4,
          dtype = "float16"
         )

However I am running into OutOfMemoryError:

2023-08-25 12:31:27,423	WARNING utils.py:597 -- Ray currently does not support initializing Ray with fractional cpus. Your num_cpus will be truncated from 19.2 to 19.
2023-08-25 12:31:27,634	INFO worker.py:1612 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
INFO 08-25 12:31:29 llm_engine.py:70] Initializing an LLM engine with config: model=''meta-llama/Llama-2-13b-chat-hf'', tokenizer=''meta-llama/Llama-2-13b-chat-hf'', tokenizer_mode=auto, trust_remote_code=True, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=4, seed=0)
INFO 08-25 12:31:29 tokenizer.py:29] For some LLaMA-based models, initializing the fast tokenizer may take a long time. To eliminate the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
Cell In[4], line 1
----> 1 llm = LLM(model = '/home/jovyan/data-chatbot/models/pretrained_models/meta-llama-Llama-2-13b-chat-hf', 
      2           trust_remote_code=True, 
      3           tensor_parallel_size = 4,
      4           dtype = "float16"
      5          )

File /opt/conda/lib/python3.10/site-packages/vllm/entrypoints/llm.py:66, in LLM.__init__(self, model, tokenizer, tokenizer_mode, trust_remote_code, tensor_parallel_size, dtype, seed, **kwargs)
     55     kwargs["disable_log_stats"] = True
     56 engine_args = EngineArgs(
     57     model=model,
     58     tokenizer=tokenizer,
   (...)
     64     **kwargs,
     65 )
---> 66 self.llm_engine = LLMEngine.from_engine_args(engine_args)
     67 self.request_counter = Counter()

File /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py:220, in LLMEngine.from_engine_args(cls, engine_args)
    217 distributed_init_method, placement_group = initialize_cluster(
    218     parallel_config)
    219 # Create the LLM engine.
--> 220 engine = cls(*engine_configs,
    221              distributed_init_method,
    222              placement_group,
    223              log_stats=not engine_args.disable_log_stats)
    224 return engine

File /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py:99, in LLMEngine.__init__(self, model_config, cache_config, parallel_config, scheduler_config, distributed_init_method, placement_group, log_stats)
     97 # Create the parallel GPU workers.
     98 if self.parallel_config.worker_use_ray:
---> 99     self._init_workers_ray(placement_group)
    100 else:
    101     self._init_workers(distributed_init_method)

File /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py:170, in LLMEngine._init_workers_ray(self, placement_group)
    160 scheduler_config = copy.deepcopy(self.scheduler_config)
    161 self._run_workers("init_worker",
    162                   get_all_outputs=True,
    163                   worker_init_fn=lambda: Worker(
   (...)
    168                       None,
    169                   ))
--> 170 self._run_workers(
    171     "init_model",
    172     get_all_outputs=True,
    173 )

File /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py:474, in LLMEngine._run_workers(self, method, get_all_outputs, *args, **kwargs)
    471     all_outputs.append(output)
    473 if self.parallel_config.worker_use_ray:
--> 474     all_outputs = ray.get(all_outputs)
    476 if get_all_outputs:
    477     return all_outputs

File /opt/conda/lib/python3.10/site-packages/ray/_private/auto_init_hook.py:24, in wrap_auto_init.<locals>.auto_init_wrapper(*args, **kwargs)
     21 @wraps(fn)
     22 def auto_init_wrapper(*args, **kwargs):
     23     auto_init_ray()
---> 24     return fn(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:103, in client_mode_hook.<locals>.wrapper(*args, **kwargs)
    101     if func.__name__ != "init" or is_client_mode_enabled_by_default:
    102         return getattr(ray, func.__name__)(*args, **kwargs)
--> 103 return func(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/ray/_private/worker.py:2526, in get(object_refs, timeout)
   2524             raise value.as_instanceof_cause()
   2525         else:
-> 2526             raise value
   2528 if is_individual_id:
   2529     values = values[0]

OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 100.64.142.67, ID: ef44dd1ed28a0968c267bdedabdbe98d9abe9f0391ae8fb1bbdd7d13) where the task (actor ID: 543be1633c4a9cd670a4818e01000000, name=RayWorker.__init__, pid=14193, memory used=6.80GB) was running was 76.00GB / 76.80GB (0.989536), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 2fa51644f263f52d4207ded74d7db946099bc7c548c0fe6d5828000a) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 100.64.142.67`. To see the logs of the worker, use `ray logs worker-2fa51644f263f52d4207ded74d7db946099bc7c548c0fe6d5828000a*out -ip 100.64.142.67. Top 10 memory users:
PID	MEM(GB)	COMMAND
14192	6.85	ray::RayWorker.execute_method
14191	6.82	ray::RayWorker.execute_method
14193	6.80	ray::RayWorker.execute_method
14190	6.75	ray::RayWorker.execute_method
13516	0.32	/opt/conda/bin/python3 -m ipykernel_launcher -f /home/jovyan/.local/share/jupyter/runtime/kernel-aaf...
13589	0.12	/opt/conda/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_2...
188	0.08	/opt/conda/bin/python3 /opt/conda/bin/jupyter-lab --notebook-dir=/home/jovyan --ip=0.0.0.0 --no-brow...
13758	0.07	/opt/conda/bin/python3 -u /opt/conda/lib/python3.10/site-packages/ray/dashboard/agent.py --node-ip-a...
13649	0.07	/opt/conda/bin/python3 /opt/conda/lib/python3.10/site-packages/ray/dashboard/dashboard.py --host=127...
13648	0.05	/opt/conda/bin/python3 -u /opt/conda/lib/python3.10/site-packages/ray/autoscaler/_private/monitor.py...
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.

The funny thing is when I try to run the same model on a single 40GB A100 GPU, it runs without any issue.

Can anyone tell me whats going on?

any help is appreciated

@horiacristescu
Copy link

horiacristescu commented Aug 27, 2023

I have the same problem, it works with 2 GPUs and tensor-parallel-size 2, but gives OOM with 4.
I just want to run more requests in parallel in batch mode using more GPUs.

same model Llama-2-13b

@caonann
Copy link

caonann commented Sep 6, 2023

i have the same question too

@tresiwald
Copy link

same here when using four gpus, any solution?

@smallmocha
Copy link

it's a bug to be fix,#322

@boydfd
Copy link
Contributor

boydfd commented Oct 17, 2023

I met the same issue and figured out how to fix it. Already created a PR #1395

@hmellor hmellor closed this as completed Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants