Taking too much memory on multiple GPUs #872

krishanu-deloitte · 2023-08-25T12:33:26Z

I am trying to load a llama13b model on a machine with 4 16GB V100 GPUs (Combined 64 GB GPU memory), 64 GB memory and 16 CPUs.
This is the command I am using:

llm = LLM(model = 'meta-llama/Llama-2-13b-chat-hf', 
          trust_remote_code=True, 
          tensor_parallel_size = 4,
          dtype = "float16"
         )

However I am running into OutOfMemoryError:

2023-08-25 12:31:27,423	WARNING utils.py:597 -- Ray currently does not support initializing Ray with fractional cpus. Your num_cpus will be truncated from 19.2 to 19.
2023-08-25 12:31:27,634	INFO worker.py:1612 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
INFO 08-25 12:31:29 llm_engine.py:70] Initializing an LLM engine with config: model=''meta-llama/Llama-2-13b-chat-hf'', tokenizer=''meta-llama/Llama-2-13b-chat-hf'', tokenizer_mode=auto, trust_remote_code=True, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=4, seed=0)
INFO 08-25 12:31:29 tokenizer.py:29] For some LLaMA-based models, initializing the fast tokenizer may take a long time. To eliminate the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
Cell In[4], line 1
----> 1 llm = LLM(model = '/home/jovyan/data-chatbot/models/pretrained_models/meta-llama-Llama-2-13b-chat-hf', 
      2           trust_remote_code=True, 
      3           tensor_parallel_size = 4,
      4           dtype = "float16"
      5          )

File /opt/conda/lib/python3.10/site-packages/vllm/entrypoints/llm.py:66, in LLM.__init__(self, model, tokenizer, tokenizer_mode, trust_remote_code, tensor_parallel_size, dtype, seed, **kwargs)
     55     kwargs["disable_log_stats"] = True
     56 engine_args = EngineArgs(
     57     model=model,
     58     tokenizer=tokenizer,
   (...)
     64     **kwargs,
     65 )
---> 66 self.llm_engine = LLMEngine.from_engine_args(engine_args)
     67 self.request_counter = Counter()

File /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py:220, in LLMEngine.from_engine_args(cls, engine_args)
    217 distributed_init_method, placement_group = initialize_cluster(
    218     parallel_config)
    219 # Create the LLM engine.
--> 220 engine = cls(*engine_configs,
    221              distributed_init_method,
    222              placement_group,
    223              log_stats=not engine_args.disable_log_stats)
    224 return engine

File /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py:99, in LLMEngine.__init__(self, model_config, cache_config, parallel_config, scheduler_config, distributed_init_method, placement_group, log_stats)
     97 # Create the parallel GPU workers.
     98 if self.parallel_config.worker_use_ray:
---> 99     self._init_workers_ray(placement_group)
    100 else:
    101     self._init_workers(distributed_init_method)

File /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py:170, in LLMEngine._init_workers_ray(self, placement_group)
    160 scheduler_config = copy.deepcopy(self.scheduler_config)
    161 self._run_workers("init_worker",
    162                   get_all_outputs=True,
    163                   worker_init_fn=lambda: Worker(
   (...)
    168                       None,
    169                   ))
--> 170 self._run_workers(
    171     "init_model",
    172     get_all_outputs=True,
    173 )

File /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py:474, in LLMEngine._run_workers(self, method, get_all_outputs, *args, **kwargs)
    471     all_outputs.append(output)
    473 if self.parallel_config.worker_use_ray:
--> 474     all_outputs = ray.get(all_outputs)
    476 if get_all_outputs:
    477     return all_outputs

File /opt/conda/lib/python3.10/site-packages/ray/_private/auto_init_hook.py:24, in wrap_auto_init.<locals>.auto_init_wrapper(*args, **kwargs)
     21 @wraps(fn)
     22 def auto_init_wrapper(*args, **kwargs):
     23     auto_init_ray()
---> 24     return fn(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:103, in client_mode_hook.<locals>.wrapper(*args, **kwargs)
    101     if func.__name__ != "init" or is_client_mode_enabled_by_default:
    102         return getattr(ray, func.__name__)(*args, **kwargs)
--> 103 return func(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/ray/_private/worker.py:2526, in get(object_refs, timeout)
   2524             raise value.as_instanceof_cause()
   2525         else:
-> 2526             raise value
   2528 if is_individual_id:
   2529     values = values[0]

OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 100.64.142.67, ID: ef44dd1ed28a0968c267bdedabdbe98d9abe9f0391ae8fb1bbdd7d13) where the task (actor ID: 543be1633c4a9cd670a4818e01000000, name=RayWorker.__init__, pid=14193, memory used=6.80GB) was running was 76.00GB / 76.80GB (0.989536), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 2fa51644f263f52d4207ded74d7db946099bc7c548c0fe6d5828000a) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 100.64.142.67`. To see the logs of the worker, use `ray logs worker-2fa51644f263f52d4207ded74d7db946099bc7c548c0fe6d5828000a*out -ip 100.64.142.67. Top 10 memory users:
PID	MEM(GB)	COMMAND
14192	6.85	ray::RayWorker.execute_method
14191	6.82	ray::RayWorker.execute_method
14193	6.80	ray::RayWorker.execute_method
14190	6.75	ray::RayWorker.execute_method
13516	0.32	/opt/conda/bin/python3 -m ipykernel_launcher -f /home/jovyan/.local/share/jupyter/runtime/kernel-aaf...
13589	0.12	/opt/conda/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_2...
188	0.08	/opt/conda/bin/python3 /opt/conda/bin/jupyter-lab --notebook-dir=/home/jovyan --ip=0.0.0.0 --no-brow...
13758	0.07	/opt/conda/bin/python3 -u /opt/conda/lib/python3.10/site-packages/ray/dashboard/agent.py --node-ip-a...
13649	0.07	/opt/conda/bin/python3 /opt/conda/lib/python3.10/site-packages/ray/dashboard/dashboard.py --host=127...
13648	0.05	/opt/conda/bin/python3 -u /opt/conda/lib/python3.10/site-packages/ray/autoscaler/_private/monitor.py...
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.

The funny thing is when I try to run the same model on a single 40GB A100 GPU, it runs without any issue.

Can anyone tell me whats going on?

any help is appreciated

The text was updated successfully, but these errors were encountered:

horiacristescu · 2023-08-27T14:32:12Z

I have the same problem, it works with 2 GPUs and tensor-parallel-size 2, but gives OOM with 4.
I just want to run more requests in parallel in batch mode using more GPUs.

same model Llama-2-13b

caonann · 2023-09-06T11:55:18Z

i have the same question too

tresiwald · 2023-10-10T14:18:25Z

same here when using four gpus, any solution?

smallmocha · 2023-10-10T14:24:17Z

it's a bug to be fix，#322

boydfd · 2023-10-17T11:36:31Z

I met the same issue and figured out how to fix it. Already created a PR #1395

boydfd mentioned this issue Oct 17, 2023

fix RAM OOM when load large models in tensor parallel mode. #1395

Merged

hmellor closed this as completed Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taking too much memory on multiple GPUs #872

Taking too much memory on multiple GPUs #872

krishanu-deloitte commented Aug 25, 2023 •

edited

Loading

horiacristescu commented Aug 27, 2023 •

edited

Loading

caonann commented Sep 6, 2023

tresiwald commented Oct 10, 2023

smallmocha commented Oct 10, 2023

boydfd commented Oct 17, 2023

Taking too much memory on multiple GPUs #872

Taking too much memory on multiple GPUs #872

Comments

krishanu-deloitte commented Aug 25, 2023 • edited Loading

horiacristescu commented Aug 27, 2023 • edited Loading

caonann commented Sep 6, 2023

tresiwald commented Oct 10, 2023

smallmocha commented Oct 10, 2023

boydfd commented Oct 17, 2023

krishanu-deloitte commented Aug 25, 2023 •

edited

Loading

horiacristescu commented Aug 27, 2023 •

edited

Loading