Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix RAM OOM when load large models in tensor parallel mode. #1395

Merged
merged 2 commits into from
Nov 21, 2023

Conversation

boydfd
Copy link
Contributor

@boydfd boydfd commented Oct 17, 2023

fix bug: #322 #872

Test log:

(vllm-boydfd) root:~/projects# python -m vllm.entrypoints.api_server --model /root/WizardLM--WizardCoder-15B-V1.0/ --tensor-parallel-size 8
2023-10-17 19:13:33,431 INFO worker.py:1642 -- Started a local Ray instance.
INFO 10-17 19:13:34 llm_engine.py:72] Initializing an LLM engine with config: model='/root/WizardLM--WizardCoder-15B-V1.0/', tokenizer='/root/WizardLM--WizardCoder-15B-V1.0/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=8, quantization=None, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
  File "/workspace/miniconda/envs/vllm-boydfd/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/workspace/miniconda/envs/vllm-boydfd/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/miniconda/envs/vllm-boydfd/lib/python3.10/site-packages/vllm/entrypoints/api_server.py", line 74, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/workspace/miniconda/envs/vllm-boydfd/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 487, in from_engine_args
    engine = cls(engine_args.worker_use_ray,
  File "/workspace/miniconda/envs/vllm-boydfd/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 270, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/workspace/miniconda/envs/vllm-boydfd/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 306, in _init_engine
    return engine_class(*args, **kwargs)
  File "/workspace/miniconda/envs/vllm-boydfd/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 108, in __init__
    self._init_workers_ray(placement_group)
  File "/workspace/miniconda/envs/vllm-boydfd/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 190, in _init_workers_ray
    self._run_workers(
  File "/workspace/miniconda/envs/vllm-boydfd/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 730, in _run_workers
    all_outputs.extend(self._run_workers_in_batch(workers, method, *args, **kwargs))
  File "/workspace/miniconda/envs/vllm-boydfd/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 711, in _run_workers_in_batch
    all_outputs = ray.get(all_outputs)
  File "/workspace/miniconda/envs/vllm-boydfd/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/workspace/miniconda/envs/vllm-boydfd/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/workspace/miniconda/envs/vllm-boydfd/lib/python3.10/site-packages/ray/_private/worker.py", line 2549, in get
    raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: xxxxxxxxx, ID: xxxxxx) where the task (actor ID: xxxxx, name=RayWorker.__init__, pid=1080909, memory used=29.64GB) was running was 242.20GB / 251.71GB (0.962216), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: xxxxx) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip xxxxxx`. To see the logs of the worker, use `ray logs worker-xxxxx*out -ip xxxxxx. Top 10 memory users:
PID     MEM(GB) COMMAND
1080909 29.64   ray::RayWorker.execute_method
1080908 29.61   ray::RayWorker.execute_method
1080907 29.57   ray::RayWorker.execute_method
1080906 29.53   ray::RayWorker.execute_method
1080905 29.50   ray::RayWorker.execute_method
1080904 29.47   ray::RayWorker.execute_method
1080903 29.43   ray::RayWorker.execute_method
1080902 29.40   ray::RayWorker.execute_method
1078425 0.27    /workspace/miniconda/envs/vllm-boydfd/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server -...
1078324 0.25    python -m vllm.entrypoints.api_server --model /root/WizardLM--WizardCoder-15B-V1.0/ --tensor-paralle...
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
^C
(vllm-boydfd) root:~/projects# python -m vllm.entrypoints.api_server --model /root/WizardLM--WizardCoder-15B-V1.0/ --tensor-parallel-size 8 --tensor-parallel-model-load-batch-size 2
2023-10-17 19:16:32,655 INFO worker.py:1642 -- Started a local Ray instance.
INFO 10-17 19:16:33 llm_engine.py:72] Initializing an LLM engine with config: model='/root/WizardLM--WizardCoder-15B-V1.0/', tokenizer='/root/WizardLM--WizardCoder-15B-V1.0/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=8, quantization=None, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 10-17 19:21:29 llm_engine.py:218] # GPU blocks: 31424, # CPU blocks: 13107
INFO:     Started server process [1082377]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Memory usage in model loading

image

@williamjeong2
Copy link

This PR seems like it hasn’t been reviewed yet. Just wanted to bring this to attention in case it slipped through the cracks.

If anyone has some time to take a look and provide feedback, that would be great. I believe getting more eyes on this would really help in enhancing the quality of vllm project.

@jaywongs
Copy link

@boydfd Hey, thanks for your work!
I tried this pr, but didn't work.
i use vllm to reference codellama-34B-awq in 2*A10, with 24GB gpu memory per device, total 376 GB cpu memory.

Error info:
SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution! The code in general looks good. Left some small comments on naming.

vllm/config.py Outdated Show resolved Hide resolved
vllm/engine/arg_utils.py Outdated Show resolved Hide resolved
vllm/engine/arg_utils.py Outdated Show resolved Hide resolved
vllm/engine/llm_engine.py Outdated Show resolved Hide resolved
@boydfd boydfd force-pushed the parallel-batch-load branch 2 times, most recently from d0c6bbf to 1422828 Compare October 31, 2023 02:05
…t: model_load_batch_size, max_parallel_loading_workers -> max_parallel_loading_workers, batch_size -> max_concurrent_workers.
@boydfd boydfd force-pushed the parallel-batch-load branch from 1422828 to 5ee5bcd Compare October 31, 2023 02:06
@boydfd
Copy link
Contributor Author

boydfd commented Oct 31, 2023

@boydfd Hey, thanks for your work! I tried this pr, but didn't work. i use vllm to reference codellama-34B-awq in 2*A10, with 24GB gpu memory per device, total 376 GB cpu memory.

Error info: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

@jaywongs
some quick questions:

  1. how many --tensor-parallel-size do you set?
  2. how many --tensor-parallel-model-load-batch-size do you set?
  3. can you post your error log? since 376 GB RAM is totally enough for loading 34B model, I can't imagine why this happened

@boydfd
Copy link
Contributor Author

boydfd commented Oct 31, 2023

Thank you for your contribution! The code in general looks good. Left some small comments on naming.

already updated all commented naming in another commit.

@boydfd boydfd requested a review from zhuohan123 October 31, 2023 02:46
@jaywongs
Copy link

@boydfd Hey, thanks for your work! I tried this pr, but didn't work. i use vllm to reference codellama-34B-awq in 2*A10, with 24GB gpu memory per device, total 376 GB cpu memory.
Error info: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

@jaywongs some quick questions:

  1. how many --tensor-parallel-size do you set?
  2. how many --tensor-parallel-model-load-batch-size do you set?
  3. can you post your error log? since 376 GB RAM is totally enough for loading 34B model, I can't imagine why this happened

thank you for your reply:

  1. tensor-parallel-size set to 2
  2. tensor-parallel-model-load-batch-size tried 1,2,4,8
  3. full error log is here:
    [2023-10-30 14:14:08] time="2023-10-30T06:14:08Z" level=info msg="create process: /bin/sh, command: ["/bin/sh","-c","python3 -m vllm.entrypoints.openai.api_server --model /dtc-llm/models --host 0.0.0.0 --port 5000 --max-num-batched-tokens 16384 --dtype=float16 --quantization awq --served-model-name Phind-CodeLlama-34B-v2-AWQ --tensor-parallel-size 2 --tensor-parallel-model-load-batch-size 8"]"
    [2023-10-30 14:14:10] WARNING 10-30 06:14:10 config.py:351] Casting torch.bfloat16 to torch.float16.
    [2023-10-30 14:14:12] 2023-10-30 06:14:12,055 WARNING services.py:1832 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67043328 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
    [2023-10-30 14:14:12] 2023-10-30 06:14:12,170 INFO worker.py:1621 -- Started a local Ray instance.
    [2023-10-30 14:14:13] INFO 10-30 06:14:13 llm_engine.py:72] Initializing an LLM engine with config: model='/dtc-llm/models', tokenizer='/dtc-llm/models', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=awq, seed=0)
    [2023-10-30 14:14:19] (RayWorker pid=2584) /usr/local/lib/python3.10/dist-packages/torch/cuda/init.py:546: UserWarning: Can't initialize NVML
    [2023-10-30 14:14:19] (RayWorker pid=2584) warnings.warn("Can't initialize NVML")
    [2023-10-30 14:14:35] (RayWorker pid=2584) *** SIGBUS received at time=1698646475 on cpu 12 ***
    [2023-10-30 14:14:35] (RayWorker pid=2584) PC: @ 0x7feba0b85b28 (unknown) c10::function_ref<>::callback_fn<>()
    [2023-10-30 14:14:35] (RayWorker pid=2584) @ 0x7fee8080d520 166537088 (unknown)
    [2023-10-30 14:14:35] (RayWorker pid=2583) /usr/local/lib/python3.10/dist-packages/torch/cuda/init.py:546: UserWarning: Can't initialize NVML
    [2023-10-30 14:14:35] (RayWorker pid=2583) warnings.warn("Can't initialize NVML")
    [2023-10-30 14:14:35] (RayWorker pid=2583) @ 0x7f92b6124520 (unknown) (unknown)
    [2023-10-30 14:14:35] (RayWorker pid=2584) @ 0x7feb9c84b8cd (unknown) at::TensorIteratorBase::serial_for_each()
    [2023-10-30 14:14:35] (RayWorker pid=2584) [2023-10-30 06:14:35,419 E 2584 2584] logging.cc:361: *** SIGBUS received at time=1698646475 on cpu 12 ***
    [2023-10-30 14:14:35] (RayWorker pid=2584) [2023-10-30 06:14:35,419 E 2584 2584] logging.cc:361: PC: @ 0x7feba0b85b28 (unknown) c10::function_ref<>::callback_fn<>()
    [2023-10-30 14:14:35] (RayWorker pid=2584) [2023-10-30 06:14:35,419 E 2584 2584] logging.cc:361: @ 0x7fee8080d520 166537088 (unknown)
    [2023-10-30 14:14:35] (RayWorker pid=2584) [2023-10-30 06:14:35,419 E 2584 2584] logging.cc:361: @ 0x7feb9c84b8cd (unknown) at::TensorIteratorBase::serial_for_each()
    [2023-10-30 14:14:35] (RayWorker pid=2584) Fatal Python error: Bus error
    [2023-10-30 14:14:35] (RayWorker pid=2584)
    [2023-10-30 14:14:35] (RayWorker pid=2584) Stack (most recent call first):
    [2023-10-30 14:14:35] (RayWorker pid=2584) File "/root/vllm/vllm/model_executor/models/llama.py", line 411 in load_weights
    [2023-10-30 14:14:35] (RayWorker pid=2584) File "/root/vllm/vllm/model_executor/model_loader.py", line 103 in get_model
    [2023-10-30 14:14:35] (RayWorker pid=2584) File "/root/vllm/vllm/worker/worker.py", line 72 in load_model
    [2023-10-30 14:14:35] (RayWorker pid=2584) File "/root/vllm/vllm/engine/ray_utils.py", line 32 in execute_method
    [2023-10-30 14:14:35] (RayWorker pid=2584) File "/usr/local/lib/python3.10/dist-packages/ray/util/tracing/tracing_helper.py", line 464 in _resume_span
    [2023-10-30 14:14:35] (RayWorker pid=2584) File "/usr/local/lib/python3.10/dist-packages/ray/_private/function_manager.py", line 726 in actor_method_executor
    [2023-10-30 14:14:35] (RayWorker pid=2584) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 779 in main_loop
    [2023-10-30 14:14:35] (RayWorker pid=2584) File "/usr/local/lib/python3.10/dist-packages/ray/_private/workers/default_worker.py", line 264 in
    [2023-10-30 14:14:35] (RayWorker pid=2584)
    [2023-10-30 14:14:35] (RayWorker pid=2584) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, uvloop.loop, ray._raylet, charset_normalizer.md, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, sentencepiece._sentencepiece, pyarrow.lib, pyarrow._hdfsio, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.hashing, pandas._libs.tslib, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.index, pandas._libs.internals, pandas._libs.join, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, grpc._cython.cygrpc, pydantic.typing, pydantic.errors, pydantic.version, pydantic.utils, pydantic.class_validators, pydantic.config, pydantic.color, pydantic.datetime_parse, pydantic.validators, pydantic.networks, pydantic.types, pydantic.json, pydantic.error_wrappers, pydantic.fields, pydantic.parse, pydantic.schema, pydantic.main, pydantic.dataclasses, pydantic.annotated_types, pydantic.decorator, pydantic.env_settings, pydantic.tools, pydantic, pyarrow._json, PIL._imaging (total: 104)
    [2023-10-30 14:14:35] (RayWorker pid=2583) [2023-10-30 06:14:35,419 E 2583 2583] logging.cc:361: @ 0x7f92b6124520 (unknown) (unknown)
    [2023-10-30 14:14:35] 2023-10-30 06:14:35,520 WARNING worker.py:2037 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff3d43680d3ad4f61efd15d19401000000 Worker ID: 803d573862007b4ba1bf498a76cecb880855462175719fa19f4267d7 Node ID: f3f4cb95f16e75a2ce9e15f0339bf3db6efce0a726e0e3edae85e2e3 Worker IP address: 172.16.103.199 Worker port: 46271 Worker PID: 2583 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
    [2023-10-30 14:14:35] Traceback (most recent call last):
    [2023-10-30 14:14:35] File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    [2023-10-30 14:14:35] return _run_code(code, main_globals, None,
    [2023-10-30 14:14:35] File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    [2023-10-30 14:14:35] exec(code, run_globals)
    [2023-10-30 14:14:35] File "/root/vllm/vllm/entrypoints/openai/api_server.py", line 614, in
    [2023-10-30 14:14:35] engine = AsyncLLMEngine.from_engine_args(engine_args)
    [2023-10-30 14:14:35] File "/root/vllm/vllm/engine/async_llm_engine.py", line 487, in from_engine_args
    [2023-10-30 14:14:35] engine = cls(engine_args.worker_use_ray,
    [2023-10-30 14:14:35] File "/root/vllm/vllm/engine/async_llm_engine.py", line 270, in init
    [2023-10-30 14:14:35] self.engine = self._init_engine(*args, **kwargs)
    [2023-10-30 14:14:35] File "/root/vllm/vllm/engine/async_llm_engine.py", line 306, in _init_engine
    [2023-10-30 14:14:35] return engine_class(*args, **kwargs)
    [2023-10-30 14:14:35] File "/root/vllm/vllm/engine/llm_engine.py", line 108, in init
    [2023-10-30 14:14:35] self._init_workers_ray(placement_group)
    [2023-10-30 14:14:35] File "/root/vllm/vllm/engine/llm_engine.py", line 190, in _init_workers_ray
    [2023-10-30 14:14:35] self._run_workers(
    [2023-10-30 14:14:35] File "/root/vllm/vllm/engine/llm_engine.py", line 735, in _run_workers
    [2023-10-30 14:14:35] self._run_workers_in_batch(workers, method, *args, **kwargs))
    [2023-10-30 14:14:35] File "/root/vllm/vllm/engine/llm_engine.py", line 712, in _run_workers_in_batch
    [2023-10-30 14:14:35] all_outputs = ray.get(all_outputs)
    [2023-10-30 14:14:35] File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    [2023-10-30 14:14:35] return fn(*args, **kwargs)
    [2023-10-30 14:14:35] File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    [2023-10-30 14:14:35] return func(*args, **kwargs)
    [2023-10-30 14:14:35] File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2526, in get
    [2023-10-30 14:14:35] raise value
    [2023-10-30 14:14:35] ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
    [2023-10-30 14:14:35] class_name: RayWorker
    [2023-10-30 14:14:35] actor_id: 3d43680d3ad4f61efd15d19401000000
    [2023-10-30 14:14:35] pid: 2583
    [2023-10-30 14:14:35] namespace: 7d0e9878-2219-4b03-ac0b-e18bc81dd12e
    [2023-10-30 14:14:35] ip: 172.16.103.199
    [2023-10-30 14:14:35] The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
    [2023-10-30 14:14:35] 2023-10-30 06:14:35,574 WARNING worker.py:2037 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff0d32bcb41d5677999734e27101000000 Worker ID: 5c3fcbef4daf379d364d76f1997500121ca7f44ebf0bf018bd326739 Node ID: f3f4cb95f16e75a2ce9e15f0339bf3db6efce0a726e0e3edae85e2e3 Worker IP address: 172.16.103.199 Worker port: 32769 Worker PID: 2584 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

@boydfd
Copy link
Contributor Author

boydfd commented Nov 1, 2023

@boydfd Hey, thanks for your work! I tried this pr, but didn't work. i use vllm to reference codellama-34B-awq in 2*A10, with 24GB gpu memory per device, total 376 GB cpu memory.
Error info: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

@jaywongs some quick questions:

  1. how many --tensor-parallel-size do you set?
  2. how many --tensor-parallel-model-load-batch-size do you set?
  3. can you post your error log? since 376 GB RAM is totally enough for loading 34B model, I can't imagine why this happened

thank you for your reply:

  1. tensor-parallel-size set to 2
  2. tensor-parallel-model-load-batch-size tried 1,2,4,8
  3. full error log is here:
    [2023-10-30 14:14:08] time="2023-10-30T06:14:08Z" level=info msg="create process: /bin/sh, command: ["/bin/sh","-c","python3 -m vllm.entrypoints.openai.api_server --model /dtc-llm/models --host 0.0.0.0 --port 5000 --max-num-batched-tokens 16384 --dtype=float16 --quantization awq --served-model-name Phind-CodeLlama-34B-v2-AWQ --tensor-parallel-size 2 --tensor-parallel-model-load-batch-size 8"]"
    [2023-10-30 14:14:10] WARNING 10-30 06:14:10 config.py:351] Casting torch.bfloat16 to torch.float16.
    [2023-10-30 14:14:12] 2023-10-30 06:14:12,055 WARNING services.py:1832 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67043328 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
    [2023-10-30 14:14:12] 2023-10-30 06:14:12,170 INFO worker.py:1621 -- Started a local Ray instance.
    [2023-10-30 14:14:13] INFO 10-30 06:14:13 llm_engine.py:72] Initializing an LLM engine with config: model='/dtc-llm/models', tokenizer='/dtc-llm/models', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=awq, seed=0)
    [2023-10-30 14:14:19] (RayWorker pid=2584) /usr/local/lib/python3.10/dist-packages/torch/cuda/init.py:546: UserWarning: Can't initialize NVML
    [2023-10-30 14:14:19] (RayWorker pid=2584) warnings.warn("Can't initialize NVML")
    [2023-10-30 14:14:35] (RayWorker pid=2584) *** SIGBUS received at time=1698646475 on cpu 12 ***
    [2023-10-30 14:14:35] (RayWorker pid=2584) PC: @ 0x7feba0b85b28 (unknown) c10::function_ref<>::callback_fn<>()
    [2023-10-30 14:14:35] (RayWorker pid=2584) @ 0x7fee8080d520 166537088 (unknown)
    [2023-10-30 14:14:35] (RayWorker pid=2583) /usr/local/lib/python3.10/dist-packages/torch/cuda/init.py:546: UserWarning: Can't initialize NVML
    [2023-10-30 14:14:35] (RayWorker pid=2583) warnings.warn("Can't initialize NVML")
    [2023-10-30 14:14:35] (RayWorker pid=2583) @ 0x7f92b6124520 (unknown) (unknown)
    [2023-10-30 14:14:35] (RayWorker pid=2584) @ 0x7feb9c84b8cd (unknown) at::TensorIteratorBase::serial_for_each()
    [2023-10-30 14:14:35] (RayWorker pid=2584) [2023-10-30 06:14:35,419 E 2584 2584] logging.cc:361: *** SIGBUS received at time=1698646475 on cpu 12 ***
    [2023-10-30 14:14:35] (RayWorker pid=2584) [2023-10-30 06:14:35,419 E 2584 2584] logging.cc:361: PC: @ 0x7feba0b85b28 (unknown) c10::function_ref<>::callback_fn<>()
    [2023-10-30 14:14:35] (RayWorker pid=2584) [2023-10-30 06:14:35,419 E 2584 2584] logging.cc:361: @ 0x7fee8080d520 166537088 (unknown)
    [2023-10-30 14:14:35] (RayWorker pid=2584) [2023-10-30 06:14:35,419 E 2584 2584] logging.cc:361: @ 0x7feb9c84b8cd (unknown) at::TensorIteratorBase::serial_for_each()
    [2023-10-30 14:14:35] (RayWorker pid=2584) Fatal Python error: Bus error
    [2023-10-30 14:14:35] (RayWorker pid=2584)
    [2023-10-30 14:14:35] (RayWorker pid=2584) Stack (most recent call first):
    [2023-10-30 14:14:35] (RayWorker pid=2584) File "/root/vllm/vllm/model_executor/models/llama.py", line 411 in load_weights
    [2023-10-30 14:14:35] (RayWorker pid=2584) File "/root/vllm/vllm/model_executor/model_loader.py", line 103 in get_model
    [2023-10-30 14:14:35] (RayWorker pid=2584) File "/root/vllm/vllm/worker/worker.py", line 72 in load_model
    [2023-10-30 14:14:35] (RayWorker pid=2584) File "/root/vllm/vllm/engine/ray_utils.py", line 32 in execute_method
    [2023-10-30 14:14:35] (RayWorker pid=2584) File "/usr/local/lib/python3.10/dist-packages/ray/util/tracing/tracing_helper.py", line 464 in _resume_span
    [2023-10-30 14:14:35] (RayWorker pid=2584) File "/usr/local/lib/python3.10/dist-packages/ray/_private/function_manager.py", line 726 in actor_method_executor
    [2023-10-30 14:14:35] (RayWorker pid=2584) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 779 in main_loop
    [2023-10-30 14:14:35] (RayWorker pid=2584) File "/usr/local/lib/python3.10/dist-packages/ray/_private/workers/default_worker.py", line 264 in
    [2023-10-30 14:14:35] (RayWorker pid=2584)
    [2023-10-30 14:14:35] (RayWorker pid=2584) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, uvloop.loop, ray._raylet, charset_normalizer.md, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, sentencepiece._sentencepiece, pyarrow.lib, pyarrow._hdfsio, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.hashing, pandas._libs.tslib, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.index, pandas._libs.internals, pandas._libs.join, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, grpc._cython.cygrpc, pydantic.typing, pydantic.errors, pydantic.version, pydantic.utils, pydantic.class_validators, pydantic.config, pydantic.color, pydantic.datetime_parse, pydantic.validators, pydantic.networks, pydantic.types, pydantic.json, pydantic.error_wrappers, pydantic.fields, pydantic.parse, pydantic.schema, pydantic.main, pydantic.dataclasses, pydantic.annotated_types, pydantic.decorator, pydantic.env_settings, pydantic.tools, pydantic, pyarrow._json, PIL._imaging (total: 104)
    [2023-10-30 14:14:35] (RayWorker pid=2583) [2023-10-30 06:14:35,419 E 2583 2583] logging.cc:361: @ 0x7f92b6124520 (unknown) (unknown)
    [2023-10-30 14:14:35] 2023-10-30 06:14:35,520 WARNING worker.py:2037 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff3d43680d3ad4f61efd15d19401000000 Worker ID: 803d573862007b4ba1bf498a76cecb880855462175719fa19f4267d7 Node ID: f3f4cb95f16e75a2ce9e15f0339bf3db6efce0a726e0e3edae85e2e3 Worker IP address: 172.16.103.199 Worker port: 46271 Worker PID: 2583 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
    [2023-10-30 14:14:35] Traceback (most recent call last):
    [2023-10-30 14:14:35] File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    [2023-10-30 14:14:35] return _run_code(code, main_globals, None,
    [2023-10-30 14:14:35] File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    [2023-10-30 14:14:35] exec(code, run_globals)
    [2023-10-30 14:14:35] File "/root/vllm/vllm/entrypoints/openai/api_server.py", line 614, in
    [2023-10-30 14:14:35] engine = AsyncLLMEngine.from_engine_args(engine_args)
    [2023-10-30 14:14:35] File "/root/vllm/vllm/engine/async_llm_engine.py", line 487, in from_engine_args
    [2023-10-30 14:14:35] engine = cls(engine_args.worker_use_ray,
    [2023-10-30 14:14:35] File "/root/vllm/vllm/engine/async_llm_engine.py", line 270, in init
    [2023-10-30 14:14:35] self.engine = self._init_engine(*args, **kwargs)
    [2023-10-30 14:14:35] File "/root/vllm/vllm/engine/async_llm_engine.py", line 306, in _init_engine
    [2023-10-30 14:14:35] return engine_class(*args, **kwargs)
    [2023-10-30 14:14:35] File "/root/vllm/vllm/engine/llm_engine.py", line 108, in init
    [2023-10-30 14:14:35] self._init_workers_ray(placement_group)
    [2023-10-30 14:14:35] File "/root/vllm/vllm/engine/llm_engine.py", line 190, in _init_workers_ray
    [2023-10-30 14:14:35] self._run_workers(
    [2023-10-30 14:14:35] File "/root/vllm/vllm/engine/llm_engine.py", line 735, in _run_workers
    [2023-10-30 14:14:35] self._run_workers_in_batch(workers, method, *args, **kwargs))
    [2023-10-30 14:14:35] File "/root/vllm/vllm/engine/llm_engine.py", line 712, in _run_workers_in_batch
    [2023-10-30 14:14:35] all_outputs = ray.get(all_outputs)
    [2023-10-30 14:14:35] File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    [2023-10-30 14:14:35] return fn(*args, **kwargs)
    [2023-10-30 14:14:35] File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    [2023-10-30 14:14:35] return func(*args, **kwargs)
    [2023-10-30 14:14:35] File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2526, in get
    [2023-10-30 14:14:35] raise value
    [2023-10-30 14:14:35] ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
    [2023-10-30 14:14:35] class_name: RayWorker
    [2023-10-30 14:14:35] actor_id: 3d43680d3ad4f61efd15d19401000000
    [2023-10-30 14:14:35] pid: 2583
    [2023-10-30 14:14:35] namespace: 7d0e9878-2219-4b03-ac0b-e18bc81dd12e
    [2023-10-30 14:14:35] ip: 172.16.103.199
    [2023-10-30 14:14:35] The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
    [2023-10-30 14:14:35] 2023-10-30 06:14:35,574 WARNING worker.py:2037 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff0d32bcb41d5677999734e27101000000 Worker ID: 5c3fcbef4daf379d364d76f1997500121ca7f44ebf0bf018bd326739 Node ID: f3f4cb95f16e75a2ce9e15f0339bf3db6efce0a726e0e3edae85e2e3 Worker IP address: 172.16.103.199 Worker port: 32769 Worker PID: 2584 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

since you have only 2 GPUs and set tensor-parallel-size to 2, you should set tensor-parallel-model-load-batch-size to 1 which means model loading will happen one by one.

I still can't know why OOM happened. Can you try these two methods to help find out the root cause:

  1. can you monitor the memory usage when you execute the code? to see how the memory increases. The expected behavior should be 1 RayWorker process eats a lot of RAM and will release the memory later, Then another RayWorker process eats a lot of RAM.
  2. can you write a Python code to eat a lot of RAM to test how much memory can you allocate in your Python code? you can check this link to see how to eat a lot of RAM.

last question: do you run your code in K8s or docker? there might be memory limits, so you should consider setting a higher value.

@starlitsky2010
Copy link

@boydfd How to use your patch? Here are my codes. Could you provide some tips about this?

from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="llm/awq_llama-70b-chat-hf_awq", quantization="AWQ", tensor_parallel_size=4, max_parallel_loading_workers=4)

outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

@boydfd
Copy link
Contributor Author

boydfd commented Nov 12, 2023

@boydfd How to use your patch? Here are my codes. Could you provide some tips about this?

from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="llm/awq_llama-70b-chat-hf_awq", quantization="AWQ", tensor_parallel_size=4, max_parallel_loading_workers=4)

outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

If you met RAM OOM, try to set the max_parallel_loading_workers to a smaller number (like 1)

Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for your contribution!

@zhuohan123 zhuohan123 merged commit 4bb6b67 into vllm-project:main Nov 21, 2023
2 checks passed
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
sjchoi1 pushed a commit to casys-kaist-internal/vllm that referenced this pull request May 7, 2024
@SachitS
Copy link

SachitS commented Jan 8, 2025

Hey, do we know why this was removed in the recent versions? If we do --max-parallel-loading-workers 1 in the engine args we get

NotImplementedError(
(vllm_cluster_thedrummer_anubis_70b_v1, pid=2323) ERROR 01-08 12:06:40 engine.py:366] NotImplementedError: max_concurrent_workers is not supported yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants