Running Vllm on ray cluster, logging stuck at loading #5052

maherr13 · 2024-05-25T17:48:19Z

Your current environment

I have two machine 2*4090, I wanted to runner a model (eg gpt-neox-20b) using vllm on ray cluster, so i follow the documentation by making ray cluster
on head
ray start --head
on node
ray start --address=:port

I manged to make the cluster so far, when i run simple script for inference:

from vllm import LLM
llm = LLM("/home/administrator/nlp-deploy/models/gpt-neox-20b/", tensor_parallel_size=2, disable_custom_all_reduce=True, enforce_eager=True)

prompt = "GPT-NeoX-20B is"
output = llm.generate(prompt)
print(output)

the model is stuck at loading.
nvidia-smi for the head and the node while loading

logs when running the script

versions
cuda : 12.4
ray : 2.22

🐛 Describe the bug

i tried other solutions mentioned in the issues like
export NCCL_P2P_DISABLE = 1
disable_custom_all_reduce=True
enforce_eager=True

and its not solved

The text was updated successfully, but these errors were encountered:

maherr13 · 2024-05-29T13:06:56Z

@simon-mo can i get some help ?

simon-mo · 2024-05-30T22:28:58Z

This setup is a bit hard to debug because while the multi-node setting should work out of the box, the hanging could be due to connectivity issue between two machines.

maherr13 · 2024-05-31T01:11:27Z

thanks for your response @simon-mo

the ray cluster can see both machine ray status returns that the available resources are 2 GPUs.
nvidia-smi seems shows that ray already allocated the resources need for loading the model.
my setup for the firewall was to allow any request from each server to the other server.

what could make the setup not suitable for the application ?

DarkLight1337 · 2024-06-13T09:01:53Z

We have added documentation for this situation in #5430. Please take a look.

cmwilhelm · 2024-08-16T20:46:51Z

Chiming in to say that I'm seeing the same issue.

I have an instance of LLM running in a ray actor as part of a larger batch application. It runs on a single machine. Normally, ray aggregates the the logs of various processes to stdout/err. This works fine until vllm begins loading model weights, and then all logging ceases.

However, the application itself continues to run fine. I cannot see logs after that point, but it produces file output to s3, and other instrumentation (datadog, in my case) continues to run and report results. This appears to be purely a logging issue, and manifests during the model loading process.

Here is some example output from my ray app that looks very similar to @maherr13 's:

2024-08-16 13:44:02,020	INFO worker.py:1772 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
2024-08-16 13:44:03,456	INFO runner.py:55 -- Starting batch inference server with settings:
 llm_model_config=LLMModelConfig(name_or_path='/artifacts', dtype='bfloat16', trust_remote_code=False, revision=None, tensor_parallel_size=None, max_model_len=None, fast_tokenizer=True, gpu_memory_utilization=0.9) format_config=FormatConfig(new_line_symbol='\n', system_message=None, instruction_prefix=None, generation_prompt=None, add_generation_prompt=False, chat_template=None) generate_config=GenerateConfig(max_tokens=None, max_context_length=4096, temperature=0.2, top_k=50, top_p=1.0, drop_long_outputs=False) pipeline_config=PipelineConfig(sqs_queue_name='MiseBatchJob-refine-test', target_bucket='ai2-oe-data', target_object_prefix='refine/batch_inference/dev/dd2', predictors_per_gpu=None, num_workers=1, max_num_messages_per_worker=None, num_tokenizers=4, num_gpus=1, tokenization_batch_size=3000, generation_batch_size=64, decoding_batch_size=3000, output_serializer=<OutputSerializerEnum.default: 'default'>) artifacts_s3_path=None dd_namespace='batch_inference' gpumon_interval=5 terminal_dd_flush_time=20 dummy_mode=False timestamp_ns=1723841043457176601 job_id=01000000 worker_id=01000000ffffffffffffffffffffffffffffffffffffffffffffffff node_id=4e48c3b3b1270435faca85c2c1f87583308ab16b680df7bfb64b033e
(pid=48549) 2024-08-16 13:44:03,946	WARNING compression.py:68 -- overriding existing compression handler for '.zst' timestamp_ns=1723841043946756624 job_id=01000000 worker_id=b22186fdadd710f33ce6eaa6b38e0388e8cc54716ea6ed1bc6e5b518 node_id=4e48c3b3b1270435faca85c2c1f87583308ab16b680df7bfb64b033e actor_id=2914c373bb5b2bcc2423d80901000000 task_id=ffffffffffffffff2914c373bb5b2bcc2423d80901000000
(SQSActor pid=48549) 2024-08-16 13:44:04,083	INFO credentials.py:1147 -- Found credentials in environment variables. timestamp_ns=1723841044083466059 job_id=01000000 worker_id=b22186fdadd710f33ce6eaa6b38e0388e8cc54716ea6ed1bc6e5b518 node_id=4e48c3b3b1270435faca85c2c1f87583308ab16b680df7bfb64b033e actor_id=2914c373bb5b2bcc2423d80901000000 task_id=ffffffffffffffff2914c373bb5b2bcc2423d80901000000
2024-08-16 13:44:04,860	INFO config.py:59 -- PyTorch version 2.4.0 available. timestamp_ns=1723841044860294009 job_id=01000000 worker_id=01000000ffffffffffffffffffffffffffffffffffffffffffffffff node_id=4e48c3b3b1270435faca85c2c1f87583308ab16b680df7bfb64b033e
(pid=53236) 2024-08-16 13:44:11,025	INFO config.py:59 -- PyTorch version 2.4.0 available. timestamp_ns=1723841051025810069 job_id=01000000 worker_id=4b00ace9ceafe77497c3563a77806cbc612582210c321ca4751fdef8 node_id=4e48c3b3b1270435faca85c2c1f87583308ab16b680df7bfb64b033e actor_id=7884fdf2e298268c7dfe9b4301000000 task_id=ffffffffffffffff7884fdf2e298268c7dfe9b4301000000
(pid=53237) 2024-08-16 13:44:08,075	WARNING compression.py:68 -- overriding existing compression handler for '.zst' timestamp_ns=1723841048075279388 job_id=01000000 worker_id=5e0aa0f7ebeddd16cb342f22c8c7368f1de39ec57e79136b280c7362 node_id=4e48c3b3b1270435faca85c2c1f87583308ab16b680df7bfb64b033e actor_id=43112b538178701f46c6af5101000000 task_id=ffffffffffffffff43112b538178701f46c6af5101000000
(PredictorActor pid=53236) WARNING 08-16 13:44:13 arg_utils.py:766] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
(PredictorActor pid=53236) INFO 08-16 13:44:13 config.py:820] Chunked prefill is enabled with max_num_batched_tokens=512.
(PredictorActor pid=53236) INFO 08-16 13:44:13 model_runner.py:720] Starting to load model /artifacts...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.42it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.41it/s]
(PredictorActor pid=53236)
(PredictorActor pid=53236) INFO 08-16 13:44:14 model_runner.py:732] Loading model weights took 0.9381 GB
(PredictorActor pid=53236) INFO 08-16 13:44:14 gpu_executor.py:102] # GPU blocks: 221755, # CPU blocks: 21845
(PredictorActor pid=53236) INFO 08-16 13:44:17 model_runner.py:1024] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(PredictorActor pid=53236) INFO 08-16 13:44:17 model_runner.py:1028] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(PredictorActor pid=53236) INFO 08-16 13:44:28 model_runner.py:1225] Graph capturing finished in 11 secs.

You can see logs reported from other task/actor processes earlier in the flow, and then once PredictorActor (which houses the LLM instance) gets started, all logs cease.

I know this is an issue vllm specifically, because replacing it for a dummy implementation in PredictorActor resolves all the logging issues.

My best guess here was that it has something to do with vllm's logger.py, but that is just a stab in the dark.

cmwilhelm · 2024-08-16T21:06:39Z

Folks, I have confirmed that vllm's custom logger.py routine has the effect of silencing all other logs in ray applications.

Setting export VLLM_CONFIGURE_LOGGING=0 completely resolved the issue.

maherr13 added the bug Something isn't working label May 25, 2024

DarkLight1337 closed this as completed Jun 13, 2024

rkooo567 added the ray anything related with ray label Aug 17, 2024

rkooo567 self-assigned this Aug 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running Vllm on ray cluster, logging stuck at loading #5052

Running Vllm on ray cluster, logging stuck at loading #5052

maherr13 commented May 25, 2024

maherr13 commented May 29, 2024

simon-mo commented May 30, 2024

maherr13 commented May 31, 2024

DarkLight1337 commented Jun 13, 2024

cmwilhelm commented Aug 16, 2024 •

edited

Loading

cmwilhelm commented Aug 16, 2024

Running Vllm on ray cluster, logging stuck at loading #5052

Running Vllm on ray cluster, logging stuck at loading #5052

Comments

maherr13 commented May 25, 2024

Your current environment

🐛 Describe the bug

maherr13 commented May 29, 2024

simon-mo commented May 30, 2024

maherr13 commented May 31, 2024

DarkLight1337 commented Jun 13, 2024

cmwilhelm commented Aug 16, 2024 • edited Loading

cmwilhelm commented Aug 16, 2024

cmwilhelm commented Aug 16, 2024 •

edited

Loading