Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running Vllm on ray cluster, logging stuck at loading #5052

Closed
maherr13 opened this issue May 25, 2024 · 6 comments
Closed

Running Vllm on ray cluster, logging stuck at loading #5052

maherr13 opened this issue May 25, 2024 · 6 comments
Assignees
Labels
bug Something isn't working ray anything related with ray

Comments

@maherr13
Copy link

Your current environment

I have two machine 2*4090, I wanted to runner a model (eg gpt-neox-20b) using vllm on ray cluster, so i follow the documentation by making ray cluster
on head
ray start --head
on node
ray start --address=:port

I manged to make the cluster so far, when i run simple script for inference:

from vllm import LLM
llm = LLM("/home/administrator/nlp-deploy/models/gpt-neox-20b/", tensor_parallel_size=2, disable_custom_all_reduce=True, enforce_eager=True)

prompt = "GPT-NeoX-20B is"
output = llm.generate(prompt)
print(output)

the model is stuck at loading.
nvidia-smi for the head and the node while loading
image

logs when running the script
image

versions
cuda : 12.4
ray : 2.22

🐛 Describe the bug

i tried other solutions mentioned in the issues like
export NCCL_P2P_DISABLE = 1
disable_custom_all_reduce=True
enforce_eager=True

and its not solved

@maherr13 maherr13 added the bug Something isn't working label May 25, 2024
@maherr13
Copy link
Author

@simon-mo can i get some help ?

@simon-mo
Copy link
Collaborator

This setup is a bit hard to debug because while the multi-node setting should work out of the box, the hanging could be due to connectivity issue between two machines.

@maherr13
Copy link
Author

thanks for your response @simon-mo

  • the ray cluster can see both machine ray status returns that the available resources are 2 GPUs.
  • nvidia-smi seems shows that ray already allocated the resources need for loading the model.
  • my setup for the firewall was to allow any request from each server to the other server.

what could make the setup not suitable for the application ?

@DarkLight1337
Copy link
Member

We have added documentation for this situation in #5430. Please take a look.

@cmwilhelm
Copy link

cmwilhelm commented Aug 16, 2024

Chiming in to say that I'm seeing the same issue.

I have an instance of LLM running in a ray actor as part of a larger batch application. It runs on a single machine. Normally, ray aggregates the the logs of various processes to stdout/err. This works fine until vllm begins loading model weights, and then all logging ceases.

However, the application itself continues to run fine. I cannot see logs after that point, but it produces file output to s3, and other instrumentation (datadog, in my case) continues to run and report results. This appears to be purely a logging issue, and manifests during the model loading process.

Here is some example output from my ray app that looks very similar to @maherr13 's:

2024-08-16 13:44:02,020	INFO worker.py:1772 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
2024-08-16 13:44:03,456	INFO runner.py:55 -- Starting batch inference server with settings:
 llm_model_config=LLMModelConfig(name_or_path='/artifacts', dtype='bfloat16', trust_remote_code=False, revision=None, tensor_parallel_size=None, max_model_len=None, fast_tokenizer=True, gpu_memory_utilization=0.9) format_config=FormatConfig(new_line_symbol='\n', system_message=None, instruction_prefix=None, generation_prompt=None, add_generation_prompt=False, chat_template=None) generate_config=GenerateConfig(max_tokens=None, max_context_length=4096, temperature=0.2, top_k=50, top_p=1.0, drop_long_outputs=False) pipeline_config=PipelineConfig(sqs_queue_name='MiseBatchJob-refine-test', target_bucket='ai2-oe-data', target_object_prefix='refine/batch_inference/dev/dd2', predictors_per_gpu=None, num_workers=1, max_num_messages_per_worker=None, num_tokenizers=4, num_gpus=1, tokenization_batch_size=3000, generation_batch_size=64, decoding_batch_size=3000, output_serializer=<OutputSerializerEnum.default: 'default'>) artifacts_s3_path=None dd_namespace='batch_inference' gpumon_interval=5 terminal_dd_flush_time=20 dummy_mode=False timestamp_ns=1723841043457176601 job_id=01000000 worker_id=01000000ffffffffffffffffffffffffffffffffffffffffffffffff node_id=4e48c3b3b1270435faca85c2c1f87583308ab16b680df7bfb64b033e
(pid=48549) 2024-08-16 13:44:03,946	WARNING compression.py:68 -- overriding existing compression handler for '.zst' timestamp_ns=1723841043946756624 job_id=01000000 worker_id=b22186fdadd710f33ce6eaa6b38e0388e8cc54716ea6ed1bc6e5b518 node_id=4e48c3b3b1270435faca85c2c1f87583308ab16b680df7bfb64b033e actor_id=2914c373bb5b2bcc2423d80901000000 task_id=ffffffffffffffff2914c373bb5b2bcc2423d80901000000
(SQSActor pid=48549) 2024-08-16 13:44:04,083	INFO credentials.py:1147 -- Found credentials in environment variables. timestamp_ns=1723841044083466059 job_id=01000000 worker_id=b22186fdadd710f33ce6eaa6b38e0388e8cc54716ea6ed1bc6e5b518 node_id=4e48c3b3b1270435faca85c2c1f87583308ab16b680df7bfb64b033e actor_id=2914c373bb5b2bcc2423d80901000000 task_id=ffffffffffffffff2914c373bb5b2bcc2423d80901000000
2024-08-16 13:44:04,860	INFO config.py:59 -- PyTorch version 2.4.0 available. timestamp_ns=1723841044860294009 job_id=01000000 worker_id=01000000ffffffffffffffffffffffffffffffffffffffffffffffff node_id=4e48c3b3b1270435faca85c2c1f87583308ab16b680df7bfb64b033e
(pid=53236) 2024-08-16 13:44:11,025	INFO config.py:59 -- PyTorch version 2.4.0 available. timestamp_ns=1723841051025810069 job_id=01000000 worker_id=4b00ace9ceafe77497c3563a77806cbc612582210c321ca4751fdef8 node_id=4e48c3b3b1270435faca85c2c1f87583308ab16b680df7bfb64b033e actor_id=7884fdf2e298268c7dfe9b4301000000 task_id=ffffffffffffffff7884fdf2e298268c7dfe9b4301000000
(pid=53237) 2024-08-16 13:44:08,075	WARNING compression.py:68 -- overriding existing compression handler for '.zst' timestamp_ns=1723841048075279388 job_id=01000000 worker_id=5e0aa0f7ebeddd16cb342f22c8c7368f1de39ec57e79136b280c7362 node_id=4e48c3b3b1270435faca85c2c1f87583308ab16b680df7bfb64b033e actor_id=43112b538178701f46c6af5101000000 task_id=ffffffffffffffff43112b538178701f46c6af5101000000
(PredictorActor pid=53236) WARNING 08-16 13:44:13 arg_utils.py:766] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
(PredictorActor pid=53236) INFO 08-16 13:44:13 config.py:820] Chunked prefill is enabled with max_num_batched_tokens=512.
(PredictorActor pid=53236) INFO 08-16 13:44:13 model_runner.py:720] Starting to load model /artifacts...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.42it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.41it/s]
(PredictorActor pid=53236)
(PredictorActor pid=53236) INFO 08-16 13:44:14 model_runner.py:732] Loading model weights took 0.9381 GB
(PredictorActor pid=53236) INFO 08-16 13:44:14 gpu_executor.py:102] # GPU blocks: 221755, # CPU blocks: 21845
(PredictorActor pid=53236) INFO 08-16 13:44:17 model_runner.py:1024] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(PredictorActor pid=53236) INFO 08-16 13:44:17 model_runner.py:1028] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(PredictorActor pid=53236) INFO 08-16 13:44:28 model_runner.py:1225] Graph capturing finished in 11 secs.

You can see logs reported from other task/actor processes earlier in the flow, and then once PredictorActor (which houses the LLM instance) gets started, all logs cease.

I know this is an issue vllm specifically, because replacing it for a dummy implementation in PredictorActor resolves all the logging issues.

My best guess here was that it has something to do with vllm's logger.py, but that is just a stab in the dark.

@cmwilhelm
Copy link

Folks, I have confirmed that vllm's custom logger.py routine has the effect of silencing all other logs in ray applications.

Setting export VLLM_CONFIGURE_LOGGING=0 completely resolved the issue.

@rkooo567 rkooo567 added the ray anything related with ray label Aug 17, 2024
@rkooo567 rkooo567 self-assigned this Aug 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ray anything related with ray
Projects
None yet
Development

No branches or pull requests

5 participants