Model will be loaded on different devices when using multiple gpus. #67

baichuanzhou · 2024-04-26T14:19:36Z

It appears that models will be loaded on different gpus when setting num_processes to more than one, which will cause error:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

Here's my command to launch:

accelerate launch --num_processes=2 -m lmms_eval --model llava   --model_args pretrained="xxx,conv_template=xxx"   --tasks gqa,vqav2,scienceqa,textvqa --batch_size 1 --log_samples --log_samples_suffix xxx --output_path ./logs/

I found a temporary fix by installing previous version:
pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git@bf4c78b7e405e2ca29bf76f579371382fec3dd02
and in this version multi-gpu inference works fine.

The text was updated successfully, but these errors were encountered:

kcz358 · 2024-04-28T03:08:46Z

May I ask in which line of inference did this error occur?

baichuanzhou · 2024-05-07T03:13:11Z

Sorry for the delay.

Here is one error message:

[lmms_eval/models/llava.py:386] ERROR Error Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper_CUDA__cudnn_convolution) in generating.

kcz358 · 2024-05-07T05:36:02Z

You might also want to try setting device_map=auto in your model_args when you do multi-processing

--model_args pretrained=xxx,conv_template=xxx,device_map=auto

baichuanzhou · 2024-05-07T06:00:38Z

Setting device_map to auto didn't do the trick. Here's my command:

srun -p xxx --gres=gpu:4 accelerate launch --num_processes=4 --main_process_port 19500 -m lmms_eval --model llava   --model_args pretrained="xxx,conv_template=xxx,device_map=auto"   --task textvqa_val,vizwiz_vqa_val,mmbench_en --batch_size 1 --log_samples --log_samples_suffix llava_hermes2_llama3_merged_data_v1.1_anyres_tune_vit --output_path ./logs/ #

I noticed one difference between evaluation using v0.1.2 and bf4c78b7e405e2ca29bf76f579371382fec3dd02 was this logger information:
v0.1.2:[lmms_eval/models/llava.py:124] INFO Using single device: cuda
bf4c78b7e405e2ca29bf76f579371382fec3dd02: lmms_eval/models/llava.py:104] INFO Using 4 devices with data parallelism

Line 104 appears to be here.

kcz358 · 2024-05-07T07:02:34Z

Sorry, my bad

Should set device_map="" when using multiprocess. Set device_map=auto only when you use num_processes=1

baichuanzhou · 2024-05-07T11:26:58Z

Thanks. Now it works!

baichuanzhou closed this as completed May 7, 2024

kcz358 mentioned this issue May 11, 2024

[MultiGPU Evaluation Error]: model on differencet device #79

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model will be loaded on different devices when using multiple gpus. #67

Model will be loaded on different devices when using multiple gpus. #67

baichuanzhou commented Apr 26, 2024

kcz358 commented Apr 28, 2024

baichuanzhou commented May 7, 2024 •

edited

Loading

kcz358 commented May 7, 2024

baichuanzhou commented May 7, 2024 •

edited

Loading

kcz358 commented May 7, 2024

baichuanzhou commented May 7, 2024

Model will be loaded on different devices when using multiple gpus. #67

Model will be loaded on different devices when using multiple gpus. #67

Comments

baichuanzhou commented Apr 26, 2024

kcz358 commented Apr 28, 2024

baichuanzhou commented May 7, 2024 • edited Loading

kcz358 commented May 7, 2024

baichuanzhou commented May 7, 2024 • edited Loading

kcz358 commented May 7, 2024

baichuanzhou commented May 7, 2024

baichuanzhou commented May 7, 2024 •

edited

Loading

baichuanzhou commented May 7, 2024 •

edited

Loading