Please add a version that is able to run with 2/4/8 tensor parallel 请做一个可以2卡4卡8卡张量并行的版本 #231

aabbccddwasd · 2024-09-19T16:23:11Z

在使用官方发布的量化模型中发现无法进行张量并行
原因在于intermediate_size为29568，除以groupsize(128)后剩下的231无法被2或4或8整除，这在vllm会引发错误导致无法进行张量并行

现在请求官方使用不同的groupsize进行量化以使得intermediate_size / groupsize可以被2，4，8整除，或者略微修改模型将intermediate_size变为qwen2.5的29696，这样便可以在groupsize为128的量化下正常张量并行

如果上述方法不可行，希望说明下如何使用这些量化后的模型进行张量并行

qingwu11 · 2024-09-20T02:58:23Z

遇到同样问题，希望得到解决

osoctz · 2024-09-21T02:23:29Z

把config.json中的intermediate_size改为29184,暂时不清楚有什么影响 @aabbccddwasd

bash99 · 2024-09-21T08:59:15Z

遇到同样问题，希望得到解决

bash99 · 2024-09-21T08:59:56Z

把config.json中的intermediate_size改为29184,暂时不清楚有什么影响 @aabbccddwasd

官方建议是padding，https://qwen.readthedocs.io/en/latest/quantization/gptq.html
你这个相当于直接忽略掉了少量参数？

我是V100-32G，双卡够跑，所以只减了128 = 29440，这样是230，2的倍数。
这个方式倒是能跑结果也正常

aabbccddwasd · 2024-09-21T13:27:49Z

把config.json中的intermediate_size改为29184,暂时不清楚有什么影响 @aabbccddwasd

官方建议是padding，https://qwen.readthedocs.io/en/latest/quantization/gptq.html 你这个相当于直接忽略掉了少量参数？

我是V100-32G，双卡够跑，所以只减了128 = 29440，这样是230，2的倍数。这个方式倒是能跑结果也正常

padding不也是忽略了参数吗，这里可不可以选择加没有意义的空参数

aabbccddwasd · 2024-09-21T13:30:08Z

把config.json中的intermediate_size改为29184,暂时不清楚有什么影响 @aabbccddwasd

官方建议是padding，https://qwen.readthedocs.io/en/latest/quantization/gptq.html 你这个相当于直接忽略掉了少量参数？

我是V100-32G，双卡够跑，所以只减了128 = 29440，这样是230，2的倍数。这个方式倒是能跑结果也正常

我看了下文档，我觉得padding应该就是加参数，你是不是说错了

bash99 · 2024-09-22T02:37:42Z

把config.json中的intermediate_size改为29184,暂时不清楚有什么影响 @aabbccddwasd

官方建议是padding，https://qwen.readthedocs.io/en/latest/quantization/gptq.html 你这个相当于直接忽略掉了少量参数？
我是V100-32G，双卡够跑，所以只减了128 = 29440，这样是230，2的倍数。这个方式倒是能跑结果也正常

我看了下文档，我觉得padding应该就是加参数，你是不是说错了

原始值是 29568吧，文档是用0 padding到 29696，你的改法是减少到 29184.
这样确实参数就少了啊？等于忽略了极少量参数？

aabbccddwasd · 2024-09-22T03:07:19Z

把config.json中的intermediate_size改为29184,暂时不清楚有什么影响 @aabbccddwasd

官方建议是padding，https://qwen.readthedocs.io/en/latest/quantization/gptq.html 你这个相当于直接忽略掉了少量参数？
我是V100-32G，双卡够跑，所以只减了128 = 29440，这样是230，2的倍数。这个方式倒是能跑结果也正常

我看了下文档，我觉得padding应该就是加参数，你是不是说错了

原始值是 29568吧，文档是用0 padding到 29696，你的改法是减少到 29184. 这样确实参数就少了啊？等于忽略了极少量参数？

我是V100-32G，双卡够跑，所以只减了128 = 29440，这样是230，2的倍数。

这句话是不是应该改成
"我是V100-32G，双卡够跑，所以只加了128 = 29568，这样是232，2的倍数。"

QwertyJack · 2024-09-23T02:52:08Z

intermediate_size=29440 可以运行，intermediate_size=29568 无法启动：

RuntimeError: start (14848) + length (14848) exceeds dimension size (29568).

QwertyJack · 2024-09-23T03:02:10Z

intermediate_size=29440 可以运行，intermediate_size=29568 无法启动：
RuntimeError: start (14848) + length (14848) exceeds dimension size (29568).

补充：

减还是加？ 减可以直接运行，加需要重新量化；实测减好像看不出来表现明显变差；
减/加多少？ 取决于tp，要求 29568 / 128 是 tp 的整数倍，所以如果两卡 tp=2 那么减/加 128 即可，如果是8卡 tp=8 那么减 896 或者加 128。

总之，还是希望官方出一个 padding 到 29696 的版本，毕竟 calibration dataset 没有公开。

whitesay · 2024-09-23T03:03:11Z

同样的问题，我这边8卡3090部署，按照这个逻辑似乎只能减小到28672

whitesay · 2024-09-23T03:04:13Z

8卡减小到28672可以work，但是增加到29696会出现类似 exceeds dimension size错误

bash99 · 2024-09-23T04:41:45Z

这句话是不是应该改成 "我是V100-32G，双卡够跑，所以只加了128 = 29568，这样是232，2的倍数。"

上面 QwertyJack 说得很清楚了，不padding加上重新量化，没法加，减=忽略少量参数。

QwertyJack · 2024-09-23T05:12:14Z

太巧了，这个issue也是 #231

Cherryjingyao · 2024-09-23T07:31:32Z

把config.json中的intermediate_size改为29184,暂时不清楚有什么影响 @aabbccddwasd

官方建议是padding，https://qwen.readthedocs.io/en/latest/quantization/gptq.html 你这个相当于直接忽略掉了少量参数？

我是V100-32G，双卡够跑，所以只减了128 = 29440，这样是230，2的倍数。这个方式倒是能跑结果也正常

请问你运行的指令是什么，我修改了config.json 之后，VLLM_WORKER_MULTIPROC_METHOD=spawn cuda_visible_devices=0,2 python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-VL-7B-Instruct --model /data/LLM_model/Qwen2-VL-2B-Instruct/Qwen2-VL-72B-Instruct-GPTQ-Int4 --tensor-parallel-size 2 还是显存不够。一张卡是40G

osoctz · 2024-09-23T10:52:26Z

把config.json中的intermediate_size改为29184,暂时不清楚有什么影响 @aabbccddwasd

官方建议是padding，https://qwen.readthedocs.io/en/latest/quantization/gptq.html 你这个相当于直接忽略掉了少量参数？

我是V100-32G，双卡够跑，所以只减了128 = 29440，这样是230，2的倍数。这个方式倒是能跑结果也正常

嗯, 我是4卡, --tensor-parallel-size 4

liuyanyi · 2024-09-23T14:19:17Z

tp需要等官方更新权重，着急用的可以先试试pipeline parallel，今天提的pr已经合了，用vllm的main分支就可以。

bash99 · 2024-09-23T14:48:19Z

把config.json中的intermediate_size改为29184,暂时不清楚有什么影响 @aabbccddwasd

官方建议是padding，https://qwen.readthedocs.io/en/latest/quantization/gptq.html 你这个相当于直接忽略掉了少量参数？
我是V100-32G，双卡够跑，所以只减了128 = 29440，这样是230，2的倍数。这个方式倒是能跑结果也正常

请问你运行的指令是什么，我修改了config.json 之后，VLLM_WORKER_MULTIPROC_METHOD=spawn cuda_visible_devices=0,2 python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-VL-7B-Instruct --model /data/LLM_model/Qwen2-VL-2B-Instruct/Qwen2-VL-72B-Instruct-GPTQ-Int4 --tensor-parallel-size 2 还是显存不够。一张卡是40G

VLLM_WORKER_MULTIPROC_METHOD=spawn CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-VL-72B-Instruct-GPTQ-Int4 --model Qwen2-VL-72B-Instruct-GPTQ-Int4 --port 7865 --dtype half --trust-remote-code --kv-cache-dtype fp8 -q gptq --disable-log-requests --gpu-memory-utilization 0.998 --max-model-len 8192 --enforce_eager -tp 2

V100-32G * 2

niaoyu · 2024-09-23T14:52:44Z

tp需要等官方更新权重，着急用的可以先试试pipeline parallel，今天提的pr已经合了，用vllm的main分支就可以。

thx so much!
By the way, if i want to use 4*A10 to run Qwen2-VL-72B-Instruct-GPTQ-Int4, the command is show as:

 python3 -m vllm.entrypoints.openai.api_server \
      --model Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4 \
      --pipeline-parallel-size 4 \
      --gpu-memory-utilization 0.95 \
      --max-num-seqs 16 \
      --max-model-len 4096 \
      --tokenizer-mode auto \
      --disable-log-requests

Is this correct?

Also could you give us the proper version of transformer? Because It seems like a bug in the latest version of transformer mentioned in vllm-project/vllm#7905 (comment)
We should not use the latest transformer library
pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830

But I see you mentioned "qwen2 vl need the latest transformer library," in the vllm-project/vllm#8696. So the bug is fixed , right?

aabbccddwasd · 2024-09-23T15:06:13Z

so when will the tensor parallel version be uploaded

所以啥时候传能tp的版本

liuyanyi · 2024-09-23T15:18:10Z

tp需要等官方更新权重，着急用的可以先试试pipeline parallel，今天提的pr已经合了，用vllm的main分支就可以。

thx so much!
By the way, if i want to use 4*A10 to run Qwen2-VL-72B-Instruct-GPTQ-Int4, the command is show as:
 python3 -m vllm.entrypoints.openai.api_server \
      --model Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4 \
      --pipeline-parallel-size 4 \
      --gpu-memory-utilization 0.95 \
      --max-num-seqs 16 \
      --max-model-len 4096 \
      --tokenizer-mode auto \
      --disable-log-requests
Is this correct?

Also could you give us the proper version of transformer? Because It seems like a bug in the latest version of transformer mentioned in vllm-project/vllm#7905 (comment)
We should not use the latest transformer library
pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830

But I see you mentioned "qwen2 vl need the latest transformer library," in the vllm-project/vllm#8696. So the bug is fixed , right?

It's not clear in the pr, you should use the specific version not the latest.

kq-chen · 2024-09-24T10:01:00Z

Based on the suggestion from @aabbccddwasd, we have adjusted the intermediate size to 29696 and re-quantized the model. The updated 72B AWQ/GPTQ-Int4/GPTQ-Int8 checkpoints have been uploaded to Hugging Face. To utilize the new checkpoints, please download them again from Hugging Face.

You can use the following command to perform inference on the quantized 72B model with VLLM tensor-parallel:

Server:

VLLM_WORKER_MULTIPROC_METHOD=spawn python -m vllm.entrypoints.openai.api_server \
  --served-model-name qwen2vl \
  --model Qwen/Qwen2-VL-72B-Instruct-AWQ \
  --tensor-parallel-size 4 \
  --max_num_seqs 16

Client:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "qwen2vl",
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
        {"type": "text", "text": "What is the text in the illustration?"}
    ]}
    ]
    }'

NaiveYan · 2024-09-24T15:35:54Z

Based on the suggestion from @aabbccddwasd, we have adjusted the intermediate size to 29696 and re-quantized the model. The updated 72B AWQ/GPTQ-Int4/GPTQ-Int8 checkpoints have been uploaded to Hugging Face. To utilize the new checkpoints, please download them again from Hugging Face.

You can use the following command to perform inference on the quantized 72B model with VLLM tensor-parallel:

Server:
VLLM_WORKER_MULTIPROC_METHOD=spawn python -m vllm.entrypoints.openai.api_server \
  --served-model-name qwen2vl \
  --model Qwen/Qwen2-VL-72B-Instruct-AWQ \
  --tensor-parallel-size 4 \
  --max_num_seqs 16
Client:
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "qwen2vl",
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
        {"type": "text", "text": "What is the text in the illustration?"}
    ]}
    ]
    }'

Any plan for modelscope?

aabbccddwasd · 2024-09-25T04:29:25Z

thanks
qwen best FOREVER！

kq-chen · 2024-09-25T06:19:47Z

@NaiveYan the 72b awq/gptq ckpt has been updated in modelscope.

YChengxin · 2024-09-27T08:48:32Z

根据@aabbccddwasd，我们将中间尺寸调整为 29696，并重新量化了模型。更新后的 72B AWQ/GPTQ-Int4/GPTQ-Int8 检查点已上传至 Hugging Face。要使用新的检查点，请从 Hugging Face 再次下载。
您可以使用以下命令通过 VLLM tensor-parallel 对量化的 72B 模型执行推理：
服务器：
VLLM_WORKER_MULTIPROC_METHOD=spawn python -m vllm.entrypoints.openai.api_server \
  --served-model-name qwen2vl \
  --model Qwen/Qwen2-VL-72B-Instruct-AWQ \
  --tensor-parallel-size 4 \
  --max_num_seqs 16
客户：
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "qwen2vl",
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
        {"type": "text", "text": "What is the text in the illustration?"}
    ]}
    ]
    }'
对 modelscope 有什么计划吗？

会不会考虑对未量化Qwen2.5-72b-instruct等版本均更新为29696，这样可以方便大家微调后自行量化加速部署 ღ( ´･ᴗ･` )比心

kq-chen · 2024-10-01T19:27:43Z

@YChengxin you can pad the checkpoint with the following code snippet:

import os

import torch
from torch.nn import functional as F
from transformers import Qwen2VLForConditionalGeneration

def fix_dim(
    model_path: str,
    output_path: str,
    src_dim: int = 29568,
    tar_dim: int = 29696,
):
    pad_size = tar_dim - src_dim
    model = Qwen2VLForConditionalGeneration.from_pretrained(model_path, torch_dtype='auto', device_map='auto')
    sd = model.state_dict()
    for i, k in enumerate(sd):
        v = sd[k]
        if ('mlp.up_proj.weight' in k) or ('mlp.gate_proj.weight' in k):
            prev_v = F.pad(v.unsqueeze(1), (0, 0, 0, 1, 0, 0)).reshape(src_dim*2, -1)[:pad_size*2]
            new_v = torch.cat([prev_v, v[pad_size:]], dim=0)
            sd[k] = new_v
        elif 'mlp.down_proj.weight' in k:
            prev_v = F.pad(v.unsqueeze(2), (0, 1)).reshape(v.shape[0], src_dim*2)[:, :pad_size*2]
            new_v = torch.cat([prev_v, v[:, pad_size:]], dim=1)
            sd[k] = new_v
    os.makedirs(output_path, exist_ok=True)
    torch.save(sd, f"{output_path}/pytorch_model.bin")

ShuaiBai623 assigned fyabc and kq-chen Sep 20, 2024

QwertyJack mentioned this issue Sep 20, 2024

Error on running Qwen/Qwen2-VL-72B-Instruct-AWQ #230

Closed

This was referenced Sep 23, 2024

qwen2-vl-72b-instruct and qwen-vl-72b-instruct on ali cloud L20， Vllm frame can`t start server incorrect #253

Closed

vllm can not support Qwen2VL's paralled inference #236

Closed

fyabc mentioned this issue Sep 24, 2024

使用vllm部署qwen2-vl 72Bint4报错 #260

Closed

kq-chen closed this as completed Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please add a version that is able to run with 2/4/8 tensor parallel 请做一个可以2卡4卡8卡张量并行的版本 #231

Please add a version that is able to run with 2/4/8 tensor parallel 请做一个可以2卡4卡8卡张量并行的版本 #231

aabbccddwasd commented Sep 19, 2024

qingwu11 commented Sep 20, 2024

osoctz commented Sep 21, 2024

bash99 commented Sep 21, 2024

bash99 commented Sep 21, 2024 •

edited

Loading

aabbccddwasd commented Sep 21, 2024

aabbccddwasd commented Sep 21, 2024

bash99 commented Sep 22, 2024 •

edited

Loading

aabbccddwasd commented Sep 22, 2024

QwertyJack commented Sep 23, 2024

QwertyJack commented Sep 23, 2024

whitesay commented Sep 23, 2024

whitesay commented Sep 23, 2024

bash99 commented Sep 23, 2024

QwertyJack commented Sep 23, 2024

Cherryjingyao commented Sep 23, 2024

osoctz commented Sep 23, 2024

liuyanyi commented Sep 23, 2024

bash99 commented Sep 23, 2024

niaoyu commented Sep 23, 2024 •

edited

Loading

aabbccddwasd commented Sep 23, 2024

liuyanyi commented Sep 23, 2024

kq-chen commented Sep 24, 2024

NaiveYan commented Sep 24, 2024

aabbccddwasd commented Sep 25, 2024

kq-chen commented Sep 25, 2024

YChengxin commented Sep 27, 2024

kq-chen commented Oct 1, 2024

Please add a version that is able to run with 2/4/8 tensor parallel 请做一个可以2卡4卡8卡张量并行的版本 #231

Please add a version that is able to run with 2/4/8 tensor parallel 请做一个可以2卡4卡8卡张量并行的版本 #231

Comments

aabbccddwasd commented Sep 19, 2024

qingwu11 commented Sep 20, 2024

osoctz commented Sep 21, 2024

bash99 commented Sep 21, 2024

bash99 commented Sep 21, 2024 • edited Loading

aabbccddwasd commented Sep 21, 2024

aabbccddwasd commented Sep 21, 2024

bash99 commented Sep 22, 2024 • edited Loading

aabbccddwasd commented Sep 22, 2024

QwertyJack commented Sep 23, 2024

QwertyJack commented Sep 23, 2024

whitesay commented Sep 23, 2024

whitesay commented Sep 23, 2024

bash99 commented Sep 23, 2024

QwertyJack commented Sep 23, 2024

Cherryjingyao commented Sep 23, 2024

osoctz commented Sep 23, 2024

liuyanyi commented Sep 23, 2024

bash99 commented Sep 23, 2024

niaoyu commented Sep 23, 2024 • edited Loading

aabbccddwasd commented Sep 23, 2024

liuyanyi commented Sep 23, 2024

kq-chen commented Sep 24, 2024

NaiveYan commented Sep 24, 2024

aabbccddwasd commented Sep 25, 2024

kq-chen commented Sep 25, 2024

YChengxin commented Sep 27, 2024

kq-chen commented Oct 1, 2024

bash99 commented Sep 21, 2024 •

edited

Loading

bash99 commented Sep 22, 2024 •

edited

Loading

niaoyu commented Sep 23, 2024 •

edited

Loading