Skip to content
This repository has been archived by the owner on Dec 6, 2024. It is now read-only.
/ vllm-gptq Public archive
forked from vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

License

Notifications You must be signed in to change notification settings

QwenLM/vllm-gptq

 
 

Repository files navigation

Important

The End for QwenLM/vllm-gptq

Since December 2023, vllm has supported 4-bit GPTQ, followed by 8-bit GPTQ support since March 2024. Additionally, vllm now includes Marlin and MoE support.

This repository has fulfilled its role. We recommend transitioning to the original vllm for Qwen models to take advantage of the latest features and ongoing improvements.

vLLM

本仓库是基于vLLM(版本0.2.2)进行修改的一个分支,主要为了支持Qwen系列大语言模型的GPTQ量化推理。

This repo is a fork of vLLM(Version: 0.2.2), which supports the GPTQ model inference of Qwen large language models.

新增功能

该版本vLLM跟官方0.22版本的主要区别在于增加GPTQ int4量化模型支持。我们在Qwen-72B-Chat上测试了量化模型性能,结果如下表。

The features we added is to support GPTQ int4 quantization. We test on the Qwen-72B and the test performance is shown in the table.

context length generate length tokens/s tokens/s tokens/s tokens/s tokens/s tokens/s tokens/s tokens/s
tp=8 tp=8 tp=4 tp=4 tp=2 tp=2 tp=1 tp=1
fp16 a16w16 int4 a16w4 fp16 a16w16 int4 a16w4 fp16 a16w16 int4 a16w4 fp16 a16w16 int4 a16w4
1 2k 26.42 27.68 24.98 27.19 17.39 20.76 - 14.63
6k 2k 24.93 25.98 22.76 24.56 - 18.07 - -
14k 2k 22.67 22.87 19.38 19.28 - 14.51 - -
30k 2k 19.95 19.87 17.05 16.93 - - - -

如何开始

安装

为了安装vLLM,你必须满足以下要求:

To install vLLM, you must meet the below requirements.

  • torch >= 2.0
  • cuda 11.8 or 12.1

目前,我们仅支持源码安装。

You can install vLLM from source.

如果你使用cuda 12.1和torch 2.1,你可以使用以下方法安装

If you use cuda 12.2 and torch 2.1, you can install vLLM by

   git clone https://github.com/QwenLM/vllm-gptq.git
   cd vllm-gptq
   pip install -e .

其他情况下,安装可能较为复杂。一个可能的方式是,安装对应版本的cuda和PyTorch后,删除requirements.txt的torch依赖,并删除pyproject.toml,再尝试执行pip install -e .

In other cases, installation may be complicated. One possible way is to install the corresponding versions of CUDA and PyTorch, **delete the torch dependencies in Requirements.txt, delete pyproject.toml, and then try to execute pip install -e.

如何使用

我们在此仅介绍如何运行Qwen的量化模型。

We only introduce how to run Qwen's quantized model.

关于Qen量化模型的示例代码,代码目录在tests/qwen/。

Regarding the example code of Qwen quantized model, the code directory is in tests/qwen/.

注意:当前本仓库仅支持Int4量化模型。Int8量化模型将在后续支持。

Note: The current warehouse only supports Int4 quantized model. Int8 quantization will be supported in near future.

批处理调用模型

注意:运行以下代码,需要先进入对应的目录:tests/qwen/。

Note: To run the following code, you need to enter the directory 'tests/qwen/' first.

from vllm_wrapper import vLLMWrapper

if __name__ == '__main__':
    model = "Qwen/Qwen-72B-Chat-Int4"

    vllm_model = vLLMWrapper(model,
                             quantization = 'gptq',
                             dtype="float16",
                             tensor_parallel_size=1)

    response, history = vllm_model.chat(query="你好",
                                        history=None)
    print(response)
    response, history = vllm_model.chat(query="给我讲一个年轻人奋斗创业最终取得成功的故事。",
                                        history=history)
    print(response)
    response, history = vllm_model.chat(query="给这个故事起一个标题",
                                        history=history)
    print(response)

API方式调用模型

除去安装vLLM外,以API方式调用模型需要额外安装fastchat

In addition to installing vLLM, you should install FastChat.

pip install fschat
启动Server

step 1. 启动控制器

step 1. Launch the controller

    python -m fastchat.serve.controller

step 2. 启动模型worker

step 2. Launch the model worker

    python -m fastchat.serve.vllm_worker --model-path $model_path --tensor-parallel-size 1 --trust-remote-code

step 3. 启动服务器

step 3. Launch the openai api server

   python -m fastchat.serve.openai_api_server --host localhost --port 8000
API调用

step 1. 安装openai-python

step 1. install openai-python

pip install --upgrade openai

step 2. 调用接口

step 2. Query APIs

import openai
# to get proper authentication, make sure to use a valid key that's listed in
# the --api-keys flag. if no flag value is provided, the `api_key` will be ignored.
openai.api_key = "EMPTY"
openai.api_base = "http://localhost:8000/v1"

model = "qwen"
call_args = {
    'temperature': 1.0,
    'top_p': 1.0,
    'top_k': -1,
    'max_tokens': 2048, # output-len
    'presence_penalty': 1.0,
    'frequency_penalty': 0.0,
}
# create a chat completion
completion = openai.ChatCompletion.create(
  model=model,
  messages=[{"role": "user", "content": "Hello! What is your name?"}],
  **call_args
)
# print the completion
print(completion.choices[0].message.content)

About

A high-throughput and memory-efficient inference and serving engine for LLMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 82.9%
  • Cuda 15.8%
  • Other 1.3%