api_calls_vllm_en

OpenAI API Calls (vLLM Backend)

For more detailed information about OPENAI API: https://platform.openai.com/docs/api-reference

This is a simple server demo implemented using FastAPI that mimics the style of the OPENAI API. You can use this API DEMO to quickly build personal websites based on large Chinese models, as well as other interesting WEB DEMOs.

This implementation is based on vLLM for deploying LLM backend services, and it temporarily does not support loading LoRA models, only supports CPU deployment, and uses 8-bit inference.

Deployment Method

Install dependencies

pip install fastapi uvicorn shortuuid vllm fschat

Start script

python scripts/openai_server_demo/openai_api_server_vllm.py --model /path/to/base_model --tokenizer-mode slow --served-model-name chinese-llama-alpaca

Parameter explanation

--model {base_model}: Directory that holds the full(merged) weights and configuration files of Chinese Alpaca-2 model.
--tokenizer {tokenizer_path}: Directory that holds the corresponding tokenizer. If this parameter is not provided, its default value is the same as --base_model.
--tokenizer-mode {tokenizer-mode}: The mode of the tokenizer. When using models based on LLaMA/Llama-2, this is set to slow.
--tensor-parallel-size {tensor_parallel_size}: The number of GPUs used. The default is 1.
--served-model-name {served-model-name}: The model name used in the API. If using the Chinese Alpaca-2 series model, the model name must include chinese-llama-alpaca-2.
--host {host_name}: The host name of the deployed service. The default value is localhost.
--port {port}: The port number of the deployed service. The default value is 8000.

Text Completion

For the Chinese translation of "completion", Professor Li Hongyi translates it as "text completion" https://www.youtube.com/watch?v=yiY4nPOzJEg

The most basic API interface, it takes a prompt as input and returns the text completion result from a large language model.

The API DEMO has built-in prompt templates, and the prompt will be placed in the instruction template. The input prompt here should be more like a command rather than a dialogue.

Quick Experience with the Completion Interface

Request command:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "chinese-llama-alpaca-2",
    "prompt": "Tell me where the capital of China is"
  }'

JSON response body:

{
    "id": "cmpl-41234d71fa034ec3ae90bbf6b5be7",
    "object": "text_completion",
    "created": 1690870733,
    "model": "chinese-llama-alpaca-2",
    "choices": [
        {
            "index": 0,
            "text": "The capital of China is Beijing."
        }
    ]
}

Advanced Parameters for the Completion Interface

Request command:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Tell me what are the advantages and disadvantages of China and the United States respectively",
    "max_tokens": 90,
    "temperature": 0.7,
    "num_beams": 4,
    "top_k": 40
  }'

JSON response body:

{
    "id": "cmpl-ceca9906bf0a429989e850368cc3f893",
    "object": "text_completion",
    "created": 1690870952,
    "model": "chinese-llama-alpaca-2",
    "choices": [
        {
            "index": 0,
            "text": "The advantage of China is its rich culture and history, while the advantage of the United States is its advanced technology and economic system."
        }
    ]
}

Explanation of Advanced Parameters for the Completion Interface

For more details about Decoding strategies, you can refer to https://towardsdatascience.com/the-three-decoding-methods-for-nlp-23ca59cb1e9d. This article elaborates on the three decoding strategies used by LLaMA: Greedy Decoding, Random Sampling, and Beam Search. Decoding strategies are the basis for advanced parameters like top_k, top_p, temperature, etc.

prompt: The prompt for generating text completion.
max_tokens: The token length of the new sentence.
temperature: Sampling temperature chosen between 0 and 2. Higher values like 0.8 make the output more random, while lower values like 0.2 make the output more deterministic. The higher the temperature, the greater the probability of using random sampling for decoding.
use_beam_search: Use beam search. Default is False, i.e., enabling random sampling strategy.
n: Number of output sequences, default is 1.
best_of: When the search strategy is beam search, this parameter is the number of beams used in beam search. Default is the same as n.
top_k: In random sampling, the top_k high-probability tokens will be randomly sampled as candidate tokens.
top_p: In random sampling, tokens with cumulative probability exceeding top_p will be randomly sampled as candidate tokens. The lower it is, the greater the randomness. For example, when top_p is set to 0.6, and the probabilities of the top 5 tokens are {0.23, 0.20, 0.18, 0.11, 0.10}, the cumulative probability of the first three tokens is 0.61, so the 4th token will be filtered out, and only the first three tokens will be randomly sampled as candidate tokens.
presence_penalty: Repetition penalty, the range is -2 ~ 2, the default value is 0. A value greater than 0 encourages the model to use new tokens, while a value less than 0 encourages repetition.

Chat Completion

The chat interface supports multi-turn dialogue.

Quick Experience with Chat Interface

Request command：

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "chinese-llama-alpaca-2",
    "messages": [
      {"role": "user","message": "Tell me some stories about Hangzhou"}
    ]
  }'

JSON response body：

{
    "id": "cmpl-8fc1b6356cf64681a41a8739445a8cf8",
    "object": "chat.completion",
    "created": 1690872695,
    "model": "chinese-llama-alpaca-2",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Alright, do you have any particular preferences about Hangzhou?"
            }
        }
    ]
}

Multi-turn Dialogue

Request command：

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "chinese-llama-alpaca-2",
    "messages": [
      {"role": "user","message": "Tell me some stories about Hangzhou"},
      {"role": "assistant","message": "Alright, do you have any particular preferences about Hangzhou?"},
      {"role": "user","message": "I'm more interested in West Lake, can you tell me about it?"}
    ],
    "repetition_penalty": 1.0
  }'

JSON response body：

{
    "id": "cmpl-02bf36497d3543c980ca2ae8cc4feb63",
    "object": "chat.completion",
    "created": 1690872676,
    "model": "chinese-llama-alpaca-2",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Yes, West Lake is one of the most famous attractions in Hangzhou, it's considered as a 'Paradise on Earth'."
            }
        }
    ]
}

Explanation of Advanced Parameters in Chat Interface

prompt: The prompt for generating the completion.
max_tokens: The token length of the newly generated sentence.
temperature: Sampling temperature to choose between 0 and 2. Higher values like 0.8 make the output more random, while lower values like 0.2 make it more deterministic. The higher the temperature, the greater the probability of using random sampling as the decoding strategy.
use_beam_search: Use beam search. Default is False, i.e., enable random sampling strategy.
n: The number of output sequences, default is 1.
best_of: When the search strategy is beam search, this parameter is the number of beams used in the beam search. Default is the same as n.
top_k: In random sampling, the top_k tokens with the highest probability will be sampled as candidate tokens.
top_p: In random sampling, the tokens with cumulative probability exceeding top_p will be sampled as candidate tokens. The lower the top_p, the greater the randomness. For example, when top_p is set to 0.6, if the probabilities of the top 5 tokens are {0.23, 0.20, 0.18, 0.11, 0.10}, the cumulative probability of the first three tokens is 0.61, so the fourth token will be filtered out, and only the first three tokens will be sampled as candidate tokens.
presence_penalty: Repetition penalty, the range is -2 ~ 2, the default value is 0. A value greater than 0 encourages the model to use new tokens, while a value less than 0 encourages repetition.

中文文档

模型合并与转换
- 在线模型合并与转换（Colab）
- 手动模型合并与转换
模型量化、推理、部署
效果与评测
训练脚本
- 预训练脚本
- 指令精调脚本
基于人类反馈的强化学习
- 奖励模型
- 强化学习
常见问题

English Docs

Model Reconstruction
- Online Conversion (Colab)
- Manual Conversion
Model Quantization, Inference and Deployment
System Performance
Training Scripts
- Pre-training Scripts
- Instruction Fine-tuning Scripts
Reinforcement Learning from Human Feedback
- Reward Modeling
- Reinforcement Learning
FAQ

Provide feedback

Saved searches

Use saved searches to filter your results more quickly