-
Notifications
You must be signed in to change notification settings - Fork 581
api_calls_vllm_en
For more detailed information about OPENAI API: https://platform.openai.com/docs/api-reference
This is a simple server demo implemented using FastAPI that mimics the style of the OPENAI API. You can use this API DEMO to quickly build personal websites based on large Chinese models, as well as other interesting WEB DEMOs.
This implementation is based on vLLM for deploying LLM backend services, and it temporarily does not support loading LoRA models, only supports CPU deployment, and uses 8-bit inference.
Install dependencies
pip install fastapi uvicorn shortuuid vllm fschat
Start script
python scripts/openai_server_demo/openai_api_server_vllm.py --model /path/to/base_model --tokenizer-mode slow --served-model-name chinese-llama-alpaca
Parameter explanation
-
--model {base_model}
: Directory that holds the full(merged) weights and configuration files of Chinese Alpaca-2 model. -
--tokenizer {tokenizer_path}
: Directory that holds the corresponding tokenizer. If this parameter is not provided, its default value is the same as--base_model
. -
--tokenizer-mode {tokenizer-mode}
: The mode of the tokenizer. When using models based on LLaMA/Llama-2, this is set toslow
. -
--tensor-parallel-size {tensor_parallel_size}
: The number of GPUs used. The default is 1. -
--served-model-name {served-model-name}
: The model name used in the API. If using the Chinese Alpaca-2 series model, the model name must includechinese-llama-alpaca-2
. -
--host {host_name}
: The host name of the deployed service. The default value islocalhost
. -
--port {port}
: The port number of the deployed service. The default value is8000
.
For the Chinese translation of "completion", Professor Li Hongyi translates it as "text completion" https://www.youtube.com/watch?v=yiY4nPOzJEg
The most basic API interface, it takes a prompt as input and returns the text completion result from a large language model.
The API DEMO has built-in prompt templates, and the prompt will be placed in the instruction template. The input prompt here should be more like a command rather than a dialogue.
Request command:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "chinese-llama-alpaca-2",
"prompt": "Tell me where the capital of China is"
}'
JSON response body:
{
"id": "cmpl-41234d71fa034ec3ae90bbf6b5be7",
"object": "text_completion",
"created": 1690870733,
"model": "chinese-llama-alpaca-2",
"choices": [
{
"index": 0,
"text": "The capital of China is Beijing."
}
]
}
Request command:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "Tell me what are the advantages and disadvantages of China and the United States respectively",
"max_tokens": 90,
"temperature": 0.7,
"num_beams": 4,
"top_k": 40
}'
JSON response body:
{
"id": "cmpl-ceca9906bf0a429989e850368cc3f893",
"object": "text_completion",
"created": 1690870952,
"model": "chinese-llama-alpaca-2",
"choices": [
{
"index": 0,
"text": "The advantage of China is its rich culture and history, while the advantage of the United States is its advanced technology and economic system."
}
]
}
For more details about Decoding strategies, you can refer to https://towardsdatascience.com/the-three-decoding-methods-for-nlp-23ca59cb1e9d. This article elaborates on the three decoding strategies used by LLaMA: Greedy Decoding, Random Sampling, and Beam Search. Decoding strategies are the basis for advanced parameters like top_k, top_p, temperature, etc.
-
prompt
: The prompt for generating text completion. -
max_tokens
: The token length of the new sentence. -
temperature
: Sampling temperature chosen between 0 and 2. Higher values like 0.8 make the output more random, while lower values like 0.2 make the output more deterministic. The higher the temperature, the greater the probability of using random sampling for decoding. -
use_beam_search
: Use beam search. Default isFalse
, i.e., enabling random sampling strategy. -
n
: Number of output sequences, default is 1. -
best_of
: When the search strategy is beam search, this parameter is the number of beams used in beam search. Default is the same asn
. -
top_k
: In random sampling, the top_k high-probability tokens will be randomly sampled as candidate tokens. -
top_p
: In random sampling, tokens with cumulative probability exceeding top_p will be randomly sampled as candidate tokens. The lower it is, the greater the randomness. For example, when top_p is set to 0.6, and the probabilities of the top 5 tokens are {0.23, 0.20, 0.18, 0.11, 0.10}, the cumulative probability of the first three tokens is 0.61, so the 4th token will be filtered out, and only the first three tokens will be randomly sampled as candidate tokens. -
presence_penalty
: Repetition penalty, the range is -2 ~ 2, the default value is 0. A value greater than 0 encourages the model to use new tokens, while a value less than 0 encourages repetition.
The chat interface supports multi-turn dialogue.
Request command:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "chinese-llama-alpaca-2",
"messages": [
{"role": "user","message": "Tell me some stories about Hangzhou"}
]
}'
JSON response body:
{
"id": "cmpl-8fc1b6356cf64681a41a8739445a8cf8",
"object": "chat.completion",
"created": 1690872695,
"model": "chinese-llama-alpaca-2",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Alright, do you have any particular preferences about Hangzhou?"
}
}
]
}
Request command:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "chinese-llama-alpaca-2",
"messages": [
{"role": "user","message": "Tell me some stories about Hangzhou"},
{"role": "assistant","message": "Alright, do you have any particular preferences about Hangzhou?"},
{"role": "user","message": "I'm more interested in West Lake, can you tell me about it?"}
],
"repetition_penalty": 1.0
}'
JSON response body:
{
"id": "cmpl-02bf36497d3543c980ca2ae8cc4feb63",
"object": "chat.completion",
"created": 1690872676,
"model": "chinese-llama-alpaca-2",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Yes, West Lake is one of the most famous attractions in Hangzhou, it's considered as a 'Paradise on Earth'."
}
}
]
}
-
prompt
: The prompt for generating the completion. -
max_tokens
: The token length of the newly generated sentence. -
temperature
: Sampling temperature to choose between 0 and 2. Higher values like 0.8 make the output more random, while lower values like 0.2 make it more deterministic. The higher the temperature, the greater the probability of using random sampling as the decoding strategy. -
use_beam_search
: Use beam search. Default isFalse
, i.e., enable random sampling strategy. -
n
: The number of output sequences, default is 1. -
best_of
: When the search strategy is beam search, this parameter is the number of beams used in the beam search. Default is the same asn
. -
top_k
: In random sampling, the top_k tokens with the highest probability will be sampled as candidate tokens. -
top_p
: In random sampling, the tokens with cumulative probability exceeding top_p will be sampled as candidate tokens. The lower the top_p, the greater the randomness. For example, when top_p is set to 0.6, if the probabilities of the top 5 tokens are {0.23, 0.20, 0.18, 0.11, 0.10}, the cumulative probability of the first three tokens is 0.61, so the fourth token will be filtered out, and only the first three tokens will be sampled as candidate tokens. -
presence_penalty
: Repetition penalty, the range is -2 ~ 2, the default value is 0. A value greater than 0 encourages the model to use new tokens, while a value less than 0 encourages repetition.