-
Notifications
You must be signed in to change notification settings - Fork 574
api_calls_en
This document is a mirror of
scripts/openai_server_demo/README.md
, provided by @sunyuhan19981208 in a project PR. For more detailed OpenAI API information, visit: https://platform.openai.com/docs/api-reference
This is a simple server DEMO implemented with FastAPI that mimics the style of the OpenAI API. You can use this API DEMO to quickly build personal websites and other interesting WEB DEMOs based on large Chinese models.
Install dependencies:
$ pip install fastapi uvicorn shortuuid sse_starlette
Start the script:
$ python scripts/openai_server_demo/openai_api_server.py --base_model /path/to/base_model --gpus 0,1
Parameter Explanation:
-
--base_model {base_model}
: The directory holding the full(merged) Chinese Alpaca-2 model, or it can be the original Llama-2 model converted to HF format (in which case you need to provide--lora_model
). You can use the name of the model from the 🤗Model Hub. -
--lora_model {lora_model}
: The directory containing the decompressed Chinese Alpaca-2 LoRA files, or you can use the name of the model from the 🤗Model Hub. If this argument is provided, make sure the value of--base_model
is the original Llama-2 model. -
--tokenizer_path {tokenizer_path}
: The directory where the corresponding tokenizer is stored. If it is not provided, its default value is the same as--lora_model
; if--lora_model
is also not provided, its default value is the same as--base_model
. -
--only_cpu
: Use only the CPU for inference. -
--gpus {gpu_ids}
: Specify the GPU device numbers to use, the default is 0. If you are using multiple GPUs, separate them with commas, like 0,1,2. -
--load_in_8bit
or--load_in_4bit
: Use the 8bit or 4bit model for inference, which can save video memory, but may affect the model effect. -
--alpha {alpha}
: The coefficient for extending the context length using the NTK method, which can increase the manageable input length. The default is 1. If you are unsure about how to set this, you can keep the default value, or set it to"auto"
.
For the Chinese translation of completion, Professor Li Hongyi translates it as "text completion". https://www.youtube.com/watch?v=yiY4nPOzJEg
This is the most basic API interface where you input a prompt and get a text completion result from the large language model.
The API DEMO has built-in prompt templates. The prompt will be inserted into the instruction template, so the input prompt should be more like an instruction rather than a conversation.
Request command:
curl http://localhost:19327/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "Tell me where the capital of China is"
}'
Json return body:
{
"id": "cmpl-3watqWsbmYgbWXupsSik7s",
"object": "text_completion",
"created": 1686067311,
"model": "chinese-llama-alpaca-2",
"choices": [
{
"index": 0,
"text": "The capital of China is Beijing."
}
]
}
Request command:
curl http://localhost:19327/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "Tell me what are the pros and cons of China and the United States respectively",
"max_tokens": 90,
"temperature": 0.7,
"num_beams": 4,
"top_k": 40
}'
Json return body:
{
"id": "cmpl-PvVwfMq2MVWHCBKiyYJfKM",
"object": "text_completion",
"created": 1686149471,
"model": "chinese-llama-alpaca-2",
"choices": [
{
"index": 0,
"text": "The advantages of China are its rich culture and history, while the advantages of the United States are its advanced technology and economic system."
}
]
}
For more detailed information about Decoding strategies, you can refer to https://towardsdatascience.com/the-three-decoding-methods-for-nlp-23ca59cb1e9d. This article explains in detail the three decoding strategies used by LLaMA: Greedy Decoding, Random Sampling, and Beam Search, which are the basis for advanced parameters such as top_k, top_p, temperature, num_beam, etc.
-
prompt
: The prompt for generating text completions. -
max_tokens
: The token length of the new sentence. -
temperature
: The sampling temperature chosen between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more deterministic. The higher the temperature, the more likely it is to use random sampling as decoding. -
num_beams
: When the search strategy is beam search, this parameter is the number of beams used in the beam search. When num_beams=1, it is actually greedy decoding. -
top_k
: In random sampling, the top_k high-probability tokens will be sampled as candidate tokens. -
top_p
: In random sampling, tokens with cumulative probabilities exceeding top_p will be sampled as candidate tokens. The lower it is, the more randomness, for example, when top_p is set to 0.6, if the probabilities of the top 5 tokens are {0.23, 0.20, 0.18, 0.11, 0.10}, the cumulative probability of the first three tokens is 0.61, so the fourth token will be filtered out, and only the first three tokens will be sampled as candidate tokens. -
repetition_penalty
: Repetition penalty. For more details, you can refer to this article: https://arxiv.org/pdf/1909.05858.pdf. -
do_sample
: Enable the random sampling strategy. The default is true.
The chat interface supports multi-turn dialogues.
Request command:
curl http://localhost:19327/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user","content": "Tell me some stories about Hangzhou"}
],
"repetition_penalty": 1.0
}'
Json return body:
{
"id": "chatcmpl-5L99pYoW2ov5ra44Ghwupt",
"object": "chat.completion",
"created": 1686143170,
"model": "chinese-llama-alpaca-2",
"choices": [
{
"index": 0,
"message": {
"role": "user",
"content": "Tell me some stories about Hangzhou"
}
},
{
"index": 1,
"message": {
"role": "assistant",
"content": "Sure, do you have any particular preferences about Hangzhou?"
}
}
]
}
Request command:
curl http://localhost:19327/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user","content": "Tell me some stories about Hangzhou"},
{"role": "assistant","content": "Sure, do you have any particular preferences about Hangzhou?"},
{"role": "user","content": "I particularly like the West Lake, could you tell me about it?"}
],
"repetition_penalty": 1.0
}'
Json return body:
{
"id": "chatcmpl-hmvrQNPGYTcLtmYruPJbv6",
"object": "chat.completion",
"created": 1686143439,
"model": "chinese-llama-alpaca-2",
"choices": [
{
"index": 0,
"message": {
"role": "user",
"content": "Tell me some stories about Hangzhou"
}
},
{
"index": 1,
"message": {
"role": "assistant",
"content": "Sure, do you have any particular preferences about Hangzhou?"
}
},
{
"index": 2,
"message": {
"role": "user",
"content": "I particularly like the West Lake, could you tell me about it?"
}
},
{
"index": 3,
"message": {
"role": "assistant",
"content": "Yes, West Lake is one of the most famous attractions in Hangzhou, it is known as 'Paradise on Earth'. <\\s>"
}
}
]
}
-
prompt
: The prompt for generating text completion. -
max_tokens
: The token length of the newly generated sentence. -
temperature
: The sampling temperature chosen between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more deterministic. The higher the temperature, the greater the probability of using random sampling for decoding. -
num_beams
: When the search strategy is beam search, this parameter is the number of beams used in the beam search. When num_beams=1, it is actually greedy decoding. -
top_k
: In random sampling, the top_k high probability tokens will be randomly sampled as candidate tokens. -
top_p
: In random sampling, tokens with a cumulative probability exceeding top_p will be randomly sampled as candidate tokens. The lower the value, the higher the randomness. For example, when top_p is set to 0.6 and the probabilities of the top 5 tokens are [0.23, 0.20, 0.18, 0.11, 0.10], the cumulative probability of the first three tokens is 0.61, so the fourth token will be filtered out, and only the first three tokens will be randomly sampled as candidate tokens. -
repetition_penalty
: Repetition penalty, for more details you can refer to this article: https://arxiv.org/pdf/1909.05858.pdf. -
do_sample
: Enable the random sampling strategy. Default is true. -
stream
: When set to true, data will be streamed back in the format used by OpenAI. By default, it's set to false.
Text embeddings have many uses, including but not limited to question answering based on large documents, summarizing the content of a book, finding the memory most similar to the current user input for large language models, and so on.
Request command:
curl http://localhost:19327/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"input": "The weather is really nice today"
}'
Json return body:
{
"object": "list",
"data": [
{
"object": "embedding",
"embedding": [
0.003643923671916127,
-0.0072653163224458694,
0.0075545101426541805,
....,
0.0045851171016693115
],
"index": 0
}
],
"model": "chinese-llama-alpaca-2"
}
The length of the embedding vector is the same as the hidden size of the model used. For example, when using the 7B model, the length of the embedding is 4096.