-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenAI API Wrapper #735
Comments
Hello @BloodSucker99, I am not sure that's possible on the server side since models have different prompts. |
Yea i kind of suspected that doing it on server side would not be possible :( Any chance you/anyone would be interested in building like a middle man for this? A python wrapper that just sits in middle would be cool |
I had some time and started working on something. I will share the first version here. I would love to get feedback if you are willing to try out. |
Id love to try it out. Also possible to communicate over discord? Would make it much easier :) |
Okay, I rushed out the first version. It is in a package i started called easyllm. Github: https://github.com/philschmid/easyllm The documentation also includes examples for streaming ExampleInstall EasyLLM via pip: pip install easyllm Then import and start using the clients: from easyllm.clients import huggingface
from easyllm.prompt_utils import build_llama2_prompt
# helper to build llama2 prompt
huggingface.prompt_builder = build_llama2_prompt
response = huggingface.ChatCompletion.create(
model="meta-llama/Llama-2-70b-chat-hf",
messages=[
{"role": "system", "content": "\nYou are a helpful assistant speaking like a pirate. argh!"},
{"role": "user", "content": "What is the sun?"},
],
temperature=0.9,
top_p=0.6,
max_tokens=256,
)
print(response) |
This is interesting. Could you give me an example of connecting this to the TGI api? There is a model space but that would be loading the model again right? So instead if I am usin TGI which has the model loaded, how would i use its api in here and get a openai api out? |
No, its a client. How would add a Wrapper when you don't know the prompt format on the server side? |
Hi! I am a bit confused of what you mean by "you don't know the prompt format on the server side". So there is a wrapper made by langchain: I was kind of thinking this way but for openai if that makes sense. |
I think what @philschmid meant, is how are you supposed to send a final fully formed id sequence. So the missing step is going from
To:
Which is needed for good results with https://huggingface.co/meta-llama/Llama-2-7b-chat-hf for instance (don't quote me on the prompt I did it from memory) |
I agree with @Narsil point. Some people or projects don't use OpenAI-style prompts. Eventually, all messages will be merged into a single string for input to LLM, limiting flexibility. One possible solution is to create an API template on the server side, allowing users to define their preferred API. However, implementing this approach might require a substantial amount of work and could potentially introduce bugs. I have a question: Why is the TGI API slightly different from the TGI client SDK? For instance, the parameter 'detail' is ignored in the TGI client source code. Shouldn't they be exactly the same? |
I'm not sure what you are referring to. There could be some slight out sync between the Python client and the server, but that's not intentional. |
That's definitely an option, which I would like with guidance and token healing if we were to do it, they seem to serve the same purpose: extending the querying API in a user defined way. (Both the server user and the actual querying user) |
I've implemented a small wrapper around the chat completions for llama2. |
here is what I'm referring to here. The request parameters is slight different from ones in API. It's ok but why the 'detail' is manually set to True here |
In case this is helpful, llama.cpp does this via api_like_OAI.py. This PR would update that script to use fastchat's conversation.py to handle the serialization problem discussed upthread. |
And here is fastchat's own version of this: https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/openai_api_server.py |
I would love to have this to be supported. |
LiteLLM has support for TGI: https://docs.litellm.ai/docs/providers/huggingface#text-generation-interface-tgi---llms |
Thanks for mentioning us @abhinavkulkarni Hey @Narsil @jcushman @zfang This is the basic code:
We also handle prompt formatting - https://docs.litellm.ai/docs/providers/huggingface#models-with-prompt-formatting based on the lmsys/fastchat implementation. But you can overwrite this with your own changes if necessary - https://docs.litellm.ai/docs/providers/huggingface#custom-prompt-templates |
Hi @krrishdholakia, Thanks for the info. Instead of a client I actually a middle service because I'm trying to host an API server for the chatbot arena https://github.com/lm-sys/FastChat/blob/main/docs/arena.md#how-to-add-a-new-model I can use vLLM to host a service and provide an OpenAI-compatible API but it's quite slower than TGI. It pains me that TGI doesn't support this. I will probably need to hack a FastChat service to redirect calls to TGI. Regards, Felix |
@zfang we have an open-source proxy you can fork and run this through - https://github.com/BerriAI/liteLLM-proxy would it be helpful if we exposed a cli command to deploy this through?
|
LiteLLM has developed an OpenAI wrapper for TGI (and for lots of other model-serving frameworks). Here are more details: https://docs.litellm.ai/docs/proxy_server You can set it up as follows: Set up a local TGI endpoint first: $ text-generation-launcher
--model-id abhinavkulkarni/codellama-CodeLlama-7b-Instruct-hf-w4-g128-awq \
--trust-remote-code --port 8080 \
--max-input-length 5376 --max-total-tokens 6144 --max-batch-prefill-tokens 6144 \
--quantize awq I have a LiteLLM proxy server on top of that. $ litellm \
--model huggingface/abhinavkulkarni/codellama-CodeLlama-7b-Instruct-hf-w4-g128-awq \
--api_base http://localhost:8080 I am able to successfully obtain responses from openai.ChatCompletion.create endpoint as follows: >>> import openai
>>> openai.api_key = "xyz"
>>> openai.api_base = "http://0.0.0.0:8000"
>>> model = "huggingface/abhinavkulkarni/codellama-CodeLlama-7b-Instruct-hf-w4-g128-awq"
>>> completion = openai.ChatCompletion.create(model=model, messages=[{"role": "user", "content": "How are you?"}])
>>> print(completion)
{
"object": "chat.completion",
"choices": [
{
"finish_reason": "length",
"index": 0,
"message": {
"content": "I'm fine, thanks. I'm glad to hear that.\n\nI'm",
"role": "assistant",
"logprobs": -18.19830319
}
}
],
"id": "chatcmpl-7f8f5312-893a-4dab-aff5-3a97a354c2be",
"created": 1695869575.316254,
"model": "abhinavkulkarni/codellama-CodeLlama-7b-Instruct-hf-w4-g128-awq",
"usage": {
"prompt_tokens": 4,
"completion_tokens": 15,
"total_tokens": 19
}
} |
@zfang @paulcx I implemented this feature on the Apache-2.0 licenced forked project directly in Rust. https://github.com/Preemo-Inc/text-generation-inference |
Hey @michaelfeil - is TGI close-source now? I can't find other info on this |
We added a restriction in 1.0 which means you cannot use it as a cloud provider as-is without getting a license from us. More details here: |
Can I use this to wrap the official inference API as published by hf? |
@abhinavkulkarni hi, How to run a |
@yanmengxiang1: Please install |
@abhinavkulkarni yes, I know it. Should I use something |
Hey @yanmengxiang1: Run TGI at port
You now have OpenAI compatible API endpoint at port |
@yanmengxiang1 the relevant docs - https://docs.litellm.ai/docs/proxy_server |
It seems this feature is going to be depreciated? So how future proof is it to build an application around it? |
Hey @LarsHill the LiteLLM community is discussing the best approach right now - BerriAI/litellm#648 (comment) Some context but it might not be ideal. So we're trying to understand what works best (how do you provide a consistent experience + easy ability to set up configs, etc.). DM'ing you to understand what a good experience here looks like. |
The text-generation-webui made a huge progress on supporting other providers by including extensions, you can serve the compatible openai's API using this command: # clone the repo, then cd..
# install deps:
!pip install -q -r requirements.txt --upgrade
# install extensions (openai...)
!pip install -q -r extensions/openai/requirements.txt --upgrade
# download your model (using way allows you to download large models):
!python download-model.py https://huggingface.co/TheBloke/SauerkrautLM-UNA-SOLAR-Instruct-GPTQ
# this one works better for MemGPT
# serve your model (check the name of the download file/directory):
!python server.py --model TheBloke_SauerkrautLM-UNA-SOLAR-Instruct-GPTQ --n-gpu-layers 24 --n_ctx 2048 --api --nowebui --extensions openai
# or download a specific file (if using GGUF models):
!python download-model.py https://huggingface.co/TheBloke/dolphin-2.7-mixtral-8x7b-GGUF --specific-file dolphin-2.7-mixtral-8x7b.Q2_K.gguf Your server should be up and running on !curl http://0.0.0.0:5000/v1/completions -H "Content-Type: application/json" -d '{ "prompt": "This is a cake recipe:\n\n1.","max_tokens": 200, "temperature": 1, "top_p": 0.9, "seed": 10 }' This way allows you to run any model (even those aren't available on Ollama's docker images), and without hitting huggingface's api and even large models (>= 10Gb) and even if models without an inference API. No |
Increasingly, LLM software is standardizing around the use of OpenAI-esque compatible endpoints. Some examples: * [OpenLLM](https://github.com/bentoml/OpenLLM) (commonly used to self-host/deploy various LLMs in enterprises) * [Huggingface TGI](huggingface/text-generation-inference#735) (and, by extension, [AWS SageMaker](https://aws.amazon.com/blogs/machine-learning/announcing-the-launch-of-new-hugging-face-llm-inference-containers-on-amazon-sagemaker/)) * [Ollama](https://github.com/ollama/ollama) (commonly used for running LLMs locally, useful for local testing) All of these projects either have OpenAI-compatible API endpoints already, or are actively building out support for it. On strat we are regularly working with enterprise customers that self-host their own specific-model LLM via one of these methods, and wish for Cody to consume an OpenAI endpoint (understanding some specific model is on the other side and that Cody should optimize for / target that specific model.) Since Cody needs to tailor to a specific model (prompt generation, stop sequences, context limits, timeouts, etc.) and handle other provider-specific nuances, it is insufficient to simply expect that a customer-provided OpenAI compatible endpoint is in fact 1:1 compatible with e.g. GPT-3.5 or GPT-4. We need to be able to configure/tune many of these aspects to the specific provider/model, even though it presents as an OpenAI endpoint. In response to these needs, I am working on adding an 'OpenAI-compatible' provider proper: the ability for a Sourcegraph enterprise instance to advertise that although it is connected to an OpenAI compatible endpoint, there is in fact a specific model on the other side (starting with Starchat and Starcoder) and that Cody should target that configuration. The _first step_ of this work is this change. After this change, an existing (current-version) Sourcegraph enterprise instance can configure an OpenAI endpoint for completions via the site config such as: ``` "cody.enabled": true, "completions": { "provider": "openai", "accessToken": "asdf", "endpoint": "http://openllm.foobar.com:3000", "completionModel": "gpt-4", "chatModel": "gpt-4", "fastChatModel": "gpt-4", }, ``` The `gpt-4` model parameters will be sent to the OpenAI-compatible endpoint specified, but will otherwise be unused today. Users may then specify in their VS Code configuration that Cody should treat the LLM on the other side as if it were e.g. Starchat: ``` "cody.autocomplete.advanced.provider": "experimental-openaicompatible", "cody.autocomplete.advanced.model": "starchat-16b-beta", "cody.autocomplete.advanced.timeout.multiline": 10000, "cody.autocomplete.advanced.timeout.singleline": 10000, ``` In the future, we will make it possible to configure the above options via the Sourcegraph site configuration instead of each user needing to configure it in their VS Code settings explicitly. Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
This functionality is now supported in TGI with the introduction of the Message API and can be used like: from openai import OpenAI
# init the client but point it to TGI
client = OpenAI(
base_url="http://localhost:3000/v1",
api_key="-"
)
chat_completion = client.chat.completions.create(
model="tgi",
messages=[
{"role": "system", "content": "You are a helpful assistant." },
{"role": "user", "content": "What is deep learning?"}
],
stream=False
)
print(chat_completion) Please see the docs here for more details https://huggingface.co/docs/text-generation-inference/messages_api |
Increasingly, LLM software is standardizing around the use of OpenAI-esque compatible endpoints. Some examples: * [OpenLLM](https://github.com/bentoml/OpenLLM) (commonly used to self-host/deploy various LLMs in enterprises) * [Huggingface TGI](huggingface/text-generation-inference#735) (and, by extension, [AWS SageMaker](https://aws.amazon.com/blogs/machine-learning/announcing-the-launch-of-new-hugging-face-llm-inference-containers-on-amazon-sagemaker/)) * [Ollama](https://github.com/ollama/ollama) (commonly used for running LLMs locally, useful for local testing) All of these projects either have OpenAI-compatible API endpoints already, or are actively building out support for it. On strat we are regularly working with enterprise customers that self-host their own specific-model LLM via one of these methods, and wish for Cody to consume an OpenAI endpoint (understanding some specific model is on the other side and that Cody should optimize for / target that specific model.) Since Cody needs to tailor to a specific model (prompt generation, stop sequences, context limits, timeouts, etc.) and handle other provider-specific nuances, it is insufficient to simply expect that a customer-provided OpenAI compatible endpoint is in fact 1:1 compatible with e.g. GPT-3.5 or GPT-4. We need to be able to configure/tune many of these aspects to the specific provider/model, even though it presents as an OpenAI endpoint. In response to these needs, I am working on adding an 'OpenAI-compatible' provider proper: the ability for a Sourcegraph enterprise instance to advertise that although it is connected to an OpenAI compatible endpoint, there is in fact a specific model on the other side (starting with Starchat and Starcoder) and that Cody should target that configuration. The _first step_ of this work is this change. After this change, an existing (current-version) Sourcegraph enterprise instance can configure an OpenAI endpoint for completions via the site config such as: ``` "cody.enabled": true, "completions": { "provider": "openai", "accessToken": "asdf", "endpoint": "http://openllm.foobar.com:3000", "completionModel": "gpt-4", "chatModel": "gpt-4", "fastChatModel": "gpt-4", }, ``` The `gpt-4` model parameters will be sent to the OpenAI-compatible endpoint specified, but will otherwise be unused today. Users may then specify in their VS Code configuration that Cody should treat the LLM on the other side as if it were e.g. Starchat: ``` "cody.autocomplete.advanced.provider": "experimental-openaicompatible", "cody.autocomplete.advanced.model": "starchat-16b-beta", "cody.autocomplete.advanced.timeout.multiline": 10000, "cody.autocomplete.advanced.timeout.singleline": 10000, ``` In the future, we will make it possible to configure the above options via the Sourcegraph site configuration instead of each user needing to configure it in their VS Code settings explicitly. Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
Feature request
Hi,
I was wondering if it would be possible to have a openai based api.
Motivation
Many projects have been built around openai api something similar to what vllm has and few others inference servers have. If TGI can have this, we can just swap the base url for any projects such as aider and many more and use them without any hastle of changing the code.
https://github.com/paul-gauthier/aider
https://github.com/AntonOsika/gpt-engineer
https://github.com/Significant-Gravitas/Auto-GPT
And many more.
For reference, vllm has a wrapper and text-generation webui has one too.
Your contribution
discuss.
The text was updated successfully, but these errors were encountered: