Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenAI API Wrapper #735

Closed
Tracked by #390
Ichigo3766 opened this issue Jul 28, 2023 · 35 comments
Closed
Tracked by #390

OpenAI API Wrapper #735

Ichigo3766 opened this issue Jul 28, 2023 · 35 comments

Comments

@Ichigo3766
Copy link

Feature request

Hi,

I was wondering if it would be possible to have a openai based api.

Motivation

Many projects have been built around openai api something similar to what vllm has and few others inference servers have. If TGI can have this, we can just swap the base url for any projects such as aider and many more and use them without any hastle of changing the code.

https://github.com/paul-gauthier/aider
https://github.com/AntonOsika/gpt-engineer
https://github.com/Significant-Gravitas/Auto-GPT

And many more.

For reference, vllm has a wrapper and text-generation webui has one too.

Your contribution

discuss.

@philschmid
Copy link
Member

Hello @BloodSucker99, I am not sure that's possible on the server side since models have different prompts.
So it might make sense to implement this on a client side, which converts the open AI schema (List of dicts) into a single prompt.

@Ichigo3766
Copy link
Author

Yea i kind of suspected that doing it on server side would not be possible :(

Any chance you/anyone would be interested in building like a middle man for this? A python wrapper that just sits in middle would be cool

@philschmid
Copy link
Member

I had some time and started working on something. I will share the first version here. I would love to get feedback if you are willing to try out.

@Ichigo3766
Copy link
Author

Id love to try it out. Also possible to communicate over discord? Would make it much easier :)

@philschmid
Copy link
Member

Okay, I rushed out the first version. It is in a package i started called easyllm.

Github: https://github.com/philschmid/easyllm
Documentation: https://philschmid.github.io/easyllm/

The documentation also includes examples for streaming

Example

Install EasyLLM via pip:

pip install easyllm

Then import and start using the clients:

from easyllm.clients import huggingface
from easyllm.prompt_utils import build_llama2_prompt

# helper to build llama2 prompt
huggingface.prompt_builder = build_llama2_prompt

response = huggingface.ChatCompletion.create(
    model="meta-llama/Llama-2-70b-chat-hf",
    messages=[
        {"role": "system", "content": "\nYou are a helpful assistant speaking like a pirate. argh!"},
        {"role": "user", "content": "What is the sun?"},
    ],
      temperature=0.9,
      top_p=0.6,
      max_tokens=256,
)

print(response)

@Ichigo3766
Copy link
Author

Ichigo3766 commented Jul 29, 2023

This is interesting. Could you give me an example of connecting this to the TGI api? There is a model space but that would be loading the model again right? So instead if I am usin TGI which has the model loaded, how would i use its api in here and get a openai api out?

@philschmid
Copy link
Member

No, its a client. How would add a Wrapper when you don't know the prompt format on the server side?
It might be possible the write a different server.rs which implements common templating and you could define what you want when starting it, but that's a lot of work.

@Ichigo3766
Copy link
Author

Hi! I am a bit confused of what you mean by "you don't know the prompt format on the server side". So there is a wrapper made by langchain:
https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/llms/huggingface_text_gen_inference.py

I was kind of thinking this way but for openai if that makes sense.

@Narsil
Copy link
Collaborator

Narsil commented Jul 31, 2023

you don't know the prompt format on the server side"

I think what @philschmid meant, is how are you supposed to send a final fully formed id sequence.
TGI doesn't know how model where trained/fine-tuned, so it doesn't know what system prompt or user_prompt is. It expects a single full string, as the langchain wrapper sends.

So the missing step is going from

 messages=[
        {"role": "system", "content": "\nYou are a helpful assistant speaking like a pirate. argh!"},
        {"role": "user", "content": "What is the sun?"},
    ],

To:

[[SYS]\nYou are a helpful assistant speaking like a pirate. argh[/SYS] What is the sun <s>

Which is needed for good results with https://huggingface.co/meta-llama/Llama-2-7b-chat-hf for instance (don't quote me on the prompt I did it from memory)

@paulcx
Copy link

paulcx commented Aug 3, 2023

I agree with @Narsil point. Some people or projects don't use OpenAI-style prompts. Eventually, all messages will be merged into a single string for input to LLM, limiting flexibility. One possible solution is to create an API template on the server side, allowing users to define their preferred API. However, implementing this approach might require a substantial amount of work and could potentially introduce bugs.

I have a question: Why is the TGI API slightly different from the TGI client SDK? For instance, the parameter 'detail' is ignored in the TGI client source code. Shouldn't they be exactly the same?

@Narsil
Copy link
Collaborator

Narsil commented Aug 3, 2023

Why is the TGI API slightly different from the TGI client SDK?

I'm not sure what you are referring to. There could be some slight out sync between the Python client and the server, but that's not intentional.

@Narsil
Copy link
Collaborator

Narsil commented Aug 3, 2023

One possible solution is to create an API template on the server side

That's definitely an option, which I would like with guidance and token healing if we were to do it, they seem to serve the same purpose: extending the querying API in a user defined way. (Both the server user and the actual querying user)

@viniciusarruda
Copy link

I've implemented a small wrapper around the chat completions for llama2.
The easyllm from @philschmid seems good, and I've compared it with my implementation for llama2 and it gives the same result!

@paulcx
Copy link

paulcx commented Aug 4, 2023

Why is the TGI API slightly different from the TGI client SDK?

I'm not sure what you are referring to. There could be some slight out sync between the Python client and the server, but that's not intentional.

here is what I'm referring to here. The request parameters is slight different from ones in API. It's ok but why the 'detail' is manually set to True here

@jcushman
Copy link

jcushman commented Sep 8, 2023

In case this is helpful, llama.cpp does this via api_like_OAI.py. This PR would update that script to use fastchat's conversation.py to handle the serialization problem discussed upthread.

@jcushman
Copy link

jcushman commented Sep 8, 2023

@zfang
Copy link

zfang commented Sep 20, 2023

I would love to have this to be supported.

@abhinavkulkarni
Copy link
Contributor

LiteLLM has support for TGI: https://docs.litellm.ai/docs/providers/huggingface#text-generation-interface-tgi---llms

@krrishdholakia
Copy link

Thanks for mentioning us @abhinavkulkarni

Hey @Narsil @jcushman @zfang
Happy to help here.

This is the basic code:

import os 
from litellm import completion 

# [OPTIONAL] set env var
os.environ["HUGGINGFACE_API_KEY"] = "huggingface_api_key" 

messages = [{ "content": "There's a llama in my garden 😱 What should I do?","role": "user"}]

# e.g. Call 'WizardLM/WizardCoder-Python-34B-V1.0' hosted on HF Inference endpoints
response = completion(model="huggingface/WizardLM/WizardCoder-Python-34B-V1.0", messages=messages, api_base="https://my-endpoint.huggingface.cloud")

print(response)

We also handle prompt formatting - https://docs.litellm.ai/docs/providers/huggingface#models-with-prompt-formatting based on the lmsys/fastchat implementation.

But you can overwrite this with your own changes if necessary - https://docs.litellm.ai/docs/providers/huggingface#custom-prompt-templates

@zfang
Copy link

zfang commented Sep 24, 2023

Hi @krrishdholakia,

Thanks for the info. Instead of a client I actually a middle service because I'm trying to host an API server for the chatbot arena https://github.com/lm-sys/FastChat/blob/main/docs/arena.md#how-to-add-a-new-model

I can use vLLM to host a service and provide an OpenAI-compatible API but it's quite slower than TGI. It pains me that TGI doesn't support this. I will probably need to hack a FastChat service to redirect calls to TGI.

Regards,

Felix

@krrishdholakia
Copy link

krrishdholakia commented Sep 24, 2023

@zfang we have an open-source proxy you can fork and run this through - https://github.com/BerriAI/liteLLM-proxy

would it be helpful if we exposed a cli command to deploy this through?

litellm --deploy

@abhinavkulkarni
Copy link
Contributor

abhinavkulkarni commented Sep 28, 2023

LiteLLM has developed an OpenAI wrapper for TGI (and for lots of other model-serving frameworks).

Here are more details: https://docs.litellm.ai/docs/proxy_server

You can set it up as follows:

Set up a local TGI endpoint first:

$ text-generation-launcher 
  --model-id abhinavkulkarni/codellama-CodeLlama-7b-Instruct-hf-w4-g128-awq \
  --trust-remote-code --port 8080 \
  --max-input-length 5376 --max-total-tokens 6144 --max-batch-prefill-tokens 6144 \
  --quantize awq

I have a LiteLLM proxy server on top of that.

$ litellm \
  --model huggingface/abhinavkulkarni/codellama-CodeLlama-7b-Instruct-hf-w4-g128-awq \
  --api_base http://localhost:8080

I am able to successfully obtain responses from openai.ChatCompletion.create endpoint as follows:

>>> import openai
>>> openai.api_key = "xyz"
>>> openai.api_base = "http://0.0.0.0:8000"
>>> model = "huggingface/abhinavkulkarni/codellama-CodeLlama-7b-Instruct-hf-w4-g128-awq"
>>> completion = openai.ChatCompletion.create(model=model, messages=[{"role": "user", "content": "How are you?"}])
>>> print(completion)
{
  "object": "chat.completion",
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "message": {
        "content": "I'm fine, thanks. I'm glad to hear that.\n\nI'm",
        "role": "assistant",
        "logprobs": -18.19830319
      }
    }
  ],
  "id": "chatcmpl-7f8f5312-893a-4dab-aff5-3a97a354c2be",
  "created": 1695869575.316254,
  "model": "abhinavkulkarni/codellama-CodeLlama-7b-Instruct-hf-w4-g128-awq",
  "usage": {
    "prompt_tokens": 4,
    "completion_tokens": 15,
    "total_tokens": 19
  }
}

@michaelfeil
Copy link

@zfang @paulcx I implemented this feature on the Apache-2.0 licenced forked project directly in Rust.

https://github.com/Preemo-Inc/text-generation-inference

@krrishdholakia
Copy link

Hey @michaelfeil - is TGI close-source now? I can't find other info on this
Screenshot 2023-10-04 at 7 57 04 AM

@Narsil
Copy link
Collaborator

Narsil commented Oct 4, 2023

We added a restriction in 1.0 which means you cannot use it as a cloud provider as-is without getting a license from us.
Most likely it doesn't change anything for you.

More details here:
#744

@adrianog
Copy link

@zfang @paulcx I implemented this feature on the Apache-2.0 licenced forked project directly in Rust.

https://github.com/Preemo-Inc/text-generation-inference

Can I use this to wrap the official inference API as published by hf?
I can't seem to be able to find an example of how to do create models using the hf inference api from llamaindex.

@batindfa
Copy link

batindfa commented Oct 20, 2023

@abhinavkulkarni hi, How to run a LiteLLM proxy server
I use litellm --model huggingface/meta-llama/Llama-2-70b-chat-hf --api_base http://0.0.0.0:8080/generate in linux command, but it appears bash: litellm: command not found

@abhinavkulkarni
Copy link
Contributor

@yanmengxiang1: Please install litellm using pip.

@batindfa
Copy link

@abhinavkulkarni yes, I know it. Should I use something like flask to wrappe this TGI?
image

@abhinavkulkarni
Copy link
Contributor

Hey @yanmengxiang1:

Run TGI at port 8080. Then run litellm so that it points to TGI:

litellm --model huggingface/meta-llama/Llama-2-70b-chat-hf --api_base http://localhost:8080 --port 8000

You now have OpenAI compatible API endpoint at port 8000.

@krrishdholakia
Copy link

@yanmengxiang1 the relevant docs - https://docs.litellm.ai/docs/proxy_server

@LarsHill
Copy link

@yanmengxiang1 the relevant docs - https://docs.litellm.ai/docs/proxy_server

It seems this feature is going to be depreciated? So how future proof is it to build an application around it?

@krrishdholakia
Copy link

krrishdholakia commented Nov 1, 2023

Hey @LarsHill the LiteLLM community is discussing the best approach right now - BerriAI/litellm#648 (comment)

Some context
We'd initially planned on the docker container being an easier replacement ( consistent environment + easier to deploy)

but it might not be ideal. So we're trying to understand what works best (how do you provide a consistent experience + easy ability to set up configs, etc.).

DM'ing you to understand what a good experience here looks like.

@bitsnaps
Copy link

bitsnaps commented Feb 6, 2024

@yanmengxiang1 the relevant docs - https://docs.litellm.ai/docs/proxy_server

It seems this feature is going to be depreciated? So how future proof is it to build an application around it?

The text-generation-webui made a huge progress on supporting other providers by including extensions, you can serve the compatible openai's API using this command:

# clone the repo, then cd..

# install deps:
!pip install -q -r requirements.txt --upgrade
# install extensions (openai...)
!pip install -q -r extensions/openai/requirements.txt --upgrade

# download your model (using way allows you to download large models):
!python download-model.py https://huggingface.co/TheBloke/SauerkrautLM-UNA-SOLAR-Instruct-GPTQ 
# this one works better for MemGPT

# serve your model (check the name of the download file/directory):
!python server.py --model TheBloke_SauerkrautLM-UNA-SOLAR-Instruct-GPTQ --n-gpu-layers 24 --n_ctx 2048 --api --nowebui --extensions openai 

# or download a specific file (if using GGUF models):
!python download-model.py https://huggingface.co/TheBloke/dolphin-2.7-mixtral-8x7b-GGUF  --specific-file dolphin-2.7-mixtral-8x7b.Q2_K.gguf

Your server should be up and running on 5000 (by default):

!curl http://0.0.0.0:5000/v1/completions -H "Content-Type: application/json" -d '{ "prompt": "This is a cake recipe:\n\n1.","max_tokens": 200, "temperature": 1,  "top_p": 0.9, "seed": 10 }'

This way allows you to run any model (even those aren't available on Ollama's docker images), and without hitting huggingface's api and even large models (>= 10Gb) and even if models without an inference API. No litellm neither ollama required.

slimsag added a commit to sourcegraph/cody that referenced this issue Feb 20, 2024
Increasingly, LLM software is standardizing around the use of OpenAI-esque
compatible endpoints. Some examples:

* [OpenLLM](https://github.com/bentoml/OpenLLM) (commonly used to self-host/deploy various LLMs in enterprises)
* [Huggingface TGI](huggingface/text-generation-inference#735) (and, by extension, [AWS SageMaker](https://aws.amazon.com/blogs/machine-learning/announcing-the-launch-of-new-hugging-face-llm-inference-containers-on-amazon-sagemaker/))
* [Ollama](https://github.com/ollama/ollama) (commonly used for running LLMs locally, useful for local testing)

All of these projects either have OpenAI-compatible API endpoints already,
or are actively building out support for it. On strat we are regularly
working with enterprise customers that self-host their own specific-model
LLM via one of these methods, and wish for Cody to consume an OpenAI
endpoint (understanding some specific model is on the other side and that
Cody should optimize for / target that specific model.)

Since Cody needs to tailor to a specific model (prompt generation, stop
sequences, context limits, timeouts, etc.) and handle other provider-specific
nuances, it is insufficient to simply expect that a customer-provided OpenAI
compatible endpoint is in fact 1:1 compatible with e.g. GPT-3.5 or GPT-4.
We need to be able to configure/tune many of these aspects to the specific
provider/model, even though it presents as an OpenAI endpoint.

In response to these needs, I am working on adding an 'OpenAI-compatible'
provider proper: the ability for a Sourcegraph enterprise instance to
advertise that although it is connected to an OpenAI compatible endpoint,
there is in fact a specific model on the other side (starting with Starchat
and Starcoder) and that Cody should target that configuration. The _first
step_ of this work is this change.

After this change, an existing (current-version) Sourcegraph enterprise
instance can configure an OpenAI endpoint for completions via the site
config such as:

```
  "cody.enabled": true,
  "completions": {
    "provider": "openai",
    "accessToken": "asdf",
    "endpoint": "http://openllm.foobar.com:3000",
    "completionModel": "gpt-4",
    "chatModel": "gpt-4",
    "fastChatModel": "gpt-4",
  },
```

The `gpt-4` model parameters will be sent to the OpenAI-compatible endpoint
specified, but will otherwise be unused today. Users may then specify in
their VS Code configuration that Cody should treat the LLM on the other
side as if it were e.g. Starchat:

```
    "cody.autocomplete.advanced.provider": "experimental-openaicompatible",
    "cody.autocomplete.advanced.model": "starchat-16b-beta",
    "cody.autocomplete.advanced.timeout.multiline": 10000,
    "cody.autocomplete.advanced.timeout.singleline": 10000,
```

In the future, we will make it possible to configure the above options
via the Sourcegraph site configuration instead of each user needing to
configure it in their VS Code settings explicitly.

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
@drbh
Copy link
Collaborator

drbh commented Mar 6, 2024

This functionality is now supported in TGI with the introduction of the Message API and can be used like:

from openai import OpenAI

# init the client but point it to TGI
client = OpenAI(
    base_url="http://localhost:3000/v1",
    api_key="-"
)

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant." },
        {"role": "user", "content": "What is deep learning?"}
    ],
    stream=False
)

print(chat_completion)

Please see the docs here for more details https://huggingface.co/docs/text-generation-inference/messages_api

@drbh drbh closed this as completed Mar 6, 2024
slimsag added a commit to sourcegraph/cody that referenced this issue Mar 26, 2024
Increasingly, LLM software is standardizing around the use of OpenAI-esque
compatible endpoints. Some examples:

* [OpenLLM](https://github.com/bentoml/OpenLLM) (commonly used to self-host/deploy various LLMs in enterprises)
* [Huggingface TGI](huggingface/text-generation-inference#735) (and, by extension, [AWS SageMaker](https://aws.amazon.com/blogs/machine-learning/announcing-the-launch-of-new-hugging-face-llm-inference-containers-on-amazon-sagemaker/))
* [Ollama](https://github.com/ollama/ollama) (commonly used for running LLMs locally, useful for local testing)

All of these projects either have OpenAI-compatible API endpoints already,
or are actively building out support for it. On strat we are regularly
working with enterprise customers that self-host their own specific-model
LLM via one of these methods, and wish for Cody to consume an OpenAI
endpoint (understanding some specific model is on the other side and that
Cody should optimize for / target that specific model.)

Since Cody needs to tailor to a specific model (prompt generation, stop
sequences, context limits, timeouts, etc.) and handle other provider-specific
nuances, it is insufficient to simply expect that a customer-provided OpenAI
compatible endpoint is in fact 1:1 compatible with e.g. GPT-3.5 or GPT-4.
We need to be able to configure/tune many of these aspects to the specific
provider/model, even though it presents as an OpenAI endpoint.

In response to these needs, I am working on adding an 'OpenAI-compatible'
provider proper: the ability for a Sourcegraph enterprise instance to
advertise that although it is connected to an OpenAI compatible endpoint,
there is in fact a specific model on the other side (starting with Starchat
and Starcoder) and that Cody should target that configuration. The _first
step_ of this work is this change.

After this change, an existing (current-version) Sourcegraph enterprise
instance can configure an OpenAI endpoint for completions via the site
config such as:

```
  "cody.enabled": true,
  "completions": {
    "provider": "openai",
    "accessToken": "asdf",
    "endpoint": "http://openllm.foobar.com:3000",
    "completionModel": "gpt-4",
    "chatModel": "gpt-4",
    "fastChatModel": "gpt-4",
  },
```

The `gpt-4` model parameters will be sent to the OpenAI-compatible endpoint
specified, but will otherwise be unused today. Users may then specify in
their VS Code configuration that Cody should treat the LLM on the other
side as if it were e.g. Starchat:

```
    "cody.autocomplete.advanced.provider": "experimental-openaicompatible",
    "cody.autocomplete.advanced.model": "starchat-16b-beta",
    "cody.autocomplete.advanced.timeout.multiline": 10000,
    "cody.autocomplete.advanced.timeout.singleline": 10000,
```

In the future, we will make it possible to configure the above options
via the Sourcegraph site configuration instead of each user needing to
configure it in their VS Code settings explicitly.

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests