[Feature] Dynamic Model Loading and Model Endpoint in FastAPI #17

MillionthOdin16 · 2023-04-04T01:38:54Z

I'd like to propose a future feature I think would add useful flexibility for users of the completions/embeddings API . I'm suggesting the ability to dynamically load models based on calls to the FastAPI endpoint.

The concept is as follows:

Have a predefined location for model files (e.g., a models folder within the project) and allow users to specify an additional model folder if needed.
When the API starts, it checks the designated model folders and populates the available models dynamically.
Users can query the available models through a GET request to the /v1/engines endpoint , which would return a list of models and their statuses.
Users can then specify the desired model when making inference requests.

This dynamic model loading feature would align with the behavior of the OpenAI spec for models and model status. It would offer users the flexibility to easily choose and use different models without having make manual changes to the project or configs.

This is a suggestion for later, but I wanted to suggest it now so we can plan if we do decide to implement it.

Let me know your thoughts :)

The text was updated successfully, but these errors were encountered:

0xdevalias · 2023-04-11T01:29:21Z

Potentially related:

Add /models/{model} endpoint #38

jmtatsch · 2023-04-12T01:30:59Z

@abetlen requested a list of prompt formats for various models

Alpaca:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
List 3 ingredients for the following recipe: Spaghetti Bolognese

### Response:

Vicuna:

### Human:
List 3 ingredients for Spaghetti Bolognese.

### Assistant:

as discussed in ggml-org/llama.cpp#302 (comment)

Koala:

BEGINNING OF CONVERSATION: USER: Hello! GPT:Hi! How can I help you?</s>USER: What is the largest animal on earth? GPT:

source: https://github.com/young-geng/EasyLM/blob/main/docs/koala.md

Open Assistant: (no llama.cpp support yet)

<|prefix_begin|>You are a large language model that wants to be helpful<|prefix_end|><|prompter|>What is red and round?<|endoftext|><|assistant|>Hmm, a red balloon?<|endoftext|><|prompter|>No, smaller<|endoftext|><|assistant|>

Source: https://github.com/LAION-AI/Open-Assistant/blob/8818d5515a5d889332d051b7989091648c017c20/model/MESSAGE_AND_TOKEN_FORMAT.md

MillionthOdin16 · 2023-04-13T00:01:28Z

@abetlen

Here's something that seemed interesting from vicuna that I just saw. I can definitely see the challenge trying to adapt to all these different input formats. This seemed like an extendable format that might help, not sure where you currently are on it.

https://github.com/lm-sys/FastChat/blob/00d9e6675bdff60be6603ffff9313b1d797d2e3e/fastchat/conversation.py#L83-L112

Edit:
I actually don't know if they're using fast API 😂 now that I actually look more at it, it looks very similar.

abetlen · 2023-04-13T01:10:43Z

@jmtatsch @MillionthOdin16 thank you!

I still have a few questions on the best way to implement this, appreciate any input.

The basic features would allow you to:

Specify a config file in whatever format is easiest for pydantic to parse
Specify one or more models to load with their paths, default llama.cpp parameters, and an alias.

The part I'm still scratching my head on are the chat models

The request passes in a list of messages
Turning chat messages -> prompt is model dependent
Only some models (Vicuna, maybe gpt4all) can handle chat corectly.

I guess the solution would be to have some way to specify these pre-defined models and custom prompt serialisation functions for each.

docmeth02 · 2023-04-13T19:31:34Z

I guess the solution would be to have some way to specify these pre-defined models and custom prompt serialisation functions for each.

Hi!
the way i implemented this on a local copy is that i added a method called generate_completion_prompts to llama_cpp.Llama that returns the PROMPT strin and the PROMPT_STOP list.

That way you can override the prompt generation from the outside and you could provide a list of model specific implementations to handle the message history and prompt generation on a per model basis :)

abetlen · 2023-12-22T22:31:44Z

Implemented in #931

abetlen added enhancement New feature or request server labels Apr 4, 2023

This was referenced Apr 7, 2023

Investigate model aliasing #39

Closed

Fix v1/chat/completions Gibberish API Responses #41

Closed

0xdevalias mentioned this issue Apr 11, 2023

Add /models/{model} endpoint #38

Closed

abetlen mentioned this issue Apr 11, 2023

Implement chat continuation #68

Closed

abetlen pinned this issue Apr 16, 2023

abetlen mentioned this issue Apr 16, 2023

Openplayground Suport #80

Closed

abetlen referenced this issue Apr 17, 2023

Update chat prompt

6208751

gjmulder added the high-priority label May 23, 2023

abetlen unpinned this issue Jul 18, 2023

abetlen closed this as completed Dec 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Dynamic Model Loading and Model Endpoint in FastAPI #17

[Feature] Dynamic Model Loading and Model Endpoint in FastAPI #17

MillionthOdin16 commented Apr 4, 2023

0xdevalias commented Apr 11, 2023

jmtatsch commented Apr 12, 2023 •

edited

Loading

MillionthOdin16 commented Apr 13, 2023 •

edited

Loading

abetlen commented Apr 13, 2023

docmeth02 commented Apr 13, 2023

abetlen commented Dec 22, 2023

[Feature] Dynamic Model Loading and Model Endpoint in FastAPI #17

[Feature] Dynamic Model Loading and Model Endpoint in FastAPI #17

Comments

MillionthOdin16 commented Apr 4, 2023

0xdevalias commented Apr 11, 2023

jmtatsch commented Apr 12, 2023 • edited Loading

MillionthOdin16 commented Apr 13, 2023 • edited Loading

abetlen commented Apr 13, 2023

docmeth02 commented Apr 13, 2023

abetlen commented Dec 22, 2023

jmtatsch commented Apr 12, 2023 •

edited

Loading

MillionthOdin16 commented Apr 13, 2023 •

edited

Loading