Adding OpenAI Compatible RESTful API #317

PawanOsman · 2023-11-18T16:25:56Z

Hey everyone,
I just pushed a draft PR where I've added an OpenAI-compatible RESTful API to DeepSpeed-MII. This update is about making our tool more flexible and user-friendly, especially for those looking to integrate with OpenAI's ecosystem.

Also, I added another API for normal text generation. This new API is super user-friendly and supports streaming responses.

I'm still working on it, so it's not final yet. it may contain bugs and errors.

Any thoughts or feedback are welcome!

Fixes: #316

This pull request introduces two RESTful API servers:

OpenAI Compatible RESTful API Server
Text Generation RESTful API Server

OpenAI Compatible RESTful API

This server is an OpenAI compatible api for text and chat completions.

Running the Server

python -m mii.entrypoints.openai_api_server \
    --model "mistralai/Mistral-7B-Instruct-v0.1" \
    --port 3000 \
    --host 0.0.0.0

Key Features and Arguments

--chat-template: To set the chat template (can be the file path, or the file content)
--response-role: Defines the role of the responder (e.g., "assistant") only for the requests which add_generation_prompt is equals to true.
--api-keys: Enables API key authentication for security, which can be a list of keys separated by a comma.
--ssl: Option to enable SSL for secure communication.

Text Generation RESTful API

A simpler API focused on text generation, both in streaming and non-streaming formats. Suitable for applications that require straightforward text generation capabilities.

Running the Server

python -m mii.entrypoints.api_server \
    --model "mistralai/Mistral-7B-Instruct-v0.1" \
    --port 3000 \
    --host 0.0.0.0

Common Features for Both APIs

CORS middleware configuration for secure cross-origin requests.
--load-balancer: when you run MII instant separately you can set the load balancer host and port (e.g., "localhost:50050")

Separately Running MII Instance

You can start the MII instance separately and then connect the servers to the MII instance load balancer.

Running MII Instance

mii.serve("mistralai/Mistral-7B-Instruct-v0.1", deployment_name="deepspeed-mii")

Connecting to the Load Balancer

For the OpenAI Compatible Server:

python -m mii.entrypoints.openai_api_server \
    --model "mistralai/Mistral-7B-Instruct-v0.1" \
    --port 3000 \
    --host 0.0.0.0 \
    --load-balance "localhost:50050"

For the Text Generation Server:

python -m mii.entrypoints.api_server \
    --model "mistralai/Mistral-7B-Instruct-v0.1" \
    --port 3000 \
    --host 0.0.0.0 \
    --load-balance "localhost:50050"

PawanOsman · 2023-11-18T16:28:58Z

@microsoft-github-policy-service agree

aliozts · 2023-11-20T22:33:32Z

Hi! first of all thank you for such a feature, it will be extremely useful to deploy models with mii while utilizing generic classes from langchain, llama index etc. One thing that I would like to ask is, what do you think about utilizing chat templates?

vllm also has a respective PR for this as well if it helps.

PawanOsman · 2023-11-21T18:24:27Z

Hi! first of all thank you for such a feature, it will be extremely useful to deploy models with mii while utilizing generic classes from langchain, llama index etc. One thing that I would like to ask is, what do you think about utilizing chat templates?

vllm also has a respective PR for this as well if it helps.

Looks like it is better to implement the transformer's chat template instead of the fastchat's chat templates, thanks for the suggestion, I forgot about the transformer's chat templates 😅 I will update it to use the transformer's chat templates instead

Tostino · 2023-12-02T05:41:09Z

@PawanOsman I had to do a bit more work on this since it was mentioned, and there is a newer PR that was merged yesterday to vLLM adding HF chat template support: vllm-project/vllm#1756

Just mentioning it, because there were some decisions that had to be made to support it, and it'd be best if inference servers remained compatible as best they can.
E.g. supporting request.add_generation_prompt which isn't part of the OpenAI API, but HF chat templates offers as a configuration knob (default it to True to maintain compatibility with the official chat/completions endpoint).

mrwyattii · 2023-12-05T16:30:13Z

Hi @PawanOsman thank you for this amazing contribution! Can you let me know when it is ready for review and I will work with you to get it merged? (It is currently still marked as a "draft")

PawanOsman · 2023-12-05T21:14:58Z

@PawanOsman I had to do a bit more work on this since it was mentioned, and there is a newer PR that was merged yesterday to vLLM adding HF chat template support: vllm-project/vllm#1756

Just mentioning it, because there were some decisions that had to be made to support it, and it'd be best if inference servers remained compatible as best they can. E.g. supporting request.add_generation_prompt which isn't part of the OpenAI API, but HF chat templates offers as a configuration knob (default it to True to maintain compatibility with the official chat/completions endpoint).

Thanks for mentioning that 🙏

PawanOsman · 2023-12-05T21:18:59Z

Hi @PawanOsman thank you for this amazing contribution! Can you let me know when it is ready for review and I will work with you to get it merged? (It is currently still marked as a "draft")

Been really tied up lately, but I'm pushing to get this sorted out as quickly as possible. Will update you soon on the progress

which previous values prevents the model from generating text Fix: API keys not being passed to the app_settings Fix: Counting prompt tokens

nani1149 · 2023-12-15T21:18:49Z

Does it support parallel tensor? I am trying load 70b llama model and getting server crashed because of memory

PawanOsman · 2023-12-16T19:05:28Z

Does it support parallel tensor? I am trying load 70b llama model and getting server crashed because of memory

Thanks, I just added --tensor-parallel and --replica-num args.

You can set tensor parallel size like below:

python -m mii.entrypoints.openai_api_server \
    --model "mistralai/Mistral-7B-Instruct-v0.1" \
    --port 3000 \
    --host 0.0.0.0 \
    --tensor-parallel 2

PawanOsman · 2023-12-18T11:19:50Z

Hi @PawanOsman thank you for this amazing contribution! Can you let me know when it is ready for review and I will work with you to get it merged? (It is currently still marked as a "draft")

Hi @mrwyattii, this PR is ready for review

nani1149 · 2023-12-18T14:53:39Z

Can you please example of inference using api server url..I am trying to deploy in k8 and want to do inference from different application

PawanOsman · 2023-12-18T16:43:13Z

Can you please example of inference using api server url..I am trying to deploy in k8 and want to do inference from different application

this is OpenAI compatible api server so you can run it with

python -m mii.entrypoints.openai_api_server \
    --model "mistralai/Mistral-7B-Instruct-v0.1" \
    --port 3000 \
    --host 0.0.0.0

then use it with OpenAI package libraries or directly with HTTP request
the API key can be anything while we didn't set the api keys arg in the server command

example:

curl http://ip:port/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "mistralai/Mistral-7B-Instruct-v0.1",
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "temperature": 0.7
   }'

or using python library

from openai import OpenAI

client = OpenAI(
    base_url="http://ip:port/v1",
    api_key="",
)

completion = openai.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.1",
    messages=[
        {
            "role": "user",
            "content": "How do I output all files in a directory using Python?",
        },
    ],
)
print(completion.choices[0].message.content)

supports text and chat completion requests

mrwyattii · 2023-12-18T18:34:55Z

Hi @PawanOsman thank you for this amazing contribution! Can you let me know when it is ready for review and I will work with you to get it merged? (It is currently still marked as a "draft")

Hi @mrwyattii, this PR is ready for review

@PawanOsman I'll review this today/tomorrow and share any feedback. Thank you for the contribution!

nani1149 · 2023-12-18T18:39:44Z

Can you please provide one example for text generation api server for inference ..

PawanOsman · 2023-12-18T18:58:40Z

Can you please provide one example for text generation api server for inference ..

Can you please example of inference using api server url..I am trying to deploy in k8 and want to do inference from different application

You can run the Text Generation RESTful API server using this command

python -m mii.entrypoints.api_server \
    --model "mistralai/Mistral-7B-Instruct-v0.1" \
    --port 3000 \
    --host 0.0.0.0

then you can use it by sending an HTTP request.

Client Usage Example:

curl http://ip:port/generate \
  -H "Content-Type: application/json" \
  -d '{
     "prompt": "Deepspeed is ",
     "max_tokens": 256,
     "temperature": 0.7
   }'

mrwyattii

Thank you for all this work @PawanOsman! I left a few comments/suggestions.

It would be great if we could add unit tests and documentation. Documentation can go on the repo landing page or in a mii/entrypoints/README.md that we link from the landing page. I can take care of that if you do not have the time (you have already done a lot!)

I think this is a great addition to DeepSpeed-MII and I would like to replace the existing RESTful API with the implementation you provide here and better integrate it with the rest of the MII code. However, I don't want to delay merging this. I can work on some refactoring and replacing the other RESTful API in a future PR.

mii/entrypoints/api_server.py

mii/entrypoints/data_models.py

mii/entrypoints/openai_api_server.py

Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>

PawanOsman · 2023-12-26T07:17:08Z

Thank you for all this work @PawanOsman! I left a few comments/suggestions.

It would be great if we could add unit tests and documentation. Documentation can go on the repo landing page or in a mii/entrypoints/README.md that we link from the landing page. I can take care of that if you do not have the time (you have already done a lot!)

I think this is a great addition to DeepSpeed-MII and I would like to replace the existing RESTful API with the implementation you provide here and better integrate it with the rest of the MII code. However, I don't want to delay merging this. I can work on some refactoring and replacing the other RESTful API in a future PR.

Thanks for the review and your feedback! I'm currently short on time and don't want to delay things. If you could take on the unit tests and documentation, that would be great.

mrwyattii · 2024-02-01T00:30:39Z

@PawanOsman Can you please run formatting on your branch and then I can merge this? Sorry for the delay here!

pre-commit run --all-files

gangooteli · 2024-02-13T22:36:15Z

Can you please provide one example for text generation api server for inference ..

Can you please example of inference using api server url..I am trying to deploy in k8 and want to do inference from different application

You can run the Text Generation RESTful API server using this command
python -m mii.entrypoints.api_server \
    --model "mistralai/Mistral-7B-Instruct-v0.1" \
    --port 3000 \
    --host 0.0.0.0
then you can use it by sending an HTTP request.

Client Usage Example:
curl http://ip:port/generate \
  -H "Content-Type: application/json" \
  -d '{
     "prompt": "Deepspeed is ",
     "max_tokens": 256,
     "temperature": 0.7
   }'

Getting below error while running this

root@b6d65533ab32:/# python3 -m mii.entrypoints.api_server     --model "mistralai/Mistral-7B-Instruct-v0.1"     --port 3000     --host 0.0.0.0
[2024-02-13 22:30:07,088] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-13 22:30:07,798] [INFO] [server.py:38:__init__] Hostfile /job/hostfile not found, creating hostfile.
[2024-02-13 22:30:07,798] [INFO] [server.py:38:__init__] Hostfile /job/hostfile not found, creating hostfile.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/mii/entrypoints/api_server.py", line 190, in <module>
    mii.serve(args.model,
  File "/usr/local/lib/python3.10/dist-packages/mii/api.py", line 124, in serve
    import_score_file(mii_config.deployment_name, DeploymentType.LOCAL).init()
  File "/tmp/mii_cache/deepspeed-mii/score.py", line 33, in init
    mii.backend.MIIServer(mii_config)
  File "/usr/local/lib/python3.10/dist-packages/mii/backend/server.py", line 44, in __init__
    mii_config.generate_replica_configs()
  File "/usr/local/lib/python3.10/dist-packages/mii/config.py", line 302, in generate_replica_configs
    replica_pool = _allocate_devices(self.hostfile,
  File "/usr/local/lib/python3.10/dist-packages/mii/config.py", line 350, in _allocate_devices
    raise ValueError(
ValueError: Only able to place 0 replicas, but 1 replicas were requested.
root@b6d65533ab32:/#

Also while running with option --load-balance "0.0.0.0:50050" , the api is started.

But getting Internal Server Error while running a curl command.
root@ai:~# curl http://0.0.0.0:8080/generate -H "Content-Type: application/json" -d '{
"prompt": "Deepspeed is ",
"max_tokens": 256,
"temperature": 0.7
}'
Internal Server Error

root@b6d65533ab32:/# python3 -m mii.entrypoints.api_server     --model "mistralai/Mistral-7B-Instruct-v0.1"     --port 8080     --host 0.0.0.0     --load-balance "0.0.0.0:50050"
[2024-02-13 22:28:10,898] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO:     Started server process [883]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)


INFO:     172.17.0.1:33688 - "POST /generate HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 758, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 778, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 299, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 79, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 299, in app
    raise e
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 294, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/mii/entrypoints/api_server.py", line 98, in generate
    responseData = await stub.GeneratorReply(requestData)
  File "/usr/local/lib/python3.10/dist-packages/grpc/aio/_call.py", line 318, in __await__
    raise _create_rpc_error(
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:0.0.0.0:50050: Failed to connect to remote host: Connection refused"
	debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:0.0.0.0:50050: Failed to connect to remote host: Connection refused", grpc_status:14, created_time:"2024-02-13T22:28:14.61881733+00:00"}"
>
^CINFO:     Shutting down

Please help

freQuensy23-coder · 2024-02-23T10:01:00Z

Only able to place 0 replicas, but same problem

SeungminHeo · 2024-04-02T05:14:39Z

I think do_sample parameter has to be changed on request. some llm models are trained to use greedy sampling.

Adding OpenAI Compatible RESTful API

4a5854f

PawanOsman mentioned this pull request Nov 21, 2023

What is the recommended way of bringing up mii as a service #318

Open

PawanOsman added 2 commits November 22, 2023 09:21

Merge branch 'main' into main

c0e85d0

Merge branch 'main' into main

05931b0

PawanOsman added 6 commits December 11, 2023 06:51

update api_server and fix openai text completion

4b3399e

Merge branch 'main' into main

4b5e3a8

Update default top-p and top-k values

b5710e6

which previous values prevents the model from generating text Fix: API keys not being passed to the app_settings Fix: Counting prompt tokens

Optimize and Fix minor issues

6a8e799

remove unused variable

644c2ee

Add credits

d1e9f51

PawanOsman marked this pull request as ready for review December 15, 2023 12:09

PawanOsman requested review from mrwyattii and awan-10 as code owners December 15, 2023 12:09

Merge branch 'main' into main

cda6225

PawanOsman added 3 commits December 16, 2023 20:43

Merge branch 'main' into main

994ac31

Adding more args

b04408d

set the deployment-name default

0938963

Update __init__.py

4a0ee7c

mrwyattii reviewed Dec 19, 2023

View reviewed changes

PawanOsman and others added 7 commits December 20, 2023 21:37

Update mii/entrypoints/api_server.py

9adf7fe

Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>

Update mii/entrypoints/data_models.py

8c66076

Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>

Update mii/entrypoints/openai_api_server.py

3963177

Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>

Update mii/entrypoints/openai_api_server.py

420b1ba

Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>

Merge branch 'main' into main

6cd410d

Update api_server.py

72a8c3c

Update openai_api_server.py

73f82fa

Merge branch 'main' into main

6be85e9

mrwyattii approved these changes Jan 31, 2024

View reviewed changes

Merge branch 'main' into main

e35f457

mrwyattii and others added 4 commits February 1, 2024 09:00

Merge branch 'main' into main

611e3f9

Merge branch 'main' into main

b0070d7

Fix linting and formatting issues identified by pre-commit

28e0254

Merge branch 'main' into main

c14232e

mrwyattii merged commit 816cbd2 into deepspeedai:main Feb 2, 2024
2 of 3 checks passed

RobinQu mentioned this pull request Apr 15, 2024

Is openai compatible server still working? #459

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding OpenAI Compatible RESTful API #317

Adding OpenAI Compatible RESTful API #317

PawanOsman commented Nov 18, 2023 •

edited

Loading

PawanOsman commented Nov 18, 2023

aliozts commented Nov 20, 2023

PawanOsman commented Nov 21, 2023

Tostino commented Dec 2, 2023

mrwyattii commented Dec 5, 2023

PawanOsman commented Dec 5, 2023

PawanOsman commented Dec 5, 2023

nani1149 commented Dec 15, 2023

PawanOsman commented Dec 16, 2023

PawanOsman commented Dec 18, 2023

nani1149 commented Dec 18, 2023

PawanOsman commented Dec 18, 2023 •

edited

Loading

mrwyattii commented Dec 18, 2023

nani1149 commented Dec 18, 2023

PawanOsman commented Dec 18, 2023

mrwyattii left a comment

PawanOsman commented Dec 26, 2023

mrwyattii commented Feb 1, 2024

gangooteli commented Feb 13, 2024

freQuensy23-coder commented Feb 23, 2024

SeungminHeo commented Apr 2, 2024

Adding OpenAI Compatible RESTful API #317

Adding OpenAI Compatible RESTful API #317

Conversation

PawanOsman commented Nov 18, 2023 • edited Loading

OpenAI Compatible RESTful API

Running the Server

Key Features and Arguments

Text Generation RESTful API

Running the Server

Common Features for Both APIs

Separately Running MII Instance

Running MII Instance

Connecting to the Load Balancer

PawanOsman commented Nov 18, 2023

aliozts commented Nov 20, 2023

PawanOsman commented Nov 21, 2023

Tostino commented Dec 2, 2023

mrwyattii commented Dec 5, 2023

PawanOsman commented Dec 5, 2023

PawanOsman commented Dec 5, 2023

nani1149 commented Dec 15, 2023

PawanOsman commented Dec 16, 2023

PawanOsman commented Dec 18, 2023

nani1149 commented Dec 18, 2023

PawanOsman commented Dec 18, 2023 • edited Loading

mrwyattii commented Dec 18, 2023

nani1149 commented Dec 18, 2023

PawanOsman commented Dec 18, 2023

mrwyattii left a comment

Choose a reason for hiding this comment

PawanOsman commented Dec 26, 2023

mrwyattii commented Feb 1, 2024

gangooteli commented Feb 13, 2024

freQuensy23-coder commented Feb 23, 2024

SeungminHeo commented Apr 2, 2024

PawanOsman commented Nov 18, 2023 •

edited

Loading

PawanOsman commented Dec 18, 2023 •

edited

Loading