Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding OpenAI Compatible RESTful API #317

Merged
merged 27 commits into from
Feb 2, 2024
Merged

Adding OpenAI Compatible RESTful API #317

merged 27 commits into from
Feb 2, 2024

Conversation

PawanOsman
Copy link
Contributor

@PawanOsman PawanOsman commented Nov 18, 2023

Hey everyone,
I just pushed a draft PR where I've added an OpenAI-compatible RESTful API to DeepSpeed-MII. This update is about making our tool more flexible and user-friendly, especially for those looking to integrate with OpenAI's ecosystem.

Also, I added another API for normal text generation. This new API is super user-friendly and supports streaming responses.

I'm still working on it, so it's not final yet. it may contain bugs and errors.

Any thoughts or feedback are welcome!

Fixes: #316

This pull request introduces two RESTful API servers:

  1. OpenAI Compatible RESTful API Server
  2. Text Generation RESTful API Server

OpenAI Compatible RESTful API

This server is an OpenAI compatible api for text and chat completions.

Running the Server

python -m mii.entrypoints.openai_api_server \
    --model "mistralai/Mistral-7B-Instruct-v0.1" \
    --port 3000 \
    --host 0.0.0.0

Key Features and Arguments

  • --chat-template: To set the chat template (can be the file path, or the file content)
  • --response-role: Defines the role of the responder (e.g., "assistant") only for the requests which add_generation_prompt is equals to true.
  • --api-keys: Enables API key authentication for security, which can be a list of keys separated by a comma.
  • --ssl: Option to enable SSL for secure communication.

Text Generation RESTful API

A simpler API focused on text generation, both in streaming and non-streaming formats. Suitable for applications that require straightforward text generation capabilities.

Running the Server

python -m mii.entrypoints.api_server \
    --model "mistralai/Mistral-7B-Instruct-v0.1" \
    --port 3000 \
    --host 0.0.0.0

Common Features for Both APIs

  • CORS middleware configuration for secure cross-origin requests.
  • --load-balancer: when you run MII instant separately you can set the load balancer host and port (e.g., "localhost:50050")

Separately Running MII Instance

You can start the MII instance separately and then connect the servers to the MII instance load balancer.

Running MII Instance

mii.serve("mistralai/Mistral-7B-Instruct-v0.1", deployment_name="deepspeed-mii")

Connecting to the Load Balancer

For the OpenAI Compatible Server:

python -m mii.entrypoints.openai_api_server \
    --model "mistralai/Mistral-7B-Instruct-v0.1" \
    --port 3000 \
    --host 0.0.0.0 \
    --load-balance "localhost:50050"

For the Text Generation Server:

python -m mii.entrypoints.api_server \
    --model "mistralai/Mistral-7B-Instruct-v0.1" \
    --port 3000 \
    --host 0.0.0.0 \
    --load-balance "localhost:50050"

@PawanOsman
Copy link
Contributor Author

@microsoft-github-policy-service agree

@aliozts
Copy link

aliozts commented Nov 20, 2023

Hi! first of all thank you for such a feature, it will be extremely useful to deploy models with mii while utilizing generic classes from langchain, llama index etc. One thing that I would like to ask is, what do you think about utilizing chat templates?

vllm also has a respective PR for this as well if it helps.

@PawanOsman
Copy link
Contributor Author

Hi! first of all thank you for such a feature, it will be extremely useful to deploy models with mii while utilizing generic classes from langchain, llama index etc. One thing that I would like to ask is, what do you think about utilizing chat templates?

vllm also has a respective PR for this as well if it helps.

Looks like it is better to implement the transformer's chat template instead of the fastchat's chat templates, thanks for the suggestion, I forgot about the transformer's chat templates 😅 I will update it to use the transformer's chat templates instead

@Tostino
Copy link

Tostino commented Dec 2, 2023

@PawanOsman I had to do a bit more work on this since it was mentioned, and there is a newer PR that was merged yesterday to vLLM adding HF chat template support: vllm-project/vllm#1756

Just mentioning it, because there were some decisions that had to be made to support it, and it'd be best if inference servers remained compatible as best they can.
E.g. supporting request.add_generation_prompt which isn't part of the OpenAI API, but HF chat templates offers as a configuration knob (default it to True to maintain compatibility with the official chat/completions endpoint).

@mrwyattii
Copy link
Contributor

Hi @PawanOsman thank you for this amazing contribution! Can you let me know when it is ready for review and I will work with you to get it merged? (It is currently still marked as a "draft")

@PawanOsman
Copy link
Contributor Author

@PawanOsman I had to do a bit more work on this since it was mentioned, and there is a newer PR that was merged yesterday to vLLM adding HF chat template support: vllm-project/vllm#1756

Just mentioning it, because there were some decisions that had to be made to support it, and it'd be best if inference servers remained compatible as best they can. E.g. supporting request.add_generation_prompt which isn't part of the OpenAI API, but HF chat templates offers as a configuration knob (default it to True to maintain compatibility with the official chat/completions endpoint).

Thanks for mentioning that 🙏

@PawanOsman
Copy link
Contributor Author

Hi @PawanOsman thank you for this amazing contribution! Can you let me know when it is ready for review and I will work with you to get it merged? (It is currently still marked as a "draft")

Been really tied up lately, but I'm pushing to get this sorted out as quickly as possible. Will update you soon on the progress

@PawanOsman PawanOsman marked this pull request as ready for review December 15, 2023 12:09
@nani1149
Copy link

Does it support parallel tensor? I am trying load 70b llama model and getting server crashed because of memory

@PawanOsman
Copy link
Contributor Author

Does it support parallel tensor? I am trying load 70b llama model and getting server crashed because of memory

Thanks, I just added --tensor-parallel and --replica-num args.

You can set tensor parallel size like below:

python -m mii.entrypoints.openai_api_server \
    --model "mistralai/Mistral-7B-Instruct-v0.1" \
    --port 3000 \
    --host 0.0.0.0 \
    --tensor-parallel 2

@PawanOsman
Copy link
Contributor Author

Hi @PawanOsman thank you for this amazing contribution! Can you let me know when it is ready for review and I will work with you to get it merged? (It is currently still marked as a "draft")

Hi @mrwyattii, this PR is ready for review

@nani1149
Copy link

Can you please example of inference using api server url..I am trying to deploy in k8 and want to do inference from different application

@PawanOsman
Copy link
Contributor Author

PawanOsman commented Dec 18, 2023

Can you please example of inference using api server url..I am trying to deploy in k8 and want to do inference from different application

this is OpenAI compatible api server so you can run it with

python -m mii.entrypoints.openai_api_server \
    --model "mistralai/Mistral-7B-Instruct-v0.1" \
    --port 3000 \
    --host 0.0.0.0

then use it with OpenAI package libraries or directly with HTTP request
the API key can be anything while we didn't set the api keys arg in the server command

example:

curl http://ip:port/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "mistralai/Mistral-7B-Instruct-v0.1",
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "temperature": 0.7
   }'

or using python library

from openai import OpenAI

client = OpenAI(
    base_url="http://ip:port/v1",
    api_key="",
)

completion = openai.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.1",
    messages=[
        {
            "role": "user",
            "content": "How do I output all files in a directory using Python?",
        },
    ],
)
print(completion.choices[0].message.content)

supports text and chat completion requests

@mrwyattii
Copy link
Contributor

Hi @PawanOsman thank you for this amazing contribution! Can you let me know when it is ready for review and I will work with you to get it merged? (It is currently still marked as a "draft")

Hi @mrwyattii, this PR is ready for review

@PawanOsman I'll review this today/tomorrow and share any feedback. Thank you for the contribution!

@nani1149
Copy link

Can you please provide one example for text generation api server for inference ..

@PawanOsman
Copy link
Contributor Author

Can you please provide one example for text generation api server for inference ..

Can you please example of inference using api server url..I am trying to deploy in k8 and want to do inference from different application

You can run the Text Generation RESTful API server using this command

python -m mii.entrypoints.api_server \
    --model "mistralai/Mistral-7B-Instruct-v0.1" \
    --port 3000 \
    --host 0.0.0.0

then you can use it by sending an HTTP request.

Client Usage Example:

curl http://ip:port/generate \
  -H "Content-Type: application/json" \
  -d '{
     "prompt": "Deepspeed is ",
     "max_tokens": 256,
     "temperature": 0.7
   }'

Copy link
Contributor

@mrwyattii mrwyattii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all this work @PawanOsman! I left a few comments/suggestions.

It would be great if we could add unit tests and documentation. Documentation can go on the repo landing page or in a mii/entrypoints/README.md that we link from the landing page. I can take care of that if you do not have the time (you have already done a lot!)

I think this is a great addition to DeepSpeed-MII and I would like to replace the existing RESTful API with the implementation you provide here and better integrate it with the rest of the MII code. However, I don't want to delay merging this. I can work on some refactoring and replacing the other RESTful API in a future PR.

mii/entrypoints/api_server.py Show resolved Hide resolved
mii/entrypoints/api_server.py Show resolved Hide resolved
mii/entrypoints/data_models.py Show resolved Hide resolved
mii/entrypoints/openai_api_server.py Show resolved Hide resolved
mii/entrypoints/openai_api_server.py Outdated Show resolved Hide resolved
PawanOsman and others added 7 commits December 20, 2023 21:37
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
@PawanOsman
Copy link
Contributor Author

Thank you for all this work @PawanOsman! I left a few comments/suggestions.

It would be great if we could add unit tests and documentation. Documentation can go on the repo landing page or in a mii/entrypoints/README.md that we link from the landing page. I can take care of that if you do not have the time (you have already done a lot!)

I think this is a great addition to DeepSpeed-MII and I would like to replace the existing RESTful API with the implementation you provide here and better integrate it with the rest of the MII code. However, I don't want to delay merging this. I can work on some refactoring and replacing the other RESTful API in a future PR.

Thanks for the review and your feedback! I'm currently short on time and don't want to delay things. If you could take on the unit tests and documentation, that would be great.

@mrwyattii
Copy link
Contributor

@PawanOsman Can you please run formatting on your branch and then I can merge this? Sorry for the delay here!

pre-commit run --all-files

@mrwyattii mrwyattii merged commit 816cbd2 into microsoft:main Feb 2, 2024
2 of 3 checks passed
@gangooteli
Copy link

Can you please provide one example for text generation api server for inference ..

Can you please example of inference using api server url..I am trying to deploy in k8 and want to do inference from different application

You can run the Text Generation RESTful API server using this command

python -m mii.entrypoints.api_server \
    --model "mistralai/Mistral-7B-Instruct-v0.1" \
    --port 3000 \
    --host 0.0.0.0

then you can use it by sending an HTTP request.

Client Usage Example:

curl http://ip:port/generate \
  -H "Content-Type: application/json" \
  -d '{
     "prompt": "Deepspeed is ",
     "max_tokens": 256,
     "temperature": 0.7
   }'

Getting below error while running this

root@b6d65533ab32:/# python3 -m mii.entrypoints.api_server     --model "mistralai/Mistral-7B-Instruct-v0.1"     --port 3000     --host 0.0.0.0
[2024-02-13 22:30:07,088] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-13 22:30:07,798] [INFO] [server.py:38:__init__] Hostfile /job/hostfile not found, creating hostfile.
[2024-02-13 22:30:07,798] [INFO] [server.py:38:__init__] Hostfile /job/hostfile not found, creating hostfile.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/mii/entrypoints/api_server.py", line 190, in <module>
    mii.serve(args.model,
  File "/usr/local/lib/python3.10/dist-packages/mii/api.py", line 124, in serve
    import_score_file(mii_config.deployment_name, DeploymentType.LOCAL).init()
  File "/tmp/mii_cache/deepspeed-mii/score.py", line 33, in init
    mii.backend.MIIServer(mii_config)
  File "/usr/local/lib/python3.10/dist-packages/mii/backend/server.py", line 44, in __init__
    mii_config.generate_replica_configs()
  File "/usr/local/lib/python3.10/dist-packages/mii/config.py", line 302, in generate_replica_configs
    replica_pool = _allocate_devices(self.hostfile,
  File "/usr/local/lib/python3.10/dist-packages/mii/config.py", line 350, in _allocate_devices
    raise ValueError(
ValueError: Only able to place 0 replicas, but 1 replicas were requested.
root@b6d65533ab32:/# 

Also while running with option --load-balance "0.0.0.0:50050" , the api is started.

But getting Internal Server Error while running a curl command.
root@ai:~# curl http://0.0.0.0:8080/generate -H "Content-Type: application/json" -d '{
"prompt": "Deepspeed is ",
"max_tokens": 256,
"temperature": 0.7
}'
Internal Server Error

root@b6d65533ab32:/# python3 -m mii.entrypoints.api_server     --model "mistralai/Mistral-7B-Instruct-v0.1"     --port 8080     --host 0.0.0.0     --load-balance "0.0.0.0:50050"
[2024-02-13 22:28:10,898] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO:     Started server process [883]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)



INFO:     172.17.0.1:33688 - "POST /generate HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 758, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 778, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 299, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 79, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 299, in app
    raise e
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 294, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/mii/entrypoints/api_server.py", line 98, in generate
    responseData = await stub.GeneratorReply(requestData)
  File "/usr/local/lib/python3.10/dist-packages/grpc/aio/_call.py", line 318, in __await__
    raise _create_rpc_error(
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:0.0.0.0:50050: Failed to connect to remote host: Connection refused"
	debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:0.0.0.0:50050: Failed to connect to remote host: Connection refused", grpc_status:14, created_time:"2024-02-13T22:28:14.61881733+00:00"}"
>
^CINFO:     Shutting down

Please help

@freQuensy23-coder
Copy link

Only able to place 0 replicas, but same problem

@SeungminHeo
Copy link

I think do_sample parameter has to be changed on request. some llm models are trained to use greedy sampling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

openai compatible api
8 participants