-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat: OpenAI Responses API #1989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2608b8f
to
5230dac
Compare
Status as of April 18I've moved the tests for the OpenAI Responses API over to TestingTo test this locally, ensure you have
Then, run the specific provider verification tests you want. Examples that I've run are below, along with their pass/fail rate. OpenAI - 100% PassingThis hits the hosted API on openai.com directly - ie not using Llama Stack. This is to ensure our verification tests are passing against an implementation besides Llama Stack.
OpenAI via Llama Stack - 100% PassingThis uses Llama Stack's implementation of the OpenAI Responses API, handing the actual inference off to OpenAI as the backend inference provider. An important distinction here is that Llama Stack is doing all the Responses API handling, and we're only using the Chat Completions API against the backend inference provider, which in this case is OpenAI. This ensures we're handling nuances of the Responses API properly, by isolating any differences introduced by using other inference providers or models.
together.ai via Llama Stack - 100% passingNow the tests get more interesting, as we start testing against Llama 3.3 and Llama 4 model variants using a backend inference provider that has no native OpenAI Responses API support.
Fireworks AI via Llama Stack - 77% passingThis also uses Llama 3.3 and Llama 4 models. The web search tool tests are failing with this one, which appears to be something with the way tool calls are being constructed by the backend inference service.
Untested and/or not working yet
|
Signed-off-by: Ben Browning <bbrownin@redhat.com>
This extracts out a helper message to convert previous responses to messages and to convert openai choices (from a chat completion response) into output messages for the OpenAI Responses output. Signed-off-by: Ben Browning <bbrownin@redhat.com>
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Signed-off-by: Ben Browning <bbrownin@redhat.com>
This moves the OpenAI Responses API tests under tests/verifications/openai_api/test_response.py and starts to wire them up to our verification suite, so that we can test multiple providers as well as OpenAI directly for the Responses API. Signed-off-by: Ben Browning <bbrownin@redhat.com>
70f726f
to
78da660
Compare
I verified that:
worked for all the 3 models. There was a small issue around tool calling which I fixed. |
Given that this represents a major advance (even though the implementation is not complete or robust) in terms of what the Stack can offer, I am landing it :) And also want to avoid conflicts!! :) |
Nice, thanks for the fixes and landing this! I'll work on setting up some nightly testing for this across lots of our providers so we have a baseline to track and expand compatibility with over time. |
I ran the new OpenAI Responses API verification tests from
It will take me a bit longer to get some nightly testing setup for this and to get some manual test results using ollama and remote-vllm providers. But, the changes look good across the hosted providers generally so far. |
client-side support for the in-progress API surface: llamastack/llama-stack#1989
MCP can work transparently via server side execution. But custom tools need to come back as a tool call to the client, right? |
Yes, custom tools will come back to the client as a tool call - ie https://platform.openai.com/docs/guides/function-calling?api-mode=responses . We will have a decision to make (or expose the decision to the person running the Llama Stack server) as to which (if any) MCP tools we invoke automatically via server-side execution and which we send back to the client. In other words, we may choose to allow the person configuring a Stack to extend the list of "builtin" tools that automatically get called from an OpenAI Responses API point-of-view to include some MCP tools. |
# What does this PR do? This provides an initial [OpenAI Responses API](https://platform.openai.com/docs/api-reference/responses) implementation. The API is not yet complete, and this is more a proof-of-concept to show how we can store responses in our key-value stores and use them to support the Responses API concepts like `previous_response_id`. ## Test Plan I've added a new `tests/integration/openai_responses/test_openai_responses.py` as part of a test-driven development for this new API. I'm only testing this locally with the remote-vllm provider for now, but it should work with any of our inference providers since the only API it requires out of the inference provider is the `openai_chat_completion` endpoint. ``` VLLM_URL="http://localhost:8000/v1" \ INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \ llama stack build --template remote-vllm --image-type venv --run ``` ``` LLAMA_STACK_CONFIG="http://localhost:8321" \ python -m pytest -v \ tests/integration/openai_responses/test_openai_responses.py \ --text-model "meta-llama/Llama-3.2-3B-Instruct" ``` --------- Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
What does this PR do?
This provides an initial OpenAI Responses API implementation. The API is not yet complete, and this is more a proof-of-concept to show how we can store responses in our key-value stores and use them to support the Responses API concepts like
previous_response_id
.Test Plan
I've added a new
tests/integration/openai_responses/test_openai_responses.py
as part of a test-driven development for this new API. I'm only testing this locally with the remote-vllm provider for now, but it should work with any of our inference providers since the only API it requires out of the inference provider is theopenai_chat_completion
endpoint.