Skip to content

Conversation

@wuhang2014
Copy link
Contributor

@wuhang2014 wuhang2014 commented Sep 7, 2025

Purpose

Currently, we initialize all MCP sessions that specified by user request before generation process no matter whether tools are actually being used.
This PR suggests implementing lazy initialization for MCP sessions. The major change is to initialize sessions with tool server only before the tool is called.

Test Plan

With a python MCP tool server, test 2 requests with tool of code_interpreter type with background mode using responses API. One request do not use tool while another request use python tool.

We expect that the first request will not create any MCP session with tool server, while the second request will create session with tool server a period time after create background task for request handling.

import subprocess
import sys
from fastmcp import FastMCP

# Create FastMCP server instance
mcp = FastMCP("python")

@mcp.tool()
def python(code: str) -> str:
    """
    Execute Python code and return the output.

    Args:
        code: Python code to execute

    Returns:
        The output from executing the Python code
    """
    try:
        # Execute the Python code and capture output
        result = subprocess.run(
            [sys.executable, "-c", code],
            capture_output=True,
            text=True,
            timeout=30  # 30 second timeout for safety
        )

        if result.returncode == 0:
            return result.stdout if result.stdout else "Code executed successfully with no output"
        else:
            return f"Error: {result.stderr}"

    except subprocess.TimeoutExpired:
        return "Error: Code execution timed out (30 seconds)"
    except Exception as e:
        return f"Error: {str(e)}"

if __name__ == "__main__":
    # Run with SSE transport
    mcp.run(transport="sse", port=9999)
VLLM_ENABLE_RESPONSES_API_STORE=1 vllm serve /home/models/gpt-oss-20b -dp 8 --enable-expert-parallel --enforce-eager --tool-server 127.0.0.1:9999
import asyncio, time
import openai

MODEL = "/home/models/gpt-oss-20b"
client = openai.AsyncOpenAI(api_key="my_key", base_url="http://127.0.0.1:8000/v1")

async def main_background_lazy_init_mcp():
    inputs = [
        "tell me a story about a cat in 20 words",
        "Multiply 64548*15151 using python code interpreter."
    ]

    for input in inputs:
        print("="*36)
        print("="*36)
        print(input)
        print("="*36)
        response = await client.responses.create(
            model=MODEL,
            input=input,
            tools=[{
                "type": "code_interpreter",
                "container": {
                    "type": "auto"
                }
            }],
            stream=False,
            background=True,
        )

        print(f"{response=}")
        print("+"*36)
        retries = 0
        max_retries = 30
        while retries < max_retries:
            response = await client.responses.retrieve(response.id)
            print(f"{response=}")
            if response.status == "completed":
                break
            time.sleep(1)
            retries += 1

if __name__ == "__main__":
    asyncio.run(main_background_lazy_init_mcp())

Test Result

mcp server log

# python mock_mcp_server.py 


╭─ FastMCP 2.0 ──────────────────────────────────────────────────────────────╮
│                                                                            │
│        _ __ ___ ______           __  __  _____________    ____    ____     │
│       _ __ ___ / ____/___ ______/ /_/  |/  / ____/ __ \  |___ \  / __ \    │
│      _ __ ___ / /_  / __ `/ ___/ __/ /|_/ / /   / /_/ /  ___/ / / / / /    │
│     _ __ ___ / __/ / /_/ (__  ) /_/ /  / / /___/ ____/  /  __/_/ /_/ /     │
│    _ __ ___ /_/    \__,_/____/\__/_/  /_/\____/_/      /_____(_)____/      │
│                                                                            │
│                                                                            │
│                                                                            │
│    🖥️  Server name:     python                                              │
│    📦 Transport:       SSE                                                 │
│    🔗 Server URL:      http://127.0.0.1:9999/sse                           │
│                                                                            │
│    📚 Docs:            https://gofastmcp.com                               │
│    🚀 Deploy:          https://fastmcp.cloud                               │
│                                                                            │
│    🏎️  FastMCP version: 2.11.3                                              │
│    🤝 MCP version:     1.13.1                                              │
│                                                                            │
╰────────────────────────────────────────────────────────────────────────────╯


[09/08/25 08:57:40] INFO     Starting MCP server 'python' with transport 'sse' on http://127.0.0.1:9999/sse                                                                    server.py:1522
/home/wuhang/.venv/lib/python3.12/site-packages/websockets/legacy/__init__.py:6: DeprecationWarning: websockets.legacy is deprecated; see https://websockets.readthedocs.io/en/stable/howto/upgrade.html for upgrade instructions
  warnings.warn(  # deprecated in 14.0 - 2024-11-09
/home/wuhang/.venv/lib/python3.12/site-packages/uvicorn/protocols/websockets/websockets_impl.py:17: DeprecationWarning: websockets.server.WebSocketServerProtocol is deprecated
  from websockets.server import WebSocketServerProtocol
INFO:     Started server process [3876463]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:9999 (Press CTRL+C to quit)
INFO:     127.0.0.1:38216 - "GET /sse HTTP/1.1" 200 OK
INFO:     127.0.0.1:38224 - "POST /messages/?session_id=3eef3774ccb74a0a9231ec89fcfcd740 HTTP/1.1" 202 Accepted
INFO:     127.0.0.1:38224 - "POST /messages/?session_id=3eef3774ccb74a0a9231ec89fcfcd740 HTTP/1.1" 202 Accepted
INFO:     127.0.0.1:38224 - "POST /messages/?session_id=3eef3774ccb74a0a9231ec89fcfcd740 HTTP/1.1" 202 Accepted
INFO:     127.0.0.1:38224 - "POST /messages/?session_id=3eef3774ccb74a0a9231ec89fcfcd740 HTTP/1.1" 202 Accepted
INFO:     127.0.0.1:38224 - "POST /messages/?session_id=3eef3774ccb74a0a9231ec89fcfcd740 HTTP/1.1" 202 Accepted
INFO:     127.0.0.1:38224 - "POST /messages/?session_id=3eef3774ccb74a0a9231ec89fcfcd740 HTTP/1.1" 202 Accepted

client log

# python openai_responses_client.py 
====================================
====================================
tell me a story about a cat in 20 words
====================================
response=Response(id='resp_e1d66c73a1ab4cbda09b42114a78cc38', created_at=1757321872.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130928, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
++++++++++++++++++++++++++++++++++++
response=Response(id='resp_e1d66c73a1ab4cbda09b42114a78cc38', created_at=1757321872.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130928, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_e1d66c73a1ab4cbda09b42114a78cc38', created_at=1757321872.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130928, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_e1d66c73a1ab4cbda09b42114a78cc38', created_at=1757321872.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130928, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_e1d66c73a1ab4cbda09b42114a78cc38', created_at=1757321872.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130928, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_e1d66c73a1ab4cbda09b42114a78cc38', created_at=1757321872.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130928, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_e1d66c73a1ab4cbda09b42114a78cc38', created_at=1757321872.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130928, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_e1d66c73a1ab4cbda09b42114a78cc38', created_at=1757321872.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130928, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_e1d66c73a1ab4cbda09b42114a78cc38', created_at=1757321872.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130928, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_e1d66c73a1ab4cbda09b42114a78cc38', created_at=1757321872.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130928, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_e1d66c73a1ab4cbda09b42114a78cc38', created_at=1757321872.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130928, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_e1d66c73a1ab4cbda09b42114a78cc38', created_at=1757321872.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130928, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_e1d66c73a1ab4cbda09b42114a78cc38', created_at=1757321872.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130928, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_e1d66c73a1ab4cbda09b42114a78cc38', created_at=1757321872.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130928, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_e1d66c73a1ab4cbda09b42114a78cc38', created_at=1757321872.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130928, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_e1d66c73a1ab4cbda09b42114a78cc38', created_at=1757321872.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130928, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_e1d66c73a1ab4cbda09b42114a78cc38', created_at=1757321872.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130928, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_e1d66c73a1ab4cbda09b42114a78cc38', created_at=1757321872.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130928, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_e1d66c73a1ab4cbda09b42114a78cc38', created_at=1757321872.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130928, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_e1d66c73a1ab4cbda09b42114a78cc38', created_at=1757321872.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130928, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_e1d66c73a1ab4cbda09b42114a78cc38', created_at=1757321872.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130928, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_e1d66c73a1ab4cbda09b42114a78cc38', created_at=1757321872.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130928, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_e1d66c73a1ab4cbda09b42114a78cc38', created_at=1757321872.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[ResponseReasoningItem(id='rs_82079fb4036246d082a328d4c427e348', summary=[], type='reasoning', content=[Content(text='The user asks: "tell me a story about a cat in 20 words". So presumably a short story of exactly 20 words. We need to output it. Keep exactly 20 words. The content: a cat story. Provide 20 words. Let\'s craft. Count.\n\nWe need 20 words. Let\'s craft: "Milo, a shy tuxedo cat, discovered hidden garden. He chased butterflies, befriended shy hedgehog, and returned home triumphant."\n\nCount words:\n\nMilo(1), a(2), shy(3), tuxedo(4), cat,(5), discovered(6), hidden(7), garden.(8) He(9) chased(10), butterflies,(11) befriended(12), shy(13), hedgehog,(14) and(15) returned(16) home(17) triumphant.(18)\n\nThat\'s 18 words. Need 2 more words. Add "in" "moonlight" maybe after home. Let\'s rephrase: "Milo,..." but currently 18 words. Let\'s add "under" and "moonlight" as words: "Milo, a shy tuxedo cat, discovered hidden garden. He chased butterflies, befriended shy hedgehog, and returned home triumphant under moonlight." Count again.\n\nMilo(1)\na(2)\nshy(3)\ntuxedo(4)\ncat,(5)\ndiscovered(6)\nhidden(7)\ngarden.(8)\nHe(9)\nchased(10)\nbutterflies,(11)\nbefriended(12)\nshy(13)\nhedgehog,(14)\nand(15)\nreturned(16)\nhome(17)\ntriumphant(18)\nunder(19)\nmoonlight(20).\n\nYes 20. Good. That seems fine. User asked for story about a cat in 20 words. Provide output.', type='reasoning_text')], encrypted_content=None, status=None), ResponseOutputMessage(id='msg_d6404d23fef74a78add9cf01a4f95d1f', content=[ResponseOutputText(annotations=[], text='Milo, a shy tuxedo cat, discovered hidden garden. He chased butterflies, befriended shy hedgehog, and returned home triumphant under moonlight.', type='output_text', logprobs=None)], role='assistant', status='completed', type='message')], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130928, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='completed', text=None, top_logprobs=None, truncation='disabled', usage=ResponseUsage(input_tokens=144, input_tokens_details=InputTokensDetails(cached_tokens=128), output_tokens=434, output_tokens_details=OutputTokensDetails(reasoning_tokens=391, tool_output_tokens=0), total_tokens=578), user=None)
====================================
====================================
Multiply 64548*15151 using python code interpreter.
====================================
response=Response(id='resp_9ddfe579778149ce85d6045a544a5554', created_at=1757321893.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130927, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
++++++++++++++++++++++++++++++++++++
response=Response(id='resp_9ddfe579778149ce85d6045a544a5554', created_at=1757321893.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130927, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_9ddfe579778149ce85d6045a544a5554', created_at=1757321893.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130927, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_9ddfe579778149ce85d6045a544a5554', created_at=1757321893.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130927, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_9ddfe579778149ce85d6045a544a5554', created_at=1757321893.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130927, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_9ddfe579778149ce85d6045a544a5554', created_at=1757321893.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130927, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_9ddfe579778149ce85d6045a544a5554', created_at=1757321893.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130927, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_9ddfe579778149ce85d6045a544a5554', created_at=1757321893.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130927, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_9ddfe579778149ce85d6045a544a5554', created_at=1757321893.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130927, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_9ddfe579778149ce85d6045a544a5554', created_at=1757321893.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130927, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='queued', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None)
response=Response(id='resp_9ddfe579778149ce85d6045a544a5554', created_at=1757321893.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='/home/models/gpt-oss-20b', object='response', output=[ResponseReasoningItem(id='rs_bf4b873e5cb740cd9baa95b5703d1f3e', summary=[], type='reasoning', content=[Content(text='We need to multiply 64548 * 15151 using python. Compute.', type='reasoning_text')], encrypted_content=None, status=None), ResponseReasoningItem(id='rs_d67410f92f0f430fb853ca5fa35dbe1c', summary=[], type='reasoning', content=[Content(text='64548 * 15151', type='reasoning_text')], encrypted_content=None, status=None), ResponseReasoningItem(id='rs_5479d468175146ce82ea94ad67ec1cb9', summary=[], type='reasoning', content=[Content(text="We need to provide the result. Let's compute manually? Use python.", type='reasoning_text')], encrypted_content=None, status=None), ResponseReasoningItem(id='rs_8a1321c1d27f47409f3189bca8a480c6', summary=[], type='reasoning', content=[Content(text='64548 * 15151', type='reasoning_text')], encrypted_content=None, status=None), ResponseReasoningItem(id='rs_74bb57d2afe54bba8b0df23073ee4df4', summary=[], type='reasoning', content=[Content(text="It didn't output. I forgot to print. Let's do it.", type='reasoning_text')], encrypted_content=None, status=None), ResponseReasoningItem(id='rs_f2e8f9eba09549fca5114c2e8b81c7db', summary=[], type='reasoning', content=[Content(text='print(64548 * 15151)', type='reasoning_text')], encrypted_content=None, status=None), ResponseReasoningItem(id='rs_09f44cf3bbfc4bc297b98a3c0e1cb737', summary=[], type='reasoning', content=[Content(text='Therefore answer is 977,966,748. Provide the result.', type='reasoning_text')], encrypted_content=None, status=None), ResponseOutputMessage(id='msg_95e69c0aa9d24c519b484df6d3d3b29f', content=[ResponseOutputText(annotations=[], text='The product of\u202f64548\u202fand\u202f15151\u202fis:\n\n\\[\n64548 \\times 15151 = 977\\,966\\,748\n\\]', type='output_text', logprobs=None)], role='assistant', status='completed', type='message')], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[CodeInterpreter(container=CodeInterpreterContainerCodeInterpreterToolAuto(type='auto', file_ids=None), type='code_interpreter')], top_p=1.0, background=True, max_output_tokens=130768, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='completed', text=None, top_logprobs=None, truncation='disabled', usage=ResponseUsage(input_tokens=902, input_tokens_details=InputTokensDetails(cached_tokens=656), output_tokens=165, output_tokens_details=OutputTokensDetails(reasoning_tokens=85, tool_output_tokens=51), total_tokens=1067), user=None)

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@wuhang2014 wuhang2014 requested a review from aarnphm as a code owner September 7, 2025 04:30
@mergify mergify bot added frontend gpt-oss Related to GPT-OSS models labels Sep 7, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the tool session initialization to be lazy by making ConversationContext an async context manager. This is a good change that centralizes resource management. The implementation uses AsyncExitStack within HarmonyContext to correctly manage the lifecycle of tool sessions. However, I've identified a critical issue in call_tool where a coroutine is passed as an argument instead of its awaited result, which would lead to a runtime error. My review includes a fix for this issue.

@mergify
Copy link

mergify bot commented Sep 9, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wuhang2014.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 9, 2025
@heheda12345
Copy link
Collaborator

Also CC @yeqcharlotte

@yeqcharlotte
Copy link
Collaborator

thanks for the change and the test plan @wuhang2014. could you rebase it after #23386? could you turn more test plan into unit tests? also maybe double check if the e2e aime25 with tool eval score still makes sense. cc: @alecsolder @lacora

Copy link
Collaborator

@yeqcharlotte yeqcharlotte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rebase and add the unit test coverage

@github-project-automation github-project-automation bot moved this from To Triage to In progress in gpt-oss Issues & Enhancements Sep 14, 2025
self._reference_count += 1
if self._async_exit_stack is None:
assert (self._reference_count == 1
), "Reference count of exit stack should be "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you make the assertion message more meaningful?

@alecsolder
Copy link
Contributor

I like a lot of what is going on here, and the implementation looks a lot cleaner than before, but a couple comments:

First, I think we should absolutely only init connections for tools that are enabled in the request, not for all built in tools registered to the tool server. This PR definitely solves that.

However I think this lazy init should be enabled as an option, and not by default.

I think there are a few different reasons why we could want the option for the connections to happen before any generation steps have taken place:

  • Failing fast before any generation has started:
    • In case there are issues with the MCP server itself, it is valuable to be able to be able to quickly fail a request.
  • Where in the state of the request the exception gets thrown:
    • If the MCP server is unavailable, we will now get an exception after generation. Instead of being able to raise a very obvious exception failing the entire request, we should respond with what the model did generate, and an exception saying we can't connect to the tool server. Which ideally should be something like responding with a ResponseFunctionWebSearch which has a status for failed, but not a great place for an error message.
  • Future MCP Tool support:
    • Currently, the tool server is static and is init on startup. It pulls tool descriptions once on startup and uses it for all requests afterwards. In the future as we continue to support more Responses API features and add things like MCP Tool, we would need to list_tools before every request (with options for cacheing), which means we would need to establish the SSE connection at the start of a request anyway.
  • More complex tools + MCP servers:
    • As the tools that are being used get more complicated, the MCP servers will play more of an active role in a Responses API request. For example:
      • Auth headers are very commonly required to use MCP servers, and validating auth headers should likely happen before generation to fail fast.
      • Sometimes tools take some time to initialize. It can take a minute for a container to fully start up to execute code in safely. This can be done when the initial connection to the MCP server is established. When this is done after the first generation happens, we occupy KV cache for that generation for an additional minute compared to doing it beforehand.

TLDR and my actual suggestions:

  • Keep both init strategies as options
    • Maybe it could be implemented as having two tool server classes, one that is lazy and one that is not lazy, which could be selected when running serve
  • Fix up the original strategy to only establish connections for tools enabled on the request (in request.tools)
  • Add unit tests to verify the expected Responses API results where establishing a SSE connection to the MCP sever fails at tool call time

@heheda12345
Copy link
Collaborator

@alecsolder Thanks!
Agree that we can have an option on whether to do lazy initialization. But I still feel that we should enable lazy initialization by default.

What about adding a config to control whether to enable lazy initialization in this PR and enable lazy initialization by default? We can discuss the better default choice without blocking this PR.

Maybe it could be implemented as having two tool server classes

I don't think we need it. We can first initialize the tool server, and if lazy initialization is disabled, create connection with MCP tool server with the same API of lazy initialization.

Fix up the original strategy to only establish connections for tools enabled on the request (in request.tools)

I agree

Add unit tests to verify the expected Responses API results where establishing a SSE connection to the MCP sever fails at tool call time

I agree

Discussion about which option is better:

Failing fast before any generation has started:

With lazy initialization, we can still serve simple requests that don't need tool call. To my understanding, simple requests will still enable all tools because there isn't a layer to detect it is a simple request and disable the tools. And if you do want fast failing, maybe we can check the healthy of MCP server every several minutes, without affecting the TTFT of all requests.

Where in the state of the request the exception gets thrown:

I think we should return the ResponseReasoningItems and ResponseFunctionWebSearch with status is fail in this case. Lazy initialization doesn't block this behavior.

Future MCP Tool support

The result of list_tools need to be cached as it doesn't change very frequently. We don't need to check it for every request.

More complex tools + MCP servers:

Auth headers Can you explain it a bit more? Is it something like we need to pass an EXA_API_KEY to use the browser tool? Do all MCP tool share the same auth result?

Sometimes tools take some time to initialize. That's a good problem. Do we need to overlap the initialization and the first round of generation?

@alecsolder
Copy link
Contributor

@alecsolder Thanks! Agree that we can have an option on whether to do lazy initialization. But I still feel that we should enable lazy initialization by default.

What about adding a config to control whether to enable lazy initialization in this PR and enable lazy initialization by default? We can discuss the better default choice without blocking this PR.

Sounds good!

Maybe it could be implemented as having two tool server classes

I don't think we need it. We can first initialize the tool server, and if lazy initialization is disabled, create connection with MCP tool server with the same API of lazy initialization.

Works for me

Failing fast before any generation has started:

With lazy initialization, we can still serve simple requests that don't need tool call. To my understanding, simple requests will still enable all tools because there isn't a layer to detect it is a simple request and disable the tools. And if you do want fast failing, maybe we can check the healthy of MCP server every several minutes, without affecting the TTFT of all requests.

I think this is a good strategy, sounds more than good enough for now.

Where in the state of the request the exception gets thrown:

I think we should return the ResponseReasoningItems and ResponseFunctionWebSearch with status is fail in this case. Lazy initialization doesn't block this behavior.

Yeah agreed, I think that part would just need to be implemented as part of this PR

Future MCP Tool support

The result of list_tools need to be cached as it doesn't change very frequently. We don't need to check it for every request.

Yep understood, I've seen it to be a bit annoying for development purposes because now if I want to iterate on my MCP server prompts, I have to restart vLLM every single time. But this is definitely more of an issue for the "remote MCP" feature for sure compared to the build in tools that are currently supported.

OpenAI gives some information on how the list_tools request is cached here. It is actually emitted as an event, and is cached on a request by request basis, which makes sense for an enterprise but is definitely a bit much here. Just mentioning it so we can consider it for the future!

More complex tools + MCP servers:

Auth headers Can you explain it a bit more? Is it something like we need to pass an EXA_API_KEY to use the browser tool? Do all MCP tool share the same auth result?

There is good info on auth headers here, basically if you define headers on the tool object, they are passed through to the MCP server. It is like if I was using stripe, I'd want to define my stripe_api_key as part of my responses API request. That is how stripe hosts one MCP service everyone can use.

Sometimes tools take some time to initialize. That's a good problem. Do we need to overlap the initialization and the first round of generation?

I think this would be ideal!

@heheda12345
Copy link
Collaborator

@wuhang2014
In short, can you help to do the following in this PR?

  1. Make lazy init optional and enable it by default. If lazy init is disabled, only establish connections for tools enabled on the request (in request.tools). One possible logic is to first initialize the tool server, and if lazy initialization is disabled, create connection with MCP tool server using the same API of lazy initialization.
  2. Handle the failure of lazy initialization. One possible way is to return the generated text plus ResponseFunctionWebSearchwith status=failed. And add necessary unit test for this.

@heheda12345
Copy link
Collaborator

@alecsolder can you help to create issues to track the TODOs that are important for you?

@wuhang2014
Copy link
Contributor Author

thanks for the change and the test plan @wuhang2014. could you rebase it after #23386? could you turn more test plan into unit tests? also maybe double check if the e2e aime25 with tool eval score still makes sense. cc: @alecsolder @lacora

I will rebase it and add unit tests in a few days

@wuhang2014
Copy link
Contributor Author

@wuhang2014 In short, can you help to do the following in this PR?

  1. Make lazy init optional and enable it by default. If lazy init is disabled, only establish connections for tools enabled on the request (in request.tools). One possible logic is to first initialize the tool server, and if lazy initialization is disabled, create connection with MCP tool server using the same API of lazy initialization.
  2. Handle the failure of lazy initialization. One possible way is to return the generated text plus ResponseFunctionWebSearchwith status=failed. And add necessary unit test for this.

Yes, I can do this in a few days.

@mergify
Copy link

mergify bot commented Oct 14, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wuhang2014.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 14, 2025
Signed-off-by: wuhang <wuhang6@huawei.com>
Signed-off-by: wuhang <wuhang6@huawei.com>
@mergify mergify bot removed the needs-rebase label Oct 15, 2025
Signed-off-by: wuhang <wuhang6@huawei.com>
Signed-off-by: wuhang <wuhang6@huawei.com>
@mergify
Copy link

mergify bot commented Oct 23, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wuhang2014.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend gpt-oss Related to GPT-OSS models needs-rebase

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

4 participants