Skip to content

remote-vllm not working with builtin::websearch tools #1277

@luis5tb

Description

@luis5tb

System Info

llamastack from master branch

Information

  • The official example scripts
  • My own modified scripts

🐛 Describe the bug

The next error appears on the llamastack server with remote-vllm

Traceback (most recent call last):
  File "/app/llama-stack-source/llama_stack/distribution/server/server.py", line 208, in sse_generator
    async for item in event_gen:
  File "/app/llama-stack-source/llama_stack/providers/inline/agents/meta_reference/agents.py", line 165, in _create_agent_turn_streaming
    async for event in agent.create_and_execute_turn(request):
  File "/app/llama-stack-source/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 197, in create_and_execute_turn
    async for chunk in self.run(
  File "/app/llama-stack-source/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 389, in run
    async for res in self._run(
  File "/app/llama-stack-source/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 631, in _run
    async for chunk in await self.inference_api.chat_completion(
  File "/app/llama-stack-source/llama_stack/distribution/routers/routers.py", line 191, in <genexpr>
    return (chunk async for chunk in await provider.chat_completion(**params))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/llama-stack-source/llama_stack/providers/remote/inference/vllm/vllm.py", line 327, in _stream_chat_completion
    async for chunk in res:
  File "/app/llama-stack-source/llama_stack/providers/remote/inference/vllm/vllm.py", line 170, in _process_vllm_chat_completion_stream_response
    choice = chunk.choices[0]
             ~~~~~~~~~~~~~^^^
IndexError: list index out of range

Steps to reproduce:

Create a docker image with the next command:
$ llama stack build --config build.yaml --image-type container --image-name vllm-tools

Where build.yaml has the next:

version: '2'
distribution_spec:
  description: Use (an external) vLLM server for running LLM inference
  providers:
    inference:
    - remote::vllm
    - inline::sentence-transformers
    vector_io:
    - inline::faiss
    - remote::chromadb
    safety:
    - inline::llama-guard
    agents:
    - inline::meta-reference
    eval:
    - inline::meta-reference
    datasetio:
    - remote::huggingface
    - inline::localfs
    scoring:
    - inline::basic
    - inline::llm-as-judge
    - inline::braintrust
    telemetry:
    - inline::meta-reference
    tool_runtime:
    - remote::tavily-search
    - inline::code-interpreter
    - inline::rag-runtime
    - remote::model-context-protocol
  container_image: registry.access.redhat.com/ubi9
image_type: container

Run it with the resulting run.yaml, with the right env vars:

$ podman run --security-opt label=disable -it --network host \
  -v ~/.llama/distributions/vllm-tools/vllm-tools-run.yaml:/app/config.yaml \
  -v ~/toolbox_utils/llama-stack:/app/llama-stack-source \
  --env LLAMA_STACK_PORT=$LLAMA_STACK_PORT \
  --env VLLM_API_TOKEN=$VLLM_API_TOKEN \
  --env INFERENCE_MODEL=$INFERENCE_MODEL \
  --env VLLM_URL=$VLLM_URL \
  --env TAVILY_SEARCH_API_KEY=$TAVILY_SEARCH_API_KEY \
  --entrypoint='["python", "-m", "llama_stack.distribution.server.server", "--yaml-config", "/app/config.yaml"]' localhost/vllm-tools:dev

And then test it with the agent with the next python code:

>>> from llama_stack_client.lib.agents.agent import Agent
>>> from llama_stack_client.lib.agents.event_logger import EventLogger
>>> from llama_stack_client.types.agent_create_params import AgentConfig
>>> from termcolor import cprint
>>> from llama_stack_client import LlamaStackClient
>>> 
>>> def create_client(llamastack_server_endpoint):
>>>     client = LlamaStackClient(
>>> 	    base_url=llamastack_server_endpoint)
>>> 	return client
>>> client = create_client("http://localhost:5001")
>>> model_id = "granite-3-8b-instruct" 
>>> agent_config = AgentConfig(                                
...     model=model_id,                                        
...     instructions="You are a helpful assistant",
...     toolgroups=["builtin::websearch"],                     
...     input_shields=[],                                      
...     output_shields=[],                                     
...     enable_session_persistence=False,                      
... )                                                         
>>> agent = Agent(client, agent_config)                        
>>> user_prompts = [                                           
...     "Hello",                                               
...     "Which teams played in the NBA western conference finals of 2024",
... ]                                                          
>>> session_id = agent.create_session("test-session")
>>> for prompt in user_prompts:
>>>     cprint(f"User> {prompt}", "green")                     
...     response = agent.create_turn(                          
...         messages=[                                         
...             {                                              
...                 "role": "user",                            
...                 "content": prompt,                         
...             }                                              
...         ],                                                 
...         session_id=session_id,                             
...     )                                                      
...     for log in EventLogger().log(response):                
...         log.print()                                        
...                                                            
User> Hello                                                    
inference> Traceback (most recent call last):                  
  File "<stdin>", line 12, in <module>                         
  File "/home/ltomasbo/toolbox_utils/llama-stack/stack-feb25/lib64/python3.12/site-packages/llama_stack_client/lib/agents/event_logger.py", line 163, in log
    for chunk in event_generator:                              
                 ^^^^^^^^^^^^^^^                               
  File "/home/ltomasbo/toolbox_utils/llama-stack/stack-feb25/lib64/python3.12/site-packages/llama_stack_client/lib/agents/agent.py", line 165, in _create_turn_streaming
    tool_calls = self._get_tool_calls(chunk)                   
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^                   
  File "/home/ltomasbo/toolbox_utils/llama-stack/stack-feb25/lib64/python3.12/site-packages/llama_stack_client/lib/agents/agent.py", line 61, in _get_tool_calls
    if chunk.event.payload.event_type not in {"turn_complete", "turn_awaiting_input"}:                                         
       ^^^^^^^^^^^^^^^^^^^                                     
AttributeError: 'NoneType' object has no attribute 'payload'

Error logs

Traceback (most recent call last):
  File "/app/llama-stack-source/llama_stack/distribution/server/server.py", line 208, in sse_generator
    async for item in event_gen:
  File "/app/llama-stack-source/llama_stack/providers/inline/agents/meta_reference/agents.py", line 165, in _create_agent_turn_streaming
    async for event in agent.create_and_execute_turn(request):
  File "/app/llama-stack-source/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 197, in create_and_execute_turn
    async for chunk in self.run(
  File "/app/llama-stack-source/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 389, in run
    async for res in self._run(
  File "/app/llama-stack-source/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 631, in _run
    async for chunk in await self.inference_api.chat_completion(
  File "/app/llama-stack-source/llama_stack/distribution/routers/routers.py", line 191, in <genexpr>
    return (chunk async for chunk in await provider.chat_completion(**params))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/llama-stack-source/llama_stack/providers/remote/inference/vllm/vllm.py", line 327, in _stream_chat_completion
    async for chunk in res:
  File "/app/llama-stack-source/llama_stack/providers/remote/inference/vllm/vllm.py", line 170, in _process_vllm_chat_completion_stream_response
    choice = chunk.choices[0]
             ~~~~~~~~~~~~~^^^
IndexError: list index out of range

Expected behavior

Tool being called and agent/llm processing the response from it

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions