Skip to content

Conversation

@zuxin666
Copy link
Contributor

@zuxin666 zuxin666 commented Apr 25, 2025

Description

This PR adds support for xLAM-2 models in vLLM's tool calling feature. The xLAM tool parser is designed to support models that generate tool calls in various JSON formats, including Salesforce's Llama-xLAM and Qwen-xLAM models.

Key highlights:

  1. Implemented xLAMToolParser class that can detect function calls in multiple output styles:

    • Direct JSON arrays
    • JSON within <think>...</think> tags
    • JSON within code blocks
    • JSON within [TOOL_CALLS] tags
    • JSON within <tool_call>...</tool_call> tags
  2. Added support for both streaming and non-streaming modes for tool calls

  3. Implemented robust JSON parsing with fallback mechanisms to handle various output formats

  4. Added support for parallel function calls with effective separation of text content from tool calls

Supported Models

  • Salesforce Llama-xLAM models: Salesforce/Llama-xLAM-2-8B-fc-r, Salesforce/Llama-xLAM-2-70B-fc-r
  • Qwen-xLAM models: Salesforce/xLAM-1B-fc-r, Salesforce/xLAM-3B-fc-r, Salesforce/xLAM-32B-fc-r

Fix

Enhances vLLM's tool calling capability by adding support for the xLAM-2 model family.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added documentation Improvements or additions to documentation frontend tool-calling labels Apr 25, 2025
@mgoin mgoin requested review from mgoin and russellb April 25, 2025 21:00
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me, thanks for the clear code and testing. Just a few comments. It would be nice to add a dedicated format example to examples/offline_inference or online_serving

@zuxin666
Copy link
Contributor Author

Hi @mgoin , would you mind taking a look again for this PR? Thank you!

@dhaneshsabane
Copy link

@zuxin666

Tried this parser on my hosted Salesforce/xLAM-32B-fc-r as a custom parser plugin for vLLM and faced the following error:

llm-1  | INFO:     172.18.0.1:37142 - "POST /v1/chat/completions HTTP/1.1" 200 OK
llm-1  | INFO 05-05 09:22:04 [async_llm.py:228] Added request chatcmpl-e543138e0c3647f197935cbc69e5234d.
llm-1  | Error in streaming tool calls
llm-1  | Traceback (most recent call last):
llm-1  |   File "/xlam_tool_parser.py", line 230, in extract_tool_calls_streaming
llm-1  |     function_name = current_tool_call.get("name")
llm-1  |                     ^^^^^^^^^^^^^^^^^^^^^
llm-1  | AttributeError: 'list' object has no attribute 'get'

A potential bug?

@zuxin666
Copy link
Contributor Author

zuxin666 commented May 8, 2025

Hi @dhaneshsabane , thanks for catching this. I have fixed the streaming issue. I used the following test scripts to test our xLAM models and it works well:

Serving:

vllm serve Salesforce/Llama-xLAM-2-8b-fc-r --enable-auto-tool-choice --tool-call-parser xlam

Testing scripts:

import json
import time

from openai import OpenAI

# Connect to vLLM server
client = OpenAI(base_url="http://localhost:8000/v1", api_key="empty")


# Define tool functions
def get_weather(location: str, unit: str):
    return f"Weather in {location} is 22 degrees {unit}."


def calculate_expression(expression: str):
    try:
        result = eval(expression)
        return f"The result of {expression} is {result}"
    except:
        return f"Could not calculate {expression}"


def search_info(query: str):
    return f"Search results for '{query}': Found multiple relevant documents."


def translate_text(text: str, target_language: str):
    return f"Translation of '{text}' to {target_language}: [translated content]"


# Define tools
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "City and state, e.g., 'San Francisco, CA'"
                },
                "unit": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"]
                }
            },
            "required": ["location", "unit"]
        }
    }
}, {
    "type": "function",
    "function": {
        "name": "calculate_expression",
        "description": "Calculate a mathematical expression",
        "parameters": {
            "type": "object",
            "properties": {
                "expression": {
                    "type": "string",
                    "description": "Mathematical expression to evaluate"
                }
            },
            "required": ["expression"]
        }
    }
}, {
    "type": "function",
    "function": {
        "name": "search_info",
        "description": "Search for information on a topic",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "The search query"
                }
            },
            "required": ["query"]
        }
    }
}, {
    "type": "function",
    "function": {
        "name": "translate_text",
        "description": "Translate text to another language",
        "parameters": {
            "type": "object",
            "properties": {
                "text": {
                    "type": "string",
                    "description": "Text to translate"
                },
                "target_language": {
                    "type": "string",
                    "description": "Target language for translation"
                }
            },
            "required": ["text", "target_language"]
        }
    }
}]

# Map of function names to implementations
tool_functions = {
    "get_weather": get_weather,
    "calculate_expression": calculate_expression,
    "search_info": search_info,
    "translate_text": translate_text
}


def process_stream(response, tool_functions):
    """Process a streaming response with possible tool calls"""
    function_name = None
    function_args = ""
    function_id = None

    print("\n--- Stream Output ---")
    for chunk in response:
        # Handle tool calls in the stream
        if chunk.choices[0].delta.tool_calls:
            tool_call = chunk.choices[0].delta.tool_calls[0]

            # Extract function information as it comes in chunks
            if hasattr(tool_call, 'function'):
                if hasattr(tool_call.function,
                           'name') and tool_call.function.name:
                    function_name = tool_call.function.name
                    print(f"Function called: {function_name}")

                if hasattr(tool_call.function,
                           'arguments') and tool_call.function.arguments:
                    function_args += tool_call.function.arguments
                    print(f"Arguments chunk: {tool_call.function.arguments}")

            if hasattr(tool_call, 'id') and tool_call.id:
                function_id = tool_call.id

        # Handle regular content in the stream
        elif chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="")

    print("\n--- End Stream ---\n")

    # Execute the function if we received a complete function call
    if function_name and function_args:
        try:
            # Parse the JSON arguments
            args = json.loads(function_args)

            # Call the function with the arguments
            function_result = tool_functions[function_name](**args)

            # Create a follow-up message with the function result
            follow_up_response = client.chat.completions.create(
                model=client.models.list().data[0].id,
                messages=[{
                    "role":
                    "user",
                    "content":
                    "What's the weather like in San Francisco?"
                }, {
                    "role":
                    "assistant",
                    "tool_calls": [{
                        "id": function_id or "call_123",
                        "type": "function",
                        "function": {
                            "name": function_name,
                            "arguments": function_args
                        }
                    }]
                }, {
                    "role": "tool",
                    "tool_call_id": function_id or "call_123",
                    "content": function_result
                }],
                stream=True)

            print(f"\n--- Function Result ---\n{function_result}\n")
            print("\n--- Follow-up Response ---")
            for chunk in follow_up_response:
                if chunk.choices[0].delta.content:
                    print(chunk.choices[0].delta.content, end="")
            print("\n--- End Follow-up ---\n")

        except Exception as e:
            print(f"Error executing function: {e}")


def run_test_case(query, test_name):
    """Run a single test case with the given query"""
    print(f"\n{'='*50}\nTEST CASE: {test_name}\n{'='*50}")
    print(f"Query: '{query}'")

    start_time = time.time()

    # Create streaming chat completion request
    response = client.chat.completions.create(
        model=client.models.list().data[0].id,
        messages=[{
            "role": "user",
            "content": query
        }],
        tools=tools,
        tool_choice="auto",
        stream=True)

    # Process the streaming response
    process_stream(response, tool_functions)

    end_time = time.time()
    print(f"Test completed in {end_time - start_time:.2f} seconds")


# Run test cases
test_cases = [
    ("What's the weather like in San Francisco?", "Weather Information"),
    ("Calculate 25 * 17 + 31", "Math Calculation"),
    ("Search for information about quantum computing", "Information Search"),
    ("Translate 'Hello world' to Spanish", "Text Translation"),
    ("What is the weather in Tokyo in celsius and then calculate 15% of 230",
     "Multiple Tool Usage")
]

# Execute all test cases
for query, test_name in test_cases:
    run_test_case(query, test_name)
    time.sleep(1)  # Small delay between tests

print("\nAll tests completed.")

Please let me know if you find any other issues.

@dhaneshsabane
Copy link

@zuxin666

The error has disappeared but the tool call in itself is still incorrect. Here's the output of your test script:

==================================================
TEST CASE: Weather Information
==================================================
Query: 'What's the weather like in San Francisco?'

--- Stream Output ---
[{"name": "get_weather", "arguments": {"location": "San Francisco, CA", "unit": "fahrenheit"}}
--- End Stream ---

Test completed in 1.29 seconds

==================================================
TEST CASE: Math Calculation
==================================================
Query: 'Calculate 25 * 17 + 31'

--- Stream Output ---
[{"name": "calculate_expression", "arguments": {"expression": "25 * 17 + 31"}}
--- End Stream ---

Test completed in 1.00 seconds

==================================================
TEST CASE: Information Search
==================================================
Query: 'Search for information about quantum computing'

--- Stream Output ---
[{"name": "search_info", "arguments": {"query": "quantum computing"}}
--- End Stream ---

Test completed in 0.79 seconds

==================================================
TEST CASE: Text Translation
==================================================
Query: 'Translate 'Hello world' to Spanish'

--- Stream Output ---
[{"name": "translate_text", "arguments": {"text": "Hello world", "target_language": "Spanish"}}
--- End Stream ---

Test completed in 0.95 seconds

==================================================
TEST CASE: Multiple Tool Usage
==================================================
Query: 'What is the weather in Tokyo in celsius and then calculate 15% of 230'

--- Stream Output ---
[{"name": "get_weather", "arguments": {"location": "Tokyo", "unit": "celsius"}}, {"name": "calculate_expression", "arguments": {"expression": "0.15 * 230"}}
--- End Stream ---

Test completed in 1.63 seconds

All tests completed.

Notice the missing ] at the end of the tool call. That results in frameworks and libraries to ignore it as a tool call and forward the stream as is as output.

@vxtra1973
Copy link

I tried it with a simple langflow agent, it returns:
[{"name": "evaluate_expression", "arguments": {"expression": "14 * 20"}}

rather than calling the tool

@zuxin666
Copy link
Contributor Author

Hi @dhaneshsabane and @vxtra1973 , sorry about the previous mistakes, I didn't understand the streaming function calling mode very well. The parallel function calls in streaming mode is more complex and difficult to implement than I expected.
I investigated other models like llama's streaming mode behavior and figured out the issue. Now I have revised the streaming parser and uploaded the two example test scripts:

  • one in streaming fc mode: examples/online_serving/openai_chat_completion_client_with_tools_xlam_streaming.py
  • one in standard non-streaming fc mode: examples/online_serving/openai_chat_completion_client_with_tools_xlam.py

After serving the model
vllm serve --model Salesforce/Llama-xLAM-2-8b-fc-r --enable-auto-tool-choice --tool-call-parser xlam
and run:
python examples/online_serving/openai_chat_completion_client_with_tools_xlam_streaming.py

The outcome is:

==================================================
TEST CASE: Weather Information
==================================================
Query: 'I want to know the weather in San Francisco'

--- Stream Output ---
[{"Function called: get_weather
Arguments chunk: {
Arguments chunk: "location": "San Francisco", "unit": "celsius"}

--- End Stream ---


--- Function Result (get_weather) ---
Weather in San Francisco is 22 degrees celsius.


--- Follow-up Response ---
The weather in San Francisco is 22 degrees celsius.
--- End Follow-up ---

Test completed in 0.27 seconds

==================================================
TEST CASE: Math Calculation
==================================================
Query: 'Calculate 25 * 17 + 31'

--- Stream Output ---
[{"Function called: calculate_expression
Arguments chunk: {
Arguments chunk: "expression": "25 * 17 + 31"}

--- End Stream ---


--- Function Result (calculate_expression) ---
The result of 25 * 17 + 31 is 456


--- Follow-up Response ---
The result of 25 * 17 + 31 is 456.
--- End Follow-up ---

Test completed in 0.26 seconds

==================================================
TEST CASE: Text Translation
==================================================
Query: 'Translate 'Hello world' to Spanish'

--- Stream Output ---
[{"Function called: translate_text
Arguments chunk: {
Arguments chunk: "text": "Hello world", "target_language": "Spanish"}

--- End Stream ---


--- Function Result (translate_text) ---
Translation of 'Hello world' to Spanish: [translated content]


--- Follow-up Response ---
The translation of 'Hello world' to Spanish is 'Hola mundo'.
--- End Follow-up ---

Test completed in 0.27 seconds

==================================================
TEST CASE: Multiple Tool Usage
==================================================
Query: 'What is the weather in Tokyo and New York in celsius'

--- Stream Output ---
[{"Function called: get_weather
Arguments chunk: {
Arguments chunk: "location": "Tokyo", "unit": "celsius"}
Function called: get_weather
Arguments chunk: {
Arguments chunk: "location": "New York", "unit": "celsius"}

--- End Stream ---


--- Function Result (get_weather) ---
Weather in Tokyo is 22 degrees celsius.


--- Function Result (get_weather) ---
Weather in New York is 22 degrees celsius.


--- Follow-up Response ---
The weather in Tokyo and New York is the same, which is 22 degrees celsius.
--- End Follow-up ---

Test completed in 0.45 seconds

All tests completed.

This should be the expected behavior, right? Let me know your thoughts. Thanks.

@mergify
Copy link

mergify bot commented May 12, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zuxin666.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 12, 2025
@loki369loki
Copy link

@zuxin666

I am using the latest code and running inference as follows:

vllm serve /data/models/Salesforce/Llama-xLAM-2-8b-fc-r \
  --enable-auto-tool-choice \
  --tool-parser-plugin /data/models/Salesforce/xlam_tool_call_parser_unmerged.py \
  --tool-call-parser xlam \
  --tensor-parallel-size 1 \
  --max-model-len 16000 \
  --host 127.0.0.1 \
  --port 8000 \
  --gpu-memory-utilization 0.80

It seems that every tool call prints "[{"", which is not necessary.

image

Could you please check this issue? Thank you!

@zuxin666
Copy link
Contributor Author

Hi @loki369loki , it has been solved, it was because the function call prefix detection logic here.
In the streaming mode if the prefix was not determined as function call, it will simply return the content.

@zuxin666
Copy link
Contributor Author

Hi @mgoin , can you also please check this PR when you are available? Thx.

@mgoin mgoin requested a review from aarnphm May 22, 2025 18:14
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thanks for the tests and examples!

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label May 22, 2025
@zuxin666
Copy link
Contributor Author

@mgoin Thanks! Seems like the above CI failing are not related to this PR? Any other blockers to merge it?

@mergify
Copy link

mergify bot commented May 22, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zuxin666.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 22, 2025
zuxin666 added 2 commits June 13, 2025 21:25
Signed-off-by: Zuxin Liu <zuxin.liu@salesforce.com>
Signed-off-by: Zuxin Liu <zuxin.liu@salesforce.com>
zuxin666 added 8 commits June 13, 2025 21:25
Signed-off-by: Zuxin Liu <zuxin.liu@salesforce.com>
Signed-off-by: Zuxin Liu <zuxin.liu@salesforce.com>
Signed-off-by: Zuxin Liu <zuxin.liu@salesforce.com>
This reverts commit 337486885aa0c28bcca123c1ac646afc14435ab7.

Signed-off-by: Zuxin Liu <zuxin.liu@salesforce.com>
Signed-off-by: Zuxin Liu <zuxin.liu@salesforce.com>
Signed-off-by: Zuxin Liu <zuxin.liu@salesforce.com>
Signed-off-by: Zuxin Liu <zuxin.liu@salesforce.com>
Signed-off-by: Zuxin Liu <zuxin.liu@salesforce.com>
Signed-off-by: Zuxin Liu <zuxin.liu@salesforce.com>
Signed-off-by: Zuxin Liu <zuxin.liu@salesforce.com>
@mergify mergify bot added the qwen Related to Qwen models label Jun 18, 2025
@zuxin666
Copy link
Contributor Author

Hi @aarnphm @mgoin , I think the above check failure is not relevant to this PR. Is there any block for merging this PR? Thanks.

@houseroad
Copy link
Collaborator

Re-triggered the CI.

@zuxin666
Copy link
Contributor Author

Thanks! Seems that it is good to be merged? @houseroad @aarnphm @mgoin

@houseroad houseroad merged commit 1d0ae26 into vllm-project:main Jun 19, 2025
66 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation frontend qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed tool-calling

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants