Tool call performs worse on v2.2.0 as compared to latest #2413

varad0309 · 2024-08-13T20:19:33Z

System Info

gpu=0
num_gpus=1
model=meta-llama/Meta-Llama-3.1-8B-Instruct
docker run -d \
--gpus "\"device=$gpu\"" \
--shm-size 16g \
-e HUGGING_FACE_HUB_TOKEN=$token \
-p 8082:80 \
-v $volume:/data \
--name Meta-Llama-3.1-8B \
ghcr.io/huggingface/text-generation-inference:sha-1cebccc@sha256:4ccb775aaaefc90df10b2de7ce17a1f00a07682c12ea9630e6e6fdfa10a1c05e \
--model-id $model \
--max-concurrent-requests $max_concurrent_request \
--max-total-tokens $max_total_token \
--max-input-length $max_input_length \
--waiting-served-ratio $wsr \
--num-shard $num_gpus \
--dtype bfloat16

OS: Ubuntu Linux
Model: meta-llama/Meta-Llama-3.1-8B-Instruct / meta-llama/Meta-Llama-3-8B-Instruct
Hardware: A100 80G
Version with issue: v2.2.0
Compared with: latest

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Launch the docker instance.
Run the following:

client = OpenAI(
        base_url=f"http://127.0.0.1:8082/v1",
        api_key="_",
    )

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=messages,
    tools=tools,
    tool_choice="auto",
    max_tokens=max_tokens,
)

predictions = chat_completion.choices[0].message.tool_calls

Expected behavior

Hey @drbh @ErikKaum, did you try benchmarking the performance of v2.2.0 against latest on tool calling? I am getting dramatically worse performance on v2.2.0 as compared to using the previous versions on some tool-call benchmarks I have created. Just changing the version causes the performance of meta-llama/Meta-Llama-3-8B-Instruct to drop from 0.66 to 0.08 on the same script and data. Ofc, I can't evaluate the performance of Llama-3.1 on previous versions, but the its' performance is similarly close to 0.

The text was updated successfully, but these errors were encountered:

drbh · 2024-08-13T20:41:40Z

Hi @varad0309 thanks for opening this issue, v2.2.0 was released ~3 weeks ago and TGI since has had some bug fixes and improvements that are available on latest. Specifically, a fix for a tool related bug was was merged yesterday #2406 and that would likely improve tool calling responses.

We will be publishing a newer release in the coming week/weeks and it should include these fixes along with many other improvements! For now I'd recommend using latest or a pinned commit to ensure you are using a version with the tool fixes. Thanks again!

varad0309 · 2024-08-13T20:44:22Z

@drbh thanks for the quick reply. I did try a commit from few hrs back (more specifically, sha-1cebccc@sha256:4ccb775aaaefc90df10b2de7ce17a1f00a07682c12ea9630e6e6fdfa10a1c05e). The problem still persists.

My observation: the list of available tools is still not getting passed appropriately.

drbh · 2024-08-13T20:49:16Z

oh apologies I must have misunderstood the issue, it sounds that tools responses have regressed starting at version 2.2.0 and onwards? Would you be able to share an example of the input and expected output? Additionally do you know when the tools were last working as you expected (maybe a version or best case the last commit sha)? Thanks!

varad0309 · 2024-08-13T21:03:40Z

Sure, here are a few examples. I unfortunately don't know the last version after which it starts breaking. The versions I am comparing are via docker images:

(this works) version 1 => ghcr.io/huggingface/text-generation-inference:latest
(this doesn't) version 2 => ghcr.io/huggingface/text-generation-inference:sha-1cebccc@sha256:4ccb775aaaefc90df10b2de7ce17a1f00a07682c12ea9630e6e6fdfa10a1c05e

Examples:

Ground truth: [{'name': 'search_hotel', 'arguments': {'destination': 'Paris', 'check_in_date': '2022-05-01', 'check_out_date': '2022-05-10'}}]
Version 1: [Function(arguments={'check_in_date': '2022-05-01', 'check_out_date': '2022-05-10', 'location': 'Paris', 'num_guests': 1, 'num_rooms': 1}, name='search_hotel', description=None)]
Version 2: [Function(arguments={'number': 7}, name='find_hotels', description=None)]

Ground truth: [{'name': 'roll_dice', 'arguments': {'sides': 6, 'quantity': 1}}]
Version 1: [Function(arguments={'quantity': 1, 'sides': 6}, name='roll_dice', description=None)]
Version 2: [Function(arguments={'artist': 'tools', 'genre': 'RNG Tools'}, name='random.randint', description=None)]

Ground truth: [{'name': 'calculate_fuel_cost', 'arguments': {'distance': 500, 'fuel_price': 1.2, 'fuel_efficiency': 10}}]
Version 1: [Function(arguments={'distance': 500, 'fuel_efficiency': 10, 'fuel_price': 1.2}, name='calculate_fuel_cost', description=None)]
Version 2: [Function(arguments={'distance': 500, 'fuel_efficiency': 10, 'fuel_price': 1.2}, name='calculate_fuel_consumption', description=None)]

drbh · 2024-08-28T18:17:42Z

Hi @varad0309 I believe these issues should be resolved by the recent improvements/bug fixes to grammars and tool calling (#2463, #2454, #2391, etc...)

Would you kindly try the most recent container image ghcr.io/huggingface/text-generation-inference:sha-8f99f16? There were some changes directly related to the performance of meta-llama/Meta-Llama-3-8B-Instruct's tools so I believe this should improve for your use case. Thank you!

varad0309 · 2024-08-29T00:00:59Z

Turn / Tool Type	sha-1cebccc	sha-21187c2	sha-8f99f16
Single turn Irrelevant	0.02	0.27	0.27
Single turn Chat	0.64	0.28	0.28
Multi turn Irrelevant	0.0	0.18	0.18
Multi turn Chat	0.08	0.04	0.04

@drbh thanks for working on this!! Just ran some tests on different versions (older to newer as you go from col 1 to col 3), on a benchmark by setting temperature 0 using the above OpenAI chat completion script on meta-llama/Meta-Llama-3-8B-Instruct.

The function calling performance seems to be still dropping. Though it seems that the models' ability to filter out irrelevant tools is pretty good (benchmark is BFCL style).

drbh self-assigned this Aug 13, 2024

srossi93 mentioned this issue Aug 22, 2024

fix(router): Fix tools not following the chat template #2451

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tool call performs worse on v2.2.0 as compared to latest #2413

Tool call performs worse on v2.2.0 as compared to latest #2413

varad0309 commented Aug 13, 2024

drbh commented Aug 13, 2024

varad0309 commented Aug 13, 2024 •

edited

Loading

drbh commented Aug 13, 2024

varad0309 commented Aug 13, 2024

drbh commented Aug 28, 2024

varad0309 commented Aug 29, 2024

Tool call performs worse on v2.2.0 as compared to latest #2413

Tool call performs worse on v2.2.0 as compared to latest #2413

Comments

varad0309 commented Aug 13, 2024

System Info

Information

Tasks

Reproduction

Expected behavior

drbh commented Aug 13, 2024

varad0309 commented Aug 13, 2024 • edited Loading

drbh commented Aug 13, 2024

varad0309 commented Aug 13, 2024

drbh commented Aug 28, 2024

varad0309 commented Aug 29, 2024

varad0309 commented Aug 13, 2024 •

edited

Loading