Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tool call performs worse on v2.2.0 as compared to latest #2413

Open
2 of 4 tasks
varad0309 opened this issue Aug 13, 2024 · 6 comments
Open
2 of 4 tasks

Tool call performs worse on v2.2.0 as compared to latest #2413

varad0309 opened this issue Aug 13, 2024 · 6 comments
Assignees

Comments

@varad0309
Copy link

System Info

gpu=0
num_gpus=1
model=meta-llama/Meta-Llama-3.1-8B-Instruct
docker run -d \
--gpus "\"device=$gpu\"" \
--shm-size 16g \
-e HUGGING_FACE_HUB_TOKEN=$token \
-p 8082:80 \
-v $volume:/data \
--name Meta-Llama-3.1-8B \
ghcr.io/huggingface/text-generation-inference:sha-1cebccc@sha256:4ccb775aaaefc90df10b2de7ce17a1f00a07682c12ea9630e6e6fdfa10a1c05e \
--model-id $model \
--max-concurrent-requests $max_concurrent_request \
--max-total-tokens $max_total_token \
--max-input-length $max_input_length \
--waiting-served-ratio $wsr \
--num-shard $num_gpus \
--dtype bfloat16

OS: Ubuntu Linux
Model: meta-llama/Meta-Llama-3.1-8B-Instruct / meta-llama/Meta-Llama-3-8B-Instruct
Hardware: A100 80G
Version with issue: v2.2.0
Compared with: latest

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

  1. Launch the docker instance.
  2. Run the following:
client = OpenAI(
        base_url=f"http://127.0.0.1:8082/v1",
        api_key="_",
    )

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=messages,
    tools=tools,
    tool_choice="auto",
    max_tokens=max_tokens,
)

predictions = chat_completion.choices[0].message.tool_calls

Expected behavior

Hey @drbh @ErikKaum, did you try benchmarking the performance of v2.2.0 against latest on tool calling? I am getting dramatically worse performance on v2.2.0 as compared to using the previous versions on some tool-call benchmarks I have created. Just changing the version causes the performance of meta-llama/Meta-Llama-3-8B-Instruct to drop from 0.66 to 0.08 on the same script and data. Ofc, I can't evaluate the performance of Llama-3.1 on previous versions, but the its' performance is similarly close to 0.

@drbh
Copy link
Collaborator

drbh commented Aug 13, 2024

Hi @varad0309 thanks for opening this issue, v2.2.0 was released ~3 weeks ago and TGI since has had some bug fixes and improvements that are available on latest. Specifically, a fix for a tool related bug was was merged yesterday #2406 and that would likely improve tool calling responses.

We will be publishing a newer release in the coming week/weeks and it should include these fixes along with many other improvements! For now I'd recommend using latest or a pinned commit to ensure you are using a version with the tool fixes. Thanks again!

@drbh drbh self-assigned this Aug 13, 2024
@varad0309
Copy link
Author

varad0309 commented Aug 13, 2024

@drbh thanks for the quick reply. I did try a commit from few hrs back (more specifically, sha-1cebccc@sha256:4ccb775aaaefc90df10b2de7ce17a1f00a07682c12ea9630e6e6fdfa10a1c05e). The problem still persists.

My observation: the list of available tools is still not getting passed appropriately.

@drbh
Copy link
Collaborator

drbh commented Aug 13, 2024

oh apologies I must have misunderstood the issue, it sounds that tools responses have regressed starting at version 2.2.0 and onwards? Would you be able to share an example of the input and expected output? Additionally do you know when the tools were last working as you expected (maybe a version or best case the last commit sha)? Thanks!

@varad0309
Copy link
Author

Sure, here are a few examples. I unfortunately don't know the last version after which it starts breaking. The versions I am comparing are via docker images:

  1. (this works) version 1 => ghcr.io/huggingface/text-generation-inference:latest
  2. (this doesn't) version 2 => ghcr.io/huggingface/text-generation-inference:sha-1cebccc@sha256:4ccb775aaaefc90df10b2de7ce17a1f00a07682c12ea9630e6e6fdfa10a1c05e

Examples:

Ground truth: [{'name': 'search_hotel', 'arguments': {'destination': 'Paris', 'check_in_date': '2022-05-01', 'check_out_date': '2022-05-10'}}]
Version 1: [Function(arguments={'check_in_date': '2022-05-01', 'check_out_date': '2022-05-10', 'location': 'Paris', 'num_guests': 1, 'num_rooms': 1}, name='search_hotel', description=None)]
Version 2: [Function(arguments={'number': 7}, name='find_hotels', description=None)]
Ground truth: [{'name': 'roll_dice', 'arguments': {'sides': 6, 'quantity': 1}}]
Version 1: [Function(arguments={'quantity': 1, 'sides': 6}, name='roll_dice', description=None)]
Version 2: [Function(arguments={'artist': 'tools', 'genre': 'RNG Tools'}, name='random.randint', description=None)]
Ground truth: [{'name': 'calculate_fuel_cost', 'arguments': {'distance': 500, 'fuel_price': 1.2, 'fuel_efficiency': 10}}]
Version 1: [Function(arguments={'distance': 500, 'fuel_efficiency': 10, 'fuel_price': 1.2}, name='calculate_fuel_cost', description=None)]
Version 2: [Function(arguments={'distance': 500, 'fuel_efficiency': 10, 'fuel_price': 1.2}, name='calculate_fuel_consumption', description=None)]

@drbh
Copy link
Collaborator

drbh commented Aug 28, 2024

Hi @varad0309 I believe these issues should be resolved by the recent improvements/bug fixes to grammars and tool calling (#2463, #2454, #2391, etc...)

Would you kindly try the most recent container image ghcr.io/huggingface/text-generation-inference:sha-8f99f16? There were some changes directly related to the performance of meta-llama/Meta-Llama-3-8B-Instruct's tools so I believe this should improve for your use case. Thank you!

@varad0309
Copy link
Author

Turn / Tool Type sha-1cebccc sha-21187c2 sha-8f99f16
Single turn Irrelevant 0.02 0.27 0.27
Single turn Chat 0.64 0.28 0.28
Multi turn Irrelevant 0.0 0.18 0.18
Multi turn Chat 0.08 0.04 0.04

@drbh thanks for working on this!! Just ran some tests on different versions (older to newer as you go from col 1 to col 3), on a benchmark by setting temperature 0 using the above OpenAI chat completion script on meta-llama/Meta-Llama-3-8B-Instruct.

The function calling performance seems to be still dropping. Though it seems that the models' ability to filter out irrelevant tools is pretty good (benchmark is BFCL style).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants