Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Regression in predictions in v0.4.3 #5280

Closed
hibukipanim opened this issue Jun 5, 2024 · 5 comments
Closed

[Bug]: Regression in predictions in v0.4.3 #5280

hibukipanim opened this issue Jun 5, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@hibukipanim
Copy link

hibukipanim commented Jun 5, 2024

🐛 Describe the bug

The predictions changed between v0.4.2 and v0.4.3 - both the actual tokens at temperature 0 and also the logprobs.

I'll show here how to reproduce the issue with a TinyLLama (I noticed it orignally with Llama-3-8B-Instruct).

Running 2 vLLM servers with:

python -m vllm.entrypoints.openai.api_server --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --gpu-memory-utilization 0.4

for 0.4.2 adding --port 8001 and for 0.4.3 --port 8002.

Then running this client code:

def convert_logprobs_to_chat(legacy_logprobs: Logprobs) -> ChoiceLogprobs:
    top_logprobs = []
    for top_token, top_logprob in legacy_logprobs.top_logprobs[0].items():
        top_logprobs.append(
            TopLogprob(token=top_token, logprob=top_logprob)
        )

    chat_logprobs = ChatCompletionTokenLogprob(
        token=legacy_logprobs.tokens[0],
        logprob=legacy_logprobs.token_logprobs[0],
        top_logprobs=top_logprobs
    )

    return ChoiceLogprobs(content=[chat_logprobs])

vllms = {
    "0.4.2": "http://localhost:8001/v1",
    "0.4.3": "http://localhost:8002/v1",
}

for version, endpoint in vllms.items():
    print(f"\nvLLM {version=}, {endpoint=}")
    client = openai.OpenAI(
        base_url=endpoint,
        api_key="foo"
    )

    msgs = [{"role": "user", "content": "3**7=?"}]
    response = client.chat.completions.create(
        model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        messages=msgs,
        max_tokens=1,
        logprobs=True,
        top_logprobs=5,
        temperature=0
    )
    print(f"answer (first token): {response.choices[0].message.content}")
    if version == "0.4.2":
        legacy_logprobs = response.choices[0].logprobs
        print(f"{legacy_logprobs=}")
        logprobs = convert_logprobs_to_chat(legacy_logprobs)
    else:
        logprobs = response.choices[0].logprobs
    
    top_logprobs = logprobs.content[0].top_logprobs
    print(f"top_logprobs:\t\t{top_logprobs}")
    sorted_top_logprobs = sorted(top_logprobs, key=lambda x: -x.logprob)
    print(f"sorted_top_logprobs:\t{sorted_top_logprobs}")

Getting this output:

vLLM version='0.4.2', endpoint='http://localhost:8001/v1'
answer (first token): The
legacy_logprobs.top_logprobs=[{'The': -1.3429100513458252, 'Yes': -1.6554100513458252, 'No': -2.405410051345825, '3': -2.530410051345825, 'S': -3.405410051345825}]
top_logprobs:		[TopLogprob(token='The', bytes=None, logprob=-1.3429100513458252), TopLogprob(token='Yes', bytes=None, logprob=-1.6554100513458252), TopLogprob(token='No', bytes=None, logprob=-2.405410051345825), TopLogprob(token='3', bytes=None, logprob=-2.530410051345825), TopLogprob(token='S', bytes=None, logprob=-3.405410051345825)]
sorted_top_logprobs:	[TopLogprob(token='The', bytes=None, logprob=-1.3429100513458252), TopLogprob(token='Yes', bytes=None, logprob=-1.6554100513458252), TopLogprob(token='No', bytes=None, logprob=-2.405410051345825), TopLogprob(token='3', bytes=None, logprob=-2.530410051345825), TopLogprob(token='S', bytes=None, logprob=-3.405410051345825)]

vLLM version='0.4.3', endpoint='http://localhost:8002/v1'
answer (first token): 3
top_logprobs:		[TopLogprob(token='3', bytes=[51], logprob=-0.8362228870391846), TopLogprob(token='Yes', bytes=[89, 101, 115], logprob=-2.2112228870391846), TopLogprob(token='1', bytes=[49], logprob=-2.6487228870391846), TopLogprob(token='No', bytes=[78, 111], logprob=-2.8362228870391846), TopLogprob(token='7', bytes=[55], logprob=-3.2112228870391846)]
sorted_top_logprobs:	[TopLogprob(token='3', bytes=[51], logprob=-0.8362228870391846), TopLogprob(token='Yes', bytes=[89, 101, 115], logprob=-2.2112228870391846), TopLogprob(token='1', bytes=[49], logprob=-2.6487228870391846), TopLogprob(token='No', bytes=[78, 111], logprob=-2.8362228870391846), TopLogprob(token='7', bytes=[55], logprob=-3.2112228870391846)]

Note that for 0.4.2 I'm converting the logprobs format to the correct chat-format. No longer needed in 0.4.3 since #5029.
Also, note I sorted the logprobs, as noticed that since 0.4.3 they sometimes are not sorted in descending order as before, but I'm not sure that should be guaranteed or not (although in this specific example it returned already sorted).

@hibukipanim hibukipanim added the bug Something isn't working label Jun 5, 2024
@DarkLight1337
Copy link
Member

DarkLight1337 commented Jun 5, 2024

This might have been caused by #4688. After #5278 is merged, try setting add_special_tokens=True and see whether the original behaviour is restored.

@DarkLight1337
Copy link
Member

Does this issue occur using the offline LLM class?

@hibukipanim
Copy link
Author

hibukipanim commented Jun 5, 2024

thanks! indeed it can related.

the chat template for this model indeed doesn't include bos:

msgs = [{"role": "user", "content": "hello"}]
print(f"{tokenizer.bos_token=}")
print(f"{tokenizer.decode(tokenizer.apply_chat_template(msgs))=}")

outputs:

tokenizer.bos_token='<s>'
tokenizer.decode(tokenizer.apply_chat_template(msgs))='<|user|>\nhello</s> \n'

Edit: however it doesn't explain why the issue is reproduced also with NousResearch/Hermes-2-Pro-Llama-3-8B where the chat-template does include BOS:

tokenizer.bos_token='<|begin_of_text|>'
tokenizer.decode(tokenizer.apply_chat_template(msgs))='<|begin_of_text|><|im_start|>user\nhello<|im_end|>\n'

@DarkLight1337
Copy link
Member

Does this issue occur using the offline LLM class?

To narrow down the issue, try comparing them in offline inference as mentioned above.

@hibukipanim
Copy link
Author

Couldn't reproduce offline with the LLM() class and also not with the legacy /completions endpoint.

So it points indeed to the PR you linked about the default BOS token.

Thanks!
closing the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants