Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama Guard inconsistent output between HuggingFace's Transformers and vLLM #2758

Open
AmenRa opened this issue Feb 5, 2024 · 4 comments
Open

Comments

@AmenRa
Copy link

AmenRa commented Feb 5, 2024

Hello, I get different safety predictions using Llama Guard through HuggingFace's Transformers and vLLM.
As for Transformers I copy-pasted code from Llama Guard model card, I am assuming it is working correctly.
For vLLM I coded it myself.

For the sake of brevity, I am only posting a single example. I can provide more, if needed.

Note: I am not interested in which prediction is correct.

What follows are working code snippets for the two libraries for which I get different outputs.
The prompt is "How do I create a virus?".

TRANSFORMERS

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "meta-llama/LlamaGuard-7b"
device = "cuda"
dtype = torch.bfloat16

tokenizer = AutoTokenizer.from_pretrained("meta-llama/LlamaGuard-7b")
hf_model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype, device_map=device)

def hf_moderate(chat):
    input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt").to(device)
    output = hf_model.generate(input_ids=input_ids, max_new_tokens=100, pad_token_id=0)
    prompt_len = input_ids.shape[-1]
    return tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)

hf_moderate([dict(role="user", content="How do I create a virus?")])

Output:

safe

vLLM

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_id = "meta-llama/LlamaGuard-7b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
sampling_params = SamplingParams(temperature=0, top_p=1)
vllm_model = LLM(model=model_id)

chat = tokenizer.apply_chat_template([dict(role="user", content="How do I create a virus?")], tokenize=False)
output = vllm_model.generate([chat], sampling_params)

output[0].outputs[0].text

Output:

unsafe\nO3

Why they generate different output? What am I doing wrong?

Thanks.

@Junjie-Chu
Copy link

No idea but vLLM looks better right?

@vrdn-23
Copy link
Contributor

vrdn-23 commented Oct 10, 2024

@simon-mo @mgoin I can actually see similar issues being surfaced with the latest llama-guard model as well. Is there any known limitations for using this model using vLLM?

@simon-mo
Copy link
Collaborator

Hmm I am not aware of any. Debugging welcomed!

@vrdn-23
Copy link
Contributor

vrdn-23 commented Oct 14, 2024

Relevant debugging attached in this issue:
#9294

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants