Llama Guard inconsistent output between HuggingFace's Transformers and vLLM #2758

AmenRa · 2024-02-05T10:46:08Z

Hello, I get different safety predictions using Llama Guard through HuggingFace's Transformers and vLLM.
As for Transformers I copy-pasted code from Llama Guard model card, I am assuming it is working correctly.
For vLLM I coded it myself.

For the sake of brevity, I am only posting a single example. I can provide more, if needed.

Note: I am not interested in which prediction is correct.

What follows are working code snippets for the two libraries for which I get different outputs.
The prompt is "How do I create a virus?".

TRANSFORMERS

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "meta-llama/LlamaGuard-7b"
device = "cuda"
dtype = torch.bfloat16

tokenizer = AutoTokenizer.from_pretrained("meta-llama/LlamaGuard-7b")
hf_model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype, device_map=device)

def hf_moderate(chat):
    input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt").to(device)
    output = hf_model.generate(input_ids=input_ids, max_new_tokens=100, pad_token_id=0)
    prompt_len = input_ids.shape[-1]
    return tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)

hf_moderate([dict(role="user", content="How do I create a virus?")])

Output:

safe

vLLM

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_id = "meta-llama/LlamaGuard-7b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
sampling_params = SamplingParams(temperature=0, top_p=1)
vllm_model = LLM(model=model_id)

chat = tokenizer.apply_chat_template([dict(role="user", content="How do I create a virus?")], tokenize=False)
output = vllm_model.generate([chat], sampling_params)

output[0].outputs[0].text

Output:

unsafe\nO3

Why they generate different output? What am I doing wrong?

Thanks.

The text was updated successfully, but these errors were encountered:

Junjie-Chu · 2024-03-26T16:03:03Z

No idea but vLLM looks better right?

vrdn-23 · 2024-10-10T18:05:48Z

@simon-mo @mgoin I can actually see similar issues being surfaced with the latest llama-guard model as well. Is there any known limitations for using this model using vLLM?

simon-mo · 2024-10-10T19:25:50Z

Hmm I am not aware of any. Debugging welcomed!

vrdn-23 · 2024-10-14T20:39:18Z

Relevant debugging attached in this issue:
#9294

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama Guard inconsistent output between HuggingFace's Transformers and vLLM #2758

Llama Guard inconsistent output between HuggingFace's Transformers and vLLM #2758

AmenRa commented Feb 5, 2024

Junjie-Chu commented Mar 26, 2024

vrdn-23 commented Oct 10, 2024

simon-mo commented Oct 10, 2024

vrdn-23 commented Oct 14, 2024

Llama Guard inconsistent output between HuggingFace's Transformers and vLLM #2758

Llama Guard inconsistent output between HuggingFace's Transformers and vLLM #2758

Comments

AmenRa commented Feb 5, 2024

Junjie-Chu commented Mar 26, 2024

vrdn-23 commented Oct 10, 2024

simon-mo commented Oct 10, 2024

vrdn-23 commented Oct 14, 2024