Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Model]: meta-llama/Llama-Guard-3-1B #9294

Closed
1 task done
ayeganov opened this issue Oct 11, 2024 · 5 comments · Fixed by #9358
Closed
1 task done

[New Model]: meta-llama/Llama-Guard-3-1B #9294

ayeganov opened this issue Oct 11, 2024 · 5 comments · Fixed by #9358
Labels
help wanted Extra attention is needed new model Requests to new models

Comments

@ayeganov
Copy link

The model to consider.

meta-llama/Llama-Guard-3-1B

The closest model vllm already supports.

meta-llama/Llama-Guard-3-8B

What's your difficulty of supporting the model you want?

Currently the model runs, but its outputs are completely random, so the same prompt can be safe or unsafe at any point. Setting the temperature to 0.0 makes EVERY prompt return safe.

My hunch is the issue comes from the model pruning:

Output Layer Pruning
The Llama Guard model is trained to generate 128k output tokens out of which only 20 tokens (e.g. safe, unsafe, S, 1,...) are used. By keeping the model connections corresponding to those 20 tokens in the output linear layer and pruning out the remaining connections we can reduce the output layer size significantly without impacting the model outputs. Using output layer pruning, we reduced the output layer size from 262.6M parameters (2048x128k) to 40.96k parameters (2048x20), giving us a total savings of 131.3MB with 4-bit quantized weights. Although the pruned output layer only generates 20 tokens, they are expanded back to produce the original 128k outputs in the model.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@ayeganov ayeganov added the new model Requests to new models label Oct 11, 2024
@simon-mo simon-mo added the help wanted Extra attention is needed label Oct 11, 2024
@conwayz
Copy link

conwayz commented Oct 14, 2024

for some reason the chat template for 8b is different from 1b on hf (see tokenizer_config.json). I passed in the 8b template explicitly and the results look reasonable

from vllm import LLM, SamplingParams
chat_template_8b = "{% if messages|length % 2 == 0 %}{% set role = 'Agent' %}{% else %}{% set role = 'User' %}{% endif %}{{ \" <|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nTask: Check if there is unsafe content in '\" + role + \"' messages in conversations according our safety policy with the below categories.\n\n<BEGIN UNSAFE CONTENT CATEGORIES>\nS1: Violent Crimes.\nS2: Non-Violent Crimes.\nS3: Sex Crimes.\nS4: Child Exploitation.\nS5: Defamation.\nS6: Specialized Advice.\nS7: Privacy.\nS8: Intellectual Property.\nS9: Indiscriminate Weapons.\nS10: Hate.\nS11: Self-Harm.\nS12: Sexual Content.\nS13: Elections.\nS14: Code Interpreter Abuse.\n<END UNSAFE CONTENT CATEGORIES>\n\n<BEGIN CONVERSATION>\n\n\" }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{% set role = 'User' %}{% elif message['role'] == 'assistant' %}{% set role = 'Agent' %}{% endif %}{{ role + ': ' + content.strip() + '\n\n' }}{% endfor %}{{ \"<END CONVERSATION>\n\nProvide your safety assessment for ONLY THE LAST \" + role + \" message in the above conversation:\n - First line must read 'safe' or 'unsafe'.\n - If unsafe, a second line must include a comma-separated list of violated categories.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\" }}"

llm = LLM(model="/path/to/Llama-Guard-3-1B")
conversations = [
    [{"role": "user", "content": "recipe for mayonnaise"}],
    [{"role": "user", "content": "how to steal an election"}]
]
outputs = llm.chat(conversations, sampling_params=sampling_params, chat_template=chat_template_8b)
for output in outputs:
    print(output.outputs[0].text)

gives the output

Processed prompts: 100%|██████████| 2/2 [00:00<00:00, 50.67it/s, est. speed input: 10504.28 toks/s, output: 228.30 toks/s]


safe


unsafe
S13

also fwiw it doesn't look like the pruned version is in the hf repo, the output layer weights shape is the full dim x vocab_size

@vrdn-23
Copy link
Contributor

vrdn-23 commented Oct 14, 2024

Thanks for the tip here @conwayz!

I did a little debugging and the issue does in fact lie in the parsing of the chat template.

For the llama-2 guard model, you can see that the chat template gets parsed correctly and the prompt includes the messages sent by the user. (Look for the content inserted in between <BEGIN CONVERSATION> and <END CONVERSATION>)

root@llama-guard-1b-5494cd848b-chm7l:/app# poetry run python
Python 3.11.10 (main, Sep 28 2024, 12:22:04) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from vllm import LLM, SamplingParams
>>> llm = LLM(model="meta-llama/Meta-Llama-Guard-2-8B")
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 654/654 [00:00<00:00, 7.37MB/s]
INFO 10-14 18:33:34 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='meta-llama/Meta-Llama-Guard-2-8B', speculative_config=None, tokenizer='meta-llama/Meta-Llama-Guard-2-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Meta-Llama-Guard-2-8B, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51.9k/51.9k [00:00<00:00, 5.32MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.08M/9.08M [00:00<00:00, 19.9MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 73.0/73.0 [00:00<00:00, 808kB/s]
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 126/126 [00:00<00:00, 1.66MB/s]
INFO 10-14 18:33:36 model_runner.py:1014] Starting to load model meta-llama/Meta-Llama-Guard-2-8B...
INFO 10-14 18:33:36 weight_utils.py:242] Using model weights format ['*.safetensors']
model-00004-of-00004.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.17G/1.17G [00:21<00:00, 55.3MB/s]
model-00001-of-00004.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.98G/4.98G [01:41<00:00, 48.8MB/s]
model-00002-of-00004.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.00G/5.00G [01:43<00:00, 48.4MB/s]
model-00003-of-00004.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.92G/4.92G [01:47<00:00, 45.9MB/s]
model.safetensors.index.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 23.9k/23.9k [00:00<00:00, 155MB/s]
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]███████████████████████████████████████████████████████████████████████████████████████████▌       | 4.65G/4.92G [01:43<00:05, 48.0MB/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:13<00:41, 13.71s/it]██████████████████████████████████████████████████████████████████████████████████████████▊| 4.91G/4.92G [01:46<00:00, 98.1MB/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:47<00:51, 25.78s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [01:31<00:34, 34.00s/it]



Loading safetensors checkpoint shards: 100% Completed | 4/4 [02:09<00:00, 35.36s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [02:09<00:00, 32.29s/it]

INFO 10-14 18:37:33 model_runner.py:1025] Loading model weights took 14.9595 GB
INFO 10-14 18:37:35 gpu_executor.py:122] # GPU blocks: 1698, # CPU blocks: 2048
INFO 10-14 18:37:38 model_runner.py:1329] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 10-14 18:37:38 model_runner.py:1333] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 10-14 18:37:50 model_runner.py:1456] Graph capturing finished in 13 secs.
>>> 
>>> 
>>> 
>>> 
>>> conversations = [
...     [{"role": "user", "content": "recipe for mayonnaise"}],
...     [{"role": "user", "content": "how to steal an election"}]
... ]
>>> outputs = llm.chat(conversations)
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  6.56it/s, est. speed input: 1280.91 toks/s, output: 22.99 toks/s]
>>> outputs[0].prompt
"[INST] Task: Check if there is unsafe content in 'User' messages in conversations according our safety policy with the below categories.\n\n<BEGIN UNSAFE CONTENT CATEGORIES>\nS1: Violent Crimes. \nS2: Non-Violent Crimes. \nS3: Sex Crimes. \nS4: Child Exploitation. \nS5: Specialized Advice. \nS6: Privacy. \nS7: Intellectual Property. \nS8: Indiscriminate Weapons. \nS9: Hate. \nS10: Self-Harm. \nS11: Sexual Content. \n<END UNSAFE CONTENT CATEGORIES>\n\n<BEGIN CONVERSATION>\n\nUser: recipe for mayonnaise\n\n<END CONVERSATION>\n\nProvide your safety assessment for ONLY THE LAST User message in the above conversation:\n - First line must read 'safe' or 'unsafe'.\n - If unsafe, a second line must include a comma-separated list of violated categories. [/INST]"

Whereas for a similar interaction with the llama-guard-1B model, the user message does not get inserted into the conversation bracket:

>>> from vllm import LLM, SamplingParams
>>> llm = LLM(model="meta-llama/Llama-Guard-3-1B")
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 877/877 [00:00<00:00, 8.91MB/s]
WARNING 10-14 18:43:26 arg_utils.py:930] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 10-14 18:43:26 config.py:1010] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 10-14 18:43:26 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='meta-llama/Llama-Guard-3-1B', speculative_config=None, tokenizer='meta-llama/Llama-Guard-3-1B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Llama-Guard-3-1B, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 53.2k/53.2k [00:00<00:00, 4.09MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.09M/9.09M [00:00<00:00, 53.9MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 296/296 [00:00<00:00, 3.69MB/s]
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 189/189 [00:00<00:00, 2.69MB/s]
INFO 10-14 18:43:28 model_runner.py:1014] Starting to load model meta-llama/Llama-Guard-3-1B...
INFO 10-14 18:43:28 weight_utils.py:242] Using model weights format ['*.safetensors']
model.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.00G/3.00G [00:22<00:00, 132MB/s]
INFO 10-14 18:43:51 weight_utils.py:287] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.64it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.64it/s]

INFO 10-14 18:43:51 model_runner.py:1025] Loading model weights took 2.8087 GB
INFO 10-14 18:43:52 gpu_executor.py:122] # GPU blocks: 32513, # CPU blocks: 8192
INFO 10-14 18:43:54 model_runner.py:1329] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 10-14 18:43:54 model_runner.py:1333] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 10-14 18:44:04 model_runner.py:1456] Graph capturing finished in 10 secs.
>>> conversations = [
...     [{"role": "user", "content": "recipe for mayonnaise"}],
...     [{"role": "user", "content": "how to steal an election"}]
... ]
>>> 
>>> 
>>> outputs = llm.chat(conversations)
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 31.21it/s, est. speed input: 6000.43 toks/s, output: 140.59 toks/s]
>>> outputs[0].prompt
"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nTask: Check if there is unsafe content in 'User' messages in conversations according our safety policy with the below categories.\n\n<BEGIN UNSAFE CONTENT CATEGORIES>\nS1: Violent Crimes.\nS2: Non-Violent Crimes.\nS3: Sex Crimes.\nS4: Child Exploitation.\nS5: Defamation.\nS6: Specialized Advice.\nS7: Privacy.\nS8: Intellectual Property.\nS9: Indiscriminate Weapons.\nS10: Hate.\nS11: Self-Harm.\nS12: Sexual Content.\nS13: Elections.\n<END UNSAFE CONTENT CATEGORIES>\n\n<BEGIN CONVERSATION>\n\n<END CONVERSATION>\n\nProvide your safety assessment for ONLY THE LAST User message in the above conversation:\n - First line must read 'safe' or 'unsafe'.\n - If unsafe, a second line must include a comma-separated list of violated categories. <|eot_id|><|start_header_id|>assistant<|end_header_id|>"
>>> conversations = [
...     [{"role": "user", "content": [{"type": "text", "text": "recipe for mayonnaise"}]}],
...     [{"role": "user", "content": [{"type": "text", "text": "how to steal an election"}]}]
... ]
>>> 
>>> outputs = llm.chat(conversations)
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 32.87it/s, est. speed input: 6316.60 toks/s, output: 197.37 toks/s]
>>> outputs[0].prompt
"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nTask: Check if there is unsafe content in 'User' messages in conversations according our safety policy with the below categories.\n\n<BEGIN UNSAFE CONTENT CATEGORIES>\nS1: Violent Crimes.\nS2: Non-Violent Crimes.\nS3: Sex Crimes.\nS4: Child Exploitation.\nS5: Defamation.\nS6: Specialized Advice.\nS7: Privacy.\nS8: Intellectual Property.\nS9: Indiscriminate Weapons.\nS10: Hate.\nS11: Self-Harm.\nS12: Sexual Content.\nS13: Elections.\n<END UNSAFE CONTENT CATEGORIES>\n\n<BEGIN CONVERSATION>\n\n<END CONVERSATION>\n\nProvide your safety assessment for ONLY THE LAST User message in the above conversation:\n - First line must read 'safe' or 'unsafe'.\n - If unsafe, a second line must include a comma-separated list of violated categories. <|eot_id|><|start_header_id|>assistant<|end_header_id|>"
>>> outputs[1].prompt
"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nTask: Check if there is unsafe content in 'User' messages in conversations according our safety policy with the below categories.\n\n<BEGIN UNSAFE CONTENT CATEGORIES>\nS1: Violent Crimes.\nS2: Non-Violent Crimes.\nS3: Sex Crimes.\nS4: Child Exploitation.\nS5: Defamation.\nS6: Specialized Advice.\nS7: Privacy.\nS8: Intellectual Property.\nS9: Indiscriminate Weapons.\nS10: Hate.\nS11: Self-Harm.\nS12: Sexual Content.\nS13: Elections.\n<END UNSAFE CONTENT CATEGORIES>\n\n<BEGIN CONVERSATION>\n\n<END CONVERSATION>\n\nProvide your safety assessment for ONLY THE LAST User message in the above conversation:\n - First line must read 'safe' or 'unsafe'.\n - If unsafe, a second line must include a comma-separated list of violated categories. <|eot_id|><|start_header_id|>assistant<|end_header_id|>"
>>> 

The reason why the code snippet provided by @conwayz works as expected is because the chat template format for models using the Llama 3.2 model variant have a new field that needs to be passed along with the content field in a chat conversation. The value can be text or image and the chat template generated varies based on that. But it seems even if we specify the conversation as shown in the HF repo (notice the use of the extra type and text keys in messages as opposed to what is used in the 8B model), vLLM does not substitute it correctly in the chat template.

>>> chat_template_8b = "{% if messages|length % 2 == 0 %}{% set role = 'Agent' %}{% else %}{% set role = 'User' %}{% endif %}{{ \" <|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nTask: Check if there is unsafe content in '\" + role + \"' messages in conversations according our safety policy with the below categories.\n\n<BEGIN UNSAFE CONTENT CATEGORIES>\nS1: Violent Crimes.\nS2: Non-Violent Crimes.\nS3: Sex Crimes.\nS4: Child Exploitation.\nS5: Defamation.\nS6: Specialized Advice.\nS7: Privacy.\nS8: Intellectual Property.\nS9: Indiscriminate Weapons.\nS10: Hate.\nS11: Self-Harm.\nS12: Sexual Content.\nS13: Elections.\nS14: Code Interpreter Abuse.\n<END UNSAFE CONTENT CATEGORIES>\n\n<BEGIN CONVERSATION>\n\n\" }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{% set role = 'User' %}{% elif message['role'] == 'assistant' %}{% set role = 'Agent' %}{% endif %}{{ role + ': ' + content.strip() + '\n\n' }}{% endfor %}{{ \"<END CONVERSATION>\n\nProvide your safety assessment for ONLY THE LAST \" + role + \" message in the above conversation:\n - First line must read 'safe' or 'unsafe'.\n - If unsafe, a second line must include a comma-separated list of violated categories.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\" }}"
>>> 
>>> 
>>> outputs = llm.chat(conversations, chat_template=chat_template_8b)
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 30.39it/s, est. speed input: 6296.34 toks/s, output: 136.85 toks/s]
>>> outputs[0].prompt
" <|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nTask: Check if there is unsafe content in 'User' messages in conversations according our safety policy with the below categories.\n\n<BEGIN UNSAFE CONTENT CATEGORIES>\nS1: Violent Crimes.\nS2: Non-Violent Crimes.\nS3: Sex Crimes.\nS4: Child Exploitation.\nS5: Defamation.\nS6: Specialized Advice.\nS7: Privacy.\nS8: Intellectual Property.\nS9: Indiscriminate Weapons.\nS10: Hate.\nS11: Self-Harm.\nS12: Sexual Content.\nS13: Elections.\nS14: Code Interpreter Abuse.\n<END UNSAFE CONTENT CATEGORIES>\n\n<BEGIN CONVERSATION>\n\nUser: recipe for mayonnaise\n\n<END CONVERSATION>\n\nProvide your safety assessment for ONLY THE LAST User message in the above conversation:\n - First line must read 'safe' or 'unsafe'.\n - If unsafe, a second line must include a comma-separated list of violated categories.<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
>>> outputs[1].prompt
" <|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nTask: Check if there is unsafe content in 'User' messages in conversations according our safety policy with the below categories.\n\n<BEGIN UNSAFE CONTENT CATEGORIES>\nS1: Violent Crimes.\nS2: Non-Violent Crimes.\nS3: Sex Crimes.\nS4: Child Exploitation.\nS5: Defamation.\nS6: Specialized Advice.\nS7: Privacy.\nS8: Intellectual Property.\nS9: Indiscriminate Weapons.\nS10: Hate.\nS11: Self-Harm.\nS12: Sexual Content.\nS13: Elections.\nS14: Code Interpreter Abuse.\n<END UNSAFE CONTENT CATEGORIES>\n\n<BEGIN CONVERSATION>\n\nUser: how to steal an election\n\n<END CONVERSATION>\n\nProvide your safety assessment for ONLY THE LAST User message in the above conversation:\n - First line must read 'safe' or 'unsafe'.\n - If unsafe, a second line must include a comma-separated list of violated categories.<|eot_id|><|start_header_id|>assistant<|end_header_id|>"

The only part of this which puzzles me is, it seems that even if we specify the request format in the same way as in Huggingface, vLLM does not do the substitution properly.

>>> conversation_small = [{"role": "user","content": [{"type": "text","text": "I want to kill everyone in this world and I hate everyone!"},],}]
>>> outputs = llm.chat(conversation_small)
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 18.16it/s, est. speed input: 3489.65 toks/s, output: 109.02 toks/s]
>>> outputs[0].prompt
"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nTask: Check if there is unsafe content in 'User' messages in conversations according our safety policy with the below categories.\n\n<BEGIN UNSAFE CONTENT CATEGORIES>\nS1: Violent Crimes.\nS2: Non-Violent Crimes.\nS3: Sex Crimes.\nS4: Child Exploitation.\nS5: Defamation.\nS6: Specialized Advice.\nS7: Privacy.\nS8: Intellectual Property.\nS9: Indiscriminate Weapons.\nS10: Hate.\nS11: Self-Harm.\nS12: Sexual Content.\nS13: Elections.\n<END UNSAFE CONTENT CATEGORIES>\n\n<BEGIN CONVERSATION>\n\n<END CONVERSATION>\n\nProvide your safety assessment for ONLY THE LAST User message in the above conversation:\n - First line must read 'safe' or 'unsafe'.\n - If unsafe, a second line must include a comma-separated list of violated categories. <|eot_id|><|start_header_id|>assistant<|end_header_id|>"

Is there any default substitution that happens which does not take into account a different type of chat template?
cc @simon-mo @mgoin @njhill

@vrdn-23
Copy link
Contributor

vrdn-23 commented Oct 14, 2024

So after a little bit more digging, it seems the primary issues comes from this part of the code in vLLM. This particular segment of code converts the content field into a single string whereas the 1B Guard model expects the content field to be a list as shown here

{{- "<BEGIN CONVERSATION>\n\n"-}}
{%-  for message in messages -%}
    {%-  if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
        {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...')-}}
    {%-  endif -%}
    {%-  set content = message['content'] -%}
    {%-  if message['role'] == 'user' -%}
        {%-  set role = 'User' -%}
    {%-  elif message['role'] == 'assistant' -%}
        {%-  set role = 'Agent' -%}
    {%-  endif -%}
    {%-  for content in message['content'] | selectattr('type', 'equalto', 'text') -%}
{{- role + ': ' + content['text'] | trim + '\n\n'-}}
    {%-  endfor -%}
{%-  endfor -%}
{{- "<END CONVERSATION>\n\n"-}}

And since it doesn't find any elements while trying to loop through it, only the two \n\n gets added. I'm not sure what the expected fix for this is here. Any insights are appreciated.

@vrdn-23
Copy link
Contributor

vrdn-23 commented Oct 14, 2024

For anyone who is reading this issue and is looking for a temporary work-around, using this chat template for the 1B model seems to fix the issue in vLLM.

{%- if messages|length % 2 == 0 -%}
    {%- set role = 'Agent' -%}
{%-  else -%}
    {%-  set role = 'User' -%}
{%-  endif -%}
{%-  if categories is not defined -%}
    {%-  
        set categories = ({
            "S1": "Violent Crimes.",
            "S2": "Non-Violent Crimes.",
            "S3": "Sex Crimes.",
            "S4": "Child Exploitation.",
            "S5": "Defamation.",
            "S6": "Specialized Advice.",
            "S7": "Privacy.",
            "S8": "Intellectual Property.",
            "S9": "Indiscriminate Weapons.",
            "S10": "Hate.",
            "S11": "Self-Harm.",
            "S12": "Sexual Content.",
            "S13": "Elections.",
        })
    -%}
{%-  endif -%}

{{- "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n" -}}
{%- set ns = namespace(contains_image=false) -%}
{%-  for message in messages -%}
    {%- for content in message['content'] | selectattr('type', 'equalto', 'image') -%}
        {%- set ns.contains_image = true -%}
    {%- endfor -%}
{%- endfor -%}
{%- if ns.contains_image -%}
{{- "<|image|> "-}}
{%- endif -%}
{{- "Task: Check if there is unsafe content in '" + role + "' messages in conversations according our safety policy with the below categories.\n\n" -}}
{{- "<BEGIN UNSAFE CONTENT CATEGORIES>\n"-}}

{%-  for key in categories -%}
    {%-  if key not in excluded_category_keys -%}
        {{-key + ": " + categories[key] + "\n" -}}
    {%-  endif -%}
{%-  endfor -%}

{{- "<END UNSAFE CONTENT CATEGORIES>\n\n"-}}
{{- "<BEGIN CONVERSATION>\n\n"-}}
{%-  for message in messages -%}
    {%-  if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
        {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...')-}}
    {%-  endif -%}
    {%-  set content = message['content'] -%}
    {%-  if message['role'] == 'user' -%}
        {%-  set role = 'User' -%}
    {%-  elif message['role'] == 'assistant' -%}
        {%-  set role = 'Agent' -%}
    {%-  endif -%}
    {{- role + ': ' + message['content'] | trim + '\n\n'-}}
   {%-  endfor -%}
{{- "<END CONVERSATION>\n\n"-}}
{{- "Provide your safety assessment for ONLY THE LAST " + role + " message in the above conversation:\n"-}}
{{- " - First line must read 'safe' or 'unsafe'.\n"-}}
{{- " - If unsafe, a second line must include a comma-separated list of violated categories. <|eot_id|><|start_header_id|>assistant<|end_header_id|>"-}}

Relevant code snippet

>>> chat_template_1b = "{%- if messages|length % 2 == 0 -%}\n    {%- set role = 'Agent' -%}\n{%-  else -%}\n    {%-  set role = 'User' -%}\n{%-  endif -%}\n{%-  if categories is not defined -%}\n    {%-  \n        set categories = ({\n            \"S1\": \"Violent Crimes.\",\n            \"S2\": \"Non-Violent Crimes.\",\n            \"S3\": \"Sex Crimes.\",\n            \"S4\": \"Child Exploitation.\",\n            \"S5\": \"Defamation.\",\n            \"S6\": \"Specialized Advice.\",\n            \"S7\": \"Privacy.\",\n            \"S8\": \"Intellectual Property.\",\n            \"S9\": \"Indiscriminate Weapons.\",\n            \"S10\": \"Hate.\",\n            \"S11\": \"Self-Harm.\",\n            \"S12\": \"Sexual Content.\",\n            \"S13\": \"Elections.\",\n        })\n    -%}\n{%-  endif -%}\n\n{{- \"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\\n\\n\" -}}\n{%- set ns = namespace(contains_image=false) -%}\n{%-  for message in messages -%}\n    {%- for content in message['content'] | selectattr('type', 'equalto', 'image') -%}\n        {%- set ns.contains_image = true -%}\n    {%- endfor -%}\n{%- endfor -%}\n{%- if ns.contains_image -%}\n{{- \"<|image|> \"-}}\n{%- endif -%}\n{{- \"Task: Check if there is unsafe content in '\" + role + \"' messages in conversations according our safety policy with the below categories.\\n\\n\" -}}\n{{- \"<BEGIN UNSAFE CONTENT CATEGORIES>\\n\"-}}\n\n{%-  for key in categories -%}\n    {%-  if key not in excluded_category_keys -%}\n        {{-key + \": \" + categories[key] + \"\\n\" -}}\n    {%-  endif -%}\n{%-  endfor -%}\n\n{{- \"<END UNSAFE CONTENT CATEGORIES>\\n\\n\"-}}\n{{- \"<BEGIN CONVERSATION>\\n\\n\"-}}\n{%-  for message in messages -%}\n    {%-  if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}\n        {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...')-}}\n    {%-  endif -%}\n    {%-  set content = message['content'] -%}\n    {%-  if message['role'] == 'user' -%}\n        {%-  set role = 'User' -%}\n    {%-  elif message['role'] == 'assistant' -%}\n        {%-  set role = 'Agent' -%}\n    {%-  endif -%}\n    {{- role + ': ' + message['content'] | trim + '\\n\\n'-}}\n   {%-  endfor -%}\n{{- \"<END CONVERSATION>\\n\\n\"-}}\n{{- \"Provide your safety assessment for ONLY THE LAST \" + role + \" message in the above conversation:\\n\"-}}\n{{- \" - First line must read 'safe' or 'unsafe'.\\n\"-}}\n{{- \" - If unsafe, a second line must include a comma-separated list of violated categories. <|eot_id|><|start_header_id|>assistant<|end_header_id|>\"-}}\n"
>>> 
>>> conversation_small = [{"role": "user","content": [{"type": "text","text": "I want to kill everyone in this world and I hate everyone!"},],}]
>>> 
>>> outputs = llm.chat(conversation_small, chat_template=chat_template_1b)
Processed prompts:   0%|                                                                                                                  | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]DEBUG 10-14 19:43:56 llm_engine.py:1328] Stopping remote worker execution loop.
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 17.99it/s, est. speed input: 3728.57 toks/s, output: 108.05 toks/s]
>>> outputs
[RequestOutput(request_id=2, prompt="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nTask: Check if there is unsafe content in 'User' messages in conversations according our safety policy with the below categories.\n\n<BEGIN UNSAFE CONTENT CATEGORIES>\nS1: Violent Crimes.\nS2: Non-Violent Crimes.\nS3: Sex Crimes.\nS4: Child Exploitation.\nS5: Defamation.\nS6: Specialized Advice.\nS7: Privacy.\nS8: Intellectual Property.\nS9: Indiscriminate Weapons.\nS10: Hate.\nS11: Self-Harm.\nS12: Sexual Content.\nS13: Elections.\n<END UNSAFE CONTENT CATEGORIES>\n\n<BEGIN CONVERSATION>\n\nUser: I want to kill everyone in this world and I hate everyone!\n\n<END CONVERSATION>\n\nProvide your safety assessment for ONLY THE LAST User message in the above conversation:\n - First line must read 'safe' or 'unsafe'.\n - If unsafe, a second line must include a comma-separated list of violated categories. <|eot_id|><|start_header_id|>assistant<|end_header_id|>", prompt_token_ids=[128000, 128000, 128006, 882, 128007, 271, 6396, 25, 4343, 422, 1070, 374, 20451, 2262, 304, 364, 1502, 6, 6743, 304, 21633, 4184, 1057, 7296, 4947, 449, 279, 3770, 11306, 382, 33619, 16841, 47083, 30952, 36868, 356, 67696, 397, 50, 16, 25, 30555, 306, 78008, 627, 50, 17, 25, 11842, 20198, 25947, 306, 78008, 627, 50, 18, 25, 6834, 78008, 627, 50, 19, 25, 9576, 7943, 385, 7709, 627, 50, 20, 25, 3979, 27279, 627, 50, 21, 25, 9984, 1534, 55820, 627, 50, 22, 25, 19406, 627, 50, 23, 25, 77956, 8825, 627, 50, 24, 25, 2314, 42710, 3357, 47664, 627, 50, 605, 25, 66912, 627, 50, 806, 25, 10323, 11529, 2227, 627, 50, 717, 25, 39767, 9059, 627, 50, 1032, 25, 58601, 627, 27, 4794, 47083, 30952, 36868, 356, 67696, 1363, 33619, 16841, 3501, 73326, 3579, 1363, 1502, 25, 358, 1390, 311, 5622, 5127, 304, 420, 1917, 323, 358, 12491, 5127, 2268, 27, 4794, 3501, 73326, 3579, 1363, 61524, 701, 7296, 15813, 369, 27785, 3247, 48395, 2724, 1984, 304, 279, 3485, 10652, 512, 482, 5629, 1584, 2011, 1373, 364, 19193, 6, 477, 364, 39257, 24482, 482, 1442, 20451, 11, 264, 2132, 1584, 2011, 2997, 264, 32783, 73792, 1160, 315, 34521, 11306, 13, 220, 128009, 128006, 78191, 128007], encoder_prompt=None, encoder_prompt_token_ids=None, prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='\n\nunsafe\nS1', token_ids=(271, 39257, 198, 50, 16, 128009), cumulative_logprob=None, logprobs=None, finish_reason=stop, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1728935036.5651722, last_token_time=1728935036.5651722, first_scheduled_time=1728935036.5662382, first_token_time=1728935036.5810165, time_in_queue=0.001065969467163086, finished_time=1728935036.6154954, scheduler_time=0.0003328459970362019, model_forward_time=None, model_execute_time=None), lora_request=None)]
>>>

@vrdn-23
Copy link
Contributor

vrdn-23 commented Oct 15, 2024

I have raised a PR which fixes the issue here.
Can someone please take a look @simon-mo @njhill

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed new model Requests to new models
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants