Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Model]: Llama 3 8B Instruct #4297

Closed
K-Mistele opened this issue Apr 23, 2024 · 3 comments
Closed

[New Model]: Llama 3 8B Instruct #4297

K-Mistele opened this issue Apr 23, 2024 · 3 comments
Labels
new model Requests to new models

Comments

@K-Mistele
Copy link
Contributor

The model to consider.

https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

The closest model vllm already supports.

LLama 1 & 2

What's your difficulty of supporting the model you want?

LLama 3 instruct requires a different stop token than is specified in the tokenizer.json file.
The tokenizer.json specifies <|end_of_text|> as the end of string token which works for the base LLama 3 model, but this is not the right token for the instruct tune. The instruct tune uses <|eot_id|>.

You can see this in the inference code for the model on the llama 3 8B instruct model card, where this token is added:

import transformers
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)

# HERE is where they add the `<|eot_id|>` token, which is not the default end of string token, to the list of terminators.
terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
print(outputs[0]["generated_text"][len(prompt):])

Here is a discussion of this topic in the llama.cpp repository:
ggerganov/llama.cpp#6751

@K-Mistele K-Mistele added the new model Requests to new models label Apr 23, 2024
@agt
Copy link
Contributor

agt commented Apr 23, 2024

This will be resolved via #4182 which has been merged and will be released with 0.4.1 .

In the meantime, #4180 has suggestions on workarounds including manually editing config.json to set the correct stop token.

@JPonsa
Copy link

JPonsa commented May 24, 2024

@agt is still open? is llama3 still not supported in vLLM?

@hmellor
Copy link
Collaborator

hmellor commented May 31, 2024

Llama-3 is supported

@hmellor hmellor closed this as completed May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new model Requests to new models
Projects
None yet
Development

No branches or pull requests

4 participants