Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama3 finetuning and generation: Double begin_of_text, no eot_id #1682

Open
sanderland opened this issue Aug 20, 2024 · 9 comments
Open

Llama3 finetuning and generation: Double begin_of_text, no eot_id #1682

sanderland opened this issue Aug 20, 2024 · 9 comments
Labels
bug Something isn't working

Comments

@sanderland
Copy link
Contributor

sanderland commented Aug 20, 2024

Bug description

When finetuning Llama3, the encoded data has:

Seems related to #1565, but may be more widespread across models.

Going by the example which downloads alpaca finance:

litgpt finetune_full meta-llama/Meta-Llama-3.1-8B-Instruct \
  --config configs/llama31-8b.yaml \
  --data JSON \
  --data.json_path my_custom_dataset.json \
  --data.mask_prompt True \
  --data.prompt_style llama3 \
  --data.val_split_fraction 0.05

and adding this to full.py along with support for skip_special_tokens=False

        if fabric.global_rank == 0 and state["iter_num"] == 1:
            non_pad_ids = input_ids[0][input_ids[0] != 0] # assume pad token id is 0
            fabric.print(f"First row of input ids with total shape {input_ids.shape}: {non_pad_ids}")
            fabric.print(f"Detokenized: {tokenizer.decode(non_pad_ids, skip_special_tokens=False)}")

gives

First row of input ids with total shape torch.Size([4, 765]): tensor([128000, 128000, 128006,   9125, 128007,    271,   264, [...] 459,   9341,     13]
Detokenized: <|begin_of_text|><|begin_of_text|><|start_header_id|> [..] accurate valuation of an investment.

What operating system are you using?

Unknown

LitGPT Version

(close to) main

@sanderland sanderland added the bug Something isn't working label Aug 20, 2024
@rasbt
Copy link
Collaborator

rasbt commented Aug 20, 2024

Thanks for raising that. Need to investigate in the next few days

@rasbt
Copy link
Collaborator

rasbt commented Aug 21, 2024

When you mentioned

(close to) main

could you check the version? Asking because I don't think that skip_special_tokens is a valid argument.

@sanderland
Copy link
Contributor Author

When you mentioned

(close to) main

could you check the version? Asking because I don't think that skip_special_tokens is a valid argument.

version = "0.4.10", but when I said

adding this to full.py along with support for skip_special_tokens=False

I meant I added that option to help debug.

@rasbt
Copy link
Collaborator

rasbt commented Aug 21, 2024

Ah yes, the reason why I was asking is that I was getting a

TypeError: Tokenizer.decode() got an unexpected keyword argument 'skip_special_tokens'

and I was wondering where you applied this

@sanderland
Copy link
Contributor Author

You can see my (somewhat messy) branch here: https://github.com/Lightning-AI/litgpt/compare/main...sanderland:dev?expand=1

@rasbt
Copy link
Collaborator

rasbt commented Aug 23, 2024

Ah thanks! I am still not understanding why this wouldn't work for me with a TypeError: Tokenizer.decode() got an unexpected keyword argument 'skip_special_tokens'. Need to investigate more (maybe a version issue).

Anyways, I just double-checked the generate_example function, and the for a prompt

What food do llamas eat?

The actual prompt that is passed to the tokenizer looks like this during finetuning with the default Alpaca style:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Recommend a movie for me to watch during the weekend and explain the reason.

### Response:

and then with the --data.prompt_style llama3 you were using:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Recommend a movie for me to watch during the weekend and explain the reason.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

So that part at least looks all ok to me.

@sanderland
Copy link
Contributor Author

sanderland commented Aug 23, 2024

skip_special_tokens is a parameter in huggingface, but not in litgpt, I just added the pass-through to debug.

As for your prompt being correct, that doesn't mean the result of encode() is

from tokenizers import Tokenizer as HFTokenizer
processor = HFTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B")
processor.encode("prompt").ids # [128000, 41681] = "<|begin_of_text|>" , "prompt"

That is, there is a template which adds "<|begin_of_text|>" in the tokenizer.

@sanderland
Copy link
Contributor Author

This is another confusing point
https://github.com/Lightning-AI/litgpt/blob/main/litgpt/tokenizer.py#L91
The tokenizer has special logic to add a bos token to llama3, but both the huggingface tokenizer AND the template add one already. At least it checks so doesn't end up with 3.

@calvintwr
Copy link

Actually I am curious as to how finetuning can work now given #1699

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants