Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenization mismatch #7

Open
ohhan777 opened this issue Sep 1, 2024 · 4 comments
Open

tokenization mismatch #7

ohhan777 opened this issue Sep 1, 2024 · 4 comments

Comments

@ohhan777
Copy link

ohhan777 commented Sep 1, 2024

Thank you for sharing the great source code. I have been trying to pretrain and fine-tune with LLaMA 3.1. While the pretraining works fine, I noticed that the following warnings occur during the fine-tuning process, preventing the model from training properly:

WARNING: tokenization mismatch: 276 vs. 272. (ignored)
WARNING: tokenization mismatch: 223 vs. 219. (ignored)
WARNING: tokenization mismatch: 131 vs. 127. (ignored)
WARNING: tokenization mismatch: 915 vs. 911. (ignored)
WARNING: tokenization mismatch: 545 vs. 541. (ignored)
WARNING: tokenization mismatch: 210 vs. 206. (ignored)
WARNING: tokenization mismatch: 177 vs. 173. (ignored)
WARNING: tokenization mismatch: 183 vs. 179. (ignored)
WARNING: tokenization mismatch: 168 vs. 164. (ignored)
WARNING: tokenization mismatch: 155 vs. 151. (ignored)
WARNING: tokenization mismatch: 117 vs. 113. (ignored)
WARNING: tokenization mismatch: 781 vs. 777. (ignored)
WARNING: tokenization mismatch: 204 vs. 200. (ignored)
WARNING: tokenization mismatch: 195 vs. 191. (ignored)
WARNING: tokenization mismatch: 107 vs. 103. (ignored)
WARNING: tokenization mismatch: 334 vs. 330. (ignored)
WARNING: tokenization mismatch: 376 vs. 372. (ignored)
WARNING: tokenization mismatch: 146 vs. 142. (ignored)
WARNING: tokenization mismatch: 121 vs. 117. (ignored)

After checking the source code, I found that in the train.py file, within the preprocess_llama_3_1() function, the cur_len value becomes 4 more than it should be due to the following line of code:

cur_len = cur_len + len(tokenizer(sep, add_special_tokens=False).input_ids)

As a result, all targets are treated as IGNORE_INDEX, and the model does not train. When I commented out this line, the issue seemed to disappear, and the training worked properly. Was this line intentionally included?

@sahilqure
Copy link

@ohhan777 Can u send me the logs after commenting it out?

@sahil02235
Copy link

@federico1-creator This is not solved even after commenting that line. Can u look into it

@federico1-creator
Copy link
Collaborator

federico1-creator commented Sep 18, 2024

Hi everyone, thank you for your interest in our project !!!

We have conduct some tests to better understand the differences in behavior between the code we're running and the tokenization mismatch issue you mentioned.
The problem is the llama 3.1 tokenizer, which was updated by the Meta team.
This update create a mismatch between the version we used during development and the one you are currently using.

To fix this issue you can use our tokenizer, which is included in the LLaVA-MORE weights.
Specifically, I have already updated the training scripts to use the new TOKENIZER_PATH.

https://github.com/aimagelab/LLaVA-MORE/blob/main/scripts/more/11_pretrain_llama_31_acc_st_1.sh
https://github.com/aimagelab/LLaVA-MORE/blob/main/scripts/more/12_finetuning_llama_31_acc_st_1.sh

@ohhan777 @sahilqure @sahil02235

@sahil02235
Copy link

@federico1-creator Thanks for this will check it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants