tokenization mismatch #7

ohhan777 · 2024-09-01T11:26:45Z

Thank you for sharing the great source code. I have been trying to pretrain and fine-tune with LLaMA 3.1. While the pretraining works fine, I noticed that the following warnings occur during the fine-tuning process, preventing the model from training properly:

WARNING: tokenization mismatch: 276 vs. 272. (ignored)
WARNING: tokenization mismatch: 223 vs. 219. (ignored)
WARNING: tokenization mismatch: 131 vs. 127. (ignored)
WARNING: tokenization mismatch: 915 vs. 911. (ignored)
WARNING: tokenization mismatch: 545 vs. 541. (ignored)
WARNING: tokenization mismatch: 210 vs. 206. (ignored)
WARNING: tokenization mismatch: 177 vs. 173. (ignored)
WARNING: tokenization mismatch: 183 vs. 179. (ignored)
WARNING: tokenization mismatch: 168 vs. 164. (ignored)
WARNING: tokenization mismatch: 155 vs. 151. (ignored)
WARNING: tokenization mismatch: 117 vs. 113. (ignored)
WARNING: tokenization mismatch: 781 vs. 777. (ignored)
WARNING: tokenization mismatch: 204 vs. 200. (ignored)
WARNING: tokenization mismatch: 195 vs. 191. (ignored)
WARNING: tokenization mismatch: 107 vs. 103. (ignored)
WARNING: tokenization mismatch: 334 vs. 330. (ignored)
WARNING: tokenization mismatch: 376 vs. 372. (ignored)
WARNING: tokenization mismatch: 146 vs. 142. (ignored)
WARNING: tokenization mismatch: 121 vs. 117. (ignored)

After checking the source code, I found that in the train.py file, within the preprocess_llama_3_1() function, the cur_len value becomes 4 more than it should be due to the following line of code:

cur_len = cur_len + len(tokenizer(sep, add_special_tokens=False).input_ids)

As a result, all targets are treated as IGNORE_INDEX, and the model does not train. When I commented out this line, the issue seemed to disappear, and the training worked properly. Was this line intentionally included?

The text was updated successfully, but these errors were encountered:

sahilqure · 2024-09-08T17:37:08Z

@ohhan777 Can u send me the logs after commenting it out?

sahil02235 · 2024-09-09T14:18:00Z

@federico1-creator This is not solved even after commenting that line. Can u look into it

federico1-creator · 2024-09-18T06:55:14Z

Hi everyone, thank you for your interest in our project !!!

We have conduct some tests to better understand the differences in behavior between the code we're running and the tokenization mismatch issue you mentioned.
The problem is the llama 3.1 tokenizer, which was updated by the Meta team.
This update create a mismatch between the version we used during development and the one you are currently using.

To fix this issue you can use our tokenizer, which is included in the LLaVA-MORE weights.
Specifically, I have already updated the training scripts to use the new TOKENIZER_PATH.

https://github.com/aimagelab/LLaVA-MORE/blob/main/scripts/more/11_pretrain_llama_31_acc_st_1.sh
https://github.com/aimagelab/LLaVA-MORE/blob/main/scripts/more/12_finetuning_llama_31_acc_st_1.sh

@ohhan777 @sahilqure @sahil02235

sahil02235 · 2024-09-18T17:48:22Z

@federico1-creator Thanks for this will check it.

federico1-creator mentioned this issue Sep 18, 2024

Minimal changes compared with llava1.5 #2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenization mismatch #7

tokenization mismatch #7

ohhan777 commented Sep 1, 2024

sahilqure commented Sep 8, 2024

sahil02235 commented Sep 9, 2024

federico1-creator commented Sep 18, 2024 •

edited

Loading

sahil02235 commented Sep 18, 2024

tokenization mismatch #7

tokenization mismatch #7

Comments

ohhan777 commented Sep 1, 2024

sahilqure commented Sep 8, 2024

sahil02235 commented Sep 9, 2024

federico1-creator commented Sep 18, 2024 • edited Loading

sahil02235 commented Sep 18, 2024

federico1-creator commented Sep 18, 2024 •

edited

Loading