-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: use eos token in target tensor for instruction-tuning #3945
Conversation
|
||
# Merge input_ids and target_ids by concatenating them together. | ||
# We remove the left padding from both input_ids and target_ids before concatenating them. | ||
for input_id_sample, target_id_sample in zip(input_ids, target_ids): | ||
input_id_sample_no_padding = remove_left_padding(input_id_sample, tokenizer)[0] | ||
target_id_sample_no_padding = remove_left_padding(target_id_sample, tokenizer)[0] | ||
target_id_sample_no_padding = torch.cat((target_id_sample_no_padding, pad_tensor), dim=-1) | ||
target_id_sample_no_padding = torch.cat((target_id_sample_no_padding, eos_tensor), dim=-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@geoffreyangus Just for my edification. Is PAD token not needed at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexsherstinsky It should always be EOS token! For most models, they don't have a pad token so we set pad to eos token and were appending "pad token" which is basically EOS token. But for models with tokenizers where "eos" token and "pad" token are both already present and different, this will be wrong since we're always supposed to be appending an eos token at the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's correct!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it -- indeed, all my notebooks have tokenizer.pad_token = tokenizer.eos_token
:-). -- thanks for the clarification!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left my comments, but otherwise LGTM!
tests/integration_tests/test_llm.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be 612 or 621? And did you intend to leave these print statements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's now 621– token count per epoch was incremented by 1 because we replaced all final PAD tokens with EOS tokens (PAD tokens are ignored by accounting: https://github.com/ludwig-ai/ludwig/blob/master/ludwig/accounting/used_tokens.py#L55)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks!
Prior to this change, we used pad token at the end of target tensor. This was okay because many of the new LLMs trained with pad token == eos token. With Gemma, there is a separate eos token. The issue now is that, during generation, Gemma cannot produce an eos token, so generation never stops. We now use eos token during fine-tuning so that LLMs are guaranteed to learn how to stop during the generation step.