Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use eos token in target tensor for instruction-tuning #3945

Merged
merged 4 commits into from
Feb 27, 2024

Conversation

geoffreyangus
Copy link
Contributor

Prior to this change, we used pad token at the end of target tensor. This was okay because many of the new LLMs trained with pad token == eos token. With Gemma, there is a separate eos token. The issue now is that, during generation, Gemma cannot produce an eos token, so generation never stops. We now use eos token during fine-tuning so that LLMs are guaranteed to learn how to stop during the generation step.


# Merge input_ids and target_ids by concatenating them together.
# We remove the left padding from both input_ids and target_ids before concatenating them.
for input_id_sample, target_id_sample in zip(input_ids, target_ids):
input_id_sample_no_padding = remove_left_padding(input_id_sample, tokenizer)[0]
target_id_sample_no_padding = remove_left_padding(target_id_sample, tokenizer)[0]
target_id_sample_no_padding = torch.cat((target_id_sample_no_padding, pad_tensor), dim=-1)
target_id_sample_no_padding = torch.cat((target_id_sample_no_padding, eos_tensor), dim=-1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geoffreyangus Just for my edification. Is PAD token not needed at all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexsherstinsky It should always be EOS token! For most models, they don't have a pad token so we set pad to eos token and were appending "pad token" which is basically EOS token. But for models with tokenizers where "eos" token and "pad" token are both already present and different, this will be wrong since we're always supposed to be appending an eos token at the end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's correct!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it -- indeed, all my notebooks have tokenizer.pad_token = tokenizer.eos_token :-). -- thanks for the clarification!

Copy link

github-actions bot commented Feb 27, 2024

Unit Test Results

  4 files  ±       0    4 suites  ±0   9m 29s ⏱️ - 17m 37s
12 tests  - 2 972    9 ✔️  - 2 962    3 💤  - 9  0  - 1 
40 runs   - 2 960  28 ✔️  - 2 953  12 💤  - 6  0  - 1 

Results for commit eaac1e4. ± Comparison against base commit d347063.

♻️ This comment has been updated with latest results.

Copy link
Contributor

@Infernaught Infernaught left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left my comments, but otherwise LGTM!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be 612 or 621? And did you intend to leave these print statements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's now 621– token count per epoch was incremented by 1 because we replaced all final PAD tokens with EOS tokens (PAD tokens are ignored by accounting: https://github.com/ludwig-ai/ludwig/blob/master/ludwig/accounting/used_tokens.py#L55)

Copy link
Collaborator

@alexsherstinsky alexsherstinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

@geoffreyangus geoffreyangus merged commit 021a099 into master Feb 27, 2024
15 of 18 checks passed
@geoffreyangus geoffreyangus deleted the fix-eos-token branch February 27, 2024 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants