fix: use eos token in target tensor for instruction-tuning #3945

geoffreyangus · 2024-02-27T17:50:36Z

Prior to this change, we used pad token at the end of target tensor. This was okay because many of the new LLMs trained with pad token == eos token. With Gemma, there is a separate eos token. The issue now is that, during generation, Gemma cannot produce an eos token, so generation never stops. We now use eos token during fine-tuning so that LLMs are guaranteed to learn how to stop during the generation step.

alexsherstinsky · 2024-02-27T17:55:13Z

ludwig/utils/llm_utils.py


    # Merge input_ids and target_ids by concatenating them together.
    # We remove the left padding from both input_ids and target_ids before concatenating them.
    for input_id_sample, target_id_sample in zip(input_ids, target_ids):
        input_id_sample_no_padding = remove_left_padding(input_id_sample, tokenizer)[0]
        target_id_sample_no_padding = remove_left_padding(target_id_sample, tokenizer)[0]
-        target_id_sample_no_padding = torch.cat((target_id_sample_no_padding, pad_tensor), dim=-1)
+        target_id_sample_no_padding = torch.cat((target_id_sample_no_padding, eos_tensor), dim=-1)


@geoffreyangus Just for my edification. Is PAD token not needed at all?

@alexsherstinsky It should always be EOS token! For most models, they don't have a pad token so we set pad to eos token and were appending "pad token" which is basically EOS token. But for models with tokenizers where "eos" token and "pad" token are both already present and different, this will be wrong since we're always supposed to be appending an eos token at the end

that's correct!

Got it -- indeed, all my notebooks have tokenizer.pad_token = tokenizer.eos_token :-). -- thanks for the clarification!

github-actions · 2024-02-27T18:20:30Z

Unit Test Results

  4 files ±      0   4 suites ±0 9m 29s ⏱️ - 17m 37s
12 tests - 2 972   9 ✔️ - 2 962   3 💤 - 9 0 ❌ - 1
40 runs - 2 960 28 ✔️ - 2 953 12 💤 - 6 0 ❌ - 1

Results for commit eaac1e4. ± Comparison against base commit d347063.

♻️ This comment has been updated with latest results.

Infernaught

Left my comments, but otherwise LGTM!

Infernaught · 2024-02-27T20:37:34Z

tests/integration_tests/test_llm.py

Is this supposed to be 612 or 621? And did you intend to leave these print statements?

it's now 621– token count per epoch was incremented by 1 because we replaced all final PAD tokens with EOS tokens (PAD tokens are ignored by accounting: https://github.com/ludwig-ai/ludwig/blob/master/ludwig/accounting/used_tokens.py#L55)

alexsherstinsky

LGTM! Thanks!

fix: use eos token in target tensor for instruction-tuning

19d4c1f

geoffreyangus requested review from w4nderlust, tgaddair, justinxzhao, arnavgarg1, jeffkinnison, Infernaught and alexsherstinsky as code owners February 27, 2024 17:50

alexsherstinsky reviewed Feb 27, 2024

View reviewed changes

fix tests

f9483c9

Infernaught approved these changes Feb 27, 2024

View reviewed changes

pr revision

bc47e33

alexsherstinsky approved these changes Feb 27, 2024

View reviewed changes

justinxzhao approved these changes Feb 27, 2024

View reviewed changes

fix tests

eaac1e4

geoffreyangus merged commit 021a099 into master Feb 27, 2024
15 of 18 checks passed

geoffreyangus deleted the fix-eos-token branch February 27, 2024 21:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use eos token in target tensor for instruction-tuning #3945

fix: use eos token in target tensor for instruction-tuning #3945

geoffreyangus commented Feb 27, 2024

alexsherstinsky Feb 27, 2024

arnavgarg1 Feb 27, 2024

geoffreyangus Feb 27, 2024

alexsherstinsky Feb 27, 2024

github-actions bot commented Feb 27, 2024 •

edited

Loading

Infernaught left a comment

Infernaught Feb 27, 2024

geoffreyangus Feb 27, 2024

alexsherstinsky left a comment

fix: use eos token in target tensor for instruction-tuning #3945

fix: use eos token in target tensor for instruction-tuning #3945

Conversation

geoffreyangus commented Feb 27, 2024

alexsherstinsky Feb 27, 2024

Choose a reason for hiding this comment

arnavgarg1 Feb 27, 2024

Choose a reason for hiding this comment

geoffreyangus Feb 27, 2024

Choose a reason for hiding this comment

alexsherstinsky Feb 27, 2024

Choose a reason for hiding this comment

github-actions bot commented Feb 27, 2024 • edited Loading

Unit Test Results

Infernaught left a comment

Choose a reason for hiding this comment

Infernaught Feb 27, 2024

Choose a reason for hiding this comment

geoffreyangus Feb 27, 2024

Choose a reason for hiding this comment

alexsherstinsky left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 27, 2024 •

edited

Loading