chore(deps): upgrade trl and transformers #448

willmj · 2025-01-23T15:35:33Z

Description of the change

requirements.txt in trl v0.13.0 no longer restricts transformers version.

Related issue number

More context

How to verify the PR

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

github-actions · 2025-01-23T15:35:45Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

dushyantbehl · 2025-01-23T15:56:25Z

pyproject.toml

@@ -33,7 +33,7 @@ dependencies = [
 "sentencepiece>=0.1.99,<0.3",
 "tokenizers>=0.13.3,<1.0",
 "tqdm>=4.66.2,<5.0",
-"trl>=0.9.3,<0.12",
+"trl>=0.9.3,<1.0",


@willmj do we want to probably lock the version to higher than one which supports packing for pretokenized datasets which we enabled along with this patch.

@dushyantbehl I thought the latest version was 0.13.0, so this would include that. Feel free to make a suggested change if I'm misunderstanding.

Suggested change

"trl>=0.9.3,<1.0",

"trl>=0.13.0,<1.0",

@dushyantbehl is this your suggestion?

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

Abhishek-TAMU · 2025-01-23T16:29:23Z

I think if we want TRL version to be greater than >=0.12 then I guess we need to first ensure fms-hf-tuning is compatible with transformers version >=4.46 which I guess currently is not, as per this issue, because I can see TRL latest version has minimum version of transformers as 4.46 requirements.txt

willmj · 2025-01-23T16:40:45Z

Upgrade of transformers >=4.46 is dependent on fms-acceleration #98, also have to update SFTConfig correctly

dushyantbehl · 2025-01-23T16:54:17Z

I think if we want TRL version to be greater than >=0.12 then I guess we need to first ensure fms-hf-tuning is compatible with transformers version >=4.46 which I guess currently is not, as per this issue, because I can see TRL latest version has minimum version of transformers as 4.46 requirements.txt

Correct @Abhishek-TAMU that's what I wanted to highlight too thanks

dushyantbehl · 2025-01-23T16:55:18Z

Upgrade of transformers >=4.46 is dependent on fms-acceleration #98, also have to update SFTConfig correctly

I see....good to note both here and we can update this later post we get working our repo with the new transformers..thanks

dushyantbehl · 2025-02-08T15:16:25Z

@anhuong thanks for quick fix on foundation-model-stack/fms-acceleration#123

Does this mean we can go ahead and move forward with this change? cc @willmj @fabianlim

Abhishek-TAMU · 2025-02-08T19:58:11Z

I think if we want TRL version to be greater than >=0.12 then I guess we need to first ensure fms-hf-tuning is compatible with transformers version >=4.46 which I guess currently is not, as per this issue, because I can see TRL latest version has minimum version of transformers as 4.46 requirements.txt

Does this mean we can go ahead and move forward with this change?

As both transformers and trl version are being updated, I guess it would also be good to run 1 automated e2e test (for all models) to once confirm the successful working with the upgraded version of trl and transformers. @willmj

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com>

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

Abhishek-TAMU · 2025-02-12T01:41:53Z

@willmj

All the failed tests (41 failed tests) would be completely resolved by these 3 changes:

1- Setting gradient_accumulation_steps = 1 here in test_sft_trainer.py.

As the size of training dataset is 10 and batch_size is 4, hence due to gradient_accumulation_steps = 4 not all optimizer steps is run and hence its not running for entire 5 epochs and hence not even completing 1 epoch, hence no training_log json file is not generated (In unit tests with not found errors).

2- Same reasoning, to change gradient_accumulation_steps = 1 in test_launch_script.py

3- Use _get_checkpoint_path to get last checkpoint instead of checkpoint-5 as not every tuning has this checkpoint. Something like this:

def _get_checkpoint_path(dir_path):
    checkpoint_dirs = [
        d for d in os.listdir(dir_path)
        if os.path.isdir(os.path.join(dir_path, d)) and re.match(r'^checkpoint-\d+$', d)
    ]
    checkpoint_dirs.sort(key=lambda name: int(name.split('-')[-1]))
    return os.path.join(dir_path, checkpoint_dirs[-1])

Co-authored-by: Abhishek <maurya.abhishek@ibm.com> Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

willmj · 2025-02-12T19:28:53Z

New release of fms-acceleration unblocking transformer 4.46. Should we go for merging?
@anhuong @dushyantbehl @Abhishek-TAMU

dushyantbehl · 2025-02-13T10:07:38Z

@willmj sure lets go ahead...do you want to merge the change which enables pretokenized datasets with packing as well? or we keep it separate?
I would prefer them to be merged separately but together

dushyantbehl · 2025-02-13T12:20:57Z

@willmj I applied the fixes to test cases suggested by @Abhishek-TAMU above and have opened a PR on reverting the packing for pretokenized dataset restirctions #468
This is based on your PR...I added few unit tests on packing with pretokenized dataset which seem to fail...testing to fix them but @Abhishek-TAMU 's suggestions above work for this PR

anhuong · 2025-02-13T16:27:39Z

Running a few tests of tuning with the updated versions, but the changes look good to me. Once my tests run, I will note in this PR

anhuong · 2025-02-13T17:37:39Z

Full fine-tuning and LoRA tuning ran fine on minimal example. We will have to test to see how these changes affect train_runtime and quality in the next release though @Abhishek-TAMU

anhuong · 2025-02-13T17:53:46Z

Hmm but the changes from main broke unit tests so will investigate...

Signed-off-by: Anh Uong <anh.uong@ibm.com>

anhuong · 2025-02-13T18:05:43Z

New version of trl v0.15.0 was released 3 hours ago and there are not yet release docs. This upgrade broke the unit tests, setting upper limit to below 0.15.0 for now.

Noting the error here: failed in training loop when running trainer.train()

0%|          | 0/15 [00:00<?, ?it/s]ERROR:sft_trainer.py:Traceback (most recent call last):
  File "/home/runner/work/fms-hf-tuning/fms-hf-tuning/tuning/sft_trainer.py", line 676, in main
    trainer, additional_train_info = train(
                                     ^^^^^^
  File "/home/runner/work/fms-hf-tuning/fms-hf-tuning/tuning/sft_trainer.py", line 420, in train
    trainer.train(resume_from_checkpoint)
  File "/home/runner/work/fms-hf-tuning/fms-hf-tuning/.tox/coverage/lib/python3.12/site-packages/transformers/trainer.py", line 2171, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/fms-hf-tuning/fms-hf-tuning/.tox/coverage/lib/python3.12/site-packages/transformers/trainer.py", line 2531, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/fms-hf-tuning/fms-hf-tuning/.tox/coverage/lib/python3.12/site-packages/transformers/trainer.py", line 3675, in training_step
    loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/fms-hf-tuning/fms-hf-tuning/.tox/coverage/lib/python3.12/site-packages/trl/trainer/sft_trainer.py", line 453, in compute_loss
    accuracy = compute_token_accuracy(shift_logits, shift_labels)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/fms-hf-tuning/fms-hf-tuning/.tox/coverage/lib/python3.12/site-packages/trl/trainer/utils.py", line 1664, in compute_token_accuracy
    correct_predictions = (predictions == labels) & mask
                           ^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The size of tensor a (72) must match the size of tensor b (64) at non-singleton dimension 1

Abhishek-TAMU

Changes to upgrade TRL and Transformer versions LGTM! Thanks for the PR @willmj
Thanks @anhuong for testing it with minimal example.

chore(deps): revert trl restriction

4f4ca10

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

willmj requested review from anhuong, Ssukriti, aluu317, fabianlim and kmehant as code owners January 23, 2025 15:35

github-actions bot added the chore label Jan 23, 2025

fix: remove check

3107fbf

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

dushyantbehl reviewed Jan 23, 2025

View reviewed changes

fix: remove packing from func def

b72003d

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

dushyantbehl mentioned this pull request Feb 6, 2025

feat: Enable streaming in data preprocessor #437

Merged

5 tasks

willmj and others added 4 commits February 10, 2025 11:59

chore(deps): upgrade transformers + trl

1b1cce0

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

enable packing for pretokenized datasets

a6b0fe5

Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com>

merge: fix-upgrade-trl into upgrade-trl

f9496ec

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

Merge branch 'main' into update-trl

a825a8a

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

tests: gradient accum steps = 1, get checkpoint path

bab449e

Co-authored-by: Abhishek <maurya.abhishek@ibm.com> Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

willmj force-pushed the update-trl branch from d152369 to bab449e Compare February 12, 2025 18:31

Merge branch 'main' into update-trl

0e53f07

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

dushyantbehl mentioned this pull request Feb 13, 2025

feat: Packing for pretokenised #468

Open

2 tasks

anhuong changed the title ~~chore(deps): revert trl restriction~~ chore(deps): upgrade trl and transformers Feb 13, 2025

Merge branch 'main' into update-trl

6df4afe

anhuong previously approved these changes Feb 13, 2025

View reviewed changes

add upper limit to trl of below 0.15

f844a79

Signed-off-by: Anh Uong <anh.uong@ibm.com>

anhuong dismissed their stale review via f844a79 February 13, 2025 18:04

Abhishek-TAMU approved these changes Feb 13, 2025

View reviewed changes

willmj merged commit 2f033c7 into foundation-model-stack:main Feb 13, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(deps): upgrade trl and transformers #448

chore(deps): upgrade trl and transformers #448

willmj commented Jan 23, 2025

github-actions bot commented Jan 23, 2025

dushyantbehl Jan 23, 2025

willmj Jan 23, 2025

ashokponkumar Jan 27, 2025

Abhishek-TAMU commented Jan 23, 2025 •

edited

Loading

willmj commented Jan 23, 2025 •

edited by Abhishek-TAMU

Loading

dushyantbehl commented Jan 23, 2025

dushyantbehl commented Jan 23, 2025 •

edited

Loading

dushyantbehl commented Feb 8, 2025

Abhishek-TAMU commented Feb 8, 2025

Abhishek-TAMU commented Feb 12, 2025

willmj commented Feb 12, 2025

dushyantbehl commented Feb 13, 2025

dushyantbehl commented Feb 13, 2025 •

edited

Loading

anhuong commented Feb 13, 2025

anhuong commented Feb 13, 2025 •

edited

Loading

anhuong commented Feb 13, 2025

anhuong commented Feb 13, 2025 •

edited

Loading

Abhishek-TAMU left a comment •

edited

Loading

chore(deps): upgrade trl and transformers #448

chore(deps): upgrade trl and transformers #448

Conversation

willmj commented Jan 23, 2025

Description of the change

Related issue number

How to verify the PR

Was the PR tested

github-actions bot commented Jan 23, 2025

dushyantbehl Jan 23, 2025

Choose a reason for hiding this comment

willmj Jan 23, 2025

Choose a reason for hiding this comment

ashokponkumar Jan 27, 2025

Choose a reason for hiding this comment

Abhishek-TAMU commented Jan 23, 2025 • edited Loading

willmj commented Jan 23, 2025 • edited by Abhishek-TAMU Loading

dushyantbehl commented Jan 23, 2025

dushyantbehl commented Jan 23, 2025 • edited Loading

dushyantbehl commented Feb 8, 2025

Abhishek-TAMU commented Feb 8, 2025

Abhishek-TAMU commented Feb 12, 2025

willmj commented Feb 12, 2025

dushyantbehl commented Feb 13, 2025

dushyantbehl commented Feb 13, 2025 • edited Loading

anhuong commented Feb 13, 2025

anhuong commented Feb 13, 2025 • edited Loading

anhuong commented Feb 13, 2025

anhuong commented Feb 13, 2025 • edited Loading

Abhishek-TAMU left a comment • edited Loading

Choose a reason for hiding this comment

Abhishek-TAMU commented Jan 23, 2025 •

edited

Loading

willmj commented Jan 23, 2025 •

edited by Abhishek-TAMU

Loading

dushyantbehl commented Jan 23, 2025 •

edited

Loading

dushyantbehl commented Feb 13, 2025 •

edited

Loading

anhuong commented Feb 13, 2025 •

edited

Loading

anhuong commented Feb 13, 2025 •

edited

Loading

Abhishek-TAMU left a comment •

edited

Loading