Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(deps): upgrade trl and transformers #448

Merged
merged 11 commits into from
Feb 13, 2025

Conversation

willmj
Copy link
Collaborator

@willmj willmj commented Jan 23, 2025

Description of the change

requirements.txt in trl v0.13.0 no longer restricts transformers version.

Related issue number

More context

How to verify the PR

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@github-actions github-actions bot added the chore label Jan 23, 2025
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
pyproject.toml Outdated
@@ -33,7 +33,7 @@ dependencies = [
"sentencepiece>=0.1.99,<0.3",
"tokenizers>=0.13.3,<1.0",
"tqdm>=4.66.2,<5.0",
"trl>=0.9.3,<0.12",
"trl>=0.9.3,<1.0",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willmj do we want to probably lock the version to higher than one which supports packing for pretokenized datasets which we enabled along with this patch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dushyantbehl I thought the latest version was 0.13.0, so this would include that. Feel free to make a suggested change if I'm misunderstanding.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"trl>=0.9.3,<1.0",
"trl>=0.13.0,<1.0",

@dushyantbehl is this your suggestion?

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
@Abhishek-TAMU
Copy link
Collaborator

Abhishek-TAMU commented Jan 23, 2025

I think if we want TRL version to be greater than >=0.12 then I guess we need to first ensure fms-hf-tuning is compatible with transformers version >=4.46 which I guess currently is not, as per this issue, because I can see TRL latest version has minimum version of transformers as 4.46 requirements.txt

@willmj
Copy link
Collaborator Author

willmj commented Jan 23, 2025

Upgrade of transformers >=4.46 is dependent on fms-acceleration #98, also have to update SFTConfig correctly

@dushyantbehl
Copy link
Collaborator

I think if we want TRL version to be greater than >=0.12 then I guess we need to first ensure fms-hf-tuning is compatible with transformers version >=4.46 which I guess currently is not, as per this issue, because I can see TRL latest version has minimum version of transformers as 4.46 requirements.txt

Correct @Abhishek-TAMU that's what I wanted to highlight too thanks

@dushyantbehl
Copy link
Collaborator

dushyantbehl commented Jan 23, 2025

Upgrade of transformers >=4.46 is dependent on fms-acceleration #98, also have to update SFTConfig correctly

I see....good to note both here and we can update this later post we get working our repo with the new transformers..thanks

@dushyantbehl
Copy link
Collaborator

@anhuong thanks for quick fix on foundation-model-stack/fms-acceleration#123

Does this mean we can go ahead and move forward with this change? cc @willmj @fabianlim

@Abhishek-TAMU
Copy link
Collaborator

I think if we want TRL version to be greater than >=0.12 then I guess we need to first ensure fms-hf-tuning is compatible with transformers version >=4.46 which I guess currently is not, as per this issue, because I can see TRL latest version has minimum version of transformers as 4.46 requirements.txt

Does this mean we can go ahead and move forward with this change?

As both transformers and trl version are being updated, I guess it would also be good to run 1 automated e2e test (for all models) to once confirm the successful working with the upgraded version of trl and transformers. @willmj

willmj and others added 4 commits February 10, 2025 11:59
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
@Abhishek-TAMU
Copy link
Collaborator

@willmj

All the failed tests (41 failed tests) would be completely resolved by these 3 changes:

1- Setting gradient_accumulation_steps = 1 here in test_sft_trainer.py.

As the size of training dataset is 10 and batch_size is 4, hence due to gradient_accumulation_steps = 4 not all optimizer steps is run and hence its not running for entire 5 epochs and hence not even completing 1 epoch, hence no training_log json file is not generated (In unit tests with not found errors).

2- Same reasoning, to change gradient_accumulation_steps = 1 in test_launch_script.py

3- Use _get_checkpoint_path to get last checkpoint instead of checkpoint-5 as not every tuning has this checkpoint. Something like this:

def _get_checkpoint_path(dir_path):
    checkpoint_dirs = [
        d for d in os.listdir(dir_path)
        if os.path.isdir(os.path.join(dir_path, d)) and re.match(r'^checkpoint-\d+$', d)
    ]
    checkpoint_dirs.sort(key=lambda name: int(name.split('-')[-1]))
    return os.path.join(dir_path, checkpoint_dirs[-1])

Co-authored-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
@willmj
Copy link
Collaborator Author

willmj commented Feb 12, 2025

New release of fms-acceleration unblocking transformer 4.46. Should we go for merging?
@anhuong @dushyantbehl @Abhishek-TAMU

@dushyantbehl
Copy link
Collaborator

@willmj sure lets go ahead...do you want to merge the change which enables pretokenized datasets with packing as well? or we keep it separate?
I would prefer them to be merged separately but together

@dushyantbehl
Copy link
Collaborator

dushyantbehl commented Feb 13, 2025

@willmj I applied the fixes to test cases suggested by @Abhishek-TAMU above and have opened a PR on reverting the packing for pretokenized dataset restirctions #468
This is based on your PR...I added few unit tests on packing with pretokenized dataset which seem to fail...testing to fix them but @Abhishek-TAMU 's suggestions above work for this PR

@anhuong anhuong changed the title chore(deps): revert trl restriction chore(deps): upgrade trl and transformers Feb 13, 2025
@anhuong
Copy link
Collaborator

anhuong commented Feb 13, 2025

Running a few tests of tuning with the updated versions, but the changes look good to me. Once my tests run, I will note in this PR

@anhuong
Copy link
Collaborator

anhuong commented Feb 13, 2025

Full fine-tuning and LoRA tuning ran fine on minimal example. We will have to test to see how these changes affect train_runtime and quality in the next release though @Abhishek-TAMU

anhuong
anhuong previously approved these changes Feb 13, 2025
@anhuong
Copy link
Collaborator

anhuong commented Feb 13, 2025

Hmm but the changes from main broke unit tests so will investigate...

Signed-off-by: Anh Uong <anh.uong@ibm.com>
@anhuong
Copy link
Collaborator

anhuong commented Feb 13, 2025

New version of trl v0.15.0 was released 3 hours ago and there are not yet release docs. This upgrade broke the unit tests, setting upper limit to below 0.15.0 for now.

Noting the error here: failed in training loop when running trainer.train()

0%|          | 0/15 [00:00<?, ?it/s]ERROR:sft_trainer.py:Traceback (most recent call last):
  File "/home/runner/work/fms-hf-tuning/fms-hf-tuning/tuning/sft_trainer.py", line 676, in main
    trainer, additional_train_info = train(
                                     ^^^^^^
  File "/home/runner/work/fms-hf-tuning/fms-hf-tuning/tuning/sft_trainer.py", line 420, in train
    trainer.train(resume_from_checkpoint)
  File "/home/runner/work/fms-hf-tuning/fms-hf-tuning/.tox/coverage/lib/python3.12/site-packages/transformers/trainer.py", line 2171, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/fms-hf-tuning/fms-hf-tuning/.tox/coverage/lib/python3.12/site-packages/transformers/trainer.py", line 2531, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/fms-hf-tuning/fms-hf-tuning/.tox/coverage/lib/python3.12/site-packages/transformers/trainer.py", line 3675, in training_step
    loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/fms-hf-tuning/fms-hf-tuning/.tox/coverage/lib/python3.12/site-packages/trl/trainer/sft_trainer.py", line 453, in compute_loss
    accuracy = compute_token_accuracy(shift_logits, shift_labels)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/fms-hf-tuning/fms-hf-tuning/.tox/coverage/lib/python3.12/site-packages/trl/trainer/utils.py", line 1664, in compute_token_accuracy
    correct_predictions = (predictions == labels) & mask
                           ^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The size of tensor a (72) must match the size of tensor b (64) at non-singleton dimension 1

Copy link
Collaborator

@Abhishek-TAMU Abhishek-TAMU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes to upgrade TRL and Transformer versions LGTM! Thanks for the PR @willmj
Thanks @anhuong for testing it with minimal example.

@willmj willmj merged commit 2f033c7 into foundation-model-stack:main Feb 13, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants