Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute loss inside the training_step for optimum trainer. #671

Closed
AdamLouly opened this issue Jan 5, 2023 · 5 comments
Closed

Compute loss inside the training_step for optimum trainer. #671

AdamLouly opened this issue Jan 5, 2023 · 5 comments
Assignees

Comments

@AdamLouly
Copy link
Contributor

Feature request

I would like to request a feature that allows for the modification of the optimum trainer to compute the loss inside the training process.

Motivation

We believe that modifying the optimum training process to compute the loss inside the training process will greatly benefit to the memory usage. This is because computing the loss separately is currently consuming more memory, which can be a significant issue for users working with large datasets or on systems with limited memory.

In a proof-of-concept implementation that we conducted; we were able to achieve a 7.8% reduction in memory usage by including the loss calculation inside the training process.

Your contribution

We have identified two potential approaches for modifying the optimum training process to compute the loss inside the training process.

The first approach involves leveraging the loss computation function inside the Hugging Face trainer, but we have encountered some limitations related to the default loss function used in this trainer (label_smoother).

The second approach, which we have implemented in a proof-of-concept, involves creating a class called 'ModuleWithLoss' that overrides the forward method to compute the loss inside. This approach has shown promising results, but we have encountered issues with the eval_loop and prediction_loop breaking when using this method.

To solve this issue, we propose maintaining two separate models: a training model wrapped inside the 'ModuleWithLoss' class for all training-related tasks, and a normal default model for evaluation and inference. We believe that this approach will be effective in addressing the issues we have encountered and will ultimately lead to the successful implementation of this feature.

we want to know what do you think about 2 approaches and if you have any idea to make the first approach happen, because it looks much cleaner if we can pass the limitations.

if not, we would like to know if maintaining both models is a good approach or is there any suggestions to do this better as you have more context on this than our team.

For reference this is the POC implementation : https://github.com/pengwa/optimum/pull/1

Thank you

@AdamLouly
Copy link
Contributor Author

@JingyaHuang

@JingyaHuang JingyaHuang self-assigned this Jan 5, 2023
@pengwa
Copy link

pengwa commented Jan 9, 2023

To more specific, the memory benefits are mainly for onnxruntime ORTTrainer.

An alternative approach might be: _inner_training_loop wrap the model + loss and did not update the self.model or self.model_wrapped (that would affect evaluation).

@JingyaHuang
Copy link
Collaborator

Hi @pengwa and @AdamLouly, in ORTTrainer, the loss is computed outside the training process only when applying label smoothing. (So if not using label_smmother, the loss will be the loss computed in the modeling)

Case 1: w/. label_smoother (extra memory usage)
https://github.com/huggingface/transformers/blob/25ddd91b249014d818fb2ed3d4ba856ed9a5653e/src/transformers/trainer.py#L2573

Case 2: w/o. label_smoother (loss computed inside modeling, shall not increase memory usage)
https://github.com/huggingface/transformers/blob/25ddd91b249014d818fb2ed3d4ba856ed9a5653e/src/transformers/trainer.py#L2578

If I understand correctly, you want to add an extra wrapper in order to include the loss compute inside forward() when using label smoothing so that it can be intercepted by ORTModule, is that correct?

@JingyaHuang
Copy link
Collaborator

Another possibility would be to create a wrapper to include label smoothing directly in the Trainer of Transformers. It can save some effort on maintaining ORTTrainer. Going to check if it makes sense.

@AdamLouly
Copy link
Contributor Author

@JingyaHuang , I tested your suggestion, with and without Label_smoother and these are the results :

image

image

image

we can still see that the performance of the wrapper is better.
also when we have label smoother the memory is less than when we don't use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants