-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compute loss inside the training_step for optimum trainer. #671
Comments
To more specific, the memory benefits are mainly for onnxruntime ORTTrainer. An alternative approach might be: _inner_training_loop wrap the model + loss and did not update the self.model or self.model_wrapped (that would affect evaluation). |
Hi @pengwa and @AdamLouly, in Case 1: w/. Case 2: w/o. If I understand correctly, you want to add an extra wrapper in order to include the loss compute inside |
Another possibility would be to create a wrapper to include label smoothing directly in the |
@JingyaHuang , I tested your suggestion, with and without Label_smoother and these are the results : we can still see that the performance of the wrapper is better. |
Feature request
I would like to request a feature that allows for the modification of the optimum trainer to compute the loss inside the training process.
Motivation
We believe that modifying the optimum training process to compute the loss inside the training process will greatly benefit to the memory usage. This is because computing the loss separately is currently consuming more memory, which can be a significant issue for users working with large datasets or on systems with limited memory.
In a proof-of-concept implementation that we conducted; we were able to achieve a 7.8% reduction in memory usage by including the loss calculation inside the training process.
Your contribution
We have identified two potential approaches for modifying the optimum training process to compute the loss inside the training process.
The first approach involves leveraging the loss computation function inside the Hugging Face trainer, but we have encountered some limitations related to the default loss function used in this trainer (label_smoother).
The second approach, which we have implemented in a proof-of-concept, involves creating a class called 'ModuleWithLoss' that overrides the forward method to compute the loss inside. This approach has shown promising results, but we have encountered issues with the eval_loop and prediction_loop breaking when using this method.
To solve this issue, we propose maintaining two separate models: a training model wrapped inside the 'ModuleWithLoss' class for all training-related tasks, and a normal default model for evaluation and inference. We believe that this approach will be effective in addressing the issues we have encountered and will ultimately lead to the successful implementation of this feature.
we want to know what do you think about 2 approaches and if you have any idea to make the first approach happen, because it looks much cleaner if we can pass the limitations.
if not, we would like to know if maintaining both models is a good approach or is there any suggestions to do this better as you have more context on this than our team.
For reference this is the POC implementation : https://github.com/pengwa/optimum/pull/1
Thank you
The text was updated successfully, but these errors were encountered: