Compute loss inside the training_step for optimum trainer. #671

AdamLouly · 2023-01-05T03:25:24Z

Feature request

I would like to request a feature that allows for the modification of the optimum trainer to compute the loss inside the training process.

Motivation

We believe that modifying the optimum training process to compute the loss inside the training process will greatly benefit to the memory usage. This is because computing the loss separately is currently consuming more memory, which can be a significant issue for users working with large datasets or on systems with limited memory.

In a proof-of-concept implementation that we conducted; we were able to achieve a 7.8% reduction in memory usage by including the loss calculation inside the training process.

Your contribution

We have identified two potential approaches for modifying the optimum training process to compute the loss inside the training process.

The first approach involves leveraging the loss computation function inside the Hugging Face trainer, but we have encountered some limitations related to the default loss function used in this trainer (label_smoother).

The second approach, which we have implemented in a proof-of-concept, involves creating a class called 'ModuleWithLoss' that overrides the forward method to compute the loss inside. This approach has shown promising results, but we have encountered issues with the eval_loop and prediction_loop breaking when using this method.

To solve this issue, we propose maintaining two separate models: a training model wrapped inside the 'ModuleWithLoss' class for all training-related tasks, and a normal default model for evaluation and inference. We believe that this approach will be effective in addressing the issues we have encountered and will ultimately lead to the successful implementation of this feature.

we want to know what do you think about 2 approaches and if you have any idea to make the first approach happen, because it looks much cleaner if we can pass the limitations.

if not, we would like to know if maintaining both models is a good approach or is there any suggestions to do this better as you have more context on this than our team.

For reference this is the POC implementation : https://github.com/pengwa/optimum/pull/1

Thank you

AdamLouly · 2023-01-05T03:25:43Z

@JingyaHuang

pengwa · 2023-01-09T03:09:06Z

To more specific, the memory benefits are mainly for onnxruntime ORTTrainer.

An alternative approach might be: _inner_training_loop wrap the model + loss and did not update the self.model or self.model_wrapped (that would affect evaluation).

JingyaHuang · 2023-01-17T14:36:34Z

Hi @pengwa and @AdamLouly, in ORTTrainer, the loss is computed outside the training process only when applying label smoothing. (So if not using label_smmother, the loss will be the loss computed in the modeling)

Case 1: w/. label_smoother (extra memory usage)
https://github.com/huggingface/transformers/blob/25ddd91b249014d818fb2ed3d4ba856ed9a5653e/src/transformers/trainer.py#L2573

Case 2: w/o. label_smoother (loss computed inside modeling, shall not increase memory usage)
https://github.com/huggingface/transformers/blob/25ddd91b249014d818fb2ed3d4ba856ed9a5653e/src/transformers/trainer.py#L2578

If I understand correctly, you want to add an extra wrapper in order to include the loss compute inside forward() when using label smoothing so that it can be intercepted by ORTModule, is that correct?

JingyaHuang · 2023-01-17T17:31:57Z

Another possibility would be to create a wrapper to include label smoothing directly in the Trainer of Transformers. It can save some effort on maintaining ORTTrainer. Going to check if it makes sense.

AdamLouly · 2023-01-17T19:10:54Z

@JingyaHuang , I tested your suggestion, with and without Label_smoother and these are the results :

we can still see that the performance of the wrapper is better.
also when we have label smoother the memory is less than when we don't use it.

JingyaHuang self-assigned this Jan 5, 2023

AdamLouly mentioned this issue Jan 11, 2023

Compute Loss inside the training step. #686

Merged

AdamLouly closed this as completed Apr 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute loss inside the training_step for optimum trainer. #671

Compute loss inside the training_step for optimum trainer. #671

AdamLouly commented Jan 5, 2023

AdamLouly commented Jan 5, 2023

pengwa commented Jan 9, 2023

JingyaHuang commented Jan 17, 2023

JingyaHuang commented Jan 17, 2023

AdamLouly commented Jan 17, 2023

Compute loss inside the training_step for optimum trainer. #671

Compute loss inside the training_step for optimum trainer. #671

Comments

AdamLouly commented Jan 5, 2023

Feature request

Motivation

Your contribution

AdamLouly commented Jan 5, 2023

pengwa commented Jan 9, 2023

JingyaHuang commented Jan 17, 2023

JingyaHuang commented Jan 17, 2023

AdamLouly commented Jan 17, 2023