You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
I am fine-tuning a RoBERTa with differential privacy (using PyTorch's Opacus). This is the specific script I'm running using torchrun for distributed training. My code also relies on private-transformers but as you can see in the stacktrace below, the error happens inside HuggingFace's Trainer and I have made a quick fix inside the Trainer source code (shown below) to make my code work. However, I am opening an issue here to see if this is a general issue that needs fixing.
Traceback (most recent call last):
File "/work/fairness-privacy/src/train.py", line 335, in <module>
train_helper(args, dataset['train'], dataset['validation'])
File "/work/fairness-privacy/src/train.py", line 300, in train_helper
model_ft = train_private(args, train_data_tok, val_data_tok)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/work/fairness-privacy/src/train.py", line 160, in train_private
trainer.train(model_path=None, dev_objective="eval_accuracy")
File "/work/fairness-privacy/private-transformers/examples/classification/src/trainer.py", line 401, in train
logging_loss_scalar = self.evaluate_and_log(
^^^^^^^^^^^^^^^^^^^^^^
File "/work/fairness-privacy/private-transformers/examples/classification/src/trainer.py", line 586, in evaluate_and_log
output = self.evaluate()
^^^^^^^^^^^^^^^
File "/work/fairness-privacy/private-transformers/examples/classification/src/trainer.py", line 569, in evaluate
output = self.prediction_loop(eval_dataloader, description="Evaluation")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/work/fairness-privacy/condaenv/lib/python3.11/site-packages/transformers/trainer.py", line 3862, in prediction_loop
losses = loss.repeat(batch_size)
^^^^^^^^^^^^^^^^^^^^^^^
TypeError: repeat(): argument 'repeats' (position 1) must be tuple of ints, but found element of type NoneType at pos 0
I am able to avoid this error when I make the following hack inside prediction_loop:
from accelerate.data_loader import DataLoaderShard
if type(dataloader) == DataLoaderShard:
batch_size = dataloader.total_batch_size
else:
batch_size = dataloader.batch_size
Expected behavior
Expected behavior is that prediction_loop runs normally and the function that calls it (evaluate_and_log) is able to log the evaluation results during the training process. On a more fine-grained level batch_size should be a scalar and not None as is happening in this case so losses = loss.repeat(batch_size) inside prediction_loop is able to run.
The text was updated successfully, but these errors were encountered:
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.37.2torchrun --nnodes=1 --nproc-per-node=${N_GPUS}
Who can help?
@muellerzr @SunMarc
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I am fine-tuning a RoBERTa with differential privacy (using PyTorch's Opacus). This is the specific script I'm running using
torchrun
for distributed training. My code also relies onprivate-transformers
but as you can see in the stacktrace below, the error happens inside HuggingFace'sTrainer
and I have made a quick fix inside theTrainer
source code (shown below) to make my code work. However, I am opening an issue here to see if this is a general issue that needs fixing.I am executing this script using:
I am able to avoid this error when I make the following hack inside
prediction_loop
:Expected behavior
Expected behavior is that
prediction_loop
runs normally and the function that calls it (evaluate_and_log
) is able to log the evaluation results during the training process. On a more fine-grained levelbatch_size
should be a scalar and notNone
as is happening in this case solosses = loss.repeat(batch_size)
insideprediction_loop
is able to run.The text was updated successfully, but these errors were encountered: