Trainer having issues with DataLoaderShard when running with torchrun #31457

mohummedalee · 2024-06-17T22:45:21Z

System Info

transformers version: 4.37.2
Platform: Linux-3.10.0-1160.25.1.el7.x86_64-x86_64-with-glibc2.17
Python version: 3.11.8
Huggingface_hub version: 0.23.3
Safetensors version: 0.4.2
Accelerate version: 0.31.0
Accelerate config: not found
PyTorch version (GPU?): 2.2.2 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes (CUDA)
Using distributed or parallel set-up in script?: Yes, running with torchrun --nnodes=1 --nproc-per-node=${N_GPUS}

Who can help?

@muellerzr @SunMarc

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I am fine-tuning a RoBERTa with differential privacy (using PyTorch's Opacus). This is the specific script I'm running using torchrun for distributed training. My code also relies on private-transformers but as you can see in the stacktrace below, the error happens inside HuggingFace's Trainer and I have made a quick fix inside the Trainer source code (shown below) to make my code work. However, I am opening an issue here to see if this is a general issue that needs fixing.

Traceback (most recent call last):
  File "/work/fairness-privacy/src/train.py", line 335, in <module>
    train_helper(args, dataset['train'], dataset['validation'])
  File "/work/fairness-privacy/src/train.py", line 300, in train_helper
    model_ft = train_private(args, train_data_tok, val_data_tok)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/fairness-privacy/src/train.py", line 160, in train_private
    trainer.train(model_path=None, dev_objective="eval_accuracy")
  File "/work/fairness-privacy/private-transformers/examples/classification/src/trainer.py", line 401, in train
    logging_loss_scalar = self.evaluate_and_log(
                          ^^^^^^^^^^^^^^^^^^^^^^
  File "/work/fairness-privacy/private-transformers/examples/classification/src/trainer.py", line 586, in evaluate_and_log
    output = self.evaluate()
             ^^^^^^^^^^^^^^^
  File "/work/fairness-privacy/private-transformers/examples/classification/src/trainer.py", line 569, in evaluate
    output = self.prediction_loop(eval_dataloader, description="Evaluation")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/fairness-privacy/condaenv/lib/python3.11/site-packages/transformers/trainer.py", line 3862, in prediction_loop
    losses = loss.repeat(batch_size)
             ^^^^^^^^^^^^^^^^^^^^^^^
TypeError: repeat(): argument 'repeats' (position 1) must be tuple of ints, but found element of type NoneType at pos 0

I am executing this script using:

EPOCHS=1
BATCH_SIZE=64
EPSILON=8
MODEL_OUT="models/roberta-priv-eps_${EPSILON}_epochs_${EPOCHS}-bs_${BATCH_SIZE}"
N_GPUS=1

torchrun --nnodes=1 --nproc-per-node=${N_GPUS} src/train.py \
    --train-mode private \
    --data-path /work/fairness-privacy/twitteraae-sentiment-data-split/ \
    --epochs $EPOCHS \
    --model-out-path $MODEL_OUT \
    --tracking-interval 5000 \
    --priv-epsilon $EPSILON \
    --priv-max-grad-norm 0.1 \
    --do-eval

I am able to avoid this error when I make the following hack inside prediction_loop:

from accelerate.data_loader import DataLoaderShard
if type(dataloader) == DataLoaderShard:
    batch_size = dataloader.total_batch_size
else:
    batch_size = dataloader.batch_size

Expected behavior

Expected behavior is that prediction_loop runs normally and the function that calls it (evaluate_and_log) is able to log the evaluation results during the training process. On a more fine-grained level batch_size should be a scalar and not None as is happening in this case so losses = loss.repeat(batch_size) inside prediction_loop is able to run.

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-08-12T09:33:44Z

Gentle ping @SunMarc @muellerzr

github-actions · 2024-10-08T08:09:26Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

amyeroberts added the trainer label Jun 18, 2024

huggingface deleted a comment from github-actions bot Jul 18, 2024

huggingface deleted a comment from github-actions bot Aug 12, 2024

huggingface deleted a comment from github-actions bot Sep 13, 2024

amyeroberts mentioned this issue Sep 13, 2024

Accelerate x Trainer issue tracker: #33345

Open

43 tasks

github-actions bot closed this as completed Oct 17, 2024

SunMarc reopened this Oct 17, 2024

zeus2611 mentioned this issue Oct 23, 2024

Fix batch size handling in prediction_loop for DataLoaderShard #34343

Merged

5 tasks

github-actions bot closed this as completed Oct 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer having issues with DataLoaderShard when running with torchrun #31457

Trainer having issues with DataLoaderShard when running with torchrun #31457

mohummedalee commented Jun 17, 2024

amyeroberts commented Aug 12, 2024

github-actions bot commented Oct 8, 2024

Trainer having issues with DataLoaderShard when running with torchrun #31457

Trainer having issues with DataLoaderShard when running with torchrun #31457

Comments

mohummedalee commented Jun 17, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Aug 12, 2024

github-actions bot commented Oct 8, 2024