Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainer having issues with DataLoaderShard when running with torchrun #31457

Closed
2 of 4 tasks
Tracked by #33345
mohummedalee opened this issue Jun 17, 2024 · 2 comments
Closed
2 of 4 tasks
Tracked by #33345
Labels

Comments

@mohummedalee
Copy link

System Info

  • transformers version: 4.37.2
  • Platform: Linux-3.10.0-1160.25.1.el7.x86_64-x86_64-with-glibc2.17
  • Python version: 3.11.8
  • Huggingface_hub version: 0.23.3
  • Safetensors version: 0.4.2
  • Accelerate version: 0.31.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.2.2 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes (CUDA)
  • Using distributed or parallel set-up in script?: Yes, running with torchrun --nnodes=1 --nproc-per-node=${N_GPUS}

Who can help?

@muellerzr @SunMarc

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I am fine-tuning a RoBERTa with differential privacy (using PyTorch's Opacus). This is the specific script I'm running using torchrun for distributed training. My code also relies on private-transformers but as you can see in the stacktrace below, the error happens inside HuggingFace's Trainer and I have made a quick fix inside the Trainer source code (shown below) to make my code work. However, I am opening an issue here to see if this is a general issue that needs fixing.

Traceback (most recent call last):
  File "/work/fairness-privacy/src/train.py", line 335, in <module>
    train_helper(args, dataset['train'], dataset['validation'])
  File "/work/fairness-privacy/src/train.py", line 300, in train_helper
    model_ft = train_private(args, train_data_tok, val_data_tok)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/fairness-privacy/src/train.py", line 160, in train_private
    trainer.train(model_path=None, dev_objective="eval_accuracy")
  File "/work/fairness-privacy/private-transformers/examples/classification/src/trainer.py", line 401, in train
    logging_loss_scalar = self.evaluate_and_log(
                          ^^^^^^^^^^^^^^^^^^^^^^
  File "/work/fairness-privacy/private-transformers/examples/classification/src/trainer.py", line 586, in evaluate_and_log
    output = self.evaluate()
             ^^^^^^^^^^^^^^^
  File "/work/fairness-privacy/private-transformers/examples/classification/src/trainer.py", line 569, in evaluate
    output = self.prediction_loop(eval_dataloader, description="Evaluation")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/fairness-privacy/condaenv/lib/python3.11/site-packages/transformers/trainer.py", line 3862, in prediction_loop
    losses = loss.repeat(batch_size)
             ^^^^^^^^^^^^^^^^^^^^^^^
TypeError: repeat(): argument 'repeats' (position 1) must be tuple of ints, but found element of type NoneType at pos 0

I am executing this script using:

EPOCHS=1
BATCH_SIZE=64
EPSILON=8
MODEL_OUT="models/roberta-priv-eps_${EPSILON}_epochs_${EPOCHS}-bs_${BATCH_SIZE}"
N_GPUS=1

torchrun --nnodes=1 --nproc-per-node=${N_GPUS} src/train.py \
    --train-mode private \
    --data-path /work/fairness-privacy/twitteraae-sentiment-data-split/ \
    --epochs $EPOCHS \
    --model-out-path $MODEL_OUT \
    --tracking-interval 5000 \
    --priv-epsilon $EPSILON \
    --priv-max-grad-norm 0.1 \
    --do-eval

I am able to avoid this error when I make the following hack inside prediction_loop:

from accelerate.data_loader import DataLoaderShard
if type(dataloader) == DataLoaderShard:
    batch_size = dataloader.total_batch_size
else:
    batch_size = dataloader.batch_size

Expected behavior

Expected behavior is that prediction_loop runs normally and the function that calls it (evaluate_and_log) is able to log the evaluation results during the training process. On a more fine-grained level batch_size should be a scalar and not None as is happening in this case so losses = loss.repeat(batch_size) inside prediction_loop is able to run.

@huggingface huggingface deleted a comment from github-actions bot Jul 18, 2024
@huggingface huggingface deleted a comment from github-actions bot Aug 12, 2024
@amyeroberts
Copy link
Collaborator

Gentle ping @SunMarc @muellerzr

Copy link

github-actions bot commented Oct 8, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants