Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IterableDatasets result in nan loss in eval with dataloader_num_workers>=1 and multi-gpu #18608

Closed
2 of 4 tasks
dlwh opened this issue Aug 12, 2022 · 1 comment · Fixed by #18856
Closed
2 of 4 tasks
Labels

Comments

@dlwh
Copy link
Contributor

dlwh commented Aug 12, 2022

System Info

  • transformers version: 4.22.0.dev0
  • Platform: Linux-5.4.0-105-generic-x86_64-with-glibc2.31
  • Python version: 3.9.13
  • Huggingface_hub version: 0.8.1
  • PyTorch version (GPU?): 1.12.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: YES

Who can help?

@sgugger

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Run this modified/minimized run_clm.py under DeepSpeed (or presumably any other multiprocessing, but I didn't check)

The script works fine if you don't use multiprocessing, or if you change it to not use an IterableDataset, or if you set dataloader_num_workers to 0 (which is the default)

Relevant bit of logs:

Traceback (most recent call last):
  File "run_clm.py", line 125, in <module>
    main()
  File "run_clm.py", line 116, in main
    assert np.isfinite(metrics["eval_loss"])
AssertionError

Expected behavior

assertion shouldn't fail, or at least trainer should require that dataloader_num_workers is 0 if using multi-gpu and IterableDataset...

The underlying issue is that Trainer creates 'IterableDatasetShard's when using multi-gpu and IterableDataset, and (evaluation_loop)[https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L3024-L3027] looks at the "num_examples" property of the IterableDatasetShard, but this value isn't actually incremented in the main training process if you're using dataloader_num_workers>0, because it's set in the worker processes...

I will note that evaluation_loop goes to some trouble to track the actual number of examples so unless I'm missing something I think one could just always use that.

@sgugger
Copy link
Collaborator

sgugger commented Sep 1, 2022

Thanks for flagging. The PR above should fix the issue, could you give it a quick try?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants