IterableDatasets result in nan loss in eval with dataloader_num_workers>=1 and multi-gpu #18608

dlwh · 2022-08-12T18:24:29Z

System Info

transformers version: 4.22.0.dev0
Platform: Linux-5.4.0-105-generic-x86_64-with-glibc2.31
Python version: 3.9.13
Huggingface_hub version: 0.8.1
PyTorch version (GPU?): 1.12.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: YES

Who can help?

@sgugger

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Run this modified/minimized run_clm.py under DeepSpeed (or presumably any other multiprocessing, but I didn't check)

The script works fine if you don't use multiprocessing, or if you change it to not use an IterableDataset, or if you set dataloader_num_workers to 0 (which is the default)

Relevant bit of logs:

Traceback (most recent call last):
  File "run_clm.py", line 125, in <module>
    main()
  File "run_clm.py", line 116, in main
    assert np.isfinite(metrics["eval_loss"])
AssertionError

Expected behavior

assertion shouldn't fail, or at least trainer should require that dataloader_num_workers is 0 if using multi-gpu and IterableDataset...

The underlying issue is that Trainer creates 'IterableDatasetShard's when using multi-gpu and IterableDataset, and (evaluation_loop)[https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L3024-L3027] looks at the "num_examples" property of the IterableDatasetShard, but this value isn't actually incremented in the main training process if you're using dataloader_num_workers>0, because it's set in the worker processes...

I will note that evaluation_loop goes to some trouble to track the actual number of examples so unless I'm missing something I think one could just always use that.

The text was updated successfully, but these errors were encountered:

sgugger · 2022-09-01T15:15:37Z

Thanks for flagging. The PR above should fix the issue, could you give it a quick try?

dlwh added the bug label Aug 12, 2022

sgugger mentioned this issue Sep 1, 2022

Fix number of examples for iterable datasets in multiprocessing #18856

Merged

sgugger closed this as completed in #18856 Sep 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IterableDatasets result in nan loss in eval with dataloader_num_workers>=1 and multi-gpu #18608

IterableDatasets result in nan loss in eval with dataloader_num_workers>=1 and multi-gpu #18608

dlwh commented Aug 12, 2022

sgugger commented Sep 1, 2022

IterableDatasets result in nan loss in eval with dataloader_num_workers>=1 and multi-gpu #18608

IterableDatasets result in nan loss in eval with dataloader_num_workers>=1 and multi-gpu #18608

Comments

dlwh commented Aug 12, 2022

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

sgugger commented Sep 1, 2022