You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
Run this modified/minimized run_clm.py under DeepSpeed (or presumably any other multiprocessing, but I didn't check)
The script works fine if you don't use multiprocessing, or if you change it to not use an IterableDataset, or if you set dataloader_num_workers to 0 (which is the default)
Relevant bit of logs:
Traceback (most recent call last):
File "run_clm.py", line 125, in <module>
main()
File "run_clm.py", line 116, in main
assert np.isfinite(metrics["eval_loss"])
AssertionError
Expected behavior
assertion shouldn't fail, or at least trainer should require that dataloader_num_workers is 0 if using multi-gpu and IterableDataset...
The underlying issue is that Trainer creates 'IterableDatasetShard's when using multi-gpu and IterableDataset, and (evaluation_loop)[https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L3024-L3027] looks at the "num_examples" property of the IterableDatasetShard, but this value isn't actually incremented in the main training process if you're using dataloader_num_workers>0, because it's set in the worker processes...
I will note that evaluation_loop goes to some trouble to track the actual number of examples so unless I'm missing something I think one could just always use that.
The text was updated successfully, but these errors were encountered:
System Info
transformers
version: 4.22.0.dev0Who can help?
@sgugger
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Run this modified/minimized run_clm.py under DeepSpeed (or presumably any other multiprocessing, but I didn't check)
The script works fine if you don't use multiprocessing, or if you change it to not use an IterableDataset, or if you set dataloader_num_workers to 0 (which is the default)
Relevant bit of logs:
Expected behavior
assertion shouldn't fail, or at least trainer should require that dataloader_num_workers is 0 if using multi-gpu and IterableDataset...
The underlying issue is that Trainer creates 'IterableDatasetShard's when using multi-gpu and IterableDataset, and (evaluation_loop)[https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L3024-L3027] looks at the "num_examples" property of the IterableDatasetShard, but this value isn't actually incremented in the main training process if you're using
dataloader_num_workers>0
, because it's set in the worker processes...I will note that
evaluation_loop
goes to some trouble to track the actual number of examples so unless I'm missing something I think one could just always use that.The text was updated successfully, but these errors were encountered: