Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace ValueError with warning in has_len() #9785

Closed
ninginthecloud opened this issue Oct 1, 2021 · 6 comments · Fixed by #9827
Closed

Replace ValueError with warning in has_len() #9785

ninginthecloud opened this issue Oct 1, 2021 · 6 comments · Fixed by #9827
Assignees
Labels
feature Is an improvement or enhancement

Comments

@ninginthecloud
Copy link
Contributor

🚀 Feature

Motivation

has_len() in data.py checks if dataloader has __len__ method implemented.
When dataloader has zero length, it raises a ValueError (see code). Recently, we got task from users brought up that when dataset has fewer samples than number of ranks, sampler would split the dataset unevenly, and sometimes 0 batch on certain ranks is inevitable. This ValueError blocked them to have successful PL run. This is also bridging the gap for Detectron2Go to better use Lightning: https://github.com/facebookresearch/d2go

Pitch

Replace ValueError with warning
Sync the length of dataloader across ranks instead of checking local dataset. By doing so, we can catch the situation when user unintentionally fit empty dataloader

Alternatives

Additional context


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

  • Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

  • Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

@ninginthecloud ninginthecloud added the feature Is an improvement or enhancement label Oct 1, 2021
@ninginthecloud
Copy link
Contributor Author

cc: @awaelchli @tchaton @ananthsub

@ninginthecloud ninginthecloud self-assigned this Oct 1, 2021
@tchaton
Copy link
Contributor

tchaton commented Oct 1, 2021

Here are my suggestions.

raise a MisConfiguration if the total length across all ranks is 0.

raise a warning if the total length isn't 0, but it is on 1 process.

However, this would very likely result in hanging during the backward pass on the all_reduce operation as Lightning doesn't support uneven batches yet.

I still believe this is something for advanced users.

One extra option would be the following.

total_length = ...
if total_length == 0:
  raise Mis...

if total_length > 0 and local_length == 0:
  if self.datamodule.enable_zero_length_dataloader_with_ddp:
         warning
  else:
         raise MisConf....

@ananthsub
Copy link
Contributor

@ninginthecloud to clarify, is this only for evaluation? Or does this check apply to training too?

Uneven batches in validation also means the user needs to take extra precaution around syncing state, such as metrics. I think we ought to include that in the warning, as D2go handles this but others might not.

@awaelchli
Copy link
Contributor

@ananthsub the check applies for training too. The has_len check happens in reset_train_dataloaders in data_loading.py.

I think we can do what @tchaton suggested to compute the length over all ranks.

@aazzolini
Copy link

Hey folks what's the status here? Seems like we're still having to apply an internal patch to replace this error with warning.

@awaelchli
Copy link
Contributor

The PR is ready and can be merged. A small comment is left to be addressed by @ninginthecloud :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment