-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
In my D2Go model training, the evaluation metrics are reduced on rank 0 and logged in validation_epoch_end hook. The other ranks will just log an empty dict.
However, the recent change #6417 will always run _all_gather. I noticed that the training is stuck there until a NCCL timeout error kicks in.
Please reproduce using the BoringModel
To Reproduce
Use following BoringModel and post here
Apply follow code in validation_epoch_end hook and run with 2 GPUs.
def validation_epoch_end(self, outputs) -> None:
rank = get_rank()
res = {}
if rank == 0:
# reduce metric across ranks
res = {"reduced_metric": 0.1}
self.log_dict(res)
https://colab.research.google.com/drive/1HvWVVTK8j2Nj52qU4Q4YCyzOm0_aLQF3?usp=sharing
(note: I'm not sure how to run distributed training in colab, so I'm not able to reproduce in above notebook)
Expected behavior
Do not try to reduce on missing metric keys.
Environment
Note: Bugs with code are solved faster ! Colab Notebook should be made public !
-
IDE: Please, use our python bug_report_model.py template. -
Colab Notebook: Please copy and paste the output from our environment collection script (or fill out the checklist below manually).
You can get the script and run it with:
wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py
- PyTorch Version (e.g., 1.0):
- OS (e.g., Linux):
- How you installed PyTorch (
conda,pip, source): - Build command you used (if compiling from source):
- Python version:
- CUDA/cuDNN version:
- GPU models and configuration:
- Any other relevant information: