dist_sync_fn seems dysfunctional #1183

yellowdolphin · 2022-08-13T15:56:12Z

🐛 Bug

dist_sync_fn is a documented kwarg of class Metric and suggests to pass an alternative to torch.distributed.all_gather. I suppose, the intend is to allow use of distributed contexts other than torch.distributed, for which the default works fine. This use case fails due to the following issues:

passed dist_sync_fn is never called because hardcoded test jit_distributed_available() returns False in any other sync context
this test is assigned as default to kwarg distributed_available of sync() and sync_context(), but neither can be called by the user because all metrics already wrap their compute() method at init
sync_context can be applied only once to a function, else the inner wrapper will raise TorchMetricsUserError "The Metric has already been synced."

To Reproduce

Code for testing Accuracy and MetricCollection in a TPU context (colab): https://colab.research.google.com/drive/1MlxWSrkKKuZ3WSb9duf1c0MoO2A8jDAE?usp=sharing

Code sample

acc = torchmetrics.Accuracy(dist_sync_fn=gather_fn).to(device)
acc.update(scores.detach(), labels)
acc_value = acc.compute()

Expected behavior

Automatically sync and compute correct metrics.

Environment

TorchMetrics version (and how you installed TM, e.g. conda, pip, build from source): 0.10dev
Python & PyTorch Version (e.g., 1.0):
Any other relevant information such as OS (e.g., Linux): e.g. torch_xla.distributed

Additional context

I found that with minimal changes in metrics.py, dist_sync_fn can actually be used to run torchmetrics on TPUs (torch_xla), like in the code sample above. I could send a PR (see fork in colab notebook).

The text was updated successfully, but these errors were encountered:

github-actions · 2022-08-13T15:56:53Z

Hi! thanks for your contribution!, great first issue!

Borda · 2022-10-19T12:40:56Z

@justusschock, mind having a look at it? 🦦

yellowdolphin added bug / fix Something isn't working help wanted Extra attention is needed labels Aug 13, 2022

SkafteNicki added this to the v0.11 milestone Sep 14, 2022

Borda modified the milestones: v0.11, v0.10 Oct 19, 2022

justusschock self-assigned this Oct 19, 2022

justusschock mentioned this issue Oct 31, 2022

Add possibility to provide distributed_available_fn #1301

Merged

4 tasks

justusschock closed this as completed in #1301 Oct 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dist_sync_fn seems dysfunctional #1183

dist_sync_fn seems dysfunctional #1183

yellowdolphin commented Aug 13, 2022 •

edited by Borda

Loading

github-actions bot commented Aug 13, 2022

Borda commented Oct 19, 2022

dist_sync_fn seems dysfunctional #1183

dist_sync_fn seems dysfunctional #1183

Comments

yellowdolphin commented Aug 13, 2022 • edited by Borda Loading

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

github-actions bot commented Aug 13, 2022

Borda commented Oct 19, 2022

yellowdolphin commented Aug 13, 2022 •

edited by Borda

Loading