Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix checkpoint warning for floats #2143

Closed
ZhaofengWu opened this issue Jun 11, 2020 · 10 comments · Fixed by #3855
Closed

Fix checkpoint warning for floats #2143

ZhaofengWu opened this issue Jun 11, 2020 · 10 comments · Fixed by #3855
Assignees
Labels
bug Something isn't working good first issue Good for newcomers help wanted Open to be worked on
Milestone

Comments

@ZhaofengWu
Copy link
Contributor

ZhaofengWu commented Jun 11, 2020

#1819 added this warning when saving checkpoints. https://github.com/PyTorchLightning/pytorch-lightning/blob/bd49b07fbba09b1e7d8851ee5a1ffce3d5925e9e/pytorch_lightning/callbacks/model_checkpoint.py#L259-L263

  1. Is there a reason why native Python floats aren't supported? I think it's actually quite common especially when libraries e.g. scipy are involved when computing the metrics.
  2. It says checkpoint not saved, but actually checkpoints are still saved. This check doesn't actually change the saving logic besides giving that warning.
@ZhaofengWu ZhaofengWu added the help wanted Open to be worked on label Jun 11, 2020
@Borda Borda added the bug Something isn't working label Jun 11, 2020
@Borda
Copy link
Member

Borda commented Jun 11, 2020

@justusschock @SkafteNicki mind have a look :]

@SkafteNicki
Copy link
Member

The reason it is still working is due to the fact that it is being converted to a torch.tensor later on:
https://github.com/PyTorchLightning/pytorch-lightning/blob/bd49b07fbba09b1e7d8851ee5a1ffce3d5925e9e/pytorch_lightning/callbacks/model_checkpoint.py#L176-L181
so in that sense they are supported. I guess the warning is there due to:

  • If you are returning something other than a native float, like a dict, you will get a non-informative error (RuntimeError: Could not infer dtype of dict). The warning could hopefully help you figure out what is going on. So the only way we can guarantee that this will work, is if you return a torch.Tensor.
  • The PR that implemented this, tried to fix some ddp bugs. I cannot completely figure out how this is related to ddp right now, but I guess that again torch.Tensor is the only type that guarantee that this works in ddp mode.

@ZhaofengWu
Copy link
Contributor Author

If types such as floats are supported, should they be excluded from the check and don't show warnings? And IMHO the warning message saying checkpoints won't be saved while they actually are is not desirable.

@Borda
Copy link
Member

Borda commented Jun 11, 2020

sounds reasonable, mind draft a PR? 🐰

@edenlightning
Copy link
Contributor

@ZhaofengWu Checking in :)

@ZhaofengWu
Copy link
Contributor Author

Ah sorry I overlooked this. I might not have the time recently to do this. Sorry :(

@Borda Borda added the good first issue Good for newcomers label Sep 15, 2020
@edenlightning edenlightning changed the title The metric you returned must be a Torch.Tensor instance, checkpoint not saved Fix checkpoint warning for floats Sep 22, 2020
@edenlightning edenlightning added feature Is an improvement or enhancement and removed bug Something isn't working labels Sep 22, 2020
@edenlightning edenlightning modified the milestones: 0.9.x, 1.1 Sep 22, 2020
@ddrevicky
Copy link
Contributor

You can assign this to me :).

@ddrevicky
Copy link
Contributor

ddrevicky commented Sep 28, 2020

Logging floats actually doesn't work with the current version.

I tried logging a metric with different dtypes (tensor, float, int, dict, list) and it only works with tensors. See this colab.

For the other types it fails even before reaching the model_checkpoint.py code because of reduction of the per validation step results at the end of a validation epoch in step_result.py.

    873 def weighted_mean(result, weights):
--> 874     weights = weights.to(result.device)
    875     numerator = torch.dot(result.float(), weights.transpose(-1, 0).float())
    876     result = numerator / weights.sum().float()

AttributeError: 'list' object has no attribute 'device'

For example if in 2 validation steps the logged metric is 42.0 and 43.0 (floats) the collected metric values to be reduced are [42.0, 43.0] and they do not have the device method.

I think this issue is connected with #3678 and #2636.

Basically, shouldn't we ensure that the metrics that user logs fulfill some basic properties? Then later in the program we can assume that they are valid and avoid later extra checks like above in this issue:

@Borda Borda added bug Something isn't working and removed feature Is an improvement or enhancement labels Oct 2, 2020
@Borda
Copy link
Member

Borda commented Oct 2, 2020

For example if in 2 validation steps the logged metric is 42.0 and 43.0 (floats) the collected metric values to be reduced are [42.0, 43.0] and they do not have the device method.

then you need to make a tensor which has those attributes... :]

@ddrevicky
Copy link
Contributor

Yeah :) Just found out this PR which could fix float logging when merged.

@Borda Borda modified the milestones: 1.1, 1.0 Oct 6, 2020
NickleDave added a commit to vocalpy/vak that referenced this issue Sep 21, 2023
Makes functions in `vak.transforms.distance.functional`
return tensors so we don't cause errors when lightning
tries to convert from numpy to tensors to log.

Letting lightning do the conversion kind of works,
but it can cause a fatal error
for someone using an Apple M1 with 'mps' as the accelerator,
see https://forum.vocalpy.org/t/vak-tweetynet-with-an-apple-m1-max/78/4?u=nicholdav

I don't find any explicit statement in either the Lightning
or Torchmetrics docs that metrics should always be tensors,
and that this guarantees there won't be weird issues
(right now we get a warning on start-up that all logged scalars
should be float32, but I would expect one should be able to log
integers too?).
But from various issues I read, it seems like that should be the case,
Lightning-AI/pytorch-lightning#2143
and I notice that torchmetrics classes tend to do things like
convert to a float tensor
NickleDave added a commit to vocalpy/vak that referenced this issue Sep 21, 2023
Makes functions in `vak.transforms.distance.functional`
return tensors so we don't cause errors when lightning
tries to convert from numpy to tensors to log.

Letting lightning do the conversion kind of works,
but it can cause a fatal error
for someone using an Apple M1 with 'mps' as the accelerator,
see https://forum.vocalpy.org/t/vak-tweetynet-with-an-apple-m1-max/78/4?u=nicholdav

I don't find any explicit statement in either the Lightning
or Torchmetrics docs that metrics should always be tensors,
and that this guarantees there won't be weird issues
(right now we get a warning on start-up that all logged scalars
should be float32, but I would expect one should be able to log
integers too?).
But from various issues I read, it seems like that should be the case,
Lightning-AI/pytorch-lightning#2143
and I notice that torchmetrics classes tend to do things like
convert to a float tensor
NickleDave added a commit to vocalpy/vak that referenced this issue Sep 23, 2023
Makes functions in `vak.transforms.distance.functional`
return tensors so we don't cause errors when lightning
tries to convert from numpy to tensors to log.

Letting lightning do the conversion kind of works,
but it can cause a fatal error
for someone using an Apple M1 with 'mps' as the accelerator,
see https://forum.vocalpy.org/t/vak-tweetynet-with-an-apple-m1-max/78/4?u=nicholdav

I don't find any explicit statement in either the Lightning
or Torchmetrics docs that metrics should always be tensors,
and that this guarantees there won't be weird issues
(right now we get a warning on start-up that all logged scalars
should be float32, but I would expect one should be able to log
integers too?).
But from various issues I read, it seems like that should be the case,
Lightning-AI/pytorch-lightning#2143
and I notice that torchmetrics classes tend to do things like
convert to a float tensor
NickleDave added a commit to vocalpy/vak that referenced this issue Sep 23, 2023
Makes functions in `vak.transforms.distance.functional`
return tensors so we don't cause errors when lightning
tries to convert from numpy to tensors to log.

Letting lightning do the conversion kind of works,
but it can cause a fatal error
for someone using an Apple M1 with 'mps' as the accelerator,
see https://forum.vocalpy.org/t/vak-tweetynet-with-an-apple-m1-max/78/4?u=nicholdav

I don't find any explicit statement in either the Lightning
or Torchmetrics docs that metrics should always be tensors,
and that this guarantees there won't be weird issues
(right now we get a warning on start-up that all logged scalars
should be float32, but I would expect one should be able to log
integers too?).
But from various issues I read, it seems like that should be the case,
Lightning-AI/pytorch-lightning#2143
and I notice that torchmetrics classes tend to do things like
convert to a float tensor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Open to be worked on
Projects
None yet
5 participants