-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TensorMetric not updated to cuda device #2274
Comments
This comment has been minimized.
This comment has been minimized.
Nevermind, issue still persists in 0.8.1 release. |
I understand now that the metrics must be moved to the appropriate device, but it's not clear what the best way to do this is or if it should be handled automatically by the LightningModule somehow. Currently I'm just setting |
Can metrics not be used for the loss computation? |
they should be ablet to... |
This shall be possible after you call This means. Once you add it as an Attribute to your lightning module, this should be updated automatically. For completeness: this happens, because we automatically move the result to |
Based on the stack-trace above though, this movement off of the cpu and onto the cuda device doesn't seem to be happening. I'll try to post a minimal example using a metric as an attribute of a lightning module soon. The stacktrace is from a module though that uses the above TverskyLossMetric code, which also calls |
Unfortunately I could also reproduce this behaviour for So this is rather not specific to metrics, but a general issue with our mixing here. |
I guess this can be fixed by introducing a buffer to each metric, which is a feature we need to implement regardless of this problem if we want to accumulate stats over multiple batches. For now, simply changing the base class to something like:
Should work. |
I think I fixed this recently in #2657
@tayden It just looks like that, but this is just the initial value there. The DeviceDtypeMixin takes care of updating the device. metric = FocalTverskyMetric(alpha=0.5, beta=0.5, gamma=1.).cuda() But of course, if you assign the metric as property in LightningModule, then the metric will also be moved automatically to the device the LightningModule is on. I agree with @SkafteNicki. All metrics that save tensors must register them as buffers. |
🐛 Bug
When tensors located on the GPU are passed through a TensorMetric forward method, they are transfered to the CPU by this method. It seems that the 'cpu' device is not updated when training on a gpu. The cpu device seems to be hardcoded here
To Reproduce
Steps to reproduce the behavior:
When training in DDP mode with 16-bit precision, these metrics throws the stack trace below. This disappears in 32-bit mode since it's the amp grad_scaler asserting that the loss value is a cuda tensor.
Expected behavior
The TensorMetric should have self._device updated to equal the current Trainer device during initialization.
Environment
The bug is also present on AWS p3.8xlarge instances using the same environment but with 4 Nvidia Tesla V100s
The text was updated successfully, but these errors were encountered: