Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weight tying is broken on TPUs leading to silent errors #2705

Closed
ibeltagy opened this issue Jul 25, 2020 · 6 comments · Fixed by #5441
Closed

Weight tying is broken on TPUs leading to silent errors #2705

ibeltagy opened this issue Jul 25, 2020 · 6 comments · Fixed by #5441
Assignees
Labels
accelerator: tpu Tensor Processing Unit bug Something isn't working feature Is an improvement or enhancement help wanted Open to be worked on
Milestone

Comments

@ibeltagy
Copy link
Contributor

🐛 Bug

PyTorch/XLA documentation mentions here that weight tying should happen after moving tensors to XLA, otherwise the tensors are copied. This is a silent error that can easily go undetected (thanks to @matt-peters for pointing it out), and it would be good if PL guards the user against it. Notice that weight tying is pretty common in today's models not a corner case.

Code sample

The following code snippet shows how to detect that this issue is happening and how to guard against it.

import pl
class MyPLModel(pl.LightningModule):
    def to(self, *args, **kwargs):
        param_count_before_moving_to_device = len(list(self.parameters()))  # 
        super().to(*args, **kwargs)
        if self.trainer.use_tpu:
            # need to re-tie the weights after moving to XLA!
            self.tie_weights()  # a new function that the user needs to implement
        param_count_after_moving_to_device = len(list(self.parameters()))
        assert param_count_before_moving_to_device == param_count_after_moving_to_device
@ibeltagy ibeltagy added bug Something isn't working help wanted Open to be worked on labels Jul 25, 2020
@edenlightning edenlightning added accelerator: tpu Tensor Processing Unit feature Is an improvement or enhancement and removed bug Something isn't working labels Jul 29, 2020
@stale
Copy link

stale bot commented Oct 22, 2020

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Oct 22, 2020
@ibeltagy
Copy link
Contributor Author

@zcain117, might be worth checking. It is a subtle issue and users are likely going to miss it.

@stale stale bot removed the won't fix This will not be worked on label Oct 22, 2020
@Borda Borda added this to the 1.1.x milestone Dec 3, 2020
@Borda
Copy link
Member

Borda commented Dec 3, 2020

@ibeltagy is the problem still on master? 🐰
cc: @zcain117 @lezwon

@ibeltagy
Copy link
Contributor Author

ibeltagy commented Dec 3, 2020

Most probably yes. It needs a small API change to support the LightningModule.tie_weights function to call from LightningModule.to. I don't know if there's a way to fix it without such API change.

@Borda
Copy link
Member

Borda commented Dec 21, 2020

@zcain117 @lezwon ^^

@lezwon
Copy link
Contributor

lezwon commented Dec 22, 2020

@Borda will pick this up :) 👍

@Borda Borda modified the milestones: 1.1.x, 1.2 Dec 30, 2020
@Borda Borda added the bug Something isn't working label Jan 26, 2021
@edenlightning edenlightning modified the milestones: 1.2, 1.2.x Feb 8, 2021
@Borda Borda modified the milestones: 1.2.x, 1.2 Feb 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accelerator: tpu Tensor Processing Unit bug Something isn't working feature Is an improvement or enhancement help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants