[fix] OSS dict load/save fix - better fix than 383 and unit test #386

blefaudeux · 2021-02-13T02:01:50Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
Did you read the contributor guideline?
Did you make sure to update the docs?
Did you write any new necessary tests? Catch the case which was broken before, and fix it

What does this PR do?

Fixes #380. Better take than #383 because fixing another issue which was not caught (383 was not enough), and reproducing the issue in an updated unit test so that this does not happen again. Thanks again @zhengwy888

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

… fixing

…3 - 26.404342651367188

blefaudeux · 2021-02-13T02:09:18Z

tests/optim/test_oss.py

@@ -859,10 +861,16 @@ def closure_sharded(input_tensor=input_tensor):
        sharded_optim_state_dict = sync_object_ranks(sharded_optim_state_dict, RECIPIENT_RANK, device)

        # - cross load the states
+        # run one step and check that the models are still the same


this modified unit test does break on the old version

blefaudeux · 2021-02-13T02:10:47Z

fairscale/optim/oss.py

@@ -379,6 +379,8 @@ def state_dict(self) -> Dict[str, Any]:
                        global_id = self.param_to_index[local_index_to_param_id[local_param_index]]
                        state_dict["state"][global_id] = s["state"][local_param_index]

+        # Make sure that the parameters are sorted in the state, as expected
+        state_dict["state"] = dict(sorted(state_dict["state"].items()))


the state dict returned was sorted properly under the "param_groups" key, but not under the "state" field, which was following the partitioning. I was assuming when loading that it was sorted, so that would break.
Pytorch just uses the ordering from the "param_groups" key, and I was just testing the loading OSS-> Pytorch and vice versa, so this was not caught unfortunately

I didn't know python's dictionary is ordered, so I just looked it up. turns out this has been enabled since python 3.5, good to know! https://stackoverflow.com/questions/39980323/are-dictionaries-ordered-in-python-3-6

blefaudeux · 2021-02-13T02:12:04Z

fairscale/optim/oss.py

-
-                # Only add this state to the sharded optimizer if it owns this param
-                for pg in self.optim.param_groups:
-                    if id(param) in [id(p) for p in pg["params"]]:


this second check could mask an issue, we just checked above that this rank owns this param, so this is not needed (and potentially risky)

blefaudeux · 2021-02-13T02:15:04Z

tests/optim/test_oss.py

@@ -832,8 +834,8 @@ def closure_sharded(input_tensor=input_tensor):
            loss_sharded_optim = cast(torch.Tensor, sharded_optimizer.step(closure=closure_sharded))

            assert torch.allclose(
-                loss_ddp, loss_sharded_optim
-            ), f"Losses differ in between Pytorch optim and OSS\nworld size {world_size}"
+                loss_ddp, loss_sharded_optim, rtol=1e-3


the rtol change is only needed on pytorch 1.5 unfortunately, without that on a two gpu machine the difference becomes
26.404895782470703 vs 26.404342651367188 (which I assume is due to a different casting and not structurally wrong) and this asserts

I think it is worth documenting the reason for 1e-3.

blefaudeux · 2021-02-13T22:03:44Z

sorry @min-xu-ai for the revert of the previous one, I just thought this was cleaner and there was one fix left in the cold

min-xu-ai

yeah, this seems to be much nicer.

min-xu-ai · 2021-02-14T00:15:41Z

tests/optim/test_oss.py

@@ -832,8 +834,8 @@ def closure_sharded(input_tensor=input_tensor):
            loss_sharded_optim = cast(torch.Tensor, sharded_optimizer.step(closure=closure_sharded))

            assert torch.allclose(
-                loss_ddp, loss_sharded_optim
-            ), f"Losses differ in between Pytorch optim and OSS\nworld size {world_size}"
+                loss_ddp, loss_sharded_optim, rtol=1e-3


I think it is worth documenting the reason for 1e-3.

blefaudeux added 4 commits February 12, 2021 16:15

WIP, needs to be fixed !

30c48ed

should be a fix, many thanks Weiyi Zheng

45f0925

slightly better unit test, sorting the states on the way out

6da7f50

reproducing the issue from Weiyi in a unit test, and finally properly…

b18b0a4

… fixing

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 13, 2021

fixing unit test on pytorch1.5 - original loss diff 26.40489578247070…

020d651

…3 - 26.404342651367188

blefaudeux commented Feb 13, 2021

View reviewed changes

min-xu-ai approved these changes Feb 14, 2021

View reviewed changes

blefaudeux merged commit 54bd62d into master Feb 14, 2021

blefaudeux deleted the oss_dict_load_fix branch February 16, 2021 23:06

tchaton mentioned this pull request Feb 18, 2021

CHANGELOG url don't appear. #396

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] OSS dict load/save fix - better fix than 383 and unit test #386

[fix] OSS dict load/save fix - better fix than 383 and unit test #386

blefaudeux commented Feb 13, 2021

blefaudeux Feb 13, 2021

blefaudeux Feb 13, 2021 •

edited

Loading

zhengwy888 Feb 16, 2021

blefaudeux Feb 13, 2021

blefaudeux Feb 13, 2021

min-xu-ai Feb 14, 2021

blefaudeux commented Feb 13, 2021

min-xu-ai left a comment

min-xu-ai Feb 14, 2021

[fix] OSS dict load/save fix - better fix than 383 and unit test #386

[fix] OSS dict load/save fix - better fix than 383 and unit test #386

Conversation

blefaudeux commented Feb 13, 2021

Before submitting

What does this PR do?

PR review

Did you have fun?

blefaudeux Feb 13, 2021

Choose a reason for hiding this comment

blefaudeux Feb 13, 2021 • edited Loading

Choose a reason for hiding this comment

zhengwy888 Feb 16, 2021

Choose a reason for hiding this comment

blefaudeux Feb 13, 2021

Choose a reason for hiding this comment

blefaudeux Feb 13, 2021

Choose a reason for hiding this comment

min-xu-ai Feb 14, 2021

Choose a reason for hiding this comment

blefaudeux commented Feb 13, 2021

min-xu-ai left a comment

Choose a reason for hiding this comment

min-xu-ai Feb 14, 2021

Choose a reason for hiding this comment

blefaudeux Feb 13, 2021 •

edited

Loading