[fix][minor] Change empty shard handling for OSS, do not rely on asserts #460

blefaudeux · 2021-03-03T01:49:30Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
Did you read the contributor guideline?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

If a shard is empty, do not assert out but skip the broadcasting step indeed. The prior issue was that if somebody was using "python -O" on a distributed job, this would just hang.
Now if some ranks are empty, they just don't participate in the optimization problem (not updating any tensor), that's all (they can still participate in the data parallel part), which is probably a better take.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

blefaudeux · 2021-03-03T01:50:23Z

fairscale/optim/oss.py

@@ -597,4 +593,4 @@ def _setup_flat_buffers(self) -> None:
                    else:
                        self.buckets[device][dst_rank] = bucket
                else:
-                    self.buckets[device].append(torch.zeros(1, device=device))
+                    self.buckets[device].append(torch.zeros(0, device=device))


an empty tensor is a thing, better take to catch empty shards

maybe add this same line as a comment for better code readability?

hmm ok, the comment was about the change actually but if that helps I can always write something. The if clause is if(no params) so I thought that was kind of clear (outside of this very narrow PR view)

blefaudeux · 2021-03-03T01:50:55Z

fairscale/optim/oss.py

@@ -552,8 +545,11 @@ def _broadcast_params(self) -> None:

        for device in self.buckets.keys():
            for src_rank, bucket in enumerate(self.buckets[device]):
-                global_src_rank = self.get_global_rank(self.group, src_rank)
-                last_work_handle = dist.broadcast(tensor=bucket, src=global_src_rank, group=self.group, async_op=True)
+                if bucket.numel() > 0:


not sure about broadcasting something empty for all backends, and does not make a ton of sense, so just skip that

blefaudeux · 2021-03-03T01:51:54Z

fairscale/optim/oss.py

@@ -140,13 +140,6 @@ def partition_parameters(self) -> List[List[dict]]:
                    param_group_rank["params"] = params
                    self._partition_parameters[rank].append(param_group_rank)



changing the logic: if a rank has an empty shard, it will be taken care of bucket-wise, no more asserting since that prove fragile on fb infra (the job can be compiled without the asserts)

Just to clarify: this happens when one of the nodes does not have any params assigned to it? Thats an empty shard right?

min-xu-ai

very nice. running with python -O is LoL. If that speed things up then the program is probably running too much python code. :-)

tests/optim/test_oss.py

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 3, 2021

blefaudeux changed the title ~~[OSS] Change empty shard handling for OSS, do not rely on asserts~~ [fix] Change empty shard handling for OSS, do not rely on asserts Mar 3, 2021

blefaudeux commented Mar 3, 2021

View reviewed changes

blefaudeux requested review from msbaines, min-xu-ai, joshim5 and anj-s March 3, 2021 01:51

blefaudeux changed the title ~~[fix] Change empty shard handling for OSS, do not rely on asserts~~ [fix][minor] Change empty shard handling for OSS, do not rely on asserts Mar 3, 2021

min-xu-ai approved these changes Mar 3, 2021

View reviewed changes

anj-s reviewed Mar 3, 2021

View reviewed changes

tests/optim/test_oss.py Outdated Show resolved Hide resolved

anj-s reviewed Mar 3, 2021

View reviewed changes

tests/optim/test_oss.py Outdated Show resolved Hide resolved

blefaudeux force-pushed the oss_safer_empty_shards branch 2 times, most recently from 005cde0 to e9454cc Compare March 4, 2021 01:03

blefaudeux added 2 commits March 4, 2021 19:39

change empty shard handling for OSS, do not rely on asserts

dc8210f

code review

02ec477

blefaudeux force-pushed the oss_safer_empty_shards branch from ee07105 to 02ec477 Compare March 5, 2021 03:39

blefaudeux merged commit d1fab39 into master Mar 5, 2021

blefaudeux deleted the oss_safer_empty_shards branch March 5, 2021 03:41

blefaudeux linked an issue Mar 5, 2021 that may be closed by this pull request

[OSS] More flexible parameter handling in the low tensor case #409

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix][minor] Change empty shard handling for OSS, do not rely on asserts #460

[fix][minor] Change empty shard handling for OSS, do not rely on asserts #460

blefaudeux commented Mar 3, 2021

blefaudeux Mar 3, 2021

anj-s Mar 3, 2021

blefaudeux Mar 3, 2021

blefaudeux Mar 3, 2021

blefaudeux Mar 3, 2021

anj-s Mar 3, 2021

min-xu-ai left a comment

		@@ -140,13 +140,6 @@ def partition_parameters(self) -> List[List[dict]]:
		param_group_rank["params"] = params
		self._partition_parameters[rank].append(param_group_rank)

[fix][minor] Change empty shard handling for OSS, do not rely on asserts #460

[fix][minor] Change empty shard handling for OSS, do not rely on asserts #460

Conversation

blefaudeux commented Mar 3, 2021

Before submitting

What does this PR do?

PR review

Did you have fun?

blefaudeux Mar 3, 2021

Choose a reason for hiding this comment

anj-s Mar 3, 2021

Choose a reason for hiding this comment

blefaudeux Mar 3, 2021

Choose a reason for hiding this comment

blefaudeux Mar 3, 2021

Choose a reason for hiding this comment

blefaudeux Mar 3, 2021

Choose a reason for hiding this comment

anj-s Mar 3, 2021

Choose a reason for hiding this comment

min-xu-ai left a comment

Choose a reason for hiding this comment