Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][OSS] adding an assert for empty shards + corresponding unit test #406

Merged
merged 4 commits into from
Feb 22, 2021

Conversation

blefaudeux
Copy link
Contributor

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
  • Did you read the contributor guideline?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

Fixes #405 .

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 19, 2021
@blefaudeux blefaudeux requested review from msbaines, min-xu-ai, anj-s and Vittorio-Caggiano and removed request for min-xu-ai February 19, 2021 23:09
fairscale/optim/oss.py Outdated Show resolved Hide resolved
tests/optim/test_oss.py Outdated Show resolved Hide resolved
@blefaudeux blefaudeux force-pushed the oss_assert_empty_shards branch from d24c484 to 9d3ce35 Compare February 20, 2021 00:07
@@ -8,12 +8,12 @@

adascale_test_data = [
# "input" value is a list of input tensors for micro-batch/rank 0 and micro-batch/rank 1.
{"input": [[1.0, 0], [0, 1.0]], "expected_gain": 2.0},
{"input": [[1.0, 0], [0, 1.0]], "expected_gain": 4.0 / 3},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@min-xu-ai @mikerabbat checking with you that this is ok. Since adascale is not changed by this PR, I assumed that the current state was correct

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is 4.0/3 is the new value? maybe if you init the bias to 0 then the value here won't change? the original value of 2 is because we have two grads from two ranks that are completely independent. it must be the grad from the bias are point to the same direction now, hence 4 grads but 3 directions. So it is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried setting the bias to zero, but the returned expected gain is still 1.3333

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. I think the new value make sense. Keep it 4.0/3 would be good.

Copy link
Contributor

@min-xu-ai min-xu-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting change! I think changing the expected gains are fine. Perhaps a few comments would be fine.

@@ -37,7 +37,7 @@ def _test_basic_func(rank, world_size, tempfile_name, test_case, oss, model=None
_dist_init(rank, world_size, tempfile_name, backend="nccl")

if model is None:
model = Linear(2, 2, bias=False)
model = Linear(2, 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need the bias or otherwise there isn't enough params?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, exactly

def run_test_catch_empty_shardd(rank, world_size, tempfile_name):
dist_init(rank, world_size, tempfile_name, backend="gloo")
m = torch.nn.Linear(1, 1)
with pytest.raises(AssertionError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice test

@@ -8,12 +8,12 @@

adascale_test_data = [
# "input" value is a list of input tensors for micro-batch/rank 0 and micro-batch/rank 1.
{"input": [[1.0, 0], [0, 1.0]], "expected_gain": 2.0},
{"input": [[1.0, 0], [0, 1.0]], "expected_gain": 4.0 / 3},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is 4.0/3 is the new value? maybe if you init the bias to 0 then the value here won't change? the original value of 2 is because we have two grads from two ranks that are completely independent. it must be the grad from the bias are point to the same direction now, hence 4 grads but 3 directions. So it is fine.

@min-xu-ai
Copy link
Contributor

unfortunately the unit tests are still failing. Perhaps you need to update a few more places since the golden values are used in several places.

@blefaudeux blefaudeux merged commit 279b802 into master Feb 22, 2021
@blefaudeux blefaudeux deleted the oss_assert_empty_shards branch February 22, 2021 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
4 participants