Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: 9 tests failing with PyTorch 1.10(dev) #9280

Closed
Borda opened this issue Sep 2, 2021 Discussed in #9273 · 12 comments
Closed

CI: 9 tests failing with PyTorch 1.10(dev) #9280

Borda opened this issue Sep 2, 2021 Discussed in #9273 · 12 comments
Labels
bug Something isn't working help wanted Open to be worked on
Milestone

Comments

@Borda
Copy link
Member

Borda commented Sep 2, 2021

Discussed in #9273

Originally posted by daniellepintz September 2, 2021
I noticed in recent commits these 9 tests are consistently failing in the Conda 1.10 environment
Screen Shot 2021-09-01 at 9 42 35 PM
(https://github.com/PyTorchLightning/pytorch-lightning/runs/3483032531)

FAILED tests/callbacks/test_quantization.py::test_quantization[True-True-average]
FAILED tests/callbacks/test_quantization.py::test_quantization[True-False-average]
FAILED tests/callbacks/test_quantization.py::test_quantization[False-True-average]
FAILED tests/callbacks/test_quantization.py::test_quantization[False-False-average]
FAILED tests/models/test_amp.py::test_amp_cpus[1-bf16-None] - TypeError: __in...
FAILED tests/models/test_amp.py::test_amp_cpus[1-bf16-ddp_spawn] - TypeError:...
FAILED tests/models/test_amp.py::test_amp_cpus[2-bf16-None] - torch.multiproc...
FAILED tests/models/test_amp.py::test_amp_cpus[2-bf16-ddp_spawn] - torch.mult...
FAILED tests/models/test_amp.py::test_cpu_model_with_amp - AssertionError: Re...

Starting this thread to track this issue and try to resolve. Does anyone have any ideas? cc @Borda

@Borda Borda added bug Something isn't working help wanted Open to be worked on labels Sep 2, 2021
@Borda Borda added this to the v1.4.x milestone Sep 2, 2021
@daniellepintz
Copy link
Contributor

I am a bit confused because this test seems to be testing Pytorch 1.10 but according to https://pytorch.org/docs/versions.html v1.10 is not released yet

@Borda
Copy link
Member Author

Borda commented Sep 6, 2021

I am a bit confused because this test seems to be testing Pytorch 1.10 but according to https://pytorch.org/docs/versions.html v1.10 is not released yet

correct, we are building against the dev version to avoid any big breaking when the next version comes.
it is the same as FB build from PL master, but we are more conservative 🐰

@daniellepintz
Copy link
Contributor

@Borda I see, thanks!

Update: after #9345 these are remaining failing tests:

FAILED tests/callbacks/test_quantization.py::test_quantization[True-True-average]
FAILED tests/callbacks/test_quantization.py::test_quantization[True-False-average]
FAILED tests/callbacks/test_quantization.py::test_quantization[False-True-average]
FAILED tests/callbacks/test_quantization.py::test_quantization[False-False-average]

@daniellepintz
Copy link
Contributor

daniellepintz commented Sep 7, 2021

Stack trace for test_quantization[False-True-average]:

         quant_score = torch.mean(torch.tensor([mape(qmodel(x), y) for x, y in dm.test_dataloader()]))
         # test that the test score is almost the same as with pure training
>       assert torch.allclose(org_score, quant_score, atol=0.45)
E       assert False
E        +  where False = <built-in method allclose of type object at 0x7f66d3b6a780>(tensor(0.3262), tensor(0.9988), atol=0.45)
E        +    where <built-in method allclose of type object at 0x7f66d3b6a780> = torch.allclose

tests/callbacks/test_quantization.py:56: AssertionError

It seems quant_score is higher than it should be. When I run the test locally (Pytorch 1.9) quant_score = 0.5105

@Borda since you worked on this do you have any idea what might be the issue? If not should we check with Pytorch folks?

@Borda
Copy link
Member Author

Borda commented Sep 7, 2021

It seems quant_score is higher than it should be. When I run the test locally quant_score = 0.5105
@Borda since you worked on this do you have any idea what might be the issue? If not should we check with Pytorch folks?

as nothing changed with PL callback I would open parallel talk to Pytorch folks and Ill check it on PL side

@daniellepintz
Copy link
Contributor

daniellepintz commented Sep 7, 2021

Sounds good, Pytorch issue created: pytorch/pytorch#64564

@Borda Borda changed the title CI: 9 tests failing in Conda 1.10 env CI: 9 tests failing with PyTorch 1.10(dev) Sep 7, 2021
@daniellepintz
Copy link
Contributor

daniellepintz commented Sep 11, 2021

So as of latest commit on master we now have 30 tests failing in the Pytorch 1.10 env: https://github.com/PyTorchLightning/pytorch-lightning/runs/3571593663

Here is what I notice about this job:

  1. it has been failing for months at least

  2. it seems like the failures are mostly due to changes on the PT side (since we are pulling new PT1.10 nightly) rather than reflecting actual issues from changes on our side. Also, the failures are constantly changing/increasing every day, due to the changing nature of the core dependency, PT1.10. It seems unsustainable to be constantly fixing issues from this test environment.

  3. this CI job is the last thing standing in our way of a green CI. There are a few other jobs which are flaky and sometimes fail, but this test consistently fails. Some recent commits on PL master only have one CI failure (this job).

Given this, what do people think about deleting this test? If this test is important, then we should make it required which will force us to keep it green, but I feel it is really harmful to have it constantly making our CI red, and don't think it is sustainable to keep fixing it all the time.

cc @PyTorchLightning/core-contributors

@carmocca
Copy link
Contributor

Testing nightly is useful to track these errors. However, you don't want to mark it as required to avoid upstream blocking us.

Since GitHub doesn't allow having "informative" jobs, if we want to keep the CI green but not mark a nightly job as required, the only option I can think of is only enabling it when there's a release candidate available which should be more stable.

@Borda
Copy link
Member Author

Borda commented Sep 14, 2021

Testing nightly is useful to track these errors. However, you don't want to mark it as required to avoid upstream blocking us.

we do not have this nightly versions as required 🐰

@daniellepintz
Copy link
Contributor

daniellepintz commented Sep 14, 2021

the only option I can think of is only enabling it when there's a release candidate available which should be more stable

@carmocca This sounds like a good option! I just think it is very important that our CI is actually green, not that it is red but those who know it well know that it is effectively "green". But that any one visiting the repo, or a new contributor, sees that it is green.

I think it makes a lot of sense to only enable the newest PT version when there's an RC available, which will be more stable as you said. Also this aligns well with the new release cycle which is synced with Pytorch's. So now since we won't be releasing PL1.5 until October 26, after PT1.10 is released, we can disable this test until closer to the release, and then focus on making it green if there are any outstanding issues.

cc @tchaton @ananthsub

@Borda
Copy link
Member Author

Borda commented Oct 14, 2021

so shall we close this issue? :]

@daniellepintz
Copy link
Contributor

yup! thanks for the help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants