CI: 9 tests failing with PyTorch 1.10(dev) #9280

Borda · 2021-09-02T09:40:32Z

Discussed in #9273

^{Originally posted by daniellepintz September 2, 2021}
I noticed in recent commits these 9 tests are consistently failing in the Conda 1.10 environment

(https://github.com/PyTorchLightning/pytorch-lightning/runs/3483032531)

FAILED tests/callbacks/test_quantization.py::test_quantization[True-True-average]
FAILED tests/callbacks/test_quantization.py::test_quantization[True-False-average]
FAILED tests/callbacks/test_quantization.py::test_quantization[False-True-average]
FAILED tests/callbacks/test_quantization.py::test_quantization[False-False-average]
FAILED tests/models/test_amp.py::test_amp_cpus[1-bf16-None] - TypeError: __in...
FAILED tests/models/test_amp.py::test_amp_cpus[1-bf16-ddp_spawn] - TypeError:...
FAILED tests/models/test_amp.py::test_amp_cpus[2-bf16-None] - torch.multiproc...
FAILED tests/models/test_amp.py::test_amp_cpus[2-bf16-ddp_spawn] - torch.mult...
FAILED tests/models/test_amp.py::test_cpu_model_with_amp - AssertionError: Re...

Starting this thread to track this issue and try to resolve. Does anyone have any ideas? cc @Borda

The text was updated successfully, but these errors were encountered:

daniellepintz · 2021-09-06T05:17:16Z

I am a bit confused because this test seems to be testing Pytorch 1.10 but according to https://pytorch.org/docs/versions.html v1.10 is not released yet

Borda · 2021-09-06T07:02:34Z

I am a bit confused because this test seems to be testing Pytorch 1.10 but according to https://pytorch.org/docs/versions.html v1.10 is not released yet

correct, we are building against the dev version to avoid any big breaking when the next version comes.
it is the same as FB build from PL master, but we are more conservative 🐰

daniellepintz · 2021-09-06T23:02:01Z

@Borda I see, thanks!

Update: after #9345 these are remaining failing tests:

FAILED tests/callbacks/test_quantization.py::test_quantization[True-True-average]
FAILED tests/callbacks/test_quantization.py::test_quantization[True-False-average]
FAILED tests/callbacks/test_quantization.py::test_quantization[False-True-average]
FAILED tests/callbacks/test_quantization.py::test_quantization[False-False-average]

daniellepintz · 2021-09-07T00:00:53Z

Stack trace for test_quantization[False-True-average]:

         quant_score = torch.mean(torch.tensor([mape(qmodel(x), y) for x, y in dm.test_dataloader()]))
         # test that the test score is almost the same as with pure training
>       assert torch.allclose(org_score, quant_score, atol=0.45)
E       assert False
E        +  where False = <built-in method allclose of type object at 0x7f66d3b6a780>(tensor(0.3262), tensor(0.9988), atol=0.45)
E        +    where <built-in method allclose of type object at 0x7f66d3b6a780> = torch.allclose

tests/callbacks/test_quantization.py:56: AssertionError

It seems quant_score is higher than it should be. When I run the test locally (Pytorch 1.9) quant_score = 0.5105

@Borda since you worked on this do you have any idea what might be the issue? If not should we check with Pytorch folks?

Borda · 2021-09-07T05:53:44Z

It seems quant_score is higher than it should be. When I run the test locally quant_score = 0.5105
@Borda since you worked on this do you have any idea what might be the issue? If not should we check with Pytorch folks?

as nothing changed with PL callback I would open parallel talk to Pytorch folks and Ill check it on PL side

daniellepintz · 2021-09-07T07:41:05Z

Sounds good, Pytorch issue created: pytorch/pytorch#64564

daniellepintz · 2021-09-11T18:24:15Z

So as of latest commit on master we now have 30 tests failing in the Pytorch 1.10 env: https://github.com/PyTorchLightning/pytorch-lightning/runs/3571593663

Here is what I notice about this job:

it has been failing for months at least
it seems like the failures are mostly due to changes on the PT side (since we are pulling new PT1.10 nightly) rather than reflecting actual issues from changes on our side. Also, the failures are constantly changing/increasing every day, due to the changing nature of the core dependency, PT1.10. It seems unsustainable to be constantly fixing issues from this test environment.
this CI job is the last thing standing in our way of a green CI. There are a few other jobs which are flaky and sometimes fail, but this test consistently fails. Some recent commits on PL master only have one CI failure (this job).

Given this, what do people think about deleting this test? If this test is important, then we should make it required which will force us to keep it green, but I feel it is really harmful to have it constantly making our CI red, and don't think it is sustainable to keep fixing it all the time.

cc @PyTorchLightning/core-contributors

carmocca · 2021-09-14T14:32:58Z

Testing nightly is useful to track these errors. However, you don't want to mark it as required to avoid upstream blocking us.

Since GitHub doesn't allow having "informative" jobs, if we want to keep the CI green but not mark a nightly job as required, the only option I can think of is only enabling it when there's a release candidate available which should be more stable.

Borda · 2021-09-14T15:01:02Z

Testing nightly is useful to track these errors. However, you don't want to mark it as required to avoid upstream blocking us.

we do not have this nightly versions as required 🐰

daniellepintz · 2021-09-14T17:47:38Z

the only option I can think of is only enabling it when there's a release candidate available which should be more stable

@carmocca This sounds like a good option! I just think it is very important that our CI is actually green, not that it is red but those who know it well know that it is effectively "green". But that any one visiting the repo, or a new contributor, sees that it is green.

I think it makes a lot of sense to only enable the newest PT version when there's an RC available, which will be more stable as you said. Also this aligns well with the new release cycle which is synced with Pytorch's. So now since we won't be releasing PL1.5 until October 26, after PT1.10 is released, we can disable this test until closer to the release, and then focus on making it green if there are any outstanding issues.

cc @tchaton @ananthsub

Borda · 2021-10-14T22:37:18Z

so shall we close this issue? :]

daniellepintz · 2021-10-14T22:38:00Z

yup! thanks for the help

Borda added bug Something isn't working help wanted Open to be worked on labels Sep 2, 2021

Borda added this to the v1.4.x milestone Sep 2, 2021

daniellepintz mentioned this issue Sep 6, 2021

Add support for CPU AMP autocast #9084

Merged

12 tasks

SeanNaren mentioned this issue Sep 6, 2021

[CI/Tests] Fix Pytorch 1.10 autocast failing tests #9345

Merged

12 tasks

Borda changed the title ~~CI: 9 tests failing in Conda 1.10 env~~ CI: 9 tests failing with PyTorch 1.10(dev) Sep 7, 2021

daniellepintz mentioned this issue Sep 23, 2021

Remove pytorch 1.10(dev) CI job #9673

Closed

12 tasks

daniellepintz closed this as completed Oct 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: 9 tests failing with PyTorch 1.10(dev) #9280

CI: 9 tests failing with PyTorch 1.10(dev) #9280

Borda commented Sep 2, 2021

daniellepintz commented Sep 6, 2021

Borda commented Sep 6, 2021

daniellepintz commented Sep 6, 2021

daniellepintz commented Sep 7, 2021 •

edited

Loading

Borda commented Sep 7, 2021

daniellepintz commented Sep 7, 2021 •

edited

Loading

daniellepintz commented Sep 11, 2021 •

edited

Loading

carmocca commented Sep 14, 2021

Borda commented Sep 14, 2021

daniellepintz commented Sep 14, 2021 •

edited

Loading

Borda commented Oct 14, 2021

daniellepintz commented Oct 14, 2021

CI: 9 tests failing with PyTorch 1.10(dev) #9280

CI: 9 tests failing with PyTorch 1.10(dev) #9280

Comments

Borda commented Sep 2, 2021

Discussed in #9273

daniellepintz commented Sep 6, 2021

Borda commented Sep 6, 2021

daniellepintz commented Sep 6, 2021

daniellepintz commented Sep 7, 2021 • edited Loading

Borda commented Sep 7, 2021

daniellepintz commented Sep 7, 2021 • edited Loading

daniellepintz commented Sep 11, 2021 • edited Loading

carmocca commented Sep 14, 2021

Borda commented Sep 14, 2021

daniellepintz commented Sep 14, 2021 • edited Loading

Borda commented Oct 14, 2021

daniellepintz commented Oct 14, 2021

daniellepintz commented Sep 7, 2021 •

edited

Loading

daniellepintz commented Sep 7, 2021 •

edited

Loading

daniellepintz commented Sep 11, 2021 •

edited

Loading

daniellepintz commented Sep 14, 2021 •

edited

Loading