Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove CUDA 11.7 builds; add 11.8 #7616

Merged
merged 7 commits into from
May 24, 2023
Merged

Conversation

ptrblck
Copy link
Contributor

@ptrblck ptrblck commented May 23, 2023

CC @atalman @malfet

cc @seemethere

This PR will be removing cuda 11.7 builds from matrix: pytorch/test-infra#4205

@pytorch-bot
Copy link

pytorch-bot bot commented May 23, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/7616

Note: Links to docs will display an error until the docs builds have been completed.

❌ 32 New Failures

As of commit d7e420d:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pmeier
Copy link
Collaborator

pmeier commented May 23, 2023

Hey @ptrblck, we are in the process of killing CircleCI in favor of GitHub Actions (#7611). Thus, your patch would be removed soon. Plus, it is missing the parts on GitHub Actions that also need to be patched. I'll send a few new commits here for future reference.

@pmeier
Copy link
Collaborator

pmeier commented May 23, 2023

In the future, we also need to set this on the CMake workflow as I did in 79ae3e6 / #7417.

@pmeier
Copy link
Collaborator

pmeier commented May 23, 2023

Linux GPU failure looks valid

Traceback (most recent call last):
  File "/work/test/test_models.py", line 705, in test_classification_model
    _assert_expected(out.cpu(), model_name, prec=prec)
  File "/work/test/test_models.py", line 155, in _assert_expected
    torch.testing.assert_close(output, expected, rtol=rtol, atol=atol, check_dtype=False, check_device=False)
  File "/opt/conda/envs/ci/lib/python3.8/site-packages/torch/testing/_comparison.py", line 1511, in assert_close
    raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!

Mismatched elements: 1 / 50 (2.0%)
Greatest absolute difference: 5.10198974609375 at index (0, 22) (up to 0.2 allowed)
Greatest relative difference: 0.2689853608608246 at index (0, 22) (up to 0.2 allowed)

Since the CPU test is just fine, this means that our GPU output now diverges from the CUDA 11.7 one.

We already have quite a few hacks in

def test_classification_model(model_fn, dev):

@NicolasHug do you know if we had something like above before?

@atalman
Copy link
Contributor

atalman commented May 23, 2023

Linux GPU failure looks valid

Traceback (most recent call last):
  File "/work/test/test_models.py", line 705, in test_classification_model
    _assert_expected(out.cpu(), model_name, prec=prec)
  File "/work/test/test_models.py", line 155, in _assert_expected
    torch.testing.assert_close(output, expected, rtol=rtol, atol=atol, check_dtype=False, check_device=False)
  File "/opt/conda/envs/ci/lib/python3.8/site-packages/torch/testing/_comparison.py", line 1511, in assert_close
    raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!

Mismatched elements: 1 / 50 (2.0%)
Greatest absolute difference: 5.10198974609375 at index (0, 22) (up to 0.2 allowed)
Greatest relative difference: 0.2689853608608246 at index (0, 22) (up to 0.2 allowed)

Since the CPU test is just fine, this means that our GPU output now diverges from the CUDA 11.7 one.

We already have quite a few hacks in

def test_classification_model(model_fn, dev):

@NicolasHug do you know if we had something like above before?

@NicolasHug @pmeier should we create a separate issue for this and unblock this PR ?

@NicolasHug
Copy link
Member

I'm not sure, the failure is rather concerning especially considering it happens on such a simple arch like ResNet. Are we sure this isn't an upstream regression?

@atalman atalman merged commit 15b4562 into pytorch:main May 24, 2023
@github-actions
Copy link

Hey @atalman!

You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py

facebook-github-bot pushed a commit that referenced this pull request May 31, 2023
Reviewed By: vmoens

Differential Revision: D46314041

fbshipit-source-id: 5e22db72c8f1d550677f1eb39d86b6ff9bdb22de

Co-authored-by: Philip Meier <github.pmeier@posteo.de>
Co-authored-by: atalman <atalman@fb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants