Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skip CPU tests on GPU GHA jobs #6970

Merged
merged 16 commits into from
Feb 9, 2023
Merged

skip CPU tests on GPU GHA jobs #6970

merged 16 commits into from
Feb 9, 2023

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Nov 22, 2022

Blocked by #6957. I've debugged why no GPU test is run until 1a2efbe. Please discard my commits and comments before that.


To save CI resources, we don't run CPU tests on GPU machines. This behavior is hardcoded to CircleCI:

vision/test/conftest.py

Lines 15 to 24 in 4a310f2

def pytest_collection_modifyitems(items):
# This hook is called by pytest after it has collected the tests (google its name to check out its doc!)
# We can ignore some tests as we see fit here, or add marks, such as a skip mark.
#
# Typically here, we try to optimize CI time. In particular, the GPU CI instances don't need to run the
# tests that don't need CUDA, because those tests are extensively tested in the CPU CI instances already.
# This is true for both CircleCI and the fbcode internal CI.
# In the fbcode CI, we have an additional constraint: we try to avoid skipping tests. So instead of relying on
# pytest.mark.skip, in fbcode we literally just remove those tests from the `items` list, and it's as if
# these tests never existed.

With the recent push to GHA (Nova), we also need this behavior there. The environment variable is called GITHUB_ACTIONS and is set to "true" in case we are in a GHA runner. This is the same as the CIRCLECI variable that we already handled before.

cc @seemethere

@pmeier
Copy link
Collaborator Author

pmeier commented Nov 22, 2022

Installation of torchvision on GHA prints:

No CUDA runtime is found, using CUDA_HOME='/work/ci_env'

@pmeier
Copy link
Collaborator Author

pmeier commented Nov 22, 2022

Looking at the collected env, it seems there is an issue with the CUDA setup.

Collecting environment information...
PyTorch version: 1.14.0.dev20221122
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17

Python version: 3.8.15 (default, Nov  4 2022, 20:59:55)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-4.14.252-195.483.amzn2.x86_64-x86_64-with-glibc2.17
Is CUDA available: False
CUDA runtime version: 11.6.124
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Probably one of the following:
/usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn.so.8.3.2
/usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.3.2
/usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.3.2
/usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.3.2
/usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.3.2
/usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.3.2
/usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.3.2
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

The important bits:

CUDA used to build PyTorch: 11.6
[...]
Is CUDA available: False
CUDA runtime version: 11.6.124

It seems we install the right versions, but CUDA is not available for some reason. Maybe the driver is not set up properly?

@pmeier
Copy link
Collaborator Author

pmeier commented Nov 22, 2022

Ok, running

modinfo nvidia || true
nvidia-smi || true
exit 0

gives

modinfo: ERROR: Module alias nvidia not found.
/exec: line 8: nvidia-smi: command not found

This has nothing to do with torchvision. @osalpekar can you have a look?

@pmeier pmeier changed the title enable CUDA tests on GHA skip CPU tests on GPU GHA jobs Nov 22, 2022
@@ -13,11 +13,11 @@
import __main__ # noqa: 401


IN_CIRCLE_CI = os.getenv("CIRCLECI", False) == "true"
IN_OSS_CI = any(os.getenv(var) == "true" for var in ["CIRCLECI", "GITHUB_ACTIONS"])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One common convention for CI providers is to set the CI=true environment variable. We could use that here as well, but I have no idea if that interferes with Meta internal systems. Thus, to be safe, we are explicit about the CI providers here. Given that we probably don't change them that often, I think this should be fine.

@pmeier pmeier marked this pull request as ready for review February 9, 2023 14:43
@pmeier pmeier requested review from osalpekar and NicolasHug and removed request for osalpekar February 9, 2023 14:43
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @pmeier ! LGTM as long as it does what we want :)

@pmeier
Copy link
Collaborator Author

pmeier commented Feb 9, 2023

PR: https://github.com/pytorch/vision/actions/runs/4135402074/jobs/7147802842#step:10:35257

= 1 failed, 13404 passed, 17527 skipped, 6 xfailed, 31 warnings in 624.39s (0:10:24) =

main: https://github.com/pytorch/vision/actions/runs/4134781712/jobs/7146348614#step:10:35781

= 1 failed, 29740 passed, 1188 skipped, 9 xfailed, 246 warnings in 3832.74s (1:03:52) =

So roughly 6x speed-up (4x if look at the overall workflow not just the tests)

@pmeier pmeier merged commit 87ec804 into pytorch:main Feb 9, 2023
@github-actions
Copy link

github-actions bot commented Feb 9, 2023

Hey @pmeier!

You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py

@pmeier pmeier deleted the gpu-ci branch February 9, 2023 15:16
facebook-github-bot pushed a commit that referenced this pull request Mar 28, 2023
Reviewed By: vmoens

Differential Revision: D44416269

fbshipit-source-id: ebffe7b7a447b70b1495cb1a614f7780219abd96
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants