skip CPU tests on GPU GHA jobs #6970

pmeier · 2022-11-22T11:44:21Z

Blocked by #6957. I've debugged why no GPU test is run until 1a2efbe. Please discard my commits and comments before that.

To save CI resources, we don't run CPU tests on GPU machines. This behavior is hardcoded to CircleCI:

vision/test/conftest.py

Lines 15 to 24 in 4a310f2

    
           def pytest_collection_modifyitems(items): 
        
               # This hook is called by pytest after it has collected the tests (google its name to check out its doc!) 
        
               # We can ignore some tests as we see fit here, or add marks, such as a skip mark. 
        
               # 
        
               # Typically here, we try to optimize CI time. In particular, the GPU CI instances don't need to run the 
        
               # tests that don't need CUDA, because those tests are extensively tested in the CPU CI instances already. 
        
               # This is true for both CircleCI and the fbcode internal CI. 
        
               # In the fbcode CI, we have an additional constraint: we try to avoid skipping tests. So instead of relying on 
        
               # pytest.mark.skip, in fbcode we literally just remove those tests from the `items` list, and it's as if 
        
               # these tests never existed.

With the recent push to GHA (Nova), we also need this behavior there. The environment variable is called GITHUB_ACTIONS and is set to "true" in case we are in a GHA runner. This is the same as the CIRCLECI variable that we already handled before.

cc @seemethere

pmeier · 2022-11-22T13:31:23Z

Installation of torchvision on GHA prints:

No CUDA runtime is found, using CUDA_HOME='/work/ci_env'

pmeier · 2022-11-22T13:56:15Z

Looking at the collected env, it seems there is an issue with the CUDA setup.

Collecting environment information...
PyTorch version: 1.14.0.dev20221122
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17

Python version: 3.8.15 (default, Nov  4 2022, 20:59:55)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-4.14.252-195.483.amzn2.x86_64-x86_64-with-glibc2.17
Is CUDA available: False
CUDA runtime version: 11.6.124
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Probably one of the following:
/usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn.so.8.3.2
/usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.3.2
/usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.3.2
/usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.3.2
/usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.3.2
/usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.3.2
/usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.3.2
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

The important bits:

CUDA used to build PyTorch: 11.6
[...]
Is CUDA available: False
CUDA runtime version: 11.6.124

It seems we install the right versions, but CUDA is not available for some reason. Maybe the driver is not set up properly?

pmeier · 2022-11-22T14:12:32Z

Ok, running

modinfo nvidia || true
nvidia-smi || true
exit 0

gives

modinfo: ERROR: Module alias nvidia not found.
/exec: line 8: nvidia-smi: command not found

This has nothing to do with torchvision. @osalpekar can you have a look?

pmeier · 2022-11-22T14:29:50Z

test/common_utils.py

@@ -13,11 +13,11 @@
 import __main__  # noqa: 401


-IN_CIRCLE_CI = os.getenv("CIRCLECI", False) == "true"
+IN_OSS_CI = any(os.getenv(var) == "true" for var in ["CIRCLECI", "GITHUB_ACTIONS"])


One common convention for CI providers is to set the CI=true environment variable. We could use that here as well, but I have no idea if that interferes with Meta internal systems. Thus, to be safe, we are explicit about the CI providers here. Given that we probably don't change them that often, I think this should be fine.

NicolasHug

Thank you @pmeier ! LGTM as long as it does what we want :)

pmeier · 2023-02-09T15:13:46Z

PR: https://github.com/pytorch/vision/actions/runs/4135402074/jobs/7147802842#step:10:35257

= 1 failed, 13404 passed, 17527 skipped, 6 xfailed, 31 warnings in 624.39s (0:10:24) =

main: https://github.com/pytorch/vision/actions/runs/4134781712/jobs/7146348614#step:10:35781

= 1 failed, 29740 passed, 1188 skipped, 9 xfailed, 246 warnings in 3832.74s (1:03:52) =

So roughly 6x speed-up (4x if look at the overall workflow not just the tests)

github-actions · 2023-02-09T15:16:24Z

Hey @pmeier!

You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py

Reviewed By: vmoens Differential Revision: D44416269 fbshipit-source-id: ebffe7b7a447b70b1495cb1a614f7780219abd96

enable CUDA tests on GHA

596dc43

facebook-github-bot added the cla signed label Nov 22, 2022

pmeier added 4 commits November 22, 2022 12:44

Merge branch 'main' into gpu-ci

19195f4

debug env vars

5199577

add debug tests

336d875

more debug output

2126c0d

check if CUDA is available

56bc60a

pmeier added 2 commits November 22, 2022 14:58

try access nvidia driver

0a520ca

try modinfo

350aa70

pmeier mentioned this pull request Nov 22, 2022

Fix GHA Linux GPU Job not running tests on CUDA #6957

Closed

revert debug

cc9595a

pmeier changed the title ~~enable CUDA tests on GHA~~ skip CPU tests on GPU GHA jobs Nov 22, 2022

fix revert

1a2efbe

pmeier commented Nov 22, 2022

View reviewed changes

pmeier added 5 commits November 22, 2022 15:30

lint

5d6a57b

Merge branch 'main' into gpu-ci

7d5692f

set smoke test for CUDA

8d35989

Merge branch 'main' into gpu-ci

a2d5b48

cleanup

df22a67

pmeier marked this pull request as ready for review February 9, 2023 14:43

pmeier requested review from osalpekar and NicolasHug and removed request for osalpekar February 9, 2023 14:43

pmeier added module: tests module: ci labels Feb 9, 2023

revert unrelated

9bc64e7

NicolasHug approved these changes Feb 9, 2023

View reviewed changes

pmeier merged commit 87ec804 into pytorch:main Feb 9, 2023

pmeier deleted the gpu-ci branch February 9, 2023 15:16

facebook-github-bot pushed a commit that referenced this pull request Mar 28, 2023

[fbsync] skip CPU tests on GPU GHA jobs (#6970)

b20ef9a

Reviewed By: vmoens Differential Revision: D44416269 fbshipit-source-id: ebffe7b7a447b70b1495cb1a614f7780219abd96

pmeier mentioned this pull request Apr 6, 2023

[ROCm] Use CI env variable instead of Githubactions and CircleCI. #7501

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

skip CPU tests on GPU GHA jobs #6970

skip CPU tests on GPU GHA jobs #6970

pmeier commented Nov 22, 2022 •

edited by pytorch-bot bot

Loading

pmeier commented Nov 22, 2022

pmeier commented Nov 22, 2022

pmeier commented Nov 22, 2022

pmeier Nov 22, 2022

NicolasHug left a comment

pmeier commented Feb 9, 2023

github-actions bot commented Feb 9, 2023

	def pytest_collection_modifyitems(items):
	# This hook is called by pytest after it has collected the tests (google its name to check out its doc!)
	# We can ignore some tests as we see fit here, or add marks, such as a skip mark.
	#
	# Typically here, we try to optimize CI time. In particular, the GPU CI instances don't need to run the
	# tests that don't need CUDA, because those tests are extensively tested in the CPU CI instances already.
	# This is true for both CircleCI and the fbcode internal CI.
	# In the fbcode CI, we have an additional constraint: we try to avoid skipping tests. So instead of relying on
	# pytest.mark.skip, in fbcode we literally just remove those tests from the `items` list, and it's as if
	# these tests never existed.

skip CPU tests on GPU GHA jobs #6970

skip CPU tests on GPU GHA jobs #6970

Conversation

pmeier commented Nov 22, 2022 • edited by pytorch-bot bot Loading

pmeier commented Nov 22, 2022

pmeier commented Nov 22, 2022

pmeier commented Nov 22, 2022

pmeier Nov 22, 2022

Choose a reason for hiding this comment

NicolasHug left a comment

Choose a reason for hiding this comment

pmeier commented Feb 9, 2023

github-actions bot commented Feb 9, 2023

pmeier commented Nov 22, 2022 •

edited by pytorch-bot bot

Loading