Restore "pytest.mark.gpu" for RELAX tests #16741

apeskov · 2024-03-19T08:42:46Z

Unfortunately, relax GPU tests are excluded from CI right now.
The testing script uses flag "-m gpu" to run gpu tests, but actual tests have no such mark.

Lunderberg

Thank you for finding and fixing this issue. The tvm.testing.requires_* marks all provide the pytest.mark.gpu flag, but they aren't yet used in all cases.

Lunderberg · 2024-03-25T18:43:04Z

tests/python/relax/test_codegen_cudnn.py


-pytestmark = [cudnn_enabled]
+pytestmark = [*tvm.testing.requires_cudnn.marks()]


Nitpick: The tvm.testing.requires_* marks can be used as marks directly, rather than needing to be explicitly expanded with foo.marks(). In addition, the pytestmark can be a single marker, rather than a list. Between these two, this can cleaned up as:

pytestmark = tvm.testing.requires_cudnn

I'm not sure that it's possible. Global object pytestmark should has type Mark or List[Mark] but our TVM markers has type tvm.testing.utils.Feature. So direct assignment, like you suggest above, will lead to type check error:

TypeError: got <tvm.testing.utils.Feature object at 0x7ff96f507340> instead of Mark

Moreover, the idea of this change was to utilise hierarchical structure of tvm testing features. pytest.mark.gpu mark is a root of this hierarchy. So I have to extract all markers from current feature and all parents. That is exactly method marks() do.

But you are right this line can be slightly simplified by removing explicit list generator, like:

pytestmark = tvm.testing.requires_cudnn.marks()

Good point. I thought I had made Feature be a subclass of Mark, but I guess I never did. In that case, I like your simplification.

Lunderberg · 2024-03-25T18:43:27Z

tests/python/relax/test_codegen_cudnn.py

@@ -36,12 +36,8 @@ def reset_seed():

 has_cudnn = tvm.get_global_func("relax.ext.cudnn", True)


Do we still need has_cudnn now that it isn't used in pytest.mark.skipif?

You re absolutely right. Will remove it.

Missed pytest.mark.gpu prevents tests from launch in CI. Signed-off-by: Alexander Peskov <alexander.peskov@deelvin.com>

apeskov · 2024-04-18T12:11:44Z

Hi @Lunderberg thank you for reviewing this PR.

The tvm.testing.requires_* marks all provide the pytest.mark.gpu flag, but they aren't yet used in all cases.

Yes I know this issue. And I absolutely agree that this PR contains logically incorrect test marks. Some tests are marked like "require_XXX" but in fact they don't. But this incorrectness was introduced not by me. A lot of test from files touched by me verify pattern matching functionality. Pattern matching is pure python code and doesn't require any cmake enabled features like specific code generator or runtime support. But they still has marker like "requires_tensorrt_codegen". So I changed one incorrect marker by another.

The reason of that was in a need to enable tests in CI. I saw 4 way to do that:

To make all test mark correct. Produce one more Jenkins job with enabled compilation with cuda/cublas/cutlass/tensorrt on virtual machine without GPU. And launch there only GPU specific unit tests which doesn't required real GPU. The main question how to filter tests which should be launched there.
Same as first but instead of creation new one Jenkins job integrate all this stuff into "cpu/pr-head". Have to change build script to add all coda dependencies. It introduces incorrectness of meaning, CI job with "cpu" in name builds "gpu" specific code.
Same as second but use "gpu/pr-head". Same problem as for first way. Have to change launch script and add all specific marker filter like "-m "gpu or cudagraph or tensorcore or cutlass or ***". I guess it's impossible to keep this command line correct. The new one tests and "tvm.testing.Features" will be messed in this list.
Just keep situation as is. And use "gpu" marker as a root marker for all type of tests we would like to launch in CI job "gpu/pr-head".

I selected the fourth way as easiest. And I realise that it's imperfect solution. If you have any other ideas how to enable all test in CI and keep testing code correct please point that.

Signed-off-by: Alexander Peskov <alexander.peskov@deelvin.com>

Lunderberg · 2024-04-19T13:54:26Z

Thank you on the explanation of the different options. I agree that changing all markers in all locations is far too heavy of a lift for any one PR, and should be avoided.

For long-term solutions, I think it would make sense to have the "cpu" and "gpu" indicate which environment should be used, and not which functionality is being tested. That way, we could distinguish between a test that compiles TensorRT (doesn't require a GPU) and a test that executes TensorRT (does require a GPU). The former would run in the cpu/pr-head step, even though it is validating functionality required for GPUs, and the latter would run in the gpu/pr-head step. However, this will require adding the compile-time support to the CI's CPU container, which is a larger step overall.

Lunderberg · 2024-04-19T14:06:50Z

(Aside from all my pontificating, I think this PR is ready to merge once the CI passes.)

I took a quick glance at the failing cases, and it looks like this line should be changed from def main(x: R.Tensor): to def main(x: R.Tensor([4], "int64")):.

The test_no_op_for_call_to_tir was added, and passed in local testing. Due to the markers, it wasn't actually tested in CI.
Recent PR added well-formed checking in TVMScript.
Recent PR added StructInfo for PrimFuncs, allowing a discrepancy between "R.Tensor" and "T.Buffer(T.int64(4), "int64")` to be identified.
This PR fixed the marks, allowing the test to run in CI.

So the test worked when it was added, should have been updated in subsequent PRs, but due to the missing pytest marks, was never noticed until this PR.

Signed-off-by: Alexander Peskov <alexander.peskov@deelvin.com>

ibsidorenko · 2024-04-22T15:12:31Z

Hi, team!
Are we ready to merge this PR? It will be great to restore testing of Relax GPU tests in CI!

apeskov · 2024-04-22T22:15:48Z

@Lunderberg could you please merge this one if you have no objections.

Lunderberg · 2024-04-22T23:24:16Z

Looks like it is still making its way through CI, but I can merge it after the CI passes.

Lunderberg approved these changes Mar 25, 2024

View reviewed changes

[TEST] Mark RELAX GPU tests with pytest.mark.gpu

91e8412

Missed pytest.mark.gpu prevents tests from launch in CI. Signed-off-by: Alexander Peskov <alexander.peskov@deelvin.com>

apeskov force-pushed the ap/mark-tests-with-gpu branch from 695d95f to 91e8412 Compare April 18, 2024 11:27

fix

3629e23

Signed-off-by: Alexander Peskov <alexander.peskov@deelvin.com>

Alexander Peskov added 3 commits April 19, 2024 14:58

Check fp8 compute capability

580ee72

Signed-off-by: Alexander Peskov <alexander.peskov@deelvin.com>

fix func signature

f8fb312

Signed-off-by: Alexander Peskov <alexander.peskov@deelvin.com>

lint

f13c879

Signed-off-by: Alexander Peskov <alexander.peskov@deelvin.com>

echuraev merged commit 11f2253 into apache:main Apr 23, 2024
18 checks passed

ysh329 mentioned this pull request Jul 20, 2024

[Release] v0.17.0 Release Candidate Notes #17178

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore "pytest.mark.gpu" for RELAX tests #16741

Restore "pytest.mark.gpu" for RELAX tests #16741

apeskov commented Mar 19, 2024

Lunderberg left a comment •

edited

Loading

Lunderberg Mar 25, 2024

apeskov Apr 17, 2024

Lunderberg Apr 17, 2024

Lunderberg Mar 25, 2024

apeskov Apr 17, 2024

apeskov commented Apr 18, 2024

Lunderberg commented Apr 19, 2024

Lunderberg commented Apr 19, 2024

ibsidorenko commented Apr 22, 2024

apeskov commented Apr 22, 2024

Lunderberg commented Apr 22, 2024


		pytestmark = [cudnn_enabled]
		pytestmark = [*tvm.testing.requires_cudnn.marks()]

		@@ -36,12 +36,8 @@ def reset_seed():

		has_cudnn = tvm.get_global_func("relax.ext.cudnn", True)

Restore "pytest.mark.gpu" for RELAX tests #16741

Restore "pytest.mark.gpu" for RELAX tests #16741

Conversation

apeskov commented Mar 19, 2024

Lunderberg left a comment • edited Loading

Choose a reason for hiding this comment

Lunderberg Mar 25, 2024

Choose a reason for hiding this comment

apeskov Apr 17, 2024

Choose a reason for hiding this comment

Lunderberg Apr 17, 2024

Choose a reason for hiding this comment

Lunderberg Mar 25, 2024

Choose a reason for hiding this comment

apeskov Apr 17, 2024

Choose a reason for hiding this comment

apeskov commented Apr 18, 2024

Lunderberg commented Apr 19, 2024

Lunderberg commented Apr 19, 2024

ibsidorenko commented Apr 22, 2024

apeskov commented Apr 22, 2024

Lunderberg commented Apr 22, 2024

Lunderberg left a comment •

edited

Loading