Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore "pytest.mark.gpu" for RELAX tests #16741

Merged
merged 5 commits into from
Apr 23, 2024

Conversation

apeskov
Copy link
Contributor

@apeskov apeskov commented Mar 19, 2024

Unfortunately, relax GPU tests are excluded from CI right now.
The testing script uses flag "-m gpu" to run gpu tests, but actual tests have no such mark.

Copy link
Contributor

@Lunderberg Lunderberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for finding and fixing this issue. The tvm.testing.requires_* marks all provide the pytest.mark.gpu flag, but they aren't yet used in all cases.


pytestmark = [cudnn_enabled]
pytestmark = [*tvm.testing.requires_cudnn.marks()]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: The tvm.testing.requires_* marks can be used as marks directly, rather than needing to be explicitly expanded with foo.marks(). In addition, the pytestmark can be a single marker, rather than a list. Between these two, this can cleaned up as:

pytestmark = tvm.testing.requires_cudnn

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that it's possible. Global object pytestmark should has type Mark or List[Mark] but our TVM markers has type tvm.testing.utils.Feature. So direct assignment, like you suggest above, will lead to type check error:

TypeError: got <tvm.testing.utils.Feature object at 0x7ff96f507340> instead of Mark

Moreover, the idea of this change was to utilise hierarchical structure of tvm testing features. pytest.mark.gpu mark is a root of this hierarchy. So I have to extract all markers from current feature and all parents. That is exactly method marks() do.

But you are right this line can be slightly simplified by removing explicit list generator, like:

pytestmark = tvm.testing.requires_cudnn.marks()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I thought I had made Feature be a subclass of Mark, but I guess I never did. In that case, I like your simplification.

@@ -36,12 +36,8 @@ def reset_seed():

has_cudnn = tvm.get_global_func("relax.ext.cudnn", True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need has_cudnn now that it isn't used in pytest.mark.skipif?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You re absolutely right. Will remove it.

Missed pytest.mark.gpu prevents tests from launch in CI.

Signed-off-by: Alexander Peskov <alexander.peskov@deelvin.com>
@apeskov apeskov force-pushed the ap/mark-tests-with-gpu branch from 695d95f to 91e8412 Compare April 18, 2024 11:27
@apeskov
Copy link
Contributor Author

apeskov commented Apr 18, 2024

Hi @Lunderberg thank you for reviewing this PR.

The tvm.testing.requires_* marks all provide the pytest.mark.gpu flag, but they aren't yet used in all cases.

Yes I know this issue. And I absolutely agree that this PR contains logically incorrect test marks. Some tests are marked like "require_XXX" but in fact they don't. But this incorrectness was introduced not by me. A lot of test from files touched by me verify pattern matching functionality. Pattern matching is pure python code and doesn't require any cmake enabled features like specific code generator or runtime support. But they still has marker like "requires_tensorrt_codegen". So I changed one incorrect marker by another.

The reason of that was in a need to enable tests in CI. I saw 4 way to do that:

  1. To make all test mark correct. Produce one more Jenkins job with enabled compilation with cuda/cublas/cutlass/tensorrt on virtual machine without GPU. And launch there only GPU specific unit tests which doesn't required real GPU. The main question how to filter tests which should be launched there.

  2. Same as first but instead of creation new one Jenkins job integrate all this stuff into "cpu/pr-head". Have to change build script to add all coda dependencies. It introduces incorrectness of meaning, CI job with "cpu" in name builds "gpu" specific code.

  3. Same as second but use "gpu/pr-head". Same problem as for first way. Have to change launch script and add all specific marker filter like "-m "gpu or cudagraph or tensorcore or cutlass or ***". I guess it's impossible to keep this command line correct. The new one tests and "tvm.testing.Features" will be messed in this list.

  4. Just keep situation as is. And use "gpu" marker as a root marker for all type of tests we would like to launch in CI job "gpu/pr-head".

I selected the fourth way as easiest. And I realise that it's imperfect solution. If you have any other ideas how to enable all test in CI and keep testing code correct please point that.

fix
Signed-off-by: Alexander Peskov <alexander.peskov@deelvin.com>
@Lunderberg
Copy link
Contributor

Thank you on the explanation of the different options. I agree that changing all markers in all locations is far too heavy of a lift for any one PR, and should be avoided.

For long-term solutions, I think it would make sense to have the "cpu" and "gpu" indicate which environment should be used, and not which functionality is being tested. That way, we could distinguish between a test that compiles TensorRT (doesn't require a GPU) and a test that executes TensorRT (does require a GPU). The former would run in the cpu/pr-head step, even though it is validating functionality required for GPUs, and the latter would run in the gpu/pr-head step. However, this will require adding the compile-time support to the CI's CPU container, which is a larger step overall.

@Lunderberg
Copy link
Contributor

(Aside from all my pontificating, I think this PR is ready to merge once the CI passes.)

I took a quick glance at the failing cases, and it looks like this line should be changed from def main(x: R.Tensor): to def main(x: R.Tensor([4], "int64")):.

  1. The test_no_op_for_call_to_tir was added, and passed in local testing. Due to the markers, it wasn't actually tested in CI.
  2. Recent PR added well-formed checking in TVMScript.
  3. Recent PR added StructInfo for PrimFuncs, allowing a discrepancy between "R.Tensor" and "T.Buffer(T.int64(4), "int64")` to be identified.
  4. This PR fixed the marks, allowing the test to run in CI.

So the test worked when it was added, should have been updated in subsequent PRs, but due to the missing pytest marks, was never noticed until this PR.

Alexander Peskov added 3 commits April 19, 2024 14:58
Signed-off-by: Alexander Peskov <alexander.peskov@deelvin.com>
Signed-off-by: Alexander Peskov <alexander.peskov@deelvin.com>
Signed-off-by: Alexander Peskov <alexander.peskov@deelvin.com>
@ibsidorenko
Copy link
Contributor

Hi, team!
Are we ready to merge this PR? It will be great to restore testing of Relax GPU tests in CI!

@apeskov
Copy link
Contributor Author

apeskov commented Apr 22, 2024

@Lunderberg could you please merge this one if you have no objections.

@Lunderberg
Copy link
Contributor

Looks like it is still making its way through CI, but I can merge it after the CI passes.

@echuraev echuraev merged commit 11f2253 into apache:main Apr 23, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants