[FEATURE] Add g5 instance to CI #20876

DickJC123 · 2022-02-05T21:18:16Z

Description

g5 instances are now available to the MXNet CI for testing on Ampere A10G GPUs (courtesy of PR apache/mxnet-ci#43 from @josephevans). Taking advantage of that, this PR turns on g5 instance use by adding a Python3: Ampere-GPU job running on the g5 to the unix-gpu pipeline. Since Ampere architecture GPUs use reduced-mantissa-width TF32 calculations by default on float32 datasets, this required some minor test tolerance adjustments.

A test_report_compute_capabilities test was added that outputs the GPU compute capability to the log . This should be helpful in debugging GPU test failures generally, and here confirms proper enablement of the arch-86 A10G GPU:

[2022-02-16T07:56:58.381Z] tests/python/gpu/test_operator_gpu.py::test_report_compute_capabilities = [86] PASSED [ 18%]

Side notes:

Unrelated MXNet numpy issue mxnet.numpy and numpy differ regarding binary op result type with broadcast 0-dim array input #20898 was discovered in the process of this PR development and a temporary work-around is included.

I encountered and fixed issues in how the test_rnn_layers_fp{16,32} tests are invoked. Before the fix of this PR, ./tests/python/unittest/test_gluon_rnn.py::test_rnn_layers_fp16 was run on a GPU, even though the "unittest" path is supposed to be for cpu-only testing. Also, ./tests/python/gpu/test_operator_gpu.py::test_rnn_layers_fp32, invoked via import, was run on a cpu, even though the "gpu" path was used.

After encountering an error on test_countsketch, I supplied a fix for an out-of-bounds write in the backward kernel. The operator is coded in the legacy operator style, and should be launching the kernels into the context's stream. Instead, the operator currently launches kernels into the default stream and includes cudaDevicedSynchronizes. I stopped short of making those additional changes, feeling those changes should be a separate PR filed based on user interest.

Checklist

Essentials

PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

mxnet-bot · 2022-02-05T21:18:20Z

Hey @DickJC123 , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [centos-gpu, windows-gpu, unix-cpu, unix-gpu, miscellaneous, sanity, edge, website, centos-cpu, clang, windows-cpu]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

DickJC123 · 2022-02-08T16:57:34Z

@josephevans Could you take a look at what I've done so far, and perhaps troubleshoot why I'm seeing the error There are no nodes with the label ‘mxnetlinux-gpu-g5’ on CI page https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-20876/3/pipeline/247. Thanks.

…alg_qr

DickJC123 · 2022-02-22T00:55:18Z

I've encountered a test failure of test_countsketch here: https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-20876/16/pipeline

I see where threads might write outside the output tensor bounds, so pushing a fix to this PR.

This reverts commit ae17b1f.

josephevans

LGTM, Thanks!

DickJC123 · 2022-03-07T21:54:32Z

Gentle ping for additional reviews and an eventual merge. This PR contains a few unrelated CI fixes that could help PR development generally.

TristonC · 2022-03-08T21:09:28Z

@szha @ptrendx Could you please help review and merge this PR? It passed all the checks and is ready to be merged.

DickJC123 · 2022-03-15T20:55:52Z

To help this PR pass CI, it included a fix to test_countsketch, providing a resolution for #10988.

Add g5 instance to jenkinsfiles where both p3 and g4 are mentioned

73431dd

DickJC123 requested review from aaronmarkham and marcoabreu as code owners February 5, 2022 21:18

mseth10 added the pr-work-in-progress PR is still work in progress label Feb 5, 2022

DickJC123 added 3 commits February 7, 2022 15:19

Remove reference to non-existent restricted-mxnetlinux-gpu-g5

1433911

Enable unittest job on g5

dd3b58a

Fix Jenkinsfile_unix_gpu syntax

48b6173

DickJC123 added 2 commits February 8, 2022 16:13

Include A10G arch 86 in build for g5

1c11634

Merge remote-tracking branch 'mxnet/master' into add_g5_instance_to_CI

3d8f5cc

DickJC123 requested a review from szha as a code owner February 15, 2022 22:32

DickJC123 added 7 commits February 15, 2022 14:34

Update is_TF32_enabled() for SM arch > 80

ec30697

Remove gpu arch 86 from centos builds on cuda 10

1c96ef9

Fix test_convolution_{grouping,dilated_impulse_response}, test_np_lin…

4b93cdb

…alg_qr

Fix test_convolution_grouping on A100

5583ae6

Fix test_rnn_unroll_variant_length

00cfc89

Fix test_convolution_dilated_impulse_response

9161495

Skip test_np_standard_binary_funcs test of 0-dim array broadcast

f4e2f40

DickJC123 changed the title ~~[FEATURE] [WIP] Add g5 instance to CI~~ [FEATURE] Add g5 instance to CI Feb 17, 2022

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Feb 17, 2022

DickJC123 requested review from josephevans and ptrendx February 17, 2022 07:35

mseth10 added pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress labels Feb 17, 2022

DickJC123 mentioned this pull request Feb 22, 2022

CI fixes #20903

Merged

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Feb 22, 2022

Fix potential out-of-bounds write in count_sketch.cu

9114332

mseth10 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 22, 2022

Revert "Pin MarkupSafe==2.0.1 to avoid soft_unicode import failure"

3b9a520

This reverts commit ae17b1f.

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Feb 24, 2022

Merge remote-tracking branch 'mxnet/master' into add_g5_instance_to_CI

26734de

josephevans approved these changes Mar 2, 2022

View reviewed changes

ptrendx approved these changes Mar 8, 2022

View reviewed changes

ptrendx merged commit fc54fab into apache:master Mar 8, 2022

This was referenced Mar 15, 2022

Flaky test: test_operator_gpu.test_countsketch #10988

Closed

[2.0] [BACKPORT] of [1.x][FEATURE] CUDA graphs support (#19142) #20324

Merged

bartekkuncer mentioned this pull request Mar 23, 2022

GPU CI fails to build project #20975

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Add g5 instance to CI #20876

[FEATURE] Add g5 instance to CI #20876

DickJC123 commented Feb 5, 2022 •

edited

Loading

mxnet-bot commented Feb 5, 2022

DickJC123 commented Feb 8, 2022

DickJC123 commented Feb 22, 2022

josephevans left a comment

DickJC123 commented Mar 7, 2022

TristonC commented Mar 8, 2022 •

edited

Loading

DickJC123 commented Mar 15, 2022

[FEATURE] Add g5 instance to CI #20876

[FEATURE] Add g5 instance to CI #20876

Conversation

DickJC123 commented Feb 5, 2022 • edited Loading

Description

Checklist

Essentials

Changes

Comments

mxnet-bot commented Feb 5, 2022

DickJC123 commented Feb 8, 2022

DickJC123 commented Feb 22, 2022

josephevans left a comment

Choose a reason for hiding this comment

DickJC123 commented Mar 7, 2022

TristonC commented Mar 8, 2022 • edited Loading

DickJC123 commented Mar 15, 2022

DickJC123 commented Feb 5, 2022 •

edited

Loading

TristonC commented Mar 8, 2022 •

edited

Loading