-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Conversation
Hey @DickJC123 , Thanks for submitting the PR
CI supported jobs: [centos-gpu, windows-gpu, unix-cpu, unix-gpu, miscellaneous, sanity, edge, website, centos-cpu, clang, windows-cpu] Note: |
@josephevans Could you take a look at what I've done so far, and perhaps troubleshoot why I'm seeing the error |
I've encountered a test failure of test_countsketch here: https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-20876/16/pipeline I see where threads might write outside the output tensor bounds, so pushing a fix to this PR. |
This reverts commit ae17b1f.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thanks!
Gentle ping for additional reviews and an eventual merge. This PR contains a few unrelated CI fixes that could help PR development generally. |
To help this PR pass CI, it included a fix to test_countsketch, providing a resolution for #10988. |
Description
g5 instances are now available to the MXNet CI for testing on Ampere A10G GPUs (courtesy of PR apache/mxnet-ci#43 from @josephevans). Taking advantage of that, this PR turns on g5 instance use by adding a
Python3: Ampere-GPU
job running on the g5 to theunix-gpu
pipeline. Since Ampere architecture GPUs use reduced-mantissa-width TF32 calculations by default on float32 datasets, this required some minor test tolerance adjustments.A
test_report_compute_capabilities
test was added that outputs the GPU compute capability to the log . This should be helpful in debugging GPU test failures generally, and here confirms proper enablement of the arch-86 A10G GPU:Side notes:
Unrelated MXNet numpy issue
mxnet.numpy and numpy differ regarding binary op result type with broadcast 0-dim array input
#20898 was discovered in the process of this PR development and a temporary work-around is included.I encountered and fixed issues in how the test_rnn_layers_fp{16,32} tests are invoked. Before the fix of this PR, ./tests/python/unittest/test_gluon_rnn.py::test_rnn_layers_fp16 was run on a GPU, even though the "unittest" path is supposed to be for cpu-only testing. Also, ./tests/python/gpu/test_operator_gpu.py::test_rnn_layers_fp32, invoked via import, was run on a cpu, even though the "gpu" path was used.
After encountering an error on test_countsketch, I supplied a fix for an out-of-bounds write in the backward kernel. The operator is coded in the legacy operator style, and should be launching the kernels into the context's stream. Instead, the operator currently launches kernels into the default stream and includes cudaDevicedSynchronizes. I stopped short of making those additional changes, feeling those changes should be a separate PR filed based on user interest.
Checklist
Essentials
Changes
Comments