-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Cudnn conv dgrad algo filtering #14310
Cudnn conv dgrad algo filtering #14310
Conversation
@mxnet-label-bot add [pr-awaiting-review] |
Finished first step of this PR: 'Do an initial PR submission that includes the problem-exposing test, and with a temporary addition of that test to the tensorrt CI test suite.' CI passed, except for the one newly-introduced test (as expected):
|
I've now finished the second step of the PR: 'push the fix (which involves filtering-out algo 3 from the results of cudnnFind when necessary).' From the CI logs for the tensorrt build (the only one that previously failed due to its use of cudnn 7.4), now we have:
|
…a10, cudnn7.4.2)" This reverts commit 1cb743b.
This PR is in a good state to review. I hope you guys like the test-driven-development ;-) |
This PR as been ready to go for a week. Should I be concerned there are no reviewers/assignees yet? @eric-haibin-lin @szha The idea with this is to protect MXNet users from issues present in more advanced cudnn versions than what is used in the standard MXNet builds and CI. |
Still looking for reviewers of this fairly small PR. @marcoabreu @larroy ? |
I have pinged a few reviewers, you should get a response shortly. Please excuse the delay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry that we review late.
I have reviewed this PR, LGTM.
Thanks for your contribution!
I have a question: will the test case trigger algo_exclusion=True?
Yes, the test case was designed to trigger the failure (see earlier post). With the fix in place, the algo_exclusion==True path is activated to ensure the problematic algo is bypassed. |
* Add test exposing issue with conv dgrad algo 3 for some cudnn's. * Add test temporarily to tests run with tensorrt CI build (cuda10, cudnn7.4.2) * Relax tol of new test. * Fix for problematic conv dgrad algo 3 for some cuDNNs. * Add algo exclusion term to cudnnFind result processing. * Revert "Add test temporarily to tests run with tensorrt CI build (cuda10, cudnn7.4.2)" This reverts commit 1cb743b. * Trigger CI. * Add link to cuDNN release notes. * Trigger CI.
* Add test exposing issue with conv dgrad algo 3 for some cudnn's. * Add test temporarily to tests run with tensorrt CI build (cuda10, cudnn7.4.2) * Relax tol of new test. * Fix for problematic conv dgrad algo 3 for some cuDNNs. * Add algo exclusion term to cudnnFind result processing. * Revert "Add test temporarily to tests run with tensorrt CI build (cuda10, cudnn7.4.2)" This reverts commit 1cb743b. * Trigger CI. * Add link to cuDNN release notes. * Trigger CI.
* Add test exposing issue with conv dgrad algo 3 for some cudnn's. * Add test temporarily to tests run with tensorrt CI build (cuda10, cudnn7.4.2) * Relax tol of new test. * Fix for problematic conv dgrad algo 3 for some cuDNNs. * Add algo exclusion term to cudnnFind result processing. * Revert "Add test temporarily to tests run with tensorrt CI build (cuda10, cudnn7.4.2)" This reverts commit 1cb743b. * Trigger CI. * Add link to cuDNN release notes. * Trigger CI.
Description
I've learned that for cudnn versions in the range [7.3.1,7.5.0), the Convolution dgrad algorithm 3 (CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT_TILING) may produce incorrect results for some strided convolutions. This is not something that would generally appear in the current CI, which builds against cuda9.1 and cudnn7.1(so before the problem was introduced).
I've created a test to expose the problem, as well as a fix. I've noticed that there is a CI build for tensorrt that uses cuda10.0 and cudnn7.4, which should exhibit the problem. Thus, the plan is:
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments