Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tuning] rocm 4.4 update for develop #1133

Merged
merged 7 commits into from
Sep 18, 2021
Merged

Conversation

cderb
Copy link
Contributor

@cderb cderb commented Aug 31, 2021

No description provided.

@cderb cderb added the tuning label Aug 31, 2021
@cderb cderb requested review from JehandadKhan and atamazov August 31, 2021 17:59
@codecov

This comment has been minimized.

@junliume
Copy link
Contributor

@cderb CI has some interesting issues:

[2021-08-31T19:41:23.263Z] FAILED: pdb.FindRecord(ctx): /var/jenkins/workspace/MLLibs_MIOpen_tuning-rocm-4.4/test/embed_sqlite.cpp: 81
[2021-08-31T19:41:23.263Z]
[2021-08-31T19:41:23.263Z] warning: core file may not match specified executable file.

The following tests FAILED:

[2021-08-31T19:27:42.458Z] 18 - test_embed_sqlite (Failed)
[2021-08-31T19:27:42.458Z] 58 - test_conv_embed_db (Failed)

@atamazov
Copy link
Contributor

atamazov commented Sep 1, 2021

I'll run performance tests as soon as CI passed.

@junliume
Copy link
Contributor

junliume commented Sep 5, 2021

I'll run performance tests as soon as CI passed.

CI issues are reproducible so we may need to fix them first @cderb @JehandadKhan

@atamazov
Copy link
Contributor

test_conv_embed_db is consistently failing.

@atamazov
Copy link
Contributor

I will merge recent develop and restart CI.

@atamazov
Copy link
Contributor

atamazov commented Sep 16, 2021

CI run #5 (with fix test_conv_embed_db) still shows test_conv_embed_db failure.

CI run #8 failed with filesystem::recursive_directory_iterator error during "Fp32 OpenCL gfx90a". Maybe this is related to missing gfx90a databases for OpenCL backend?

Restarted.

@atamazov
Copy link
Contributor

It seems like the filesystem::recursive_directory_iterator problem first happened a while ago, see #701 (comment).

@junliume
Copy link
Contributor

Issue #1161 logged and will keep track

@junliume junliume merged commit 068b6b1 into develop Sep 18, 2021
@atamazov
Copy link
Contributor

I am running performance tests on Vega20 right now.

@atamazov
Copy link
Contributor

[Intermediate testing results] Unfortunately I see performance & correctness problems with FP32 on gfx906_60. Analysis is in progress.

An example of correctness problem: duplicated algorithm in find-db (file: gfx906_60.HIP.fdb.txt)

128-17-17-7x1-128-17-17-128-3x0-1x1-1x1-0-NCHW-FP32-F=
    miopenConvolutionFwdAlgoImplicitGEMM:
        ConvHipImplicitGemmV4R1Fwd,0.979353,0,miopenConvolutionFwdAlgoImplicitGEMM,128x17x17x7x1x128x17x17x128xNCHWxFP32x3x0x1x1x1x1x1xF;
    miopenConvolutionFwdAlgoImplicitGEMM:
        ConvAsmImplicitGemmV4R1DynamicFwd,0.979353,0,miopenConvolutionFwdAlgoImplicitGEMM,128x17x17x7x1x128x17x17x128xNCHWxFP32x3x0x1x1x1x1x1xF;
    miopenConvolutionFwdAlgoWinograd:
        ConvBinWinogradRxSf3x2,1.11983,0,miopenConvolutionFwdAlgoWinograd,128x17x17x7x1x128x17x17x128xNCHWxFP32x3x0x1x1x1x1x1xF;
    miopenConvolutionFwdAlgoDirect:
        ConvDirectNaiveConvFwd,20.013,0,miopenConvolutionFwdAlgoDirect,128x17x17x7x1x128x17x17x128xNCHWxFP32x3x0x1x1x1x1x1xF;
    miopenConvolutionFwdAlgoGEMM:
        GemmFwdRest,9081.04,1035776,miopenConvolutionFwdAlgoGEMM,128x17x17x7x1x128x17x17x128xNCHWxFP32x3x0x1x1x1x1x1xF

Two records that refer to the same algorithm, miopenConvolutionFwdAlgoImplicitGEMM:. Notice that time is identical (0.979353); perhaps this is the implicit reason of the issue.

@atamazov
Copy link
Contributor

Unfortunately, there are some incorrect FP16 records that lead to errors like this:

MIOpen(HIP): Info [FindSolutionImpl] Perf Db: record not found for: ConvOclDirectFwdFused
MIOpenConvDirUni.cl:922:5: error: 'MLO_CONV_BIAS' is not defined, evaluates to 0 [-Werror,-Wundef]

Some details

Example of incorrect record:

1-48-480-3x3-16-48-480-16-1x1-1x1-1x1-0-NCHW-FP16-F=
            miopenConvolutionFwdAlgoDirect:
                        ConvOclDirectFwdFused,0.08048,0,miopenConvolutionFwdAlgoDirect,1x48x480x3x3x16x48x480x16xNCHWxFP16x1x1x1x1x1x1x1xF;
            miopenConvolutionFwdAlgoWinograd:
                        ConvBinWinogradRxSf2x3g1,0.16864,0,miopenConvolutionFwdAlgoWinograd,1x48x480x3x3x16x48x480x16xNCHWxFP16x1x1x1x1x1x1x1xF;
            miopenConvolutionFwdAlgoGEMM:
                        GemmFwdRest,8920.92,414720,miopenConvolutionFwdAlgoGEMM,1x48x480x3x3x16x48x480x16xNCHWxFP16x1x1x1x1x1x1x1xF

How it should look like:

1-48-480-3x3-16-48-480-16-1x1-1x1-1x1-0-NCHW-FP16-F=
            miopenConvolutionFwdAlgoDirect:
                        ConvOclDirectFwd,0.06528,0,miopenConvolutionFwdAlgoDirect,<unused>;
            miopenConvolutionFwdAlgoGEMM:
                        GemmFwdRest,0.990416,414720,miopenConvolutionFwdAlgoGEMM,<unused>;
            miopenConvolutionFwdAlgoWinograd:
                        ConvBinWinogradRxSf2x3g1,0.1456,0,miopenConvolutionFwdAlgoWinograd,<unused>

Notice ConvOclDirectFwdFused and huge GEMM time.

This PR must be reverted. More details available upon request.

@atamazov
Copy link
Contributor

Maybe it is enough to revert only find-db changes. I am investigating this possibility.

@atamazov
Copy link
Contributor

Maybe it is enough to revert only find-db changes. I am investigating this possibility.

No. There is some FP32 perf regression (~23%) due to missing perf-db records (many "Perf Db: record not found for..." messages).

revert-tuning-rocm-4.4-fordev - ONLY FIND-DB REVERTED - - - -
ROCm 4.0, gfx906/60 - Performance gain, based on sum of all times/directions - -
. - GPU time Wall time Aux Wall time
FP32 Default find mode 0.77 0.98 0.92
FP32 OpenCL BE Default find mode 0.77 0.96 1.02

FP16/BF16 not tested. Let's revert all.

atamazov added a commit that referenced this pull request Sep 20, 2021
atamazov added a commit that referenced this pull request Sep 20, 2021
junliume pushed a commit that referenced this pull request Sep 21, 2021
* Revert find-db changes from #1133 "[Tuning] rocm 4.4 update for develop"

* Revert "Revert find-db changes from #1133 "[Tuning] rocm 4.4 update for develop""

This reverts commit 86e7e0a.

* Revert "[Tuning] rocm 4.4 update for develop (#1133)"

This reverts commit 068b6b1.
@JehandadKhan JehandadKhan deleted the tuning-rocm-4.4-fordev branch September 23, 2021 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants