src:cpu:aarch64 restricts the conv operation to just one post-op when… #1689

renato-arantes · 2023-07-20T07:50:48Z

Description

When using fp16 on AArch64 there are multiple failed conv tests due to a precision problem in oneDNN's fp16 operations. OneDNN conducts all fp16 post-op operations in fp32 and then downgrades to fp16 at the end, leading to a rounding error. To resolve this issue, this patch limits the conv operation to a single post-op when the src/wei/dst data type is fp16.

Checklist

General

Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
Have you formatted the code using clang-format?

… the src/wei/dst data type is fp16

mgouicem · 2023-07-24T19:52:00Z

Hi @renato-arantes thanks for the patch.
Could you

share bug instances this patches fixes (maybe a few benchdnn repro lines)
describe what is the rootcause for the issue you are fixing?

renato-arantes · 2023-07-25T07:47:05Z

Hi @mgouicem

The root cause is that OneDNN conducts all fp16 post-op operations in fp32 and then downgrades to fp16 at the end, leading to a rounding error. To bypass this issue, we decided to limit the conv operation to a single post-op when the src/wei/dst data type is fp16.

Here are two benchdnn repro lines that result in an error before the patch:

--conv --skip-impl=ref --allow-enum-tags-only=false --dt=f16:f16:f16 --dtag=acdb --attr-post-ops=tanh+sum:1:0:f16 mb1ic1ih10oc3oh10kh3ph1n
--conv --skip-impl=ref --allow-enum-tags-only=false --dt=f16:f16:f16 --attr-post-ops=sum:1:0:f16+mul:f16:per_oc+add:f16:per_oc+relu mb1ic128ih38oc128oh38kh3ph1n

I have several more if you need it.

mgouicem · 2023-07-25T08:34:41Z

Hi @renato-arantes

The root cause is that OneDNN conducts all fp16 post-op operations in fp32 and then downgrades to fp16 at the end, leading to a rounding error.

This is the expected behavior in oneDNN, and this is exactly what benchdnn checks. I just confirmed that these two tests pass on x64 implementation, which has that exact behavior (computing post-ops in f32).

To bypass this issue, we decided to limit the conv operation to a single post-op when the src/wei/dst data type is fp16.

Could you clarify how this remove the f32->f16 downconversion? I would expect that f32->f16 conversion to be here independently of the number of post-ops since convolution accumulates to f32 anyway.

renato-arantes · 2023-07-25T09:25:02Z

Hi @mgouicem,

This is the expected behavior in oneDNN, and this is exactly what benchdnn checks. I just confirmed that these two tests pass on x64 implementation, which has that exact behavior (computing post-ops in f32).

Both fail for me as this patch is for aarch64 and not for x64! You can check that my patch only changes files under src/cpu/aarch64.

Here is the benchdnn output for the first one:

./tests/benchdnn/benchdnn --conv --skip-impl=ref --allow-enum-tags-only=false --dt=f16:f16:f16 --dtag=acdb --attr-post-ops=tanh+sum:1:0:f16 mb1ic1ih10oc3oh10kh3ph1n
[ 2][DST][0:0:0:2] exp_f32:-1.66893e-06 exp:-1.66893e-06 got: -1 diff:0.999998 rdiff: 599185
[ 7][DST][0:0:0:7] exp_f32: 2.00495 exp: 2.00586 got: 2.00391 diff:0.00195312 rdiff:0.00097371
[ 17][DST][0:0:1:7] exp_f32: -2.99505 exp: -2.99414 got: -2.99609 diff:0.00195312 rdiff:0.000652316
[ 101][DST][0:1:0:1] exp_f32: -2 exp: -2 got: -1 diff: 1 rdiff: 0.5
[ 102][DST][0:1:0:2] exp_f32: 2 exp: 2 got: 3 diff: 1 rdiff: 0.5
[ 105][DST][0:1:0:5] exp_f32: -2.00495 exp: -2.00586 got: -2.00391 diff:0.00195312 rdiff:0.00097371
[ 144][DST][0:1:4:4] exp_f32: -3.00495 exp: -3.00586 got: -3.00391 diff:0.00195312 rdiff:0.000649773
[ 156][DST][0:1:5:6] exp_f32: 2.99505 exp: 2.99414 got: 2.99609 diff:0.00195312 rdiff:0.000652316
[ 190][DST][0:1:9:0] exp_f32: 2.99505 exp: 2.99414 got: 2.99609 diff:0.00195312 rdiff:0.000652316
0:FAILED (errors:16 total:300) __REPRO: --conv --skip-impl=ref --allow-enum-tags-only=false --dt=f16:f16:f16 --dtag=acdb --attr-post-ops=tanh+sum:1:0:f16 mb1ic1ih10oc3oh10kh3ph1
tests:1 passed:0 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:1 listed:0

I built oneDNN from the github/master and the top commit is 5f9ff5d9ab3461047222a6e71a4e7bad0e838caf.

renato-arantes · 2023-07-25T09:35:35Z

Hi @mgouicem,

Could you clarify how this remove the f32->f16 downconversion? I would expect that f32->f16 conversion to be here independently of the number of post-ops since convolution accumulates to f32 anyway.

ACL does all post-op in f16 when demanded by --dt=f16:f16:f16, and when we have more than one post-op sometimes the results are different from oneDNN due to oneDNN downcast.

This patch doesn't remove the down conversation. But in the case of aarch64, limits the number of post-op operations to just one, so in this case, the probability of having a failure comparison with the ACL result due to the f32->f16 downcast is lower. Currently, I haven't seen any test with just one post-op that fails.

mgouicem · 2023-07-25T10:09:54Z

@nSircombe @fadara01 @milpuz01 for awareness.

Both fail for me as this patch is for aarch64 and not for x64! You can check that my patch only changes files under src/cpu/aarch64.

Sorry if my point was not clear. What I meant is that if post-ops are run in f32, and the down-conversion happens only once before writting to dst (like in x64 implementation), then there should be no failure in benchdnn for those test cases.

ACL does all post-op in f16 when demanded by --dt=f16:f16:f16, and when we have more than one post-op sometimes the results are different from oneDNN due to oneDNN downcast.

That makes sense that the issue would be post-ops using f16 instead of f32. There were a few discussions on this topic, and the plan was to rely on f32 post-ops for ACL (see this PR and this PR). Was there any change with respect to this?

This patch doesn't remove the down conversation. But in the case of aarch64, limits the number of post-op operations to just one, so in this case, the probability of having a failure comparison with the ACL result due to the f32->f16 downcast is lower. Currently, I haven't seen any test with just one post-op that fails.

Hence my concern: this PR does not seem to address the actual issue (the use of f16 in postops), but mostly hides the issue (benchdnn uses a higher threshold when post-ops are used, so with 1 post-ops, not much error is compounded and the test passes).

igorsafo · 2023-07-26T16:53:50Z

Discussed this PR with @milpuz01 , @dzarukin and @cfRod :

The desicion is to keep post-ops supported until an RFC about fp16 post-ops is ready and implemented.
- Switching off post-ops doesn't solve the issue but might introduce performance degradation on user's side.
- Failing benchdnn tests are fine as long as it is a known issue that will be addressed at some point.

vpirogov · 2024-01-12T20:52:17Z

API to manage accumulation mode is now merged to main branch. Please decide whether you want to continue working on this PR or open a new one.

vpirogov · 2024-09-25T21:17:40Z

Closing as stale.

src:cpu:aarch64 restricts the conv operation to just one post-op when…

746fb35

… the src/wei/dst data type is fp16

Merge branch 'oneapi-src:master' into fp16_postops

5e4e68c

vpirogov added the platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 label Mar 29, 2024

vpirogov mentioned this pull request Sep 3, 2024

[ARM] Support FP16 post-ops fusion into ACL kernels #2067

Open

vpirogov closed this Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src:cpu:aarch64 restricts the conv operation to just one post-op when… #1689

src:cpu:aarch64 restricts the conv operation to just one post-op when… #1689

renato-arantes commented Jul 20, 2023 •

edited

Loading

mgouicem commented Jul 24, 2023 •

edited

Loading

renato-arantes commented Jul 25, 2023 •

edited

Loading

mgouicem commented Jul 25, 2023 •

edited

Loading

renato-arantes commented Jul 25, 2023 •

edited

Loading

renato-arantes commented Jul 25, 2023 •

edited

Loading

mgouicem commented Jul 25, 2023

igorsafo commented Jul 26, 2023

vpirogov commented Jan 12, 2024

vpirogov commented Sep 25, 2024

src:cpu:aarch64 restricts the conv operation to just one post-op when… #1689

src:cpu:aarch64 restricts the conv operation to just one post-op when… #1689

Conversation

renato-arantes commented Jul 20, 2023 • edited Loading

Description

Checklist

General

mgouicem commented Jul 24, 2023 • edited Loading

renato-arantes commented Jul 25, 2023 • edited Loading

mgouicem commented Jul 25, 2023 • edited Loading

renato-arantes commented Jul 25, 2023 • edited Loading

renato-arantes commented Jul 25, 2023 • edited Loading

mgouicem commented Jul 25, 2023

igorsafo commented Jul 26, 2023

vpirogov commented Jan 12, 2024

vpirogov commented Sep 25, 2024

renato-arantes commented Jul 20, 2023 •

edited

Loading

mgouicem commented Jul 24, 2023 •

edited

Loading

renato-arantes commented Jul 25, 2023 •

edited

Loading

mgouicem commented Jul 25, 2023 •

edited

Loading

renato-arantes commented Jul 25, 2023 •

edited

Loading

renato-arantes commented Jul 25, 2023 •

edited

Loading