Layer Norm x86 SIMD Optimizations #4065

LinHeLurking · 2022-07-21T08:15:32Z

This PR provides some SIMD optimizations for LayerNorm, both for packed or unpacked tensors.

…something strange in packing layout;

tencent-adm · 2022-07-21T08:15:46Z

All committers have signed the CLA.

codecov-commenter · 2022-07-21T08:56:34Z

Codecov Report

Merging #4065 (1b118f7) into master (4f414c1) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4065      +/-   ##
==========================================
+ Coverage   94.41%   94.43%   +0.02%     
==========================================
  Files         745      748       +3     
  Lines      178496   179052     +556     
==========================================
+ Hits       168533   169094     +561     
+ Misses       9963     9958       -5

Impacted Files	Coverage Δ
src/layer/x86/layernorm_x86.cpp	`100.00% <100.00%> (ø)`
src/cpu.cpp	`62.11% <0.00%> (-0.24%)`	⬇️
src/mat.h	`89.82% <0.00%> (ø)`
src/layer/riscv/gru_riscv.cpp	`96.56% <0.00%> (ø)`
src/layer/riscv/rvv_mathfun.h	`100.00% <0.00%> (ø)`
src/layer/riscv/cast_riscv.cpp	`95.58% <0.00%> (ø)`
src/layer/riscv/clip_riscv.cpp	`100.00% <0.00%> (ø)`
src/layer/riscv/crop_riscv.cpp	`97.26% <0.00%> (ø)`
src/layer/riscv/gelu_riscv.cpp	`100.00% <0.00%> (ø)`
src/layer/riscv/mish_riscv.cpp	`100.00% <0.00%> (ø)`
... and 31 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4158e63...1b118f7. Read the comment docs.

nihui

for simd register horizontal sum, there is utility function in x86_usability.h
for avx/fma multiply-add intrinsics, there is wrapper comp_fmadd function in x86_usability.h

use size * elempack as the loop count when applicable, so you can merge multiple for loop code blocks into one

LinHeLurking · 2022-07-22T02:13:28Z

for simd register horizontal sum, there is utility function in x86_usability.h for avx/fma multiply-add intrinsics, there is wrapper comp_fmadd function in x86_usability.h

use size * elempack as the loop count when applicable, so you can merge multiple for loop code blocks into one

I think I do not need SIMD register horizontal summation because the length of tensor varies. But AVX/FMA fmadd wrappers in x86_usability.h are useful. I've adopted them.

But I'm not sure how to merge multiple loop blocks into one by using size * elempack as the loop count. I think the packed layout is different with the normal one. And when handling unpacked tensors, I can calculate along dimensions. But when handling packed tensors, branches must be dispatched according to the elempack parameter, which is implicitly decided by CPU vector instruction set availability.

nihui · 2022-07-22T03:13:28Z

suppose v is data from tensor, and a is the weight(such as alpha beta gamma etc.)

pack1

for loop1
{
    v1 + a1
}

pack4

for loop4
{
    v4 + a4
}

pack8

for loop8
{
    v8 + a8
}

pack16

for loop16
{
    v16 + a16
}

unified pack

// prepare a4 a8 a16 if pack1
a1 = a1
a4 = a1 a1 a1 a1
a8 = a4 a4
a16 = a8 a8

// prepare a8 a16 if pack4
a1 = undefined
a4 = a4
a8 = a4 a4
a16 = a8 a8

// prepare a16 if pack8
a1 = undefined
a4 = undefined
a8 = a8
a16 = a8 a8

//  if pack16
a1 = undefined
a4 = undefined
a8 = undefined
a16 = a16

for loop16
{
    v16 + a16
}
for loop8
{
    v8 + a8
}
for loop4
{
    v4 + a4
}
for loop1
{
    v1 + a1
}

LinHeLurking · 2022-07-22T07:05:06Z

suppose v is data from tensor, and a is the weight(such as alpha beta gamma etc.)

pack1

for loop1
{
    v1 + a1
}

pack4

for loop4
{
    v4 + a4
}

pack8

for loop8
{
    v8 + a8
}

pack16

for loop16
{
    v16 + a16
}

unified pack

// prepare a4 a8 a16 if pack1
a1 = a1
a4 = a1 a1 a1 a1
a8 = a4 a4
a16 = a8 a8

// prepare a8 a16 if pack4
a1 = undefined
a4 = a4
a8 = a4 a4
a16 = a8 a8

// prepare a16 if pack8
a1 = undefined
a4 = undefined
a8 = a8
a16 = a8 a8

//  if pack16
a1 = undefined
a4 = undefined
a8 = undefined
a16 = a16

for loop16
{
    v16 + a16
}
for loop8
{
    v8 + a8
}
for loop4
{
    v4 + a4
}
for loop1
{
    v1 + a1
}

Thanks. Now I managed to merge many cases into one.

nihui

add copyright header for new source

src/layer/x86/layernorm_x86.h

nihui · 2022-07-25T08:03:15Z

diff coverage is not good enough from https://app.codecov.io/gh/Tencent/ncnn/pull/4065
it may be a good idea to steal some test cases from #4060
add more tests to test_layernorm.cpp

LinHeLurking · 2022-07-25T08:36:51Z

diff coverage is not good enough from https://app.codecov.io/gh/Tencent/ncnn/pull/4065 it may be a good idea to steal some test cases from #4060 add more tests to test_layernorm.cpp

I've added some test cases about 16-packed tensors.

But I'm confused about the diff coverage. Most files shown in https://app.codecov.io/gh/Tencent/ncnn/pull/4065 are not modified or even influenced by this PR. I have no idea that how LayerNorm_x86 affects them.

nihui · 2022-07-25T08:57:51Z

diff coverage is not good enough from https://app.codecov.io/gh/Tencent/ncnn/pull/4065 it may be a good idea to steal some test cases from #4060 add more tests to test_layernorm.cpp

I've added some test cases about 16-packed tensors.

But I'm confused about the diff coverage. Most files shown in https://app.codecov.io/gh/Tencent/ncnn/pull/4065 are not modified or even influenced by this PR. I have no idea that how LayerNorm_x86 affects them.

It often fails in that way.
Care about modified files we known.

src/layer/x86/layernorm_x86.h

src/layer/x86/layernorm_x86.cpp

nihui · 2022-07-29T10:13:32Z

Thanks for your contribution !

LinHeLurking added 5 commits July 18, 2022 20:59

A LayerNorm_x86 class mocking LayerNorm for tests;

f5783b5

All SIMD optimizations success wihout support_packing; Maybe there's …

7a94b1a

…something strange in packing layout;

Located error about packed layout.

b126e0f

All test passed; Now it supports packing layout

1982605

Fix runtime cpu dispatch;

0fa8689

nihui reviewed Jul 21, 2022

View reviewed changes

Use fmadd wrapper in x86_usability.h;

6a683d3

LinHeLurking added 3 commits July 22, 2022 05:13

Merge packed & unpacked code.

bf95312

Func rename.

af97b05

Simplify and merge more branches about packed layout;

a9be63a

Code format

976692a

nihui reviewed Jul 25, 2022

View reviewed changes

src/layer/x86/layernorm_x86.h Outdated Show resolved Hide resolved

LinHeLurking and others added 3 commits July 25, 2022 03:28

Replace some member functions with static inline functions.

d7007c3

Add copyright header

508d143

apply code-format changes

cf015d8

LinHeLurking added 4 commits July 25, 2022 08:27

Add more tests with 16 packed for AVX512

5084955

Code format

48fb4ea

Merge branch 'master' of https://github.com/Tencent/ncnn

8c4ed97

Merge branch 'master' of https://github.com/LinHeLurking/ncnn

3c2c1c8

nihui reviewed Jul 26, 2022

View reviewed changes

src/layer/x86/layernorm_x86.h Outdated Show resolved Hide resolved

src/layer/x86/layernorm_x86.cpp Outdated Show resolved Hide resolved

src/layer/x86/layernorm_x86.cpp Outdated Show resolved Hide resolved

LinHeLurking added 2 commits July 26, 2022 03:52

Copyright statement year fixed

487568d

Fix accidentally added corelation of mean/var and SIMD ISA

23db5ab

LinHeLurking and others added 5 commits July 26, 2022 06:48

Fix accidentally added corelation of fmadd/affine_fmadd and SIMD ISA

72777b4

Fix a wrong test param

b20d298

Fix runtime dispatch

4fddf9e

apply code-format changes

2555b3e

no store duplicates

1b118f7

nihui merged commit 03f2ad3 into Tencent:master Jul 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Layer Norm x86 SIMD Optimizations #4065

Layer Norm x86 SIMD Optimizations #4065

LinHeLurking commented Jul 21, 2022

tencent-adm commented Jul 21, 2022 •

edited

Loading

codecov-commenter commented Jul 21, 2022 •

edited

Loading

nihui left a comment

LinHeLurking commented Jul 22, 2022

nihui commented Jul 22, 2022

LinHeLurking commented Jul 22, 2022

nihui left a comment

nihui commented Jul 25, 2022

LinHeLurking commented Jul 25, 2022

nihui commented Jul 25, 2022

nihui commented Jul 29, 2022

Layer Norm x86 SIMD Optimizations #4065

Layer Norm x86 SIMD Optimizations #4065

Conversation

LinHeLurking commented Jul 21, 2022

tencent-adm commented Jul 21, 2022 • edited Loading

codecov-commenter commented Jul 21, 2022 • edited Loading

Codecov Report

nihui left a comment

Choose a reason for hiding this comment

LinHeLurking commented Jul 22, 2022

nihui commented Jul 22, 2022

LinHeLurking commented Jul 22, 2022

nihui left a comment

Choose a reason for hiding this comment

nihui commented Jul 25, 2022

LinHeLurking commented Jul 25, 2022

nihui commented Jul 25, 2022

nihui commented Jul 29, 2022

tencent-adm commented Jul 21, 2022 •

edited

Loading

codecov-commenter commented Jul 21, 2022 •

edited

Loading