[operator] Integrate oneDNN layer normalization implementation #19562

bartekkuncer · 2020-11-18T17:08:29Z

Description

The change integrates oneDNNs implementation of forward and backward propagation of Layer Normalization for axis == -1 (default case - last axis).

Comments

As oneDNNs LayerNorm primitive does not support axis parameter (https://oneapi-src.github.io/oneDNN/dev_guide_layer_normalization.html) I had to modify input data by adjusting tensors in mxnet before sending them to oneDNN to make it work with axis != -1. I tried two approaches:

Create custom memory descriptors for tensors with adjusted shapes and strides to make layer normalization operate along different axis.
Reorder tensors before and after using layer normalization primitive.

Both approaches turned out to be significantly slower than current mxnet implementation.

OneDNNs backward propagation is significantly faster than current mxnet's implementation. Forward implementation has similar performance to mxnet's generic version - depending on shape at times faster is marian and at times faster is oneDNN. As the difference in performance is significant in some of these cases I introduced simple heuristics (based on huge amount of benchmarking) for checking if layer normalization should be computed by oneDNN:

  auto ShapeBetterForMKLDNN = [](const mxnet::TShape& shape) {
    constexpr size_t shapeLimit = 1024;
    return shape.Size() / shape[0] >= shapeLimit && shape[0] >= shapeLimit;
  };

The above function can be found in mkldnn_layer_norm.cc file.

Most recent performance numbers

ln_opperf1908clx.xlsx

mxnet-bot · 2020-11-18T17:08:31Z

Hey @bartekkuncer , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [windows-gpu, sanity, centos-gpu, miscellaneous, website, unix-gpu, centos-cpu, clang, windows-cpu, unix-cpu, edge]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

pengzhao-intel · 2020-11-24T12:22:35Z

@bartekkuncer could you add more description in the PR to avoid confusion for the reviewers?

bartekkuncer · 2020-11-24T13:28:52Z

@bartekkuncer could you add more description in the PR to avoid confusion for the reviewers?

@pengzhao-intel was planning to add them as soon as I fix tests :)

kpuatamazon · 2020-11-30T18:13:50Z

Hi, can we compare #19601 ?

bartekkuncer · 2020-12-10T15:45:37Z

Hi, can we compare #19601 ?

@kpuatamazon Sorry for the late response, was waiting for the layer norm optimization in oneDNN. Below are the results I got using marian and my onednn implementation. It looks like oneDNN is faster in most cases. Can you tell me what CPU are you using?

threads amount	28 threads	28 threads	4 threads	4 threads
execution	marian	mx w/ onednn	marian	mx w/ onednn
1000x5	0,0213	0,032	0,0272	0,029
1000x100	0,03	0,0859	0,0664	0,0421
300x512	0,0255	0,1505	0,0667	0,073
500x512	0,0305	0,2436	0,0859	0,0449
1000x2048	0,1396	0,0616	0,4103	0,2594
1000x3	0,0156	0,0239	0,018	0,0242
45x512	0,0152	0,038	0,0252	0,0377
1000x5x100	0,0652	0,0457	0,2194	0,117
1000x8x100	0,0963	0,0538	0,3147	0,1594
300x512x512	8,0384	7,8368	31,491	28,0296
500x512x10	0,6231	0,3557	3,9009	1,0595
1000x5x2048	2,1114	1,574	4,9501	3,8672
1000x2048x3	2,6274	1,3618	12,9816	5,8291
45x512x512	2,2648	1,7278	5,4215	4,6426
1000x5x30x200	3,6479	3,5311	13,3026	11,3592
100x100x10x300	3,6913	3,9369	13,4574	10,93
300x512x45x20	24,2647	16,6294	146,4946	76,6008
50x512x40x30	5,3653	4,0195	27,1334	15,8884
100x2048x10x10	6,2846	3,4051	34,9686	13,3237
1000x3x10x200	0,9976	0,5507	2,1544	1,0273
45x52x300x45	4,7193	3,9909	22,6179	14,0623
100x5x30x20x10	0,7225	0,3956	4,5843	1,3655
100x4x100x10x30	2,7491	1,9519	10,9089	6,3355
300x52x2x45x20	5,7044	3,9567	30,7142	15,74
500x52x3x40x30	13,8315	10,8504	80,604	46,8657
100x28x10x10x10	0,6829	0,4126	4,3192	1,2705
100x3x10x18x200	2,1217	1,752	5,232	4,2747
45x512x30x4x45	15,4316	12,9341	86,8337	54,7685

I built mxnet using:
cmake -GNinja -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl -DUSE_CUDA=0 -DUSE_LAPACK=0 -DUSE_GPERFTOOLS=0 -DUSE_OPENCV=0 ..

kpuatamazon · 2020-12-14T10:29:11Z

I've been using a c5.12xlarge Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz. Assume these are some sort of seconds?

We should at least do -march=native to see if it's just a matter of CPU support i.e. MXNet doesn't seem to enable AVX512 by default and one could add CPUID dispatch.

Might as well reshape to two dimensions with the axis preserved and everything else multiplied. The problem is identical for e.g. 100x28x10x10x10 and 280000x10. Also, those are some really small channels to layer normalize over.

Also, I feel like the optimal assembly implementation would benefit from a different ordering of the input tensor to allow for pure vertical adds whereas layer normalization is currently setup for horizontal adds. I can certainly see how a JIT will do better at e.g. 1000x3 where multiple problems share the same vector. But oddly that's where marian is doing better.

fhieber · 2020-12-14T16:51:20Z

Speaking up as a 'customer' of LayerNorm here: Sockeye (and its Transformer models) cares about smaller matrix sizes for LayerNorm, i.e. typically in ranges around (y, 512) for y in range(2,100). It would be great if we could get the performance benefits of the Marian implementation into MXNet.

kpuatamazon · 2020-12-15T00:36:13Z

I made the sizing more systematic.

AVX512 means __attribute__((target("avx512f,avx512bw,avx512cd,avx512dq,avx512vnni")))

Inverse means the 1.f/std is computed in advance rather than dividing by std in the loop.

Overall, the Marian implementation seems to win on smaller problem sizes, including the x512 sizes from @fhieber but lose on larger problem sizes.

Of course there are edge cases when the width is not a multiple of 16 and gcc is testing for those edge cases every time, so I see how that could be optimized.

Shape	Marian	+ AVX512	+ AVX512 + inverse	oneDNN
1x 3	0.0000363	0.0000370	0.0000364	0.0000453
5x 3	0.0000337	0.0000355	0.0000348	0.0000426
10x 3	0.0000346	0.0000352	0.0000344	0.0000434
20x 3	0.0000337	0.0000342	0.0000349	0.0000438
30x 3	0.0000354	0.0000376	0.0000371	0.0000434
40x 3	0.0000373	0.0000396	0.0000382	0.0000431
50x 3	0.0000381	0.0000405	0.0000393	0.0000436
60x 3	0.0000390	0.0000408	0.0000403	0.0000447
70x 3	0.0000391	0.0000376	0.0000411	0.0000440
80x 3	0.0000403	0.0000400	0.0000414	0.0000438
90x 3	0.0000399	0.0000399	0.0000392	0.0000446
100x 3	0.0000378	0.0000409	0.0000399	0.0000451
110x 3	0.0000385	0.0000414	0.0000398	0.0000446
120x 3	0.0000390	0.0000413	0.0000417	0.0000454
130x 3	0.0000389	0.0000429	0.0000420	0.0000457
140x 3	0.0000402	0.0000437	0.0000436	0.0000452
150x 3	0.0000403	0.0000442	0.0000432	0.0000461
200x 3	0.0000432	0.0000480	0.0000466	0.0000476
300x 3	0.0000476	0.0000553	0.0000533	0.0000506
400x 3	0.0000532	0.0000617	0.0000604	0.0000533
500x 3	0.0000575	0.0000694	0.0000670	0.0000570
1000x 3	0.0000826	0.0001037	0.0001029	0.0000713
2000x 3	0.0001340	0.0001730	0.0001706	0.0000988
3000x 3	0.0001818	0.0002431	0.0002402	0.0001275
4000x 3	0.0002298	0.0003116	0.0003092	0.0001554
5000x 3	0.0002777	0.0003804	0.0003814	0.0001832
16384x 3	0.0008319	0.0011868	0.0012013	0.0005041
1x 10	0.0000339	0.0000387	0.0000348	0.0000412
5x 10	0.0000336	0.0000346	0.0000345	0.0000423
10x 10	0.0000349	0.0000347	0.0000343	0.0000426
20x 10	0.0000339	0.0000369	0.0000374	0.0000423
30x 10	0.0000371	0.0000391	0.0000378	0.0000439
40x 10	0.0000378	0.0000397	0.0000395	0.0000433
50x 10	0.0000393	0.0000407	0.0000408	0.0000440
60x 10	0.0000398	0.0000411	0.0000423	0.0000444
70x 10	0.0000403	0.0000399	0.0000416	0.0000445
80x 10	0.0000411	0.0000414	0.0000398	0.0000449
90x 10	0.0000410	0.0000417	0.0000406	0.0000452
100x 10	0.0000391	0.0000424	0.0000416	0.0000457
110x 10	0.0000403	0.0000433	0.0000425	0.0000460
120x 10	0.0000404	0.0000446	0.0000433	0.0000465
130x 10	0.0000414	0.0000450	0.0000440	0.0000463
140x 10	0.0000413	0.0000460	0.0000455	0.0000467
150x 10	0.0000420	0.0000474	0.0000455	0.0000471
200x 10	0.0000457	0.0000511	0.0000497	0.0000496
300x 10	0.0000517	0.0000598	0.0000579	0.0000522
400x 10	0.0000584	0.0000697	0.0000660	0.0000560
500x 10	0.0000646	0.0000772	0.0000759	0.0000592
1000x 10	0.0000956	0.0001229	0.0001169	0.0000773
2000x 10	0.0001586	0.0002063	0.0002010	0.0001119
3000x 10	0.0002192	0.0002946	0.0002856	0.0001441
4000x 10	0.0002827	0.0003824	0.0003686	0.0001793
5000x 10	0.0003435	0.0004679	0.0004548	0.0002134
16384x 10	0.0010506	0.0015780	0.0015417	0.0006026
1x 100	0.0000348	0.0000379	0.0000405	0.0000427
5x 100	0.0000348	0.0000346	0.0000343	0.0000433
10x 100	0.0000337	0.0000350	0.0000367	0.0000436
20x 100	0.0000401	0.0000388	0.0000395	0.0000445
30x 100	0.0000411	0.0000406	0.0000393	0.0000452
40x 100	0.0000379	0.0000381	0.0000375	0.0000462
50x 100	0.0000391	0.0000399	0.0000384	0.0000466
60x 100	0.0000394	0.0000412	0.0000393	0.0000474
70x 100	0.0000422	0.0000428	0.0000408	0.0000489
80x 100	0.0000433	0.0000439	0.0000408	0.0000492
90x 100	0.0000436	0.0000445	0.0000425	0.0000500
100x 100	0.0000448	0.0000458	0.0000435	0.0000510
110x 100	0.0000467	0.0000476	0.0000448	0.0000514
120x 100	0.0000466	0.0000481	0.0000456	0.0000527
130x 100	0.0000487	0.0000501	0.0000469	0.0000529
140x 100	0.0000501	0.0000515	0.0000479	0.0000538
150x 100	0.0000503	0.0000517	0.0000494	0.0000550
200x 100	0.0000556	0.0000595	0.0000539	0.0000592
300x 100	0.0000670	0.0000708	0.0000639	0.0000678
400x 100	0.0000782	0.0000825	0.0000748	0.0000727
500x 100	0.0000898	0.0000946	0.0000857	0.0000824
1000x 100	0.0001492	0.0001591	0.0001409	0.0001263
2000x 100	0.0002653	0.0002819	0.0002536	0.0002101
3000x 100	0.0003822	0.0004043	0.0003598	0.0002898
4000x 100	0.0004926	0.0005266	0.0004686	0.0003655
5000x 100	0.0006051	0.0006524	0.0005765	0.0004505
16384x 100	0.0020228	0.0021531	0.0019262	0.0014567
1x 256	0.0000374	0.0000397	0.0000358	0.0000434
5x 256	0.0000336	0.0000409	0.0000335	0.0000434
10x 256	0.0000399	0.0000370	0.0000404	0.0000436
20x 256	0.0000423	0.0000408	0.0000400	0.0000450
30x 256	0.0000383	0.0000371	0.0000373	0.0000463
40x 256	0.0000411	0.0000394	0.0000384	0.0000469
50x 256	0.0000418	0.0000411	0.0000386	0.0000476
60x 256	0.0000431	0.0000424	0.0000407	0.0000481
70x 256	0.0000455	0.0000441	0.0000419	0.0000495
80x 256	0.0000465	0.0000456	0.0000433	0.0000496
90x 256	0.0000493	0.0000476	0.0000445	0.0000510
100x 256	0.0000502	0.0000495	0.0000457	0.0000522
110x 256	0.0000524	0.0000500	0.0000467	0.0000534
120x 256	0.0000535	0.0000517	0.0000475	0.0000535
130x 256	0.0000554	0.0000534	0.0000493	0.0000549
140x 256	0.0000573	0.0000551	0.0000512	0.0000553
150x 256	0.0000597	0.0000570	0.0000521	0.0000568
200x 256	0.0000679	0.0000639	0.0000581	0.0000631
300x 256	0.0000850	0.0000826	0.0000713	0.0000709
400x 256	0.0001040	0.0000962	0.0000854	0.0000832
500x 256	0.0001231	0.0001130	0.0001000	0.0000967
1000x 256	0.0002105	0.0001881	0.0001694	0.0001590
2000x 256	0.0003913	0.0003506	0.0003000	0.0002847
3000x 256	0.0005685	0.0005093	0.0004264	0.0004101
4000x 256	0.0007300	0.0006509	0.0005445	0.0005121
5000x 256	0.0009223	0.0008244	0.0006968	0.0006512
16384x 256	0.0036732	0.0032559	0.0029458	0.0025328
1x 512	0.0000366	0.0000392	0.0000344	0.0000438
5x 512	0.0000394	0.0000354	0.0000346	0.0000441
10x 512	0.0000408	0.0000401	0.0000381	0.0000448
20x 512	0.0000444	0.0000435	0.0000389	0.0000453
30x 512	0.0000410	0.0000403	0.0000386	0.0000474
40x 512	0.0000446	0.0000429	0.0000403	0.0000481
50x 512	0.0000476	0.0000463	0.0000428	0.0000496
60x 512	0.0000507	0.0000478	0.0000440	0.0000510
70x 512	0.0000539	0.0000502	0.0000458	0.0000518
80x 512	0.0000577	0.0000538	0.0000476	0.0000537
90x 512	0.0000602	0.0000558	0.0000504	0.0000572
100x 512	0.0000616	0.0000582	0.0000518	0.0000565
110x 512	0.0000667	0.0000607	0.0000556	0.0000600
120x 512	0.0000689	0.0000642	0.0000576	0.0000612
130x 512	0.0000735	0.0000654	0.0000587	0.0000616
140x 512	0.0000744	0.0000695	0.0000607	0.0000661
150x 512	0.0000759	0.0000695	0.0000636	0.0000653
200x 512	0.0000913	0.0000831	0.0000738	0.0000793
300x 512	0.0001299	0.0001136	0.0001037	0.0001073
400x 512	0.0001585	0.0001409	0.0001313	0.0001338
500x 512	0.0001883	0.0001598	0.0001498	0.0001580
1000x 512	0.0003369	0.0002909	0.0002665	0.0002632
2000x 512	0.0006396	0.0005457	0.0004962	0.0004790
3000x 512	0.0009528	0.0008094	0.0007397	0.0006964
4000x 512	0.0013894	0.0011007	0.0010205	0.0009181
5000x 512	0.0018264	0.0015710	0.0015030	0.0012737
16384x 512	0.0074474	0.0067132	0.0065738	0.0060986
1x 1024	0.0000347	0.0000357	0.0000356	0.0000445
5x 1024	0.0000391	0.0000385	0.0000362	0.0000450
10x 1024	0.0000383	0.0000410	0.0000393	0.0000459
20x 1024	0.0000439	0.0000408	0.0000384	0.0000493
30x 1024	0.0000489	0.0000458	0.0000417	0.0000504
40x 1024	0.0000557	0.0000516	0.0000448	0.0000527
50x 1024	0.0000587	0.0000545	0.0000490	0.0000546
60x 1024	0.0000650	0.0000586	0.0000516	0.0000610
70x 1024	0.0000707	0.0000617	0.0000563	0.0000616
80x 1024	0.0000768	0.0000683	0.0000618	0.0000659
90x 1024	0.0000834	0.0000735	0.0000695	0.0000716
100x 1024	0.0000886	0.0000759	0.0000708	0.0000757
110x 1024	0.0000949	0.0000833	0.0000779	0.0000826
120x 1024	0.0001031	0.0000882	0.0000841	0.0000858
130x 1024	0.0001088	0.0000956	0.0000887	0.0000903
140x 1024	0.0001156	0.0001010	0.0000923	0.0000969
150x 1024	0.0001152	0.0001057	0.0000978	0.0001028
200x 1024	0.0001450	0.0001328	0.0001257	0.0001313
300x 1024	0.0002082	0.0001793	0.0001721	0.0001762
400x 1024	0.0002650	0.0002286	0.0002192	0.0002239
500x 1024	0.0003157	0.0002699	0.0002596	0.0002592
1000x 1024	0.0005968	0.0005075	0.0004867	0.0004764
2000x 1024	0.0012529	0.0010099	0.0009831	0.0009186
3000x 1024	0.0021209	0.0017632	0.0018832	0.0017245
4000x 1024	0.0029924	0.0025091	0.0027886	0.0024449
5000x 1024	0.0040340	0.0034140	0.0037645	0.0033816
16384x 1024	0.0149200	0.0130993	0.0142042	0.0132403

bartekkuncer · 2020-12-15T06:36:31Z

@kpuatamazon which version of oneDNN have you used for the benchmark? There is a change boosting the perf of layer normalization which is going to be included in oneDNN v2.1 release.

kpuatamazon · 2020-12-16T16:20:00Z

I used whatever was in your pull request: c4b6bce

mxnet-bot · 2021-08-19T20:39:51Z

Jenkins CI successfully triggered : [centos-gpu]

src/operator/nn/mkldnn/mkldnn_layer_norm.cc

…type())

src/operator/nn/mkldnn/mkldnn_layer_norm.cc

…tion_forward

szha · 2021-08-22T02:55:20Z

@mxnet-bot run ci [website]

mxnet-bot · 2021-08-22T02:55:26Z

Jenkins CI successfully triggered : [website]

…e#19562) * [operator] Integrate oneDNN layer normalization implementation * change sizeof(float) to mshadow_sizeof(inputs[layernorm::kBwdGamma].dtype()) * remove eps from key and unify layernorm_fwd_t/mkldnn::layer_normalization_forward * add author

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Nov 18, 2020

bartekkuncer force-pushed the layernormmaster branch from 439e314 to c4b6bce Compare November 19, 2020 00:43

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Nov 19, 2020

pengzhao-intel added the MKLDNN label Nov 24, 2020

bartekkuncer changed the title ~~Integrate oneDNN layer normalization implementation~~ [WIP]Integrate oneDNN layer normalization implementation Nov 25, 2020

bartekkuncer force-pushed the layernormmaster branch 5 times, most recently from 4fe9420 to 6d0428c Compare January 11, 2021 13:15

bartekkuncer changed the title ~~[WIP]Integrate oneDNN layer normalization implementation~~ Integrate oneDNN layer normalization implementation Jan 11, 2021

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Jan 11, 2021

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Aug 19, 2021

anko-intel approved these changes Aug 20, 2021

View reviewed changes

src/operator/nn/mkldnn/mkldnn_layer_norm.cc Outdated Show resolved Hide resolved

change sizeof(float) to mshadow_sizeof(inputs[layernorm::kBwdGamma].d…

38d45fc

…type())

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-awaiting-review PR is waiting for code review pr-awaiting-testing PR is reviewed and waiting CI build and test labels Aug 20, 2021

bgawrych reviewed Aug 20, 2021

View reviewed changes

src/operator/nn/mkldnn/mkldnn_layer_norm.cc Outdated Show resolved Hide resolved

src/operator/nn/mkldnn/mkldnn_layer_norm.cc Outdated Show resolved Hide resolved

remove eps from key and unify layernorm_fwd_t/mkldnn::layer_normaliza…

15825b5

…tion_forward

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Aug 20, 2021

add author

240cac8

mseth10 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Aug 20, 2021

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Aug 22, 2021

akarbown approved these changes Aug 24, 2021

View reviewed changes

akarbown merged commit 695ba2e into apache:master Aug 24, 2021

bartekkuncer deleted the layernormmaster branch April 17, 2023 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[operator] Integrate oneDNN layer normalization implementation #19562

[operator] Integrate oneDNN layer normalization implementation #19562

bartekkuncer commented Nov 18, 2020 •

edited

Loading

mxnet-bot commented Nov 18, 2020

pengzhao-intel commented Nov 24, 2020

bartekkuncer commented Nov 24, 2020

kpuatamazon commented Nov 30, 2020

bartekkuncer commented Dec 10, 2020 •

edited

Loading

kpuatamazon commented Dec 14, 2020 •

edited

Loading

fhieber commented Dec 14, 2020

kpuatamazon commented Dec 15, 2020

bartekkuncer commented Dec 15, 2020

kpuatamazon commented Dec 16, 2020

mxnet-bot commented Aug 19, 2021

szha commented Aug 22, 2021

mxnet-bot commented Aug 22, 2021

[operator] Integrate oneDNN layer normalization implementation #19562

[operator] Integrate oneDNN layer normalization implementation #19562

Conversation

bartekkuncer commented Nov 18, 2020 • edited Loading

Description

Comments

Most recent performance numbers

mxnet-bot commented Nov 18, 2020

pengzhao-intel commented Nov 24, 2020

bartekkuncer commented Nov 24, 2020

kpuatamazon commented Nov 30, 2020

bartekkuncer commented Dec 10, 2020 • edited Loading

kpuatamazon commented Dec 14, 2020 • edited Loading

fhieber commented Dec 14, 2020

kpuatamazon commented Dec 15, 2020

bartekkuncer commented Dec 15, 2020

kpuatamazon commented Dec 16, 2020

mxnet-bot commented Aug 19, 2021

szha commented Aug 22, 2021

mxnet-bot commented Aug 22, 2021

bartekkuncer commented Nov 18, 2020 •

edited

Loading

bartekkuncer commented Dec 10, 2020 •

edited

Loading

kpuatamazon commented Dec 14, 2020 •

edited

Loading