Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[operator] Integrate oneDNN layer normalization implementation #19562

Merged
merged 4 commits into from
Aug 24, 2021

Conversation

bartekkuncer
Copy link
Contributor

@bartekkuncer bartekkuncer commented Nov 18, 2020

Description

The change integrates oneDNNs implementation of forward and backward propagation of Layer Normalization for axis == -1 (default case - last axis).

Comments

As oneDNNs LayerNorm primitive does not support axis parameter (https://oneapi-src.github.io/oneDNN/dev_guide_layer_normalization.html) I had to modify input data by adjusting tensors in mxnet before sending them to oneDNN to make it work with axis != -1. I tried two approaches:

  1. Create custom memory descriptors for tensors with adjusted shapes and strides to make layer normalization operate along different axis.
  2. Reorder tensors before and after using layer normalization primitive.

Both approaches turned out to be significantly slower than current mxnet implementation.

OneDNNs backward propagation is significantly faster than current mxnet's implementation. Forward implementation has similar performance to mxnet's generic version - depending on shape at times faster is marian and at times faster is oneDNN. As the difference in performance is significant in some of these cases I introduced simple heuristics (based on huge amount of benchmarking) for checking if layer normalization should be computed by oneDNN:

  auto ShapeBetterForMKLDNN = [](const mxnet::TShape& shape) {
    constexpr size_t shapeLimit = 1024;
    return shape.Size() / shape[0] >= shapeLimit && shape[0] >= shapeLimit;
  };

The above function can be found in mkldnn_layer_norm.cc file.

Most recent performance numbers

ln_opperf1908clx.xlsx

@mxnet-bot
Copy link

Hey @bartekkuncer , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [windows-gpu, sanity, centos-gpu, miscellaneous, website, unix-gpu, centos-cpu, clang, windows-cpu, unix-cpu, edge]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Nov 18, 2020
@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Nov 19, 2020
@pengzhao-intel
Copy link
Contributor

@bartekkuncer could you add more description in the PR to avoid confusion for the reviewers?

@bartekkuncer
Copy link
Contributor Author

@bartekkuncer could you add more description in the PR to avoid confusion for the reviewers?

@pengzhao-intel was planning to add them as soon as I fix tests :)

@bartekkuncer bartekkuncer changed the title Integrate oneDNN layer normalization implementation [WIP]Integrate oneDNN layer normalization implementation Nov 25, 2020
@kpuatamazon
Copy link
Contributor

Hi, can we compare #19601 ?

@bartekkuncer
Copy link
Contributor Author

bartekkuncer commented Dec 10, 2020

Hi, can we compare #19601 ?

@kpuatamazon Sorry for the late response, was waiting for the layer norm optimization in oneDNN. Below are the results I got using marian and my onednn implementation. It looks like oneDNN is faster in most cases. Can you tell me what CPU are you using?

threads amount 28 threads 28 threads 4 threads 4 threads
execution marian mx w/ onednn marian mx w/ onednn
1000x5 0,0213 0,032 0,0272 0,029
1000x100 0,03 0,0859 0,0664 0,0421
300x512 0,0255 0,1505 0,0667 0,073
500x512 0,0305 0,2436 0,0859 0,0449
1000x2048 0,1396 0,0616 0,4103 0,2594
1000x3 0,0156 0,0239 0,018 0,0242
45x512 0,0152 0,038 0,0252 0,0377
1000x5x100 0,0652 0,0457 0,2194 0,117
1000x8x100 0,0963 0,0538 0,3147 0,1594
300x512x512 8,0384 7,8368 31,491 28,0296
500x512x10 0,6231 0,3557 3,9009 1,0595
1000x5x2048 2,1114 1,574 4,9501 3,8672
1000x2048x3 2,6274 1,3618 12,9816 5,8291
45x512x512 2,2648 1,7278 5,4215 4,6426
1000x5x30x200 3,6479 3,5311 13,3026 11,3592
100x100x10x300 3,6913 3,9369 13,4574 10,93
300x512x45x20 24,2647 16,6294 146,4946 76,6008
50x512x40x30 5,3653 4,0195 27,1334 15,8884
100x2048x10x10 6,2846 3,4051 34,9686 13,3237
1000x3x10x200 0,9976 0,5507 2,1544 1,0273
45x52x300x45 4,7193 3,9909 22,6179 14,0623
100x5x30x20x10 0,7225 0,3956 4,5843 1,3655
100x4x100x10x30 2,7491 1,9519 10,9089 6,3355
300x52x2x45x20 5,7044 3,9567 30,7142 15,74
500x52x3x40x30 13,8315 10,8504 80,604 46,8657
100x28x10x10x10 0,6829 0,4126 4,3192 1,2705
100x3x10x18x200 2,1217 1,752 5,232 4,2747
45x512x30x4x45 15,4316 12,9341 86,8337 54,7685

I built mxnet using:
cmake -GNinja -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl -DUSE_CUDA=0 -DUSE_LAPACK=0 -DUSE_GPERFTOOLS=0 -DUSE_OPENCV=0 ..

@kpuatamazon
Copy link
Contributor

kpuatamazon commented Dec 14, 2020

I've been using a c5.12xlarge Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz. Assume these are some sort of seconds?

We should at least do -march=native to see if it's just a matter of CPU support i.e. MXNet doesn't seem to enable AVX512 by default and one could add CPUID dispatch.

Might as well reshape to two dimensions with the axis preserved and everything else multiplied. The problem is identical for e.g. 100x28x10x10x10 and 280000x10. Also, those are some really small channels to layer normalize over.

Also, I feel like the optimal assembly implementation would benefit from a different ordering of the input tensor to allow for pure vertical adds whereas layer normalization is currently setup for horizontal adds. I can certainly see how a JIT will do better at e.g. 1000x3 where multiple problems share the same vector. But oddly that's where marian is doing better.

@fhieber
Copy link
Contributor

fhieber commented Dec 14, 2020

Speaking up as a 'customer' of LayerNorm here: Sockeye (and its Transformer models) cares about smaller matrix sizes for LayerNorm, i.e. typically in ranges around (y, 512) for y in range(2,100). It would be great if we could get the performance benefits of the Marian implementation into MXNet.

@kpuatamazon
Copy link
Contributor

I made the sizing more systematic.

AVX512 means __attribute__((target("avx512f,avx512bw,avx512cd,avx512dq,avx512vnni")))

Inverse means the 1.f/std is computed in advance rather than dividing by std in the loop.

Overall, the Marian implementation seems to win on smaller problem sizes, including the x512 sizes from @fhieber but lose on larger problem sizes.

Of course there are edge cases when the width is not a multiple of 16 and gcc is testing for those edge cases every time, so I see how that could be optimized.

Shape Marian + AVX512 + AVX512 + inverse oneDNN
1x 3 0.0000363 0.0000370 0.0000364 0.0000453
5x 3 0.0000337 0.0000355 0.0000348 0.0000426
10x 3 0.0000346 0.0000352 0.0000344 0.0000434
20x 3 0.0000337 0.0000342 0.0000349 0.0000438
30x 3 0.0000354 0.0000376 0.0000371 0.0000434
40x 3 0.0000373 0.0000396 0.0000382 0.0000431
50x 3 0.0000381 0.0000405 0.0000393 0.0000436
60x 3 0.0000390 0.0000408 0.0000403 0.0000447
70x 3 0.0000391 0.0000376 0.0000411 0.0000440
80x 3 0.0000403 0.0000400 0.0000414 0.0000438
90x 3 0.0000399 0.0000399 0.0000392 0.0000446
100x 3 0.0000378 0.0000409 0.0000399 0.0000451
110x 3 0.0000385 0.0000414 0.0000398 0.0000446
120x 3 0.0000390 0.0000413 0.0000417 0.0000454
130x 3 0.0000389 0.0000429 0.0000420 0.0000457
140x 3 0.0000402 0.0000437 0.0000436 0.0000452
150x 3 0.0000403 0.0000442 0.0000432 0.0000461
200x 3 0.0000432 0.0000480 0.0000466 0.0000476
300x 3 0.0000476 0.0000553 0.0000533 0.0000506
400x 3 0.0000532 0.0000617 0.0000604 0.0000533
500x 3 0.0000575 0.0000694 0.0000670 0.0000570
1000x 3 0.0000826 0.0001037 0.0001029 0.0000713
2000x 3 0.0001340 0.0001730 0.0001706 0.0000988
3000x 3 0.0001818 0.0002431 0.0002402 0.0001275
4000x 3 0.0002298 0.0003116 0.0003092 0.0001554
5000x 3 0.0002777 0.0003804 0.0003814 0.0001832
16384x 3 0.0008319 0.0011868 0.0012013 0.0005041
1x 10 0.0000339 0.0000387 0.0000348 0.0000412
5x 10 0.0000336 0.0000346 0.0000345 0.0000423
10x 10 0.0000349 0.0000347 0.0000343 0.0000426
20x 10 0.0000339 0.0000369 0.0000374 0.0000423
30x 10 0.0000371 0.0000391 0.0000378 0.0000439
40x 10 0.0000378 0.0000397 0.0000395 0.0000433
50x 10 0.0000393 0.0000407 0.0000408 0.0000440
60x 10 0.0000398 0.0000411 0.0000423 0.0000444
70x 10 0.0000403 0.0000399 0.0000416 0.0000445
80x 10 0.0000411 0.0000414 0.0000398 0.0000449
90x 10 0.0000410 0.0000417 0.0000406 0.0000452
100x 10 0.0000391 0.0000424 0.0000416 0.0000457
110x 10 0.0000403 0.0000433 0.0000425 0.0000460
120x 10 0.0000404 0.0000446 0.0000433 0.0000465
130x 10 0.0000414 0.0000450 0.0000440 0.0000463
140x 10 0.0000413 0.0000460 0.0000455 0.0000467
150x 10 0.0000420 0.0000474 0.0000455 0.0000471
200x 10 0.0000457 0.0000511 0.0000497 0.0000496
300x 10 0.0000517 0.0000598 0.0000579 0.0000522
400x 10 0.0000584 0.0000697 0.0000660 0.0000560
500x 10 0.0000646 0.0000772 0.0000759 0.0000592
1000x 10 0.0000956 0.0001229 0.0001169 0.0000773
2000x 10 0.0001586 0.0002063 0.0002010 0.0001119
3000x 10 0.0002192 0.0002946 0.0002856 0.0001441
4000x 10 0.0002827 0.0003824 0.0003686 0.0001793
5000x 10 0.0003435 0.0004679 0.0004548 0.0002134
16384x 10 0.0010506 0.0015780 0.0015417 0.0006026
1x 100 0.0000348 0.0000379 0.0000405 0.0000427
5x 100 0.0000348 0.0000346 0.0000343 0.0000433
10x 100 0.0000337 0.0000350 0.0000367 0.0000436
20x 100 0.0000401 0.0000388 0.0000395 0.0000445
30x 100 0.0000411 0.0000406 0.0000393 0.0000452
40x 100 0.0000379 0.0000381 0.0000375 0.0000462
50x 100 0.0000391 0.0000399 0.0000384 0.0000466
60x 100 0.0000394 0.0000412 0.0000393 0.0000474
70x 100 0.0000422 0.0000428 0.0000408 0.0000489
80x 100 0.0000433 0.0000439 0.0000408 0.0000492
90x 100 0.0000436 0.0000445 0.0000425 0.0000500
100x 100 0.0000448 0.0000458 0.0000435 0.0000510
110x 100 0.0000467 0.0000476 0.0000448 0.0000514
120x 100 0.0000466 0.0000481 0.0000456 0.0000527
130x 100 0.0000487 0.0000501 0.0000469 0.0000529
140x 100 0.0000501 0.0000515 0.0000479 0.0000538
150x 100 0.0000503 0.0000517 0.0000494 0.0000550
200x 100 0.0000556 0.0000595 0.0000539 0.0000592
300x 100 0.0000670 0.0000708 0.0000639 0.0000678
400x 100 0.0000782 0.0000825 0.0000748 0.0000727
500x 100 0.0000898 0.0000946 0.0000857 0.0000824
1000x 100 0.0001492 0.0001591 0.0001409 0.0001263
2000x 100 0.0002653 0.0002819 0.0002536 0.0002101
3000x 100 0.0003822 0.0004043 0.0003598 0.0002898
4000x 100 0.0004926 0.0005266 0.0004686 0.0003655
5000x 100 0.0006051 0.0006524 0.0005765 0.0004505
16384x 100 0.0020228 0.0021531 0.0019262 0.0014567
1x 256 0.0000374 0.0000397 0.0000358 0.0000434
5x 256 0.0000336 0.0000409 0.0000335 0.0000434
10x 256 0.0000399 0.0000370 0.0000404 0.0000436
20x 256 0.0000423 0.0000408 0.0000400 0.0000450
30x 256 0.0000383 0.0000371 0.0000373 0.0000463
40x 256 0.0000411 0.0000394 0.0000384 0.0000469
50x 256 0.0000418 0.0000411 0.0000386 0.0000476
60x 256 0.0000431 0.0000424 0.0000407 0.0000481
70x 256 0.0000455 0.0000441 0.0000419 0.0000495
80x 256 0.0000465 0.0000456 0.0000433 0.0000496
90x 256 0.0000493 0.0000476 0.0000445 0.0000510
100x 256 0.0000502 0.0000495 0.0000457 0.0000522
110x 256 0.0000524 0.0000500 0.0000467 0.0000534
120x 256 0.0000535 0.0000517 0.0000475 0.0000535
130x 256 0.0000554 0.0000534 0.0000493 0.0000549
140x 256 0.0000573 0.0000551 0.0000512 0.0000553
150x 256 0.0000597 0.0000570 0.0000521 0.0000568
200x 256 0.0000679 0.0000639 0.0000581 0.0000631
300x 256 0.0000850 0.0000826 0.0000713 0.0000709
400x 256 0.0001040 0.0000962 0.0000854 0.0000832
500x 256 0.0001231 0.0001130 0.0001000 0.0000967
1000x 256 0.0002105 0.0001881 0.0001694 0.0001590
2000x 256 0.0003913 0.0003506 0.0003000 0.0002847
3000x 256 0.0005685 0.0005093 0.0004264 0.0004101
4000x 256 0.0007300 0.0006509 0.0005445 0.0005121
5000x 256 0.0009223 0.0008244 0.0006968 0.0006512
16384x 256 0.0036732 0.0032559 0.0029458 0.0025328
1x 512 0.0000366 0.0000392 0.0000344 0.0000438
5x 512 0.0000394 0.0000354 0.0000346 0.0000441
10x 512 0.0000408 0.0000401 0.0000381 0.0000448
20x 512 0.0000444 0.0000435 0.0000389 0.0000453
30x 512 0.0000410 0.0000403 0.0000386 0.0000474
40x 512 0.0000446 0.0000429 0.0000403 0.0000481
50x 512 0.0000476 0.0000463 0.0000428 0.0000496
60x 512 0.0000507 0.0000478 0.0000440 0.0000510
70x 512 0.0000539 0.0000502 0.0000458 0.0000518
80x 512 0.0000577 0.0000538 0.0000476 0.0000537
90x 512 0.0000602 0.0000558 0.0000504 0.0000572
100x 512 0.0000616 0.0000582 0.0000518 0.0000565
110x 512 0.0000667 0.0000607 0.0000556 0.0000600
120x 512 0.0000689 0.0000642 0.0000576 0.0000612
130x 512 0.0000735 0.0000654 0.0000587 0.0000616
140x 512 0.0000744 0.0000695 0.0000607 0.0000661
150x 512 0.0000759 0.0000695 0.0000636 0.0000653
200x 512 0.0000913 0.0000831 0.0000738 0.0000793
300x 512 0.0001299 0.0001136 0.0001037 0.0001073
400x 512 0.0001585 0.0001409 0.0001313 0.0001338
500x 512 0.0001883 0.0001598 0.0001498 0.0001580
1000x 512 0.0003369 0.0002909 0.0002665 0.0002632
2000x 512 0.0006396 0.0005457 0.0004962 0.0004790
3000x 512 0.0009528 0.0008094 0.0007397 0.0006964
4000x 512 0.0013894 0.0011007 0.0010205 0.0009181
5000x 512 0.0018264 0.0015710 0.0015030 0.0012737
16384x 512 0.0074474 0.0067132 0.0065738 0.0060986
1x 1024 0.0000347 0.0000357 0.0000356 0.0000445
5x 1024 0.0000391 0.0000385 0.0000362 0.0000450
10x 1024 0.0000383 0.0000410 0.0000393 0.0000459
20x 1024 0.0000439 0.0000408 0.0000384 0.0000493
30x 1024 0.0000489 0.0000458 0.0000417 0.0000504
40x 1024 0.0000557 0.0000516 0.0000448 0.0000527
50x 1024 0.0000587 0.0000545 0.0000490 0.0000546
60x 1024 0.0000650 0.0000586 0.0000516 0.0000610
70x 1024 0.0000707 0.0000617 0.0000563 0.0000616
80x 1024 0.0000768 0.0000683 0.0000618 0.0000659
90x 1024 0.0000834 0.0000735 0.0000695 0.0000716
100x 1024 0.0000886 0.0000759 0.0000708 0.0000757
110x 1024 0.0000949 0.0000833 0.0000779 0.0000826
120x 1024 0.0001031 0.0000882 0.0000841 0.0000858
130x 1024 0.0001088 0.0000956 0.0000887 0.0000903
140x 1024 0.0001156 0.0001010 0.0000923 0.0000969
150x 1024 0.0001152 0.0001057 0.0000978 0.0001028
200x 1024 0.0001450 0.0001328 0.0001257 0.0001313
300x 1024 0.0002082 0.0001793 0.0001721 0.0001762
400x 1024 0.0002650 0.0002286 0.0002192 0.0002239
500x 1024 0.0003157 0.0002699 0.0002596 0.0002592
1000x 1024 0.0005968 0.0005075 0.0004867 0.0004764
2000x 1024 0.0012529 0.0010099 0.0009831 0.0009186
3000x 1024 0.0021209 0.0017632 0.0018832 0.0017245
4000x 1024 0.0029924 0.0025091 0.0027886 0.0024449
5000x 1024 0.0040340 0.0034140 0.0037645 0.0033816
16384x 1024 0.0149200 0.0130993 0.0142042 0.0132403

@bartekkuncer
Copy link
Contributor Author

@kpuatamazon which version of oneDNN have you used for the benchmark? There is a change boosting the perf of layer normalization which is going to be included in oneDNN v2.1 release.

@kpuatamazon
Copy link
Contributor

I used whatever was in your pull request: c4b6bce

@bartekkuncer bartekkuncer force-pushed the layernormmaster branch 5 times, most recently from 4fe9420 to 6d0428c Compare January 11, 2021 13:15
@bartekkuncer bartekkuncer changed the title [WIP]Integrate oneDNN layer normalization implementation Integrate oneDNN layer normalization implementation Jan 11, 2021
@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Jan 11, 2021
@mxnet-bot
Copy link

Jenkins CI successfully triggered : [centos-gpu]

@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Aug 19, 2021
src/operator/nn/mkldnn/mkldnn_layer_norm.cc Outdated Show resolved Hide resolved
@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-awaiting-review PR is waiting for code review pr-awaiting-testing PR is reviewed and waiting CI build and test labels Aug 20, 2021
@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Aug 20, 2021
@mseth10 mseth10 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Aug 20, 2021
@szha
Copy link
Member

szha commented Aug 22, 2021

@mxnet-bot run ci [website]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [website]

@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Aug 22, 2021
@akarbown akarbown merged commit 695ba2e into apache:master Aug 24, 2021
KexinFeng pushed a commit to KexinFeng/incubator-mxnet that referenced this pull request Aug 27, 2021
…e#19562)

* [operator] Integrate oneDNN layer normalization implementation

* change sizeof(float) to mshadow_sizeof(inputs[layernorm::kBwdGamma].dtype())

* remove eps from key and unify layernorm_fwd_t/mkldnn::layer_normalization_forward

* add author
bgawrych pushed a commit to bgawrych/incubator-mxnet that referenced this pull request Aug 31, 2021
…e#19562)

* [operator] Integrate oneDNN layer normalization implementation

* change sizeof(float) to mshadow_sizeof(inputs[layernorm::kBwdGamma].dtype())

* remove eps from key and unify layernorm_fwd_t/mkldnn::layer_normalization_forward

* add author
@bartekkuncer bartekkuncer deleted the layernormmaster branch April 17, 2023 12:44
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
MKLDNN pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.