Added tesnorizeation for avx2 based gemm. #3982

kimishpatel · 2019-09-20T15:12:24Z

Summary:
Similar to other avx512 tensorization that reduces data:1x4 and kernel:16x4 to output:1x16, this PR introduces similar reduction using avx2 tensorization. It keeps the same API as avx512 so as to not have to introduce a new memory layout for weights.

Test Plan:
on avx2 machine:
python tests/python/contrib/test_gemm_avx2_acc32.py

Reviewers:

Subscribers:

Tasks:

Tags:

Thanks for contributing to TVM! Please refer to guideline https://docs.tvm.ai/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers.

kimishpatel · 2019-09-20T15:13:27Z

Depends on this PR: #3981

anijain2305

Thanks for the contribution.

I have left a couple of comments. But, I have a a high-level question, do we see performance improvement over FP32 using this.

Last time I checked FP32 VFMA is fully pipelined with a latency of 5 cycles. Int16 MADDs had a longer latency with little unclarity if they are pipelined. Basically Intel engineers spent lot of efforts for FP32 HW earlier.

anijain2305 · 2019-09-20T20:39:44Z

topi/python/topi/x86/tensor_intrin.py

+            vec_b_1 = ins[1].vload([8, 0], "int8x32")
+            vec_one = tvm.const(1, "int16x16")
+            pair_reduction_0 = tvm.call_llvm_intrin('int16x16',
+                                                  'llvm.x86.avx2.pmadd.ub.sw',


Align (and other places as well)

anijain2305 · 2019-09-20T20:40:39Z

topi/python/topi/x86/tensor_intrin.py

+                                                  pair_reduction_1, vec_one)
+            if index == 0:
+                ib.emit(outs[0].vstore([0], quad_reduction_0))
+                ib.emit(outs[0].vstore([8], quad_reduction_1))


Why have 2 quad_reductions in assembly instead of unrolling in the schedule?

Oh maybe, you would have then change num_lanes etc

If thats tha case, @yzhliu is working on making it an argument. Maybe we can use that to simplify this.

@anijain2305. Correct. I wanted to be able to use the same memory layout which can be tensorized for avx2 and avx512.

kimishpatel · 2019-09-20T20:53:14Z

@anijain2305, thats a good question. I havent checked the performance of this agains fp32 model we are interested in. I will report the numbers soon. But we do get a lot of speed up on skylake with avx512 and assuming we dont get anything from AVX2 instructions itself due to latency issues you mention, there should still be some perf left from the memory bw persepctive.

anijain2305 · 2019-09-20T22:41:02Z

@anijain2305, thats a good question. I havent checked the performance of this agains fp32 model we are interested in. I will report the numbers soon. But we do get a lot of speed up on skylake with avx512 and assuming we dont get anything from AVX2 instructions itself due to latency issues you mention, there should still be some perf left from the memory bw persepctive.

Ok. This might be useful to do some paper calculations to see if we can get performance theoretically - https://www.agner.org/optimize/instruction_tables.pdf

kimishpatel · 2019-09-24T02:57:54Z

@anijain2305, do you mind if I land this? I would like to use it for some of the stuff we are working on.

anijain2305 · 2019-09-24T05:28:42Z

tests/python/contrib/test_gemm_avx2_acc32.py

+    X = tvm.placeholder((m, k), name='X', dtype="uint8")
+    W = tvm.placeholder((n, k), name='W', dtype="int8")
+
+    #peak = 280 // This needs measurement and description of what this number is for avx2 machine.


Please remove the comments

anijain2305 · 2019-09-24T05:30:00Z

@anijain2305, do you mind if I land this? I would like to use it for some of the stuff we are working on.

Yes, please go ahead. The changes LGTM. Would appreciate if later you could update with the performance comparison.

kimishpatel · 2019-09-24T16:10:24Z

Thanks @anijain2305, will do so.

kimishpatel · 2019-09-24T16:24:06Z

Just as FYI. CI is failing on the test I added as it is dependent on this PR: #3981. I need to land that.

Summary: Tensorized the same region as avx512. Names produce 16x1 int32 results. Does by doing two sets of AVX2 instructions to do reduction on 8x4 int8 kernel with 1x4 data. Test Plan: on avx2 machine: python tests/python/contrib/test_gemm_avx2_acc32.py Reviewers: Subscribers: Tasks: Tags:

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

kimishpatel · 2019-09-25T14:11:04Z

@anijain2305, would mind merging this please? I dont think I can, plus it seems that my lint changes and addressing your comments might have voided your approval.

anijain2305 · 2019-09-25T16:39:52Z

I don't have merge permissions :)
@tqchen can you please take a look?

tqchen · 2019-09-25T16:52:17Z

Thanks @kimishpatel @anijain2305

This reverts commit 23727eb.

* Added tesnorizeation for avx2 based gemm. Summary: Tensorized the same region as avx512. Names produce 16x1 int32 results. Does by doing two sets of AVX2 instructions to do reduction on 8x4 int8 kernel with 1x4 data. Test Plan: on avx2 machine: python tests/python/contrib/test_gemm_avx2_acc32.py Reviewers: Subscribers: Tasks: Tags: * Fix lint errors. Removed commented out code. Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

…che#4007) This reverts commit 23727eb.

* Added tesnorizeation for avx2 based gemm. Summary: Tensorized the same region as avx512. Names produce 16x1 int32 results. Does by doing two sets of AVX2 instructions to do reduction on 8x4 int8 kernel with 1x4 data. Test Plan: on avx2 machine: python tests/python/contrib/test_gemm_avx2_acc32.py Reviewers: Subscribers: Tasks: Tags: * Fix lint errors. Removed commented out code. Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

…che#4007) This reverts commit 23727eb.

* Added tesnorizeation for avx2 based gemm. Summary: Tensorized the same region as avx512. Names produce 16x1 int32 results. Does by doing two sets of AVX2 instructions to do reduction on 8x4 int8 kernel with 1x4 data. Test Plan: on avx2 machine: python tests/python/contrib/test_gemm_avx2_acc32.py Reviewers: Subscribers: Tasks: Tags: * Fix lint errors. Removed commented out code. Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

…che#4007) This reverts commit 23727eb.

anijain2305 reviewed Sep 20, 2019

View reviewed changes

kimishpatel force-pushed the tensorization_for_avx2 branch 2 times, most recently from 5c61a2c to 841398d Compare September 24, 2019 02:56

anijain2305 approved these changes Sep 24, 2019

View reviewed changes

kimishpatel added 2 commits September 24, 2019 13:09

Fix lint errors. Removed commented out code.

c6bdd3c

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

kimishpatel force-pushed the tensorization_for_avx2 branch from 841398d to c6bdd3c Compare September 25, 2019 00:22

tqchen approved these changes Sep 25, 2019

View reviewed changes

tqchen merged commit 23727eb into apache:master Sep 25, 2019

tqchen added the status: accepted label Sep 25, 2019

tqchen added a commit that referenced this pull request Sep 25, 2019

Revert "Added tesnorizeation for avx2 based gemm. (#3982)"

eb16082

This reverts commit 23727eb.

tqchen added a commit that referenced this pull request Sep 25, 2019

Revert "Added tesnorizeation for avx2 based gemm. (#3982)" (#4007)

4a3abb9

This reverts commit 23727eb.

wweic pushed a commit to wweic/tvm that referenced this pull request Sep 30, 2019

Revert "Added tesnorizeation for avx2 based gemm. (apache#3982)" (apa…

d362600

…che#4007) This reverts commit 23727eb.

wweic pushed a commit to wweic/tvm that referenced this pull request Sep 30, 2019

Revert "Added tesnorizeation for avx2 based gemm. (apache#3982)" (apa…

c6d4fea

…che#4007) This reverts commit 23727eb.

wweic pushed a commit to neo-ai/tvm that referenced this pull request Oct 1, 2019

Revert "Added tesnorizeation for avx2 based gemm. (apache#3982)" (apa…

fb2cbfc

…che#4007) This reverts commit 23727eb.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added tesnorizeation for avx2 based gemm. #3982

Added tesnorizeation for avx2 based gemm. #3982

kimishpatel commented Sep 20, 2019

kimishpatel commented Sep 20, 2019

anijain2305 left a comment •

edited

Loading

anijain2305 Sep 20, 2019

anijain2305 Sep 20, 2019

anijain2305 Sep 20, 2019

kimishpatel Sep 20, 2019

kimishpatel commented Sep 20, 2019

anijain2305 commented Sep 20, 2019

kimishpatel commented Sep 24, 2019

anijain2305 Sep 24, 2019

anijain2305 commented Sep 24, 2019

kimishpatel commented Sep 24, 2019

kimishpatel commented Sep 24, 2019

kimishpatel commented Sep 25, 2019

anijain2305 commented Sep 25, 2019

tqchen commented Sep 25, 2019

Added tesnorizeation for avx2 based gemm. #3982

Added tesnorizeation for avx2 based gemm. #3982

Conversation

kimishpatel commented Sep 20, 2019

kimishpatel commented Sep 20, 2019

anijain2305 left a comment • edited Loading

Choose a reason for hiding this comment

anijain2305 Sep 20, 2019

Choose a reason for hiding this comment

anijain2305 Sep 20, 2019

Choose a reason for hiding this comment

anijain2305 Sep 20, 2019

Choose a reason for hiding this comment

kimishpatel Sep 20, 2019

Choose a reason for hiding this comment

kimishpatel commented Sep 20, 2019

anijain2305 commented Sep 20, 2019

kimishpatel commented Sep 24, 2019

anijain2305 Sep 24, 2019

Choose a reason for hiding this comment

anijain2305 commented Sep 24, 2019

kimishpatel commented Sep 24, 2019

kimishpatel commented Sep 24, 2019

kimishpatel commented Sep 25, 2019

anijain2305 commented Sep 25, 2019

tqchen commented Sep 25, 2019

anijain2305 left a comment •

edited

Loading