Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added tesnorizeation for avx2 based gemm. #3982

Merged
merged 2 commits into from
Sep 25, 2019

Conversation

kimishpatel
Copy link
Contributor

Summary:
Similar to other avx512 tensorization that reduces data:1x4 and kernel:16x4 to output:1x16, this PR introduces similar reduction using avx2 tensorization. It keeps the same API as avx512 so as to not have to introduce a new memory layout for weights.

Test Plan:
on avx2 machine:
python tests/python/contrib/test_gemm_avx2_acc32.py

Reviewers:

Subscribers:

Tasks:

Tags:

Thanks for contributing to TVM! Please refer to guideline https://docs.tvm.ai/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers.

@kimishpatel
Copy link
Contributor Author

Depends on this PR: #3981

Copy link
Contributor

@anijain2305 anijain2305 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution.

I have left a couple of comments. But, I have a a high-level question, do we see performance improvement over FP32 using this.

Last time I checked FP32 VFMA is fully pipelined with a latency of 5 cycles. Int16 MADDs had a longer latency with little unclarity if they are pipelined. Basically Intel engineers spent lot of efforts for FP32 HW earlier.

vec_b_1 = ins[1].vload([8, 0], "int8x32")
vec_one = tvm.const(1, "int16x16")
pair_reduction_0 = tvm.call_llvm_intrin('int16x16',
'llvm.x86.avx2.pmadd.ub.sw',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Align (and other places as well)

pair_reduction_1, vec_one)
if index == 0:
ib.emit(outs[0].vstore([0], quad_reduction_0))
ib.emit(outs[0].vstore([8], quad_reduction_1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have 2 quad_reductions in assembly instead of unrolling in the schedule?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh maybe, you would have then change num_lanes etc

If thats tha case, @yzhliu is working on making it an argument. Maybe we can use that to simplify this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anijain2305. Correct. I wanted to be able to use the same memory layout which can be tensorized for avx2 and avx512.

@kimishpatel
Copy link
Contributor Author

@anijain2305, thats a good question. I havent checked the performance of this agains fp32 model we are interested in. I will report the numbers soon. But we do get a lot of speed up on skylake with avx512 and assuming we dont get anything from AVX2 instructions itself due to latency issues you mention, there should still be some perf left from the memory bw persepctive.

@anijain2305
Copy link
Contributor

@anijain2305, thats a good question. I havent checked the performance of this agains fp32 model we are interested in. I will report the numbers soon. But we do get a lot of speed up on skylake with avx512 and assuming we dont get anything from AVX2 instructions itself due to latency issues you mention, there should still be some perf left from the memory bw persepctive.

Ok. This might be useful to do some paper calculations to see if we can get performance theoretically - https://www.agner.org/optimize/instruction_tables.pdf

@kimishpatel kimishpatel force-pushed the tensorization_for_avx2 branch 2 times, most recently from 5c61a2c to 841398d Compare September 24, 2019 02:56
@kimishpatel
Copy link
Contributor Author

@anijain2305, do you mind if I land this? I would like to use it for some of the stuff we are working on.

X = tvm.placeholder((m, k), name='X', dtype="uint8")
W = tvm.placeholder((n, k), name='W', dtype="int8")

#peak = 280 // This needs measurement and description of what this number is for avx2 machine.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the comments

@anijain2305
Copy link
Contributor

@anijain2305, do you mind if I land this? I would like to use it for some of the stuff we are working on.

Yes, please go ahead. The changes LGTM. Would appreciate if later you could update with the performance comparison.

@kimishpatel
Copy link
Contributor Author

Thanks @anijain2305, will do so.

@kimishpatel
Copy link
Contributor Author

Just as FYI. CI is failing on the test I added as it is dependent on this PR: #3981. I need to land that.

Summary:
Tensorized the same region as avx512. Names produce 16x1 int32 results.
Does by doing two sets of AVX2 instructions to do reduction on 8x4 int8
kernel with 1x4 data.

Test Plan:
on avx2 machine:
python tests/python/contrib/test_gemm_avx2_acc32.py

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
@kimishpatel kimishpatel force-pushed the tensorization_for_avx2 branch from 841398d to c6bdd3c Compare September 25, 2019 00:22
@kimishpatel
Copy link
Contributor Author

@anijain2305, would mind merging this please? I dont think I can, plus it seems that my lint changes and addressing your comments might have voided your approval.

@anijain2305
Copy link
Contributor

I don't have merge permissions :)
@tqchen can you please take a look?

@tqchen tqchen merged commit 23727eb into apache:master Sep 25, 2019
@tqchen
Copy link
Member

tqchen commented Sep 25, 2019

Thanks @kimishpatel @anijain2305

tqchen added a commit that referenced this pull request Sep 25, 2019
tqchen added a commit that referenced this pull request Sep 25, 2019
wweic pushed a commit to wweic/tvm that referenced this pull request Sep 30, 2019
* Added tesnorizeation for avx2 based gemm.

Summary:
Tensorized the same region as avx512. Names produce 16x1 int32 results.
Does by doing two sets of AVX2 instructions to do reduction on 8x4 int8
kernel with 1x4 data.

Test Plan:
on avx2 machine:
python tests/python/contrib/test_gemm_avx2_acc32.py

Reviewers:

Subscribers:

Tasks:

Tags:

* Fix lint errors. Removed commented out code.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
wweic pushed a commit to wweic/tvm that referenced this pull request Sep 30, 2019
wweic pushed a commit to wweic/tvm that referenced this pull request Sep 30, 2019
* Added tesnorizeation for avx2 based gemm.

Summary:
Tensorized the same region as avx512. Names produce 16x1 int32 results.
Does by doing two sets of AVX2 instructions to do reduction on 8x4 int8
kernel with 1x4 data.

Test Plan:
on avx2 machine:
python tests/python/contrib/test_gemm_avx2_acc32.py

Reviewers:

Subscribers:

Tasks:

Tags:

* Fix lint errors. Removed commented out code.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
wweic pushed a commit to wweic/tvm that referenced this pull request Sep 30, 2019
wweic pushed a commit to neo-ai/tvm that referenced this pull request Oct 1, 2019
* Added tesnorizeation for avx2 based gemm.

Summary:
Tensorized the same region as avx512. Names produce 16x1 int32 results.
Does by doing two sets of AVX2 instructions to do reduction on 8x4 int8
kernel with 1x4 data.

Test Plan:
on avx2 machine:
python tests/python/contrib/test_gemm_avx2_acc32.py

Reviewers:

Subscribers:

Tasks:

Tags:

* Fix lint errors. Removed commented out code.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
wweic pushed a commit to neo-ai/tvm that referenced this pull request Oct 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants