Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MetaSchedule][Hexagon] Improve vectorization for standalone elementwise op #14408

Merged
merged 1 commit into from
Mar 28, 2023

Conversation

ibsidorenko
Copy link
Contributor

Motivation:
It was found that for standalone elementwise operations (add, sub, etc.) MetaScheduler generates code with poor performance due to lack of vector code on some input tensor shapes. Current implementation is not able to vectorize if innermost loops extent is not multiple of the vector length.

What was done:
Core changes: it checks current loops nest, if all loops are "simple", i.e. loops without annotations, bindings, reduce axis, then it does the following:

  1. Fuse all loops into single one.
  2. Split this new loop into 2 parts: inner and outer. Herewith split factor for the inner loop is equal to 'max_vectorize_extent' MetaScheduler parameter.
  3. Parallelize outer loop and vectorize inner loop.

Performance measurement:
Measurement was done on Qualcomm Snapdragon 888. As it was expected, 1 and 2 got significant performance boost, 3 and 4 - without changes.

N op Dtype Shape Before fix, ms After fix, ms speedup
1 add uint8 1, 8, 56, 56, 32 1.264 0.167 7.5x
2 qnn.add uint8 1, 8, 56, 56, 32 2.213 0.336 6.6x
3 add int32 1, 8, 56, 56, 32 0.161 0.150 1.07x
4 seq* uint8 1, 64, 56, 56 2.634 2.679 0.98x

seq* - test of the ops sequence: qnn.conv2d + bias_add + qnn.requantize,
weights shape = [256, 64, 1, 1]

…ise ops

Motivation:
It was found that for standalone elementwise operations (add, sub, etc.)
MetaScheduler generates code with poor performance due to lack of vector
code on some input tensor shapes. Current implementation is not able to
vectorize if innermost loops extent is not multiple of the vector
length.

What was done:
Core changes: it checks current loops nest, if all loops are "simple",
i.e. loops without annotations, bindings, reduce axis, then it does the
following:
 1) Fuse all loops into single one.
 2) Split this new loop into 2 parts: inner and outer. Herewith split
    factor for the inner loop is equal to 'max_vectorize_extent'
    MetaScheduler parameter.
 3) Parallelize outer loop and vectorize inner loop.

Performance measurement:
Measurement was done on Qualcomm Snapdragon 888. As it was expected, 1
and 2 got significant performance boost, 3 and 4 - without changes.

N |    op   | Dtype |      Shape       | Before fix, ms | After fix, ms | speedup |
--|---------|-------|------------------|----------------|---------------|---------|
1 | add     | uint8 | 1, 8, 56, 56, 32 |      1.264     |     0.167     |  7.5x   |
2 | qnn.add | uint8 | 1, 8, 56, 56, 32 |      2.213     |     0.336     |  6.6x   |
3 | add     | int32 | 1, 8, 56, 56, 32 |      0.161     |     0.150     |  1.07x  |
4 | seq*    | uint8 | 1, 64, 56, 56    |      2.634     |     2.679     |  0.98x  |
----------------------------------------------------------------------------------|

seq* - test of the ops sequence: qnn.conv2d + bias_add + qnn.requantize,
       weights shape = [256, 64, 1, 1]
@tvm-bot
Copy link
Collaborator

tvm-bot commented Mar 27, 2023

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

  • No users to tag found in teams: metaschedule, hexagon See #10317 for details

Generated by tvm-bot

@ibsidorenko
Copy link
Contributor Author

@tvm-bot rerun

@masahi masahi merged commit 14ddb37 into apache:main Mar 28, 2023
@ibsidorenko ibsidorenko deleted the ms-vectorizer branch March 28, 2023 12:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants