Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hexagon] Create test examples to show parallelization #12654

Merged
merged 4 commits into from
Sep 19, 2022
Merged

[Hexagon] Create test examples to show parallelization #12654

merged 4 commits into from
Sep 19, 2022

Conversation

nverke
Copy link
Contributor

@nverke nverke commented Aug 30, 2022

Background:

These tests show one way to attain speedup for hexagon workloads through parallelism. Additionally, they show the speedup attained by various different workloads in the best configuration found (outer loop split into 4). This will hopefully give users an understanding of what to do and what to expect when parallelizing hexagon workloads.

Here are some tables of the runtimes on various workloads and HDKs.

HVX runtimes on 8gen1 HDK

Operation # Operations Single Thread (ms) Parallel (ms) Speedup
vrmpy 128 0.00112 0.002242 0.5x
vrmpy 256 0.001797 0.002505 0.72x
vrmpy 512 0.003095 0.00309 1.0x
vrmpy 1024 0.005732 0.00422 1.36x
vrmpy 2048 0.011118 0.006515 1.71x
vrmpy 4096 0.140888 0.056102 2.51x
vrmpy 8192 0.265677 0.100795 2.64x
vrmpy 16384 0.877999 0.286085 3.07x
vmpy 128 0.001371 0.002348 0.58x
vmpy 256 0.002236 0.002721 0.82x
vmpy 512 0.003988 0.003511 1.14x
vmpy 1024 0.007496 0.00504 1.49x
vmpy 2048 0.018755 0.012325 1.52x
vmpy 4096 0.172784 0.066596 2.59x
vmpy 8192 0.357293 0.12787 2.79x
vmpy 16384 1.223959 0.367546 3.33x
vadd 128 0.001349 0.002355 0.57x
vadd 256 0.002194 0.002685 0.82x
vadd 512 0.003968 0.003543 1.12x
vadd 1024 0.007491 0.005024 1.49x
vadd 2048 0.018481 0.012329 1.5x
vadd 4096 0.172362 0.067368 2.56x
vadd 8192 0.353322 0.130838 2.7x
vadd 16384 1.215925 0.368648 3.3x

HVX runtimes on 888 HDK

Operation # Operations Single Thread (ms) Parallel (ms) Speedup
vrmpy 128 0.001157 0.002261 0.51x
vrmpy 256 0.00188 0.003152 0.6x
vrmpy 512 0.003201 0.00379 0.84x
vrmpy 1024 0.005942 0.005273 1.13x
vrmpy 2048 0.011579 0.009932 1.17x
vrmpy 4096 0.258623 0.083141 3.11x
vrmpy 8192 0.509286 0.172014 2.96x
vrmpy 16384 0.979687 0.333655 2.94x
vmpy 128 0.00142 0.002941 0.48x
vmpy 256 0.002289 0.003374 0.68x
vmpy 512 0.004115 0.004397 0.94x
vmpy 1024 0.007751 0.006372 1.22x
vmpy 2048 0.025335 0.01625 1.56x
vmpy 4096 0.355102 0.116926 3.04x
vmpy 8192 0.62289 0.214158 2.91x
vmpy 16384 1.239413 0.40535 3.06x
vadd 128 0.00141 0.002982 0.47x
vadd 256 0.002307 0.003473 0.66x
vadd 512 0.004115 0.004371 0.94x
vadd 1024 0.007767 0.006365 1.22x
vadd 2048 0.025241 0.01564 1.61x
vadd 4096 0.357479 0.11579 3.09x
vadd 8192 0.646734 0.218949 2.95x
vadd 16384 1.243888 0.400593 3.11x

Scalar runtimes on 8gen1 HDK

Operation # Operations Single Thread (ms) Parallel (ms) Speedup
add 128 0.000961 0.002217 0.43x
add 256 0.001477 0.00277 0.53x
add 512 0.00255 0.003356 0.76x
add 1024 0.005959 0.005371 1.11x
add 2048 0.015845 0.008645 1.83x
add 4096 0.030957 0.014782 2.09x
add 8192 0.058772 0.027117 2.17x
add 16384 0.115732 0.050325 2.3x
multiply 128 0.00144 0.002561 0.56x
multiply 256 0.002457 0.003177 0.77x
multiply 512 0.004462 0.004707 0.95x
multiply 1024 0.009583 0.007716 1.24x
multiply 2048 0.022285 0.013335 1.67x
multiply 4096 0.043851 0.023989 1.83x
multiply 8192 0.085892 0.047104 1.82x
multiply 16384 0.152309 0.086548 1.76x
sub 128 0.000981 0.002077 0.47x
sub 256 0.001482 0.002512 0.59x
sub 512 0.002549 0.003345 0.76x
sub 1024 0.006042 0.005565 1.09x
sub 2048 0.015846 0.008672 1.83x
sub 4096 0.030947 0.014812 2.09x
sub 8192 0.059008 0.027133 2.17x
sub 16384 0.115776 0.049741 2.33x

Scalar runtimes on 888 HDK

Operation # Operations Single Thread (ms) Parallel (ms) Speedup
add 128 0.001019 0.002461 0.41x
add 256 0.001564 0.002714 0.58x
add 512 0.002643 0.003411 0.77x
add 1024 0.006376 0.005649 1.13x
add 2048 0.016413 0.009117 1.8x
add 4096 0.032022 0.015408 2.08x
add 8192 0.061622 0.028349 2.17x
add 16384 0.114904 0.051258 2.24x
multiply 128 0.001515 0.002529 0.6x
multiply 256 0.002533 0.003264 0.78x
multiply 512 0.00461 0.004697 0.98x
multiply 1024 0.010024 0.008115 1.24x
multiply 2048 0.02271 0.016956 1.34x
multiply 4096 0.044663 0.025626 1.74x
multiply 8192 0.08593 0.050227 1.71x
multiply 16384 0.156369 0.082328 1.9x
sub 128 0.001005 0.002229 0.45x
sub 256 0.001567 0.002598 0.6x
sub 512 0.002646 0.003399 0.78x
sub 1024 0.006413 0.006804 0.94x
sub 2048 0.016408 0.011015 1.49x
sub 4096 0.032029 0.017999 1.78x
sub 8192 0.061626 0.028435 2.17x
sub 16384 0.118424 0.052493 2.26x

From the tables above we can see that single thread runs faster for smaller # of operations since L2 cache can handle the entire request quickly. Additionally, significant performance degradation occurs once the inputs and outputs cannot all fit in L2. Both of these should be kept in mind when utilizing parallelism.

cc @csullivan @mehrdadh @tmoreau89

Comment on lines 163 to 165
1024, # Single thread runs faster since L2 cache can handle the entire request quickly
2048,
4096, # Significant performance degredation once the inputs and outputs cannot all fit in L2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are insightful comments that should go in the PR description so that we can make sure to bring them in to the commit description when this PR lands.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a little discussion to the description


def evaluate(hexagon_session, operations, expected, sch):
shape = operations
dtype = "float64"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why choose float64 here? Is the goal to try to avoid vectorization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, thats the idea, I know that hvx does not support float64

@nverke
Copy link
Contributor Author

nverke commented Sep 12, 2022

@tvm-bot rerun

Copy link
Contributor

@adstraw adstraw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, in general. I think it could use a once-over to add comments. For example, it took me a minute to figure out the difference between an _operator and a _producer function. Would be great if you could parameterize the op type as it would shrink the code size considerably.

# Experimentally best split factor but all multiples of 4 perform pretty well.
# This is because there are 4 HVX untis available on the device and pipelining
# works best with parallels of the number of available HVX.
split_factor = tvm.testing.parameter(4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you find a way to parameterize the operation type (vrmpy, vmpy, vadd) and create just one test case? There is a lot of duplicated code in the 3 test cases below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion! Ill see if I can do this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the changes

16384,
)

split_factor = tvm.testing.parameter(4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment; can we parameterize the op type to avoid code duplication?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the changes

@nverke
Copy link
Contributor Author

nverke commented Sep 13, 2022

@adstraw I believe I made the code more clear as to which are producers but let me know what you think!

Copy link
Contributor

@adstraw adstraw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tmoreau89
Copy link
Contributor

@nverke PR is good to go minus merge conflicts that require resolution.

Copy link
Contributor

@tmoreau89 tmoreau89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @adstraw LGTM

@tmoreau89 tmoreau89 merged commit da7f65d into apache:main Sep 19, 2022
@tmoreau89
Copy link
Contributor

Thanks @nverke and @adstraw - the PR has been merged

xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 25, 2022
* [Hexagon] Create test examples to show parallelization working on Hexagon workloads.

* Increase max size of tvm_rpc_android buffer size.

* Reformat tests to be parameterized.

* Comment out tests to speedup CI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants