[Hexagon] Create test examples to show parallelization #12654

nverke · 2022-08-30T17:30:38Z

Background:

These tests show one way to attain speedup for hexagon workloads through parallelism. Additionally, they show the speedup attained by various different workloads in the best configuration found (outer loop split into 4). This will hopefully give users an understanding of what to do and what to expect when parallelizing hexagon workloads.

Here are some tables of the runtimes on various workloads and HDKs.

HVX runtimes on 8gen1 HDK

Operation	# Operations	Single Thread (ms)	Parallel (ms)	Speedup
vrmpy	128	0.00112	0.002242	0.5x
vrmpy	256	0.001797	0.002505	0.72x
vrmpy	512	0.003095	0.00309	1.0x
vrmpy	1024	0.005732	0.00422	1.36x
vrmpy	2048	0.011118	0.006515	1.71x
vrmpy	4096	0.140888	0.056102	2.51x
vrmpy	8192	0.265677	0.100795	2.64x
vrmpy	16384	0.877999	0.286085	3.07x
vmpy	128	0.001371	0.002348	0.58x
vmpy	256	0.002236	0.002721	0.82x
vmpy	512	0.003988	0.003511	1.14x
vmpy	1024	0.007496	0.00504	1.49x
vmpy	2048	0.018755	0.012325	1.52x
vmpy	4096	0.172784	0.066596	2.59x
vmpy	8192	0.357293	0.12787	2.79x
vmpy	16384	1.223959	0.367546	3.33x
vadd	128	0.001349	0.002355	0.57x
vadd	256	0.002194	0.002685	0.82x
vadd	512	0.003968	0.003543	1.12x
vadd	1024	0.007491	0.005024	1.49x
vadd	2048	0.018481	0.012329	1.5x
vadd	4096	0.172362	0.067368	2.56x
vadd	8192	0.353322	0.130838	2.7x
vadd	16384	1.215925	0.368648	3.3x

HVX runtimes on 888 HDK

Operation	# Operations	Single Thread (ms)	Parallel (ms)	Speedup
vrmpy	128	0.001157	0.002261	0.51x
vrmpy	256	0.00188	0.003152	0.6x
vrmpy	512	0.003201	0.00379	0.84x
vrmpy	1024	0.005942	0.005273	1.13x
vrmpy	2048	0.011579	0.009932	1.17x
vrmpy	4096	0.258623	0.083141	3.11x
vrmpy	8192	0.509286	0.172014	2.96x
vrmpy	16384	0.979687	0.333655	2.94x
vmpy	128	0.00142	0.002941	0.48x
vmpy	256	0.002289	0.003374	0.68x
vmpy	512	0.004115	0.004397	0.94x
vmpy	1024	0.007751	0.006372	1.22x
vmpy	2048	0.025335	0.01625	1.56x
vmpy	4096	0.355102	0.116926	3.04x
vmpy	8192	0.62289	0.214158	2.91x
vmpy	16384	1.239413	0.40535	3.06x
vadd	128	0.00141	0.002982	0.47x
vadd	256	0.002307	0.003473	0.66x
vadd	512	0.004115	0.004371	0.94x
vadd	1024	0.007767	0.006365	1.22x
vadd	2048	0.025241	0.01564	1.61x
vadd	4096	0.357479	0.11579	3.09x
vadd	8192	0.646734	0.218949	2.95x
vadd	16384	1.243888	0.400593	3.11x

Scalar runtimes on 8gen1 HDK

Operation	# Operations	Single Thread (ms)	Parallel (ms)	Speedup
add	128	0.000961	0.002217	0.43x
add	256	0.001477	0.00277	0.53x
add	512	0.00255	0.003356	0.76x
add	1024	0.005959	0.005371	1.11x
add	2048	0.015845	0.008645	1.83x
add	4096	0.030957	0.014782	2.09x
add	8192	0.058772	0.027117	2.17x
add	16384	0.115732	0.050325	2.3x
multiply	128	0.00144	0.002561	0.56x
multiply	256	0.002457	0.003177	0.77x
multiply	512	0.004462	0.004707	0.95x
multiply	1024	0.009583	0.007716	1.24x
multiply	2048	0.022285	0.013335	1.67x
multiply	4096	0.043851	0.023989	1.83x
multiply	8192	0.085892	0.047104	1.82x
multiply	16384	0.152309	0.086548	1.76x
sub	128	0.000981	0.002077	0.47x
sub	256	0.001482	0.002512	0.59x
sub	512	0.002549	0.003345	0.76x
sub	1024	0.006042	0.005565	1.09x
sub	2048	0.015846	0.008672	1.83x
sub	4096	0.030947	0.014812	2.09x
sub	8192	0.059008	0.027133	2.17x
sub	16384	0.115776	0.049741	2.33x

Scalar runtimes on 888 HDK

Operation	# Operations	Single Thread (ms)	Parallel (ms)	Speedup
add	128	0.001019	0.002461	0.41x
add	256	0.001564	0.002714	0.58x
add	512	0.002643	0.003411	0.77x
add	1024	0.006376	0.005649	1.13x
add	2048	0.016413	0.009117	1.8x
add	4096	0.032022	0.015408	2.08x
add	8192	0.061622	0.028349	2.17x
add	16384	0.114904	0.051258	2.24x
multiply	128	0.001515	0.002529	0.6x
multiply	256	0.002533	0.003264	0.78x
multiply	512	0.00461	0.004697	0.98x
multiply	1024	0.010024	0.008115	1.24x
multiply	2048	0.02271	0.016956	1.34x
multiply	4096	0.044663	0.025626	1.74x
multiply	8192	0.08593	0.050227	1.71x
multiply	16384	0.156369	0.082328	1.9x
sub	128	0.001005	0.002229	0.45x
sub	256	0.001567	0.002598	0.6x
sub	512	0.002646	0.003399	0.78x
sub	1024	0.006413	0.006804	0.94x
sub	2048	0.016408	0.011015	1.49x
sub	4096	0.032029	0.017999	1.78x
sub	8192	0.061626	0.028435	2.17x
sub	16384	0.118424	0.052493	2.26x

From the tables above we can see that single thread runs faster for smaller # of operations since L2 cache can handle the entire request quickly. Additionally, significant performance degradation occurs once the inputs and outputs cannot all fit in L2. Both of these should be kept in mind when utilizing parallelism.

cc @csullivan @mehrdadh @tmoreau89

csullivan · 2022-08-30T21:06:13Z

tests/python/contrib/test_hexagon/test_parallel_hvx.py

+        1024,  # Single thread runs faster since L2 cache can handle the entire request quickly
+        2048,
+        4096,  # Significant performance degredation once the inputs and outputs cannot all fit in L2


I think these are insightful comments that should go in the PR description so that we can make sure to bring them in to the commit description when this PR lands.

I added a little discussion to the description

csullivan · 2022-09-02T16:41:06Z

tests/python/contrib/test_hexagon/test_parallel_scalar.py

+
+def evaluate(hexagon_session, operations, expected, sch):
+    shape = operations
+    dtype = "float64"


Why choose float64 here? Is the goal to try to avoid vectorization?

Yeah, thats the idea, I know that hvx does not support float64

nverke · 2022-09-12T17:25:45Z

@tvm-bot rerun

adstraw

LGTM, in general. I think it could use a once-over to add comments. For example, it took me a minute to figure out the difference between an _operator and a _producer function. Would be great if you could parameterize the op type as it would shrink the code size considerably.

adstraw · 2022-09-12T20:20:58Z

tests/python/contrib/test_hexagon/test_parallel_hvx.py

+    # Experimentally best split factor but all multiples of 4 perform pretty well.
+    # This is because there are 4 HVX untis available on the device and pipelining
+    # works best with parallels of the number of available HVX.
+    split_factor = tvm.testing.parameter(4)


Can you find a way to parameterize the operation type (vrmpy, vmpy, vadd) and create just one test case? There is a lot of duplicated code in the 3 test cases below?

Good suggestion! Ill see if I can do this!

Made the changes

adstraw · 2022-09-12T20:22:05Z

tests/python/contrib/test_hexagon/test_parallel_scalar.py

+        16384,
+    )
+
+    split_factor = tvm.testing.parameter(4)


Same comment; can we parameterize the op type to avoid code duplication?

Made the changes

nverke · 2022-09-13T17:05:25Z

@adstraw I believe I made the code more clear as to which are producers but let me know what you think!

adstraw

LGTM

tmoreau89 · 2022-09-14T13:33:13Z

@nverke PR is good to go minus merge conflicts that require resolution.

…agon workloads.

tmoreau89

Thanks for the review @adstraw LGTM

tmoreau89 · 2022-09-19T17:01:40Z

Thanks @nverke and @adstraw - the PR has been merged

* [Hexagon] Create test examples to show parallelization working on Hexagon workloads. * Increase max size of tvm_rpc_android buffer size. * Reformat tests to be parameterized. * Comment out tests to speedup CI.

github-actions bot requested review from csullivan, tmoreau89 and mehrdadh August 30, 2022 17:34

csullivan reviewed Sep 2, 2022

View reviewed changes

adstraw reviewed Sep 12, 2022

View reviewed changes

adstraw approved these changes Sep 13, 2022

View reviewed changes

nverke added 4 commits September 14, 2022 14:09

[Hexagon] Create test examples to show parallelization working on Hex…

4e5b6dc

…agon workloads.

Increase max size of tvm_rpc_android buffer size.

0185ec0

Reformat tests to be parameterized.

52bb1b1

Comment out tests to speedup CI.

ba8da1e

tmoreau89 approved these changes Sep 19, 2022

View reviewed changes

tmoreau89 merged commit da7f65d into apache:main Sep 19, 2022

AndrewZhaoLuo mentioned this pull request Oct 4, 2022

TVM v0.10.0.rc0 Release Candidate Notes #12979

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hexagon] Create test examples to show parallelization #12654

[Hexagon] Create test examples to show parallelization #12654

nverke commented Aug 30, 2022 •

edited

Loading

csullivan Aug 30, 2022

nverke Sep 12, 2022

csullivan Sep 2, 2022

nverke Sep 12, 2022

nverke commented Sep 12, 2022

adstraw left a comment

adstraw Sep 12, 2022

nverke Sep 12, 2022

nverke Sep 13, 2022

adstraw Sep 12, 2022

nverke Sep 13, 2022

nverke commented Sep 13, 2022

adstraw left a comment

tmoreau89 commented Sep 14, 2022

tmoreau89 left a comment

tmoreau89 commented Sep 19, 2022

[Hexagon] Create test examples to show parallelization #12654

[Hexagon] Create test examples to show parallelization #12654

Conversation

nverke commented Aug 30, 2022 • edited Loading

Background:

HVX runtimes on 8gen1 HDK

HVX runtimes on 888 HDK

Scalar runtimes on 8gen1 HDK

Scalar runtimes on 888 HDK

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nverke commented Sep 12, 2022

adstraw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nverke commented Sep 13, 2022

adstraw left a comment

Choose a reason for hiding this comment

tmoreau89 commented Sep 14, 2022

tmoreau89 left a comment

Choose a reason for hiding this comment

tmoreau89 commented Sep 19, 2022

nverke commented Aug 30, 2022 •

edited

Loading