[Hexagon] Create tests to showcase vtcm loading capabilities on Hexagon. #12667

nverke · 2022-08-31T22:25:40Z

Background

In order to learn more about efficiently running on Hexagon, an investigation into how to properly utilize VTCM was performed. These are the initial results of that investigation and should serve as a good starting point for others looking to leverage VTCM while running on Hexagon.

Results

Below are the results from running a simple parallel vrmpy operation in several different configurations. Each configuration is described below.

Without VTCM: This is just running the vrmpy operator without loading any data into VTCM

Basic VTCM Loads: This introduces loops to copy the data into VTCM before running the compute without any scheduling of those data copy loops.

Vec Loads: This applies the following scheduling to the data copy loops. For the DDR -> VTCM loops it uses unroll_split = 8 and vector_split = 64. For VTCM -> DDR loop it uses unroll_split = 8 and vector_split = 8

vb, vi = sch.get_loops(block)
v = sch.fuse(vb, vi)
_, vio, vii = sch.split(v, factors=[None, unroll_split, vector_split])
sch.vectorize(vii)
sch.unroll(vio)

Vec + Para Loads: This applies the same schedule as above except it also parallelizes on an outer loop. The outer_split is always 4.

vb, vi = sch.get_loops(block)
v = sch.fuse(vb, vi)
vbo, vbi, vio, vii = sch.split(v, factors=[outer_split, None, unroll_split, vector_split])
sch.vectorize(vii)
sch.unroll(vio)
sch.parallel(vbo)

Pre + Vec Loads: Same as "Vec Loads" except the VTCM buffers are allocated before runtime and passed into the operator.

Pre + Vec + Para Loads: Same as "Vec + Para Loads" except the VTCM buffers are allocated before runtime and passed into the operator.

Single DMA Load: A single DMA command is used to copy all of the data over to VTCM. This was preallocated since I could not get it to work without preallocation.

Preloaded: All of the data is already loaded into VTCM before the test starts.

8Gen1 HDK

Total Vrmpy Operations	Total Transfer (MB)	Without VTCM (Gops)	Basic VTCM Loads (Gops)	Vec Loads (Gops)	Vec + Para Loads (Gops)	Pre + Vec Loads (Gops)	Pre + Vec + Para Loads (Gops)	Single DMA Load (Gops)	Preloaded (Gops)
1024	0.39	95.0256	0.345	0.5408	0.5905	44.4814	32.8886	15.0813	124.7117
2048	0.79	124.2389	0.4002	0.7063	0.8826	43.5238	47.2871	16.1339	209.2688
4096	1.57	41.5497	0.4215	0.8664	1.1977	10.9374	26.5749	18.1754	241.1628
10240	3.93	33.2139	0.4419	1.0506	1.7311	11.7886	34.0405	25.4214	370.2948
16384	6.29	20.7683	0.4195	1.0568	1.898	7.7292	22.5898	29.7011	397.4137
20480	7.86	20.2128	0.4406	1.069	1.9779	6.6829	17.7941	25.4929	338.294

888 HDK

Total Vrmpy Operations	Total Transfer (MB)	Without VTCM (Gops)	Basic VTCM Loads (Gops)	Vec Loads (Gops)	Vec + Para Loads (Gops)	Pre + Vec Loads (Gops)	Pre + Vec + Para Loads (Gops)	Single DMA Load (Gops)	Preloaded (Gops)
1024	0.39	92.2826	0.5363	1.1438	1.3951	42.2813	37.2929	13.1085	121.4004
2048	0.79	98.8228	0.5269	1.1818	1.6554	43.9298	43.1773	14.5703	205.442
4096	1.57	22.1415	0.4095	0.988	1.5843	6.3227	16.1113	15.397	271.1367
10240	3.93	15.3377	0.4323	1.1091	1.9689	6.4958	18.68	17.9959	360.6824

Below are the results from copying data to vtcm with various strategies. The strategies are described below.

Base: This copies the data into VTCM with a simple loop.

Unroll + Vectorize: This applies the following scheduling to the data copy loop. These tests use unroll_split=2 and vector_split=128

vb = sch.get_loops(vtcm_block_a)
vbi_a, vio_a, vii_a = sch.split(vb[0], factors=[None, unroll_split, vector_split])
sch.unroll(vio_a)
sch.vectorize(vii_a)

Unroll + Vectorize + Parallel: This applies the same schedule as above except it also parallelizes on an outer loop. The outer_split is 4.

vb = sch.get_loops(vtcm_block_a)
vbo_a, vbi_a, vio_a, vii_a = sch.split(vb[0], factors=[outer_split, None, unroll_split, vector_split])
sch.unroll(vio_a)
sch.vectorize(vii_a)
sch.parallel(vbo_a)

Single DMA: Copies the data into VTCM with a single DMA instruction.

8Gen1 HDK

Total Transfer (MB)	Base (GBps)	Unroll + Vectorize (GBps)	Unroll + Vectorize + Parallel (GBps)	Single DMA (GBps)
0.01	2.2122	15.9211	4.8287	2.2524
0.02	2.3207	26.1998	9.5082	4.6669
0.04	2.4425	38.1089	17.5147	6.4492
0.08	2.5067	48.5949	32.507	9.1469
0.16	2.5507	57.6021	55.1855	11.1598
0.31	2.7053	62.8063	83.4726	15.2878
0.62	2.9199	74.3696	114.7925	17.6438
1	2.2645	49.8653	63.8026	18.8814
2	1.1232	10.3933	29.1977	20.6719
4	1.0683	9.6105	26.5143	25.201
8	0.6814	6.1916	24.049	26.1883

888 HDK

Total Transfer (MB)	Base (GBps)	Unroll + Vectorize (GBps)	Unroll + Vectorize + Parallel (GBps)	Single DMA (GBps)
0.01MB	2.6699	12.1178	4.3369	1.8245
0.02MB	2.7955	24.6427	8.6658	3.4972
0.04MB	3.0016	35.7516	14.4496	5.0863
0.08MB	3.1047	37.8442	25.2964	7.2166
0.16MB	3.2119	55.4663	43.0918	9.4149
0.31MB	3.2614	61.023	65.6292	9.8254
0.62MB	3.4791	70.5527	111.0134	10.7716
1.0MB	1.5253	42.0009	45.3035	11.5082
2.0MB	0.7137	5.29	17.3306	13.3808
4.0MB	0.721	5.2936	19.2567	13.639

cc @csullivan @mehrdadh

… and throw an error to fail early.

…m copies.

… copying data to VTCM.

tmoreau89 · 2022-08-31T22:28:41Z

Very neat work @nverke ! CC @masahi @kparzysz-quic

csullivan

👏 Amazing work characterizing the performance of VTCM and DMA @nverke.

csullivan · 2022-09-02T16:27:31Z

python/tvm/contrib/hexagon/session.py

@@ -58,7 +58,7 @@ def __init__(
        remote_kw: dict,
        session_name: str = "hexagon-rpc",
        remote_stack_size_bytes: int = 256 * 1024,  # Min size for main thread in QuRT/sim
-        rpc_receive_buffer_size_bytes: int = 5 * 1024 * 1024,  # Size for passing hexagon tests
+        rpc_receive_buffer_size_bytes: int = 1024 * 1024 * 1024,  # Size for passing hexagon tests


Left over from testing? A gigabyte for the rpc buffer can impact available memory for model execution

csullivan · 2022-09-02T16:32:48Z

tests/python/contrib/test_hexagon/test_vtcm_bandwidth.py

+        A = T.match_buffer(a, size, dtype="int8", align=128, scope="global")
+        A_global_vtcm = T.match_buffer(a_v, size, dtype="int8", align=128, scope="global.vtcm")


A nice follow up would to support measuring the bandwidth in both directions (ddr->vtcm and vtcm->ddr) given that in tests we saw a significant perf asymmetry and making that easily reproducible can help the QC experts debug or provide us insights.

…readable.

tmoreau89 · 2022-09-13T21:23:56Z

Thank you @csullivan and @nverke for the PR; it's been merged!

…on. (apache#12667) * [Hexagon] Increase max buffer size for tvm_rpc_android to 1GB. * [Hexagon] Make errors more clear when unable to allocate VTCM buffers and throw an error to fail early. * [Hexagon] Add mem_copy_DLTensor to enable directly calling DMA for mem copies. * [Hexagon] Add new tests as examples of the performance to expect when copying data to VTCM. * [Hexagon] Reduce rpc max size. * [Hexagon] Fix test_parallel_hvx_load_vtcm.py test output to be human readable. * Comment out tests that only work on 8Gen1 HDKs to get CI to pass

nverke added 4 commits August 31, 2022 15:02

[Hexagon] Increase max buffer size for tvm_rpc_android to 1GB.

5432e23

[Hexagon] Make errors more clear when unable to allocate VTCM buffers…

b2e298c

… and throw an error to fail early.

[Hexagon] Add mem_copy_DLTensor to enable directly calling DMA for me…

233fda1

…m copies.

[Hexagon] Add new tests as examples of the performance to expect when…

7b1bd26

… copying data to VTCM.

github-actions bot requested review from csullivan and tmoreau89 August 31, 2022 22:28

csullivan approved these changes Sep 2, 2022

View reviewed changes

nverke added 2 commits September 2, 2022 09:49

[Hexagon] Reduce rpc max size.

c0659cc

[Hexagon] Fix test_parallel_hvx_load_vtcm.py test output to be human …

823527c

…readable.

nverke marked this pull request as ready for review September 2, 2022 16:57

Comment out tests that only work on 8Gen1 HDKs to get CI to pass

73f3a14

github-actions bot requested a review from mehrdadh September 12, 2022 16:44

tmoreau89 merged commit 8058423 into apache:main Sep 13, 2022

AndrewZhaoLuo mentioned this pull request Oct 4, 2022

TVM v0.10.0.rc0 Release Candidate Notes #12979

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hexagon] Create tests to showcase vtcm loading capabilities on Hexagon. #12667

[Hexagon] Create tests to showcase vtcm loading capabilities on Hexagon. #12667

nverke commented Aug 31, 2022 •

edited by github-actions bot

Loading

tmoreau89 commented Aug 31, 2022

csullivan left a comment

csullivan Sep 2, 2022

csullivan Sep 2, 2022

tmoreau89 commented Sep 13, 2022

		A = T.match_buffer(a, size, dtype="int8", align=128, scope="global")
		A_global_vtcm = T.match_buffer(a_v, size, dtype="int8", align=128, scope="global.vtcm")

[Hexagon] Create tests to showcase vtcm loading capabilities on Hexagon. #12667

[Hexagon] Create tests to showcase vtcm loading capabilities on Hexagon. #12667

Conversation

nverke commented Aug 31, 2022 • edited by github-actions bot Loading

Background

Results

8Gen1 HDK

888 HDK

8Gen1 HDK

888 HDK

tmoreau89 commented Aug 31, 2022

csullivan left a comment

Choose a reason for hiding this comment

csullivan Sep 2, 2022

Choose a reason for hiding this comment

csullivan Sep 2, 2022

Choose a reason for hiding this comment

tmoreau89 commented Sep 13, 2022

nverke commented Aug 31, 2022 •

edited by github-actions bot

Loading