-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Hexagon] Create tests to showcase vtcm loading capabilities on Hexagon. #12667
Conversation
… and throw an error to fail early.
… copying data to VTCM.
Very neat work @nverke ! CC @masahi @kparzysz-quic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏 Amazing work characterizing the performance of VTCM and DMA @nverke.
@@ -58,7 +58,7 @@ def __init__( | |||
remote_kw: dict, | |||
session_name: str = "hexagon-rpc", | |||
remote_stack_size_bytes: int = 256 * 1024, # Min size for main thread in QuRT/sim | |||
rpc_receive_buffer_size_bytes: int = 5 * 1024 * 1024, # Size for passing hexagon tests | |||
rpc_receive_buffer_size_bytes: int = 1024 * 1024 * 1024, # Size for passing hexagon tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left over from testing? A gigabyte for the rpc buffer can impact available memory for model execution
A = T.match_buffer(a, size, dtype="int8", align=128, scope="global") | ||
A_global_vtcm = T.match_buffer(a_v, size, dtype="int8", align=128, scope="global.vtcm") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nice follow up would to support measuring the bandwidth in both directions (ddr->vtcm and vtcm->ddr) given that in tests we saw a significant perf asymmetry and making that easily reproducible can help the QC experts debug or provide us insights.
Thank you @csullivan and @nverke for the PR; it's been merged! |
…on. (apache#12667) * [Hexagon] Increase max buffer size for tvm_rpc_android to 1GB. * [Hexagon] Make errors more clear when unable to allocate VTCM buffers and throw an error to fail early. * [Hexagon] Add mem_copy_DLTensor to enable directly calling DMA for mem copies. * [Hexagon] Add new tests as examples of the performance to expect when copying data to VTCM. * [Hexagon] Reduce rpc max size. * [Hexagon] Fix test_parallel_hvx_load_vtcm.py test output to be human readable. * Comment out tests that only work on 8Gen1 HDKs to get CI to pass
Background
In order to learn more about efficiently running on Hexagon, an investigation into how to properly utilize VTCM was performed. These are the initial results of that investigation and should serve as a good starting point for others looking to leverage VTCM while running on Hexagon.
Results
Below are the results from running a simple parallel vrmpy operation in several different configurations. Each configuration is described below.
Without VTCM: This is just running the vrmpy operator without loading any data into VTCM
Basic VTCM Loads: This introduces loops to copy the data into VTCM before running the compute without any scheduling of those data copy loops.
Vec Loads: This applies the following scheduling to the data copy loops. For the DDR -> VTCM loops it uses unroll_split = 8 and vector_split = 64. For VTCM -> DDR loop it uses unroll_split = 8 and vector_split = 8
Vec + Para Loads: This applies the same schedule as above except it also parallelizes on an outer loop. The outer_split is always 4.
Pre + Vec Loads: Same as "Vec Loads" except the VTCM buffers are allocated before runtime and passed into the operator.
Pre + Vec + Para Loads: Same as "Vec + Para Loads" except the VTCM buffers are allocated before runtime and passed into the operator.
Single DMA Load: A single DMA command is used to copy all of the data over to VTCM. This was preallocated since I could not get it to work without preallocation.
Preloaded: All of the data is already loaded into VTCM before the test starts.
8Gen1 HDK
888 HDK
Below are the results from copying data to vtcm with various strategies. The strategies are described below.
Base: This copies the data into VTCM with a simple loop.
Unroll + Vectorize: This applies the following scheduling to the data copy loop. These tests use unroll_split=2 and vector_split=128
Unroll + Vectorize + Parallel: This applies the same schedule as above except it also parallelizes on an outer loop. The outer_split is 4.
Single DMA: Copies the data into VTCM with a single DMA instruction.
8Gen1 HDK
888 HDK
cc @csullivan @mehrdadh