-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Hexagon] Create test examples to show parallelization #12654
Conversation
1024, # Single thread runs faster since L2 cache can handle the entire request quickly | ||
2048, | ||
4096, # Significant performance degredation once the inputs and outputs cannot all fit in L2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are insightful comments that should go in the PR description so that we can make sure to bring them in to the commit description when this PR lands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a little discussion to the description
|
||
def evaluate(hexagon_session, operations, expected, sch): | ||
shape = operations | ||
dtype = "float64" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why choose float64
here? Is the goal to try to avoid vectorization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, thats the idea, I know that hvx does not support float64
@tvm-bot rerun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, in general. I think it could use a once-over to add comments. For example, it took me a minute to figure out the difference between an _operator
and a _producer
function. Would be great if you could parameterize the op type as it would shrink the code size considerably.
# Experimentally best split factor but all multiples of 4 perform pretty well. | ||
# This is because there are 4 HVX untis available on the device and pipelining | ||
# works best with parallels of the number of available HVX. | ||
split_factor = tvm.testing.parameter(4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you find a way to parameterize the operation type (vrmpy, vmpy, vadd) and create just one test case? There is a lot of duplicated code in the 3 test cases below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion! Ill see if I can do this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the changes
16384, | ||
) | ||
|
||
split_factor = tvm.testing.parameter(4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment; can we parameterize the op type to avoid code duplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the changes
@adstraw I believe I made the code more clear as to which are producers but let me know what you think! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@nverke PR is good to go minus merge conflicts that require resolution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @adstraw LGTM
* [Hexagon] Create test examples to show parallelization working on Hexagon workloads. * Increase max size of tvm_rpc_android buffer size. * Reformat tests to be parameterized. * Comment out tests to speedup CI.
Background:
These tests show one way to attain speedup for hexagon workloads through parallelism. Additionally, they show the speedup attained by various different workloads in the best configuration found (outer loop split into 4). This will hopefully give users an understanding of what to do and what to expect when parallelizing hexagon workloads.
Here are some tables of the runtimes on various workloads and HDKs.
HVX runtimes on 8gen1 HDK
HVX runtimes on 888 HDK
Scalar runtimes on 8gen1 HDK
Scalar runtimes on 888 HDK
From the tables above we can see that single thread runs faster for smaller # of operations since L2 cache can handle the entire request quickly. Additionally, significant performance degradation occurs once the inputs and outputs cannot all fit in L2. Both of these should be kept in mind when utilizing parallelism.
cc @csullivan @mehrdadh @tmoreau89