-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add vrmpy, vdmpy, improve vmpa on Hexagon #1825
Conversation
@@ -0,0 +1,55 @@ | |||
include ../support/Makefile.inc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you put all of the intermediate stuff into a BIN subdir (a la apps/simd_op_check), .gitignore won't need updating and the world will be a happy place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makefile is literally a copy/paste of another app, I only changed a few command line args to the process binaries for running the app.
I totally agree that the apps need major refactoring, to use generators, etc... but I'd like to hold off on that until we can invest time in doing it consistently for all apps. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That refactoring already happened for most apps. Copy-paste from local laplacian or bilateral grid instead. Which ones still need fixing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those don't support running on Android. I think that maybe it's just all the Android running apps that need refactoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might literally just be HelloHexagon and this one, though I thought there were more. I copied from there in order to get Android support.
Maybe I'm thinking of simd_op_check (but that one probably needs it's own custom setup anyways).
Either way... I'd like to make this fix in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think HelloHexagon should be fixed in this PR, but I'd rather not commit new code that will immediately need fixing to match the structure of the other apps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing I'm concerned about is that I'm not sure what the right thing even is. I already tried to do this for HelloHexagon once, and got bogged down with complexities (the Android toolchain stuff combined with Generator things) and aborted my attempt. I have no reason to expect to be able to solve it any better for this app right now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't view comment as a deal-breaker; that said, I thought all the others had been refactored into put-the-intermediate-stuff-in-bin/ style. I'll do another pass at some point to see what I missed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just re-copied HelloHexagon with Andrew's modernization (thanks for doing that!) and reimplemented the matrix multiply.
apps/hexagon_matmul/pipeline.cpp
Outdated
using namespace Halide; | ||
using namespace Halide::ConciseCasts; | ||
|
||
int main(int argc, char **argv) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll just note that this sure looks like a Generator to me :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was kind of eager to try this patch .
i see an abort on conv3x3a16 and conv3x3a32 and helloHexagon:
Internal error at /local/mnt2/ronl/Halide-upstream/Halide/src/IR.cpp:21
Condition failed: a.type() == b.type()
Add of mismatched types
Can you send me a repro case? |
HelloHexagon, i think you have that one already. is it not failing for you? |
OK, I wasn't actually expecting vrmpy operation to generate in HelloHexagon, because the outputs are 16 bit, not 32 bit. Unfortunately, I tried supporting 16 bit outputs from vrmpy, and while it works, it's slower than letting 2 vmpa/vdmpy generate instead (probably because of the extra interleaving required by vrmpy). |
(Also, it was a bug that the generated, which is now fixed.) |
Actually, I got a better idea: we should generate vrmpy for 16 bit results, if the required interleave simplifies away... will implement. |
The matrix multiply now looks pretty good. The inner loop has 32 vrmpy instructions in 19 packets, all accumulators stored in registers. It computes a 1024 x 1024 x 1024 matrix multiply in 0.0214 s. This is 4-5x faster than the CPU schedule. Here's the inner loop:
|
Look at all that empty space in the later packets. I wonder if there's
anything useful that could be jammed in there. Prefetching? Some add
instructions from the next loop iteration? I assume you've already tried
different unrolling factors along k.
…On Fri, Feb 10, 2017 at 11:36 AM, Dillon Sharlet ***@***.***> wrote:
The matrix multiply now looks pretty good. The inner loop has 32 vrmpy
instructions in 19 packets, all accumulators stored in registers. It
computes a 1024 x 1024 x 1024 matrix multiply in 0.0214 s. This is 4-5x
faster than the CPU schedule. Here's the inner loop:
.Ltmp2: // Block address taken
.LBB36_10: // %"for AB.s1.k$x"
// Parent Loop BB36_8 Depth=1
// => This Inner Loop Header: Depth=2
{
r17 = add(r28, r7)
r20 = add(r28, r5)
r11 = add(r28, r6)
v17 = vmem(r10+#-4)
}
{
r21 = memw(r11 + #-4)
v21 = vmem(r10+#-3)
}
{
v7.uw += vrmpy(v17.ub,r21.ub)
r20 = memw(r20 + #-4)
v22 = vmem(r10+#-2)
}
{
v3.uw += vrmpy(v17.ub,r20.ub)
v4.uw += vrmpy(v21.ub,r20.ub)
r17 = memw(r17 + #-4)
v23 = vmem(r10+#-1)
}
{
v8.uw += vrmpy(v17.ub,r17.ub)
v6.uw += vrmpy(v22.ub,r17.ub)
r22 = memw(r28 + #-4)
v19 = vmem(r10+#0)
}
{
v16.uw += vrmpy(v17.ub,r22.ub)
v12.uw += vrmpy(v23.ub,r22.ub)
r11 = memw(r28 + r5<<#0)
v20 = vmem(r10+#1)
}
{
v1.uw += vrmpy(v22.ub,r20.ub)
v9.uw += vrmpy(v21.ub,r17.ub)
r23 = memw(r28 + r7<<#0)
v17 = vmem(r10+#2)
}
{
v5.uw += vrmpy(v23.ub,r17.ub)
v2.uw += vrmpy(v23.ub,r20.ub)
r17 = memw(r28 + #0)
v18 = vmem(r10+#3)
}
{
v13.uw += vrmpy(v23.ub,r21.ub)
v10.uw += vrmpy(v21.ub,r21.ub)
r10 = add(r10, #1024)
r25 = memw(r28 + r6<<#0)
}
{
v15.uw += vrmpy(v21.ub,r22.ub)
v14.uw += vrmpy(v22.ub,r22.ub)
r28 = add(r28, #8)
}
{
v11.uw += vrmpy(v22.ub,r21.ub)
v8.uw += vrmpy(v19.ub,r23.ub)
}
{
v12.uw += vrmpy(v18.ub,r17.ub)
v16.uw += vrmpy(v19.ub,r17.ub)
}
{
v14.uw += vrmpy(v17.ub,r17.ub)
v15.uw += vrmpy(v20.ub,r17.ub)
}
{
v6.uw += vrmpy(v17.ub,r23.ub)
v5.uw += vrmpy(v18.ub,r23.ub)
}
{
v9.uw += vrmpy(v20.ub,r23.ub)
v10.uw += vrmpy(v20.ub,r25.ub)
}
{
v11.uw += vrmpy(v17.ub,r25.ub)
v13.uw += vrmpy(v18.ub,r25.ub)
}
{
v7.uw += vrmpy(v19.ub,r25.ub)
v3.uw += vrmpy(v19.ub,r11.ub)
}
{
v4.uw += vrmpy(v20.ub,r11.ub)
v1.uw += vrmpy(v17.ub,r11.ub)
}
{
v2.uw += vrmpy(v18.ub,r11.ub)
nop
} :endloop0
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1825 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAfdRhKJPQUUtY0ev8g1sX0Nbo_-hsmlks5rbKAKgaJpZM4L8zWP>
.
|
Yeah, I think prefetching might be worth doing, but the prefetch directive currently doesn't work for update definitions! I started a separate thread on that issue. I did try unrolling k, it didn't seem to matter. It does increase the proportion of packets that are filled with vrmpy ops, but it's already so high, so it doesn't make much difference. |
I will make the change I am thinking of to HelloHexagon in a separate PR
…On Fri, Feb 10, 2017 at 3:22 PM, Dillon Sharlet ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In apps/hexagon_matmul/Makefile
<#1825>:
> @@ -0,0 +1,55 @@
+include ../support/Makefile.inc
The thing I'm concerned about is that I'm not sure what the right thing
even is. I already tried to do this for HelloHexagon once, and got bogged
down with complexities (the Android toolchain stuff combined with Generator
things) and aborted my attempt. I have no reason to expect to be able to
solve it any better for this app right now...
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1825>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAfdRgk3Tze2yOfNPuuYidbsgQqFMdNfks5rbNUZgaJpZM4L8zWP>
.
|
src/Simplify.cpp
Outdated
// of vectors. This is a bit of a hacky heuristic to avoid | ||
// undoing the work of AlignLoads for Hexagon. | ||
if (!op->is_concat() || new_vectors[0].type().is_scalar()) { | ||
// Try to convert a shuffled load into a shuffle of a load. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that this is terribly important, but I found the language in the comment confusing. I think you are converting
shuffle({ld(ramp(a, 1, 4), ld(ramp(a+4, 1, 4), ld(ramp(a+8, 1, 4)}, {2, 3, 4, 5, 6, 7, 8}) into
ld(shuffle({ramp(a, 1, 12)},{2, 3, 4, 5, 6, 7, 8}), right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rephrased the comment to be a bit more precise. What it's trying to do is convert a load with shuffled indices into a shuffle of a dense load. Your example is replacing a slice(concat(load, load, load)) with load(slice(concat(idx0, idx1, idx2))), which is the opposite, but also, we don't do this to concat vectors (so as not to undo the work of AlignLoads... I wish there were a more elegant way to avoid that.
I suppose the goal of this is simplification, and arguably, dense loads are the simpler thing (maybe not from the IR perspective, but definitely so for codegen), so this heuristic isn't so unreasonable I guess.
load_predicates.push_back(load->predicate); | ||
load_indices.push_back(load->index); | ||
always_true = always_true && is_one(load->predicate); | ||
// Try to collapse an interleave of slices of vectors from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this. This is excellent. When I was first thinking of implementing vrmpy and vdmpy I had been discouraged by having to check all of this in HexagonOptimize.cpp/CodeGen_Hexagon.cpp. But, this is definitely better that it is simplified in Simplify.cpp.
Thanks for this. I am only concerned about this I wonder if we should do the above only whne the operands interleave cleanly? |
I just pushed a change that finds the vrmpy and vdmpy operands differently. It was a fairly significant code change, sorry for the churn. The new code is more robust (e.g. it can find kernels with negative coefficients that were rewritten to subtracts with a positive coefficient... doh). @pranavb-ca It is an interesting question. This instruction is certainly pushing the limits of what is reasonable to expect a compiler to do vs. just exposing an intrinsic function. vdmpy and vmpa are nice because you can just use one or the other depending on if the operands interleave cleanly or not. At first, I was thinking it was a no brainer because it was 2 vshuff packets (3 vshuff, but two can be executed in the same packet) followed by vrmpy, vs. 4 multiply-adds. But, it wouldn't be 4 multiply-adds, it would be 2 vmpa instructions (even then, expression tree balancing could also fit 4 multiply-adds into 3 packets). So, I guess I should limit it to only using vrmpy when it interleaves cleanly. This makes simd_op_check pretty nasty... :) |
Hmm, actually, you can't use vmpa in this case, because vmpa might overflow (it should produce a 32 bit results from 8 bit operands if it were to avoid overflow, like vrmpy does). So, I take it back, I think that the current lowering is the fastest thing we can do. Interleaving the operands and using vrmpy is faster than the alternative (4 vmpy + 3 vadd, note there is no vmpy for 8 bit -> 32 bit multiply accumulate). Note that this is for exactly this pattern:
BTW, this reminds me, what does everyone think about that being the pattern for vrmpy, as opposed to:
? Both are correct, but, the former is the best thing to write when you don't have vrmpy. I wrote it the way I did because the matrix multiply would do the right thing on both the CPU and on Hexagon. |
Also note that while we generate vrmpy for 16 bit results, note that we only do so if the operands interleave cleanly. If they don't, then we can use vmpa, and it does (there's a simd_op_check test for that). |
Regarding patterns - can we just match both? I.e. we expect to add four things, where each thing is either i32(i16(a)*b) or i32(a)*b |
That pattern actually already works, I just added a simd_op_check test for it. However, we still can't use vmpa for that pattern when the operands don't interleave either, as the overflow issue remains. I should have chosen my words more carefully: it's not that that exact pattern is too constraining to also match vmpa, it's that the 32 bit accumulation is too big to use vmpa (a much better reason). Another thing I realized: we could use the 16 bit version of vmpa here (that avoids the overflow issue), but I still think it should be faster to interleave and use vrmpy, it would require 4 16 -> 32 bit vmpa to do the equivalent work of 1 vrmpy, plus the cost of upcasting. |
Actually, since that pattern is easier to read, I decided to use it for all the simd_op_check tests (but still check for the original more precise pattern). |
src/HexagonOptimize.cpp
Outdated
@@ -63,6 +63,44 @@ namespace { | |||
// Broadcast to an unknown number of lanes, for making patterns. | |||
Expr bc(Expr x) { return Broadcast::make(x, 0); } | |||
|
|||
// Flatten a binary operator tree into a list of operands. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed this is old dead code, will remove
This PR adds support for vrmpy and vdmpy, and improves vmpa support on Hexagon. It also adds a Hexagon matrix multiply app. The matrix multiply looks pretty good, the inner loop has 32 vrmpy instructions in 19 packets, all accumulators stored in registers. It computes a 1024 x 1024 x 1024 matrix multiply in 0.0214 s. This is 4-5x faster than the CPU schedule.
A detailed overview of the changes is:
There's still some TODOs here that need work, but I think it would be useful as-is for now.