Add vrmpy, vdmpy, improve vmpa on Hexagon #1825

dsharletg · 2017-02-09T23:27:21Z

This PR adds support for vrmpy and vdmpy, and improves vmpa support on Hexagon. It also adds a Hexagon matrix multiply app. The matrix multiply looks pretty good, the inner loop has 32 vrmpy instructions in 19 packets, all accumulators stored in registers. It computes a 1024 x 1024 x 1024 matrix multiply in 0.0214 s. This is 4-5x faster than the CPU schedule.

A detailed overview of the changes is:

We now match expressions of the form ax + by, and choose vdmpy or vmpa depending on the shape of the vectors. If x and y interleave cleanly, we use vdmpy, otherwise we use vmpa.
We also now match expressions of the form ax + by + cz + dw and use vrmpy, even if the operands do not interleave cleanly.
Accumulating vmpa operations were a bit fragile, they should be more robust now.
Add a matrix multiply app for Hexagon.
There are a few new and improved simplifications for Shuffles and Loads. Loads with shuffled indices are simplified for all shuffles, not just interleavings, and interleavings of slices are now simplified.
Some small refactorings to disabled simd_op_check tests to make it easier to disable large blocks of tests.

There's still some TODOs here that need work, but I think it would be useful as-is for now.

steven-johnson · 2017-02-09T23:32:56Z

apps/hexagon_matmul/Makefile

@@ -0,0 +1,55 @@
+include ../support/Makefile.inc


If you put all of the intermediate stuff into a BIN subdir (a la apps/simd_op_check), .gitignore won't need updating and the world will be a happy place.

This makefile is literally a copy/paste of another app, I only changed a few command line args to the process binaries for running the app.

I totally agree that the apps need major refactoring, to use generators, etc... but I'd like to hold off on that until we can invest time in doing it consistently for all apps. What do you think?

That refactoring already happened for most apps. Copy-paste from local laplacian or bilateral grid instead. Which ones still need fixing?

Those don't support running on Android. I think that maybe it's just all the Android running apps that need refactoring.

It might literally just be HelloHexagon and this one, though I thought there were more. I copied from there in order to get Android support.

Maybe I'm thinking of simd_op_check (but that one probably needs it's own custom setup anyways).

Either way... I'd like to make this fix in a separate PR.

I don't think HelloHexagon should be fixed in this PR, but I'd rather not commit new code that will immediately need fixing to match the structure of the other apps.

The thing I'm concerned about is that I'm not sure what the right thing even is. I already tried to do this for HelloHexagon once, and got bogged down with complexities (the Android toolchain stuff combined with Generator things) and aborted my attempt. I have no reason to expect to be able to solve it any better for this app right now...

Don't view comment as a deal-breaker; that said, I thought all the others had been refactored into put-the-intermediate-stuff-in-bin/ style. I'll do another pass at some point to see what I missed.

I just re-copied HelloHexagon with Andrew's modernization (thanks for doing that!) and reimplemented the matrix multiply.

steven-johnson · 2017-02-09T23:34:35Z

apps/hexagon_matmul/pipeline.cpp

+using namespace Halide;
+using namespace Halide::ConciseCasts;
+
+int main(int argc, char **argv) {


I'll just note that this sure looks like a Generator to me :-)

i was kind of eager to try this patch .

i see an abort on conv3x3a16 and conv3x3a32 and helloHexagon:

Internal error at /local/mnt2/ronl/Halide-upstream/Halide/src/IR.cpp:21
Condition failed: a.type() == b.type()
Add of mismatched types

dsharletg · 2017-02-10T00:22:57Z

Can you send me a repro case?

ronlieb · 2017-02-10T00:29:53Z

HelloHexagon, i think you have that one already. is it not failing for you?

dsharletg · 2017-02-10T03:29:17Z

OK, I wasn't actually expecting vrmpy operation to generate in HelloHexagon, because the outputs are 16 bit, not 32 bit. Unfortunately, I tried supporting 16 bit outputs from vrmpy, and while it works, it's slower than letting 2 vmpa/vdmpy generate instead (probably because of the extra interleaving required by vrmpy).

dsharletg · 2017-02-10T03:29:31Z

(Also, it was a bug that the generated, which is now fixed.)

dsharletg · 2017-02-10T03:32:19Z

Actually, I got a better idea: we should generate vrmpy for 16 bit results, if the required interleave simplifies away... will implement.

dsharletg · 2017-02-10T17:36:08Z

The matrix multiply now looks pretty good. The inner loop has 32 vrmpy instructions in 19 packets, all accumulators stored in registers. It computes a 1024 x 1024 x 1024 matrix multiply in 0.0214 s. This is 4-5x faster than the CPU schedule. Here's the inner loop:

.Ltmp2:                                 // Block address taken
.LBB36_10:                              // %"for AB.s1.k$x"
                                        //   Parent Loop BB36_8 Depth=1
                                        // =>  This Inner Loop Header: Depth=2
    {
                    r17 = add(r28, r7)
                    r20 = add(r28, r5)
                    r11 = add(r28, r6)
                    v17 = vmem(r10+#-4)
    }
    {
                    r21 = memw(r11 + #-4)
                    v21 = vmem(r10+#-3)
    }
    {
                    v7.uw += vrmpy(v17.ub,r21.ub)
                    r20 = memw(r20 + #-4)
                    v22 = vmem(r10+#-2)
    }
    {
                    v3.uw += vrmpy(v17.ub,r20.ub)
                    v4.uw += vrmpy(v21.ub,r20.ub)
                    r17 = memw(r17 + #-4)
                    v23 = vmem(r10+#-1)
    }
    {
                    v8.uw += vrmpy(v17.ub,r17.ub)
                    v6.uw += vrmpy(v22.ub,r17.ub)
                    r22 = memw(r28 + #-4)
                    v19 = vmem(r10+#0)
    }
    {
                    v16.uw += vrmpy(v17.ub,r22.ub)
                    v12.uw += vrmpy(v23.ub,r22.ub)
                    r11 = memw(r28 + r5<<#0)
                    v20 = vmem(r10+#1)
    }
    {
                    v1.uw += vrmpy(v22.ub,r20.ub)
                    v9.uw += vrmpy(v21.ub,r17.ub)
                    r23 = memw(r28 + r7<<#0)
                    v17 = vmem(r10+#2)
    }
    {
                    v5.uw += vrmpy(v23.ub,r17.ub)
                    v2.uw += vrmpy(v23.ub,r20.ub)
                    r17 = memw(r28 + #0)
                    v18 = vmem(r10+#3)
    }
    {
                    v13.uw += vrmpy(v23.ub,r21.ub)
                    v10.uw += vrmpy(v21.ub,r21.ub)
                    r10 = add(r10, #1024)
                    r25 = memw(r28 + r6<<#0)
    }
    {
                    v15.uw += vrmpy(v21.ub,r22.ub)
                    v14.uw += vrmpy(v22.ub,r22.ub)
                    r28 = add(r28, #8)
    }
    {
                    v11.uw += vrmpy(v22.ub,r21.ub)
                    v8.uw += vrmpy(v19.ub,r23.ub)
    }
    {
                    v12.uw += vrmpy(v18.ub,r17.ub)
                    v16.uw += vrmpy(v19.ub,r17.ub)
    }
    {
                    v14.uw += vrmpy(v17.ub,r17.ub)
                    v15.uw += vrmpy(v20.ub,r17.ub)
    }
    {
                    v6.uw += vrmpy(v17.ub,r23.ub)
                    v5.uw += vrmpy(v18.ub,r23.ub)
    }
    {
                    v9.uw += vrmpy(v20.ub,r23.ub)
                    v10.uw += vrmpy(v20.ub,r25.ub)
    }
    {
                    v11.uw += vrmpy(v17.ub,r25.ub)
                    v13.uw += vrmpy(v18.ub,r25.ub)
    }
    {
                    v7.uw += vrmpy(v19.ub,r25.ub)
                    v3.uw += vrmpy(v19.ub,r11.ub)
    }
    {
                    v4.uw += vrmpy(v20.ub,r11.ub)
                    v1.uw += vrmpy(v17.ub,r11.ub)
    }
    {
                    v2.uw += vrmpy(v18.ub,r11.ub)
                    nop
    }       :endloop0

abadams · 2017-02-10T18:27:00Z

Look at all that empty space in the later packets. I wonder if there's anything useful that could be jammed in there. Prefetching? Some add instructions from the next loop iteration? I assume you've already tried different unrolling factors along k.

…

On Fri, Feb 10, 2017 at 11:36 AM, Dillon Sharlet ***@***.***> wrote: The matrix multiply now looks pretty good. The inner loop has 32 vrmpy instructions in 19 packets, all accumulators stored in registers. It computes a 1024 x 1024 x 1024 matrix multiply in 0.0214 s. This is 4-5x faster than the CPU schedule. Here's the inner loop: .Ltmp2: // Block address taken .LBB36_10: // %"for AB.s1.k$x" // Parent Loop BB36_8 Depth=1 // => This Inner Loop Header: Depth=2 { r17 = add(r28, r7) r20 = add(r28, r5) r11 = add(r28, r6) v17 = vmem(r10+#-4) } { r21 = memw(r11 + #-4) v21 = vmem(r10+#-3) } { v7.uw += vrmpy(v17.ub,r21.ub) r20 = memw(r20 + #-4) v22 = vmem(r10+#-2) } { v3.uw += vrmpy(v17.ub,r20.ub) v4.uw += vrmpy(v21.ub,r20.ub) r17 = memw(r17 + #-4) v23 = vmem(r10+#-1) } { v8.uw += vrmpy(v17.ub,r17.ub) v6.uw += vrmpy(v22.ub,r17.ub) r22 = memw(r28 + #-4) v19 = vmem(r10+#0) } { v16.uw += vrmpy(v17.ub,r22.ub) v12.uw += vrmpy(v23.ub,r22.ub) r11 = memw(r28 + r5<<#0) v20 = vmem(r10+#1) } { v1.uw += vrmpy(v22.ub,r20.ub) v9.uw += vrmpy(v21.ub,r17.ub) r23 = memw(r28 + r7<<#0) v17 = vmem(r10+#2) } { v5.uw += vrmpy(v23.ub,r17.ub) v2.uw += vrmpy(v23.ub,r20.ub) r17 = memw(r28 + #0) v18 = vmem(r10+#3) } { v13.uw += vrmpy(v23.ub,r21.ub) v10.uw += vrmpy(v21.ub,r21.ub) r10 = add(r10, #1024) r25 = memw(r28 + r6<<#0) } { v15.uw += vrmpy(v21.ub,r22.ub) v14.uw += vrmpy(v22.ub,r22.ub) r28 = add(r28, #8) } { v11.uw += vrmpy(v22.ub,r21.ub) v8.uw += vrmpy(v19.ub,r23.ub) } { v12.uw += vrmpy(v18.ub,r17.ub) v16.uw += vrmpy(v19.ub,r17.ub) } { v14.uw += vrmpy(v17.ub,r17.ub) v15.uw += vrmpy(v20.ub,r17.ub) } { v6.uw += vrmpy(v17.ub,r23.ub) v5.uw += vrmpy(v18.ub,r23.ub) } { v9.uw += vrmpy(v20.ub,r23.ub) v10.uw += vrmpy(v20.ub,r25.ub) } { v11.uw += vrmpy(v17.ub,r25.ub) v13.uw += vrmpy(v18.ub,r25.ub) } { v7.uw += vrmpy(v19.ub,r25.ub) v3.uw += vrmpy(v19.ub,r11.ub) } { v4.uw += vrmpy(v20.ub,r11.ub) v1.uw += vrmpy(v17.ub,r11.ub) } { v2.uw += vrmpy(v18.ub,r11.ub) nop } :endloop0 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1825 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAfdRhKJPQUUtY0ev8g1sX0Nbo_-hsmlks5rbKAKgaJpZM4L8zWP> .

dsharletg · 2017-02-10T19:01:28Z

Yeah, I think prefetching might be worth doing, but the prefetch directive currently doesn't work for update definitions! I started a separate thread on that issue.

I did try unrolling k, it didn't seem to matter. It does increase the proportion of packets that are filled with vrmpy ops, but it's already so high, so it doesn't make much difference.

abadams · 2017-02-10T21:28:52Z

I will make the change I am thinking of to HelloHexagon in a separate PR

…

On Fri, Feb 10, 2017 at 3:22 PM, Dillon Sharlet ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In apps/hexagon_matmul/Makefile <#1825>: > @@ -0,0 +1,55 @@ +include ../support/Makefile.inc The thing I'm concerned about is that I'm not sure what the right thing even is. I already tried to do this for HelloHexagon once, and got bogged down with complexities (the Android toolchain stuff combined with Generator things) and aborted my attempt. I have no reason to expect to be able to solve it any better for this app right now... — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1825>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAfdRgk3Tze2yOfNPuuYidbsgQqFMdNfks5rbNUZgaJpZM4L8zWP> .

pranavb-ca · 2017-02-10T22:27:00Z

src/Simplify.cpp

+        // of vectors. This is a bit of a hacky heuristic to avoid
+        // undoing the work of AlignLoads for Hexagon.
+        if (!op->is_concat() || new_vectors[0].type().is_scalar()) {
+            // Try to convert a shuffled load into a shuffle of a load.


Not that this is terribly important, but I found the language in the comment confusing. I think you are converting
shuffle({ld(ramp(a, 1, 4), ld(ramp(a+4, 1, 4), ld(ramp(a+8, 1, 4)}, {2, 3, 4, 5, 6, 7, 8}) into
ld(shuffle({ramp(a, 1, 12)},{2, 3, 4, 5, 6, 7, 8}), right?

I rephrased the comment to be a bit more precise. What it's trying to do is convert a load with shuffled indices into a shuffle of a dense load. Your example is replacing a slice(concat(load, load, load)) with load(slice(concat(idx0, idx1, idx2))), which is the opposite, but also, we don't do this to concat vectors (so as not to undo the work of AlignLoads... I wish there were a more elegant way to avoid that.

I suppose the goal of this is simplification, and arguably, dense loads are the simpler thing (maybe not from the IR perspective, but definitely so for codegen), so this heuristic isn't so unreasonable I guess.

pranavb-ca · 2017-02-10T22:30:47Z

src/Simplify.cpp

-                        load_predicates.push_back(load->predicate);
-                        load_indices.push_back(load->index);
-                        always_true = always_true && is_one(load->predicate);
+            // Try to collapse an interleave of slices of vectors from


Thank you for this. This is excellent. When I was first thinking of implementing vrmpy and vdmpy I had been discouraged by having to check all of this in HexagonOptimize.cpp/CodeGen_Hexagon.cpp. But, this is definitely better that it is simplified in Simplify.cpp.

pranavb-ca · 2017-02-10T22:32:29Z

Thanks for this. I am only concerned about this
"We also now match expressions of the form ax + by + cz + dw and use vrmpy, even if the operands do not interleave cleanly."

I wonder if we should do the above only whne the operands interleave cleanly?

multiply-adds.

dsharletg · 2017-02-14T00:42:35Z

I just pushed a change that finds the vrmpy and vdmpy operands differently. It was a fairly significant code change, sorry for the churn. The new code is more robust (e.g. it can find kernels with negative coefficients that were rewritten to subtracts with a positive coefficient... doh).

@pranavb-ca It is an interesting question. This instruction is certainly pushing the limits of what is reasonable to expect a compiler to do vs. just exposing an intrinsic function. vdmpy and vmpa are nice because you can just use one or the other depending on if the operands interleave cleanly or not.

At first, I was thinking it was a no brainer because it was 2 vshuff packets (3 vshuff, but two can be executed in the same packet) followed by vrmpy, vs. 4 multiply-adds. But, it wouldn't be 4 multiply-adds, it would be 2 vmpa instructions (even then, expression tree balancing could also fit 4 multiply-adds into 3 packets).

So, I guess I should limit it to only using vrmpy when it interleaves cleanly. This makes simd_op_check pretty nasty... :)

dsharletg · 2017-02-14T02:28:22Z

Hmm, actually, you can't use vmpa in this case, because vmpa might overflow (it should produce a 32 bit results from 8 bit operands if it were to avoid overflow, like vrmpy does).

So, I take it back, I think that the current lowering is the fastest thing we can do. Interleaving the operands and using vrmpy is faster than the alternative (4 vmpy + 3 vadd, note there is no vmpy for 8 bit -> 32 bit multiply accumulate). Note that this is for exactly this pattern:

i32(i16(x)*a) + i32(i16(y)*b) + i32(i16(z)*c) + i32(i16(w)*d

BTW, this reminds me, what does everyone think about that being the pattern for vrmpy, as opposed to:

i32(x)*a + i32(y)*b + i32(z)*c + i32(w)*d

? Both are correct, but, the former is the best thing to write when you don't have vrmpy. I wrote it the way I did because the matrix multiply would do the right thing on both the CPU and on Hexagon.

dsharletg · 2017-02-14T02:45:22Z

Also note that while we generate vrmpy for 16 bit results, note that we only do so if the operands interleave cleanly. If they don't, then we can use vmpa, and it does (there's a simd_op_check test for that).

abadams · 2017-02-14T18:09:24Z

Regarding patterns - can we just match both? I.e. we expect to add four things, where each thing is either i32(i16(a)*b) or i32(a)*b

dsharletg · 2017-02-14T18:25:48Z

That pattern actually already works, I just added a simd_op_check test for it. However, we still can't use vmpa for that pattern when the operands don't interleave either, as the overflow issue remains. I should have chosen my words more carefully: it's not that that exact pattern is too constraining to also match vmpa, it's that the 32 bit accumulation is too big to use vmpa (a much better reason).

Another thing I realized: we could use the 16 bit version of vmpa here (that avoids the overflow issue), but I still think it should be faster to interleave and use vrmpy, it would require 4 16 -> 32 bit vmpa to do the equivalent work of 1 vrmpy, plus the cost of upcasting.

dsharletg · 2017-02-14T18:37:50Z

Actually, since that pattern is easier to read, I decided to use it for all the simd_op_check tests (but still check for the original more precise pattern).

dsharletg · 2017-02-14T19:50:33Z

src/HexagonOptimize.cpp

@@ -63,6 +63,44 @@ namespace {
 // Broadcast to an unknown number of lanes, for making patterns.
 Expr bc(Expr x) { return Broadcast::make(x, 0); }

+// Flatten a binary operator tree into a list of operands.


Just noticed this is old dead code, will remove

dsharletg added 8 commits February 8, 2017 08:32

Add support for some vdmpy and vrmpy instructions.

b0dfbfe

Many vmpa, vdmpy, vrmpy instructions now generating, but fragile.

3e33381

Use small vectors to pass scalar multiply arguments.

9ccc954

Add simplifications for Shuffle.

9d23a89

Handle horizontal reductions in call_intrin.

ff804ca

Pass scalar multipliers via scalars instead of short vectors.

1910931

Add Hexagon matrix multiply example app.

7a4788e

Merge branch 'master' of https://github.com/halide/Halide into hvx-vrmpy

a825823

steven-johnson reviewed Feb 9, 2017

View reviewed changes

Don't reinterpret vector operands as scalars.

8fa0028

Fix a few issues with 16 bit vrmpy results.

b88a0d8

dsharletg added 6 commits February 9, 2017 19:54

Only generate vrmpy for 16 bit results when the operands interleave.

52dfb1d

Fast matrix multiply on HVX.

8882e7e

More straightforward schedule for swizzling.

7cdee74

Remove straggler Var.

056e92d

Simpler schedule, with comments.

de9e1e7

Simplify shuffles with equal vectors, not just the same vectors.

4456012

Don't simplify concatenated vector loads.

f726c8a

pranavb-ca reviewed Feb 10, 2017

View reviewed changes

Merge branch 'master' of https://github.com/halide/Halide into hvx-vrmpy

3378eb3

dsharletg added 4 commits February 10, 2017 14:42

Require less alignment on the k dimension.

b24c6f8

Better implementation of finding operands for horizontally reducing

9bb7d29

multiply-adds.

Merge branch 'master' of https://github.com/halide/Halide into hvx-vrmpy

706eb8f

Modernize hexagon_matmul app.

e82ddbd

Merge branch 'master' of https://github.com/halide/Halide into hvx-vrmpy

0f04f75

Add alternative pattern for vrmpy.

906ceba

Use more concise pattern for vrmpy tests.

9510bc0

dsharletg commented Feb 14, 2017

View reviewed changes

dsharletg added 2 commits February 14, 2017 12:09

Remove dead code, improve comment.

df26ecf

Fix mutator crash after #1844.

ff35689

dsharletg merged commit 4a99dcb into master Feb 16, 2017

dsharletg deleted the hvx-vrmpy branch February 24, 2017 00:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vrmpy, vdmpy, improve vmpa on Hexagon #1825

Add vrmpy, vdmpy, improve vmpa on Hexagon #1825

dsharletg commented Feb 9, 2017 •

edited

Loading

steven-johnson Feb 9, 2017

dsharletg Feb 10, 2017

abadams Feb 10, 2017

dsharletg Feb 10, 2017

dsharletg Feb 10, 2017

abadams Feb 10, 2017

dsharletg Feb 10, 2017

steven-johnson Feb 10, 2017

dsharletg Feb 14, 2017

steven-johnson Feb 9, 2017

ronlieb Feb 10, 2017

dsharletg commented Feb 10, 2017

ronlieb commented Feb 10, 2017

dsharletg commented Feb 10, 2017

dsharletg commented Feb 10, 2017

dsharletg commented Feb 10, 2017

dsharletg commented Feb 10, 2017

abadams commented Feb 10, 2017 via email

dsharletg commented Feb 10, 2017

abadams commented Feb 10, 2017 via email

pranavb-ca Feb 10, 2017

dsharletg Feb 14, 2017

pranavb-ca Feb 10, 2017

pranavb-ca commented Feb 10, 2017

dsharletg commented Feb 14, 2017

dsharletg commented Feb 14, 2017

dsharletg commented Feb 14, 2017

abadams commented Feb 14, 2017

dsharletg commented Feb 14, 2017

dsharletg commented Feb 14, 2017

dsharletg Feb 14, 2017

Add vrmpy, vdmpy, improve vmpa on Hexagon #1825

Add vrmpy, vdmpy, improve vmpa on Hexagon #1825

Conversation

dsharletg commented Feb 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsharletg commented Feb 10, 2017

ronlieb commented Feb 10, 2017

dsharletg commented Feb 10, 2017

dsharletg commented Feb 10, 2017

dsharletg commented Feb 10, 2017

dsharletg commented Feb 10, 2017

abadams commented Feb 10, 2017 via email

dsharletg commented Feb 10, 2017

abadams commented Feb 10, 2017 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pranavb-ca commented Feb 10, 2017

dsharletg commented Feb 14, 2017

dsharletg commented Feb 14, 2017

dsharletg commented Feb 14, 2017

abadams commented Feb 14, 2017

dsharletg commented Feb 14, 2017

dsharletg commented Feb 14, 2017

Choose a reason for hiding this comment

dsharletg commented Feb 9, 2017 •

edited

Loading