Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vrmpy, vdmpy, improve vmpa on Hexagon #1825

Merged
merged 27 commits into from
Feb 16, 2017
Merged

Add vrmpy, vdmpy, improve vmpa on Hexagon #1825

merged 27 commits into from
Feb 16, 2017

Conversation

dsharletg
Copy link
Contributor

@dsharletg dsharletg commented Feb 9, 2017

This PR adds support for vrmpy and vdmpy, and improves vmpa support on Hexagon. It also adds a Hexagon matrix multiply app. The matrix multiply looks pretty good, the inner loop has 32 vrmpy instructions in 19 packets, all accumulators stored in registers. It computes a 1024 x 1024 x 1024 matrix multiply in 0.0214 s. This is 4-5x faster than the CPU schedule.

A detailed overview of the changes is:

  • We now match expressions of the form ax + by, and choose vdmpy or vmpa depending on the shape of the vectors. If x and y interleave cleanly, we use vdmpy, otherwise we use vmpa.
  • We also now match expressions of the form ax + by + cz + dw and use vrmpy, even if the operands do not interleave cleanly.
  • Accumulating vmpa operations were a bit fragile, they should be more robust now.
  • Add a matrix multiply app for Hexagon.
  • There are a few new and improved simplifications for Shuffles and Loads. Loads with shuffled indices are simplified for all shuffles, not just interleavings, and interleavings of slices are now simplified.
  • Some small refactorings to disabled simd_op_check tests to make it easier to disable large blocks of tests.

There's still some TODOs here that need work, but I think it would be useful as-is for now.

@@ -0,0 +1,55 @@
include ../support/Makefile.inc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you put all of the intermediate stuff into a BIN subdir (a la apps/simd_op_check), .gitignore won't need updating and the world will be a happy place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makefile is literally a copy/paste of another app, I only changed a few command line args to the process binaries for running the app.

I totally agree that the apps need major refactoring, to use generators, etc... but I'd like to hold off on that until we can invest time in doing it consistently for all apps. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That refactoring already happened for most apps. Copy-paste from local laplacian or bilateral grid instead. Which ones still need fixing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those don't support running on Android. I think that maybe it's just all the Android running apps that need refactoring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might literally just be HelloHexagon and this one, though I thought there were more. I copied from there in order to get Android support.

Maybe I'm thinking of simd_op_check (but that one probably needs it's own custom setup anyways).

Either way... I'd like to make this fix in a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think HelloHexagon should be fixed in this PR, but I'd rather not commit new code that will immediately need fixing to match the structure of the other apps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing I'm concerned about is that I'm not sure what the right thing even is. I already tried to do this for HelloHexagon once, and got bogged down with complexities (the Android toolchain stuff combined with Generator things) and aborted my attempt. I have no reason to expect to be able to solve it any better for this app right now...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't view comment as a deal-breaker; that said, I thought all the others had been refactored into put-the-intermediate-stuff-in-bin/ style. I'll do another pass at some point to see what I missed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just re-copied HelloHexagon with Andrew's modernization (thanks for doing that!) and reimplemented the matrix multiply.

using namespace Halide;
using namespace Halide::ConciseCasts;

int main(int argc, char **argv) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just note that this sure looks like a Generator to me :-)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was kind of eager to try this patch .

i see an abort on conv3x3a16 and conv3x3a32 and helloHexagon:

Internal error at /local/mnt2/ronl/Halide-upstream/Halide/src/IR.cpp:21
Condition failed: a.type() == b.type()
Add of mismatched types

@dsharletg
Copy link
Contributor Author

Can you send me a repro case?

@ronlieb
Copy link

ronlieb commented Feb 10, 2017

HelloHexagon, i think you have that one already. is it not failing for you?

@dsharletg
Copy link
Contributor Author

OK, I wasn't actually expecting vrmpy operation to generate in HelloHexagon, because the outputs are 16 bit, not 32 bit. Unfortunately, I tried supporting 16 bit outputs from vrmpy, and while it works, it's slower than letting 2 vmpa/vdmpy generate instead (probably because of the extra interleaving required by vrmpy).

@dsharletg
Copy link
Contributor Author

(Also, it was a bug that the generated, which is now fixed.)

@dsharletg
Copy link
Contributor Author

Actually, I got a better idea: we should generate vrmpy for 16 bit results, if the required interleave simplifies away... will implement.

@dsharletg
Copy link
Contributor Author

The matrix multiply now looks pretty good. The inner loop has 32 vrmpy instructions in 19 packets, all accumulators stored in registers. It computes a 1024 x 1024 x 1024 matrix multiply in 0.0214 s. This is 4-5x faster than the CPU schedule. Here's the inner loop:

.Ltmp2:                                 // Block address taken
.LBB36_10:                              // %"for AB.s1.k$x"
                                        //   Parent Loop BB36_8 Depth=1
                                        // =>  This Inner Loop Header: Depth=2
    {
                    r17 = add(r28, r7)
                    r20 = add(r28, r5)
                    r11 = add(r28, r6)
                    v17 = vmem(r10+#-4)
    }
    {
                    r21 = memw(r11 + #-4)
                    v21 = vmem(r10+#-3)
    }
    {
                    v7.uw += vrmpy(v17.ub,r21.ub)
                    r20 = memw(r20 + #-4)
                    v22 = vmem(r10+#-2)
    }
    {
                    v3.uw += vrmpy(v17.ub,r20.ub)
                    v4.uw += vrmpy(v21.ub,r20.ub)
                    r17 = memw(r17 + #-4)
                    v23 = vmem(r10+#-1)
    }
    {
                    v8.uw += vrmpy(v17.ub,r17.ub)
                    v6.uw += vrmpy(v22.ub,r17.ub)
                    r22 = memw(r28 + #-4)
                    v19 = vmem(r10+#0)
    }
    {
                    v16.uw += vrmpy(v17.ub,r22.ub)
                    v12.uw += vrmpy(v23.ub,r22.ub)
                    r11 = memw(r28 + r5<<#0)
                    v20 = vmem(r10+#1)
    }
    {
                    v1.uw += vrmpy(v22.ub,r20.ub)
                    v9.uw += vrmpy(v21.ub,r17.ub)
                    r23 = memw(r28 + r7<<#0)
                    v17 = vmem(r10+#2)
    }
    {
                    v5.uw += vrmpy(v23.ub,r17.ub)
                    v2.uw += vrmpy(v23.ub,r20.ub)
                    r17 = memw(r28 + #0)
                    v18 = vmem(r10+#3)
    }
    {
                    v13.uw += vrmpy(v23.ub,r21.ub)
                    v10.uw += vrmpy(v21.ub,r21.ub)
                    r10 = add(r10, #1024)
                    r25 = memw(r28 + r6<<#0)
    }
    {
                    v15.uw += vrmpy(v21.ub,r22.ub)
                    v14.uw += vrmpy(v22.ub,r22.ub)
                    r28 = add(r28, #8)
    }
    {
                    v11.uw += vrmpy(v22.ub,r21.ub)
                    v8.uw += vrmpy(v19.ub,r23.ub)
    }
    {
                    v12.uw += vrmpy(v18.ub,r17.ub)
                    v16.uw += vrmpy(v19.ub,r17.ub)
    }
    {
                    v14.uw += vrmpy(v17.ub,r17.ub)
                    v15.uw += vrmpy(v20.ub,r17.ub)
    }
    {
                    v6.uw += vrmpy(v17.ub,r23.ub)
                    v5.uw += vrmpy(v18.ub,r23.ub)
    }
    {
                    v9.uw += vrmpy(v20.ub,r23.ub)
                    v10.uw += vrmpy(v20.ub,r25.ub)
    }
    {
                    v11.uw += vrmpy(v17.ub,r25.ub)
                    v13.uw += vrmpy(v18.ub,r25.ub)
    }
    {
                    v7.uw += vrmpy(v19.ub,r25.ub)
                    v3.uw += vrmpy(v19.ub,r11.ub)
    }
    {
                    v4.uw += vrmpy(v20.ub,r11.ub)
                    v1.uw += vrmpy(v17.ub,r11.ub)
    }
    {
                    v2.uw += vrmpy(v18.ub,r11.ub)
                    nop
    }       :endloop0

@abadams
Copy link
Member

abadams commented Feb 10, 2017 via email

@dsharletg
Copy link
Contributor Author

Yeah, I think prefetching might be worth doing, but the prefetch directive currently doesn't work for update definitions! I started a separate thread on that issue.

I did try unrolling k, it didn't seem to matter. It does increase the proportion of packets that are filled with vrmpy ops, but it's already so high, so it doesn't make much difference.

@abadams
Copy link
Member

abadams commented Feb 10, 2017 via email

src/Simplify.cpp Outdated
// of vectors. This is a bit of a hacky heuristic to avoid
// undoing the work of AlignLoads for Hexagon.
if (!op->is_concat() || new_vectors[0].type().is_scalar()) {
// Try to convert a shuffled load into a shuffle of a load.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that this is terribly important, but I found the language in the comment confusing. I think you are converting
shuffle({ld(ramp(a, 1, 4), ld(ramp(a+4, 1, 4), ld(ramp(a+8, 1, 4)}, {2, 3, 4, 5, 6, 7, 8}) into
ld(shuffle({ramp(a, 1, 12)},{2, 3, 4, 5, 6, 7, 8}), right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rephrased the comment to be a bit more precise. What it's trying to do is convert a load with shuffled indices into a shuffle of a dense load. Your example is replacing a slice(concat(load, load, load)) with load(slice(concat(idx0, idx1, idx2))), which is the opposite, but also, we don't do this to concat vectors (so as not to undo the work of AlignLoads... I wish there were a more elegant way to avoid that.

I suppose the goal of this is simplification, and arguably, dense loads are the simpler thing (maybe not from the IR perspective, but definitely so for codegen), so this heuristic isn't so unreasonable I guess.

load_predicates.push_back(load->predicate);
load_indices.push_back(load->index);
always_true = always_true && is_one(load->predicate);
// Try to collapse an interleave of slices of vectors from
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this. This is excellent. When I was first thinking of implementing vrmpy and vdmpy I had been discouraged by having to check all of this in HexagonOptimize.cpp/CodeGen_Hexagon.cpp. But, this is definitely better that it is simplified in Simplify.cpp.

@pranavb-ca
Copy link
Contributor

Thanks for this. I am only concerned about this
"We also now match expressions of the form ax + by + cz + dw and use vrmpy, even if the operands do not interleave cleanly."

I wonder if we should do the above only whne the operands interleave cleanly?

@dsharletg
Copy link
Contributor Author

I just pushed a change that finds the vrmpy and vdmpy operands differently. It was a fairly significant code change, sorry for the churn. The new code is more robust (e.g. it can find kernels with negative coefficients that were rewritten to subtracts with a positive coefficient... doh).

@pranavb-ca It is an interesting question. This instruction is certainly pushing the limits of what is reasonable to expect a compiler to do vs. just exposing an intrinsic function. vdmpy and vmpa are nice because you can just use one or the other depending on if the operands interleave cleanly or not.

At first, I was thinking it was a no brainer because it was 2 vshuff packets (3 vshuff, but two can be executed in the same packet) followed by vrmpy, vs. 4 multiply-adds. But, it wouldn't be 4 multiply-adds, it would be 2 vmpa instructions (even then, expression tree balancing could also fit 4 multiply-adds into 3 packets).

So, I guess I should limit it to only using vrmpy when it interleaves cleanly. This makes simd_op_check pretty nasty... :)

@dsharletg
Copy link
Contributor Author

Hmm, actually, you can't use vmpa in this case, because vmpa might overflow (it should produce a 32 bit results from 8 bit operands if it were to avoid overflow, like vrmpy does).

So, I take it back, I think that the current lowering is the fastest thing we can do. Interleaving the operands and using vrmpy is faster than the alternative (4 vmpy + 3 vadd, note there is no vmpy for 8 bit -> 32 bit multiply accumulate). Note that this is for exactly this pattern:

i32(i16(x)*a) + i32(i16(y)*b) + i32(i16(z)*c) + i32(i16(w)*d

BTW, this reminds me, what does everyone think about that being the pattern for vrmpy, as opposed to:

i32(x)*a + i32(y)*b + i32(z)*c + i32(w)*d

? Both are correct, but, the former is the best thing to write when you don't have vrmpy. I wrote it the way I did because the matrix multiply would do the right thing on both the CPU and on Hexagon.

@dsharletg
Copy link
Contributor Author

Also note that while we generate vrmpy for 16 bit results, note that we only do so if the operands interleave cleanly. If they don't, then we can use vmpa, and it does (there's a simd_op_check test for that).

@abadams
Copy link
Member

abadams commented Feb 14, 2017

Regarding patterns - can we just match both? I.e. we expect to add four things, where each thing is either i32(i16(a)*b) or i32(a)*b

@dsharletg
Copy link
Contributor Author

That pattern actually already works, I just added a simd_op_check test for it. However, we still can't use vmpa for that pattern when the operands don't interleave either, as the overflow issue remains. I should have chosen my words more carefully: it's not that that exact pattern is too constraining to also match vmpa, it's that the 32 bit accumulation is too big to use vmpa (a much better reason).

Another thing I realized: we could use the 16 bit version of vmpa here (that avoids the overflow issue), but I still think it should be faster to interleave and use vrmpy, it would require 4 16 -> 32 bit vmpa to do the equivalent work of 1 vrmpy, plus the cost of upcasting.

@dsharletg
Copy link
Contributor Author

Actually, since that pattern is easier to read, I decided to use it for all the simd_op_check tests (but still check for the original more precise pattern).

@@ -63,6 +63,44 @@ namespace {
// Broadcast to an unknown number of lanes, for making patterns.
Expr bc(Expr x) { return Broadcast::make(x, 0); }

// Flatten a binary operator tree into a list of operands.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed this is old dead code, will remove

@dsharletg dsharletg merged commit 4a99dcb into master Feb 16, 2017
@dsharletg dsharletg deleted the hvx-vrmpy branch February 24, 2017 00:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants