[x86] Separate vector instruction selection and CodeGen passes #6884

rootjalex · 2022-07-25T22:03:26Z

~~Left as a draft for now because this work is unfinished, but I was hoping to get feedback.~~

This PR is intended to create a separate vector-instruction-selection pass for x86, akin to HexagonOptimize. That pass is defined in X86Optimize.(h | cpp), and as many rules as possible are written via the IRMatch.h templating. In order to successfully create this TRS, this PR also adds:

a VectorInstruction IR Node, which is not restricted to element-wise operations (i.e. it supports dot_products)
updates to IRMatch.h, i.e. supporting bitwise operations and type hints via a new typed() method

Work that still needs to be done (possibly):

Implement VectorReduce logic similar to CodeGen_LLVM::codegen_vector_reduce to split up VectorReduce nodes
Implement a base class for ^ and possibly for division + fixed-point intrinsics lowering (?)
Add the SapphireRapids float32 + bfloat16 accumulating dot_product pattern, and the psadbw from Add support for generating x86 sum-of-absolute-difference reductions #6872

This TRS addresses the issue that @abadams raised on #6878, namely that rewriting horizontal_widening_add directly into Halide IR that corresponds to the dot_product instructions is cleaner and more extensible than the implementation in that PR.

I am tagging people that might be interested, but would love feedback from anyone (especially on the points above, and anywhere in the code that says TODO / FIXME).

steven-johnson · 2022-07-26T00:04:25Z

src/IRMatch.h

+        // TODO(rootjalex): How do we do type hints for the args?
+        // TODO(rootjalex): Is there a way to do basically an unrolled
+        // loop of the below? this is ugly.
+        // Supposedly C++20 will have constexpr std::transform, perhaps


It will likely be a long while before Halide can rely on C++20 being available.

Gotcha. Is there a reason why? I thought we just upgraded to C++17, so I figured C++20 wouldn't be that far off.

And thanks for taking a look at this!

Anecdotally, C++20 is getting adopted at a much slower pace than C++11 or C++17 did. (It's not clear that there's even a timetable to adopt it inside Google.) We'd have to be confident that ~all significant Halide customers are ready to make that commitment.

Gotcha, thanks for the explanation! I guess I'll stick to the ugly if constexpr expressions :')

…timize

abadams · 2022-09-20T18:56:32Z

src/IR.h

+
+    static const IRNodeType _node_type = IRNodeType::VectorInstruction;
+
+    const char *get_instruction_name() const;


This should be static and accept the enum to mirror Call::get_intrinsic_name

abadams · 2022-09-20T18:57:33Z

src/IRMatch.cpp

+            e->op == op->op &&
+            e->args.size() == op->args.size()) {
+            for (size_t i = 0; result && (i < e->args.size()); i++) {
+                // FIXME: should we early-out? Here and in Call*


It does early out (result is in the for loop condition)

abadams · 2022-09-20T18:59:01Z

src/IRMatch.h

@@ -1803,6 +1894,7 @@ inline std::ostream &operator<<(std::ostream &s, const BroadcastOp<A, B> &op) {
 template<typename A, typename B>
 HALIDE_ALWAYS_INLINE auto broadcast(A &&a, B lanes) noexcept -> BroadcastOp<decltype(pattern_arg(a)), decltype(pattern_arg(lanes))> {
    assert_is_lvalue_if_expr<A>();
+    assert_is_lvalue_if_expr<B>();


Can lanes really be a concrete Expr? Maybe assert not concrete expr instead.

abadams · 2022-09-20T19:04:53Z

src/IRMatch.h

+    HALIDE_ALWAYS_INLINE
+    Expr make(MatcherState &state, halide_type_t type_hint) const {
+        std::vector<Expr> r_args(sizeof...(Args));
+        // TODO(rootjalex): How do we do type hints for the args?


I believe this is done

abadams · 2022-09-20T19:05:59Z

src/IRMatch.h

+            r_args[2] = std::get<const_min(2, sizeof...(Args) - 1)>(args).make(state, {});
+        }
+
+        // for (int i = 0; i < sizeof...(Args); i++) {


Remove commented-out code

abadams · 2022-09-20T19:06:34Z

src/IRMatch.h

+template<typename... Args>
+std::ostream &operator<<(std::ostream &s, const VectorInstructionOp<Args...> &op) {
+    // TODO(rootjalex): Should we print the type?
+    s << "vector_instr(\"";


vector_instr -> v_instr

abadams · 2022-09-20T23:12:41Z

src/IRMatch.h

@@ -2080,6 +2282,39 @@ HALIDE_ALWAYS_INLINE auto cast(halide_type_t t, A &&a) noexcept -> CastOp<declty
    return {t, pattern_arg(a)};
 }

+// A node for expressing type hints, when rules are ambiguously typed.
+template<typename A>
+struct TypeHint {


If possible, make the type a template parameter instead of a member (either a C++ type or a halide_type_t), because each pattern consumes stack space in methods that use rewriters.

abadams · 2022-09-20T23:14:25Z

src/IRMatch.h

+
+    HALIDE_ALWAYS_INLINE
+    Expr make(MatcherState &state, halide_type_t type_hint) const {
+        return a.make(state, type);


Update this to use the template type with the lanes of the type_hint field

abadams · 2022-09-20T23:19:07Z

src/IRMatch.h

@@ -2306,6 +2541,8 @@ template<typename A>
 struct IsFloat {
    struct pattern_tag {};
    A a;
+    int bits;


bits can be a template parameter instead (and also in is_int and is_uint)

abadams · 2022-09-20T23:19:25Z

src/IRMatch.h

+struct IsBFloat {
+    struct pattern_tag {};
+    A a;
+    int bits;


bits can be template parameter

abadams · 2022-09-20T23:22:29Z

src/InstructionSelector.cpp

+    return IRGraphMutator::visit(op);
+}
+
+Expr InstructionSelector::visit(const Mod *op) {


consider lowering lerp here too

abadams · 2022-09-20T23:35:24Z

src/InstructionSelector.cpp

+}
+
+Expr InstructionSelector::visit(const VectorReduce *op) {
+    return mutate(codegen->split_vector_reduce(op, Expr()));


We should pull split_vector_reduce out of CodeGen_LLVM into the InstructionSelector and have it emit VectorInstruction nodes with enum values that correspond to the things that LLVM natively supports.

This might mean the instruction selector needs to know the native vector width, and maybe how to upgrade types for arithmetic. This could be pulled out of LLVM into common support code, e.g. in CodeGen_Internal.cpp

(because currently we believe that this will cause InstructionSelector to emit call nodes to llvm intrinsics)

abadams · 2022-09-20T23:36:45Z

src/ModulusRemainder.cpp

@@ -213,6 +214,12 @@ void ComputeModulusRemainder::visit(const Shuffle *op) {
    result = ModulusRemainder{};
 }

+void ComputeModulusRemainder::visit(const VectorInstruction *op) {
+    internal_error << "modulus_remainder of VectorInstruction:\n"


This seems likely to trigger because ModulusRemainder is used inside codegen for alignment queries.

Copy what the VectorReduce visitor does

abadams · 2022-09-20T23:37:53Z

src/Monotonic.cpp

@@ -535,6 +535,11 @@ class DerivativeBounds : public IRVisitor {
        result = ConstantInterval::single_point(0);
    }

+    void visit(const VectorInstruction *op) override {
+        // TODO(rootjalex): Should this be an error?


No, just remove this todo

abadams · 2022-09-20T23:38:22Z

src/Simplify_Exprs.cpp

@@ -59,6 +59,11 @@ Expr Simplify::visit(const Broadcast *op, ExprInfo *bounds) {
    }
 }

+Expr Simplify::visit(const VectorInstruction *op, ExprInfo *bounds) {
+    clear_bounds_info(bounds);


This should recursively simplify the args

abadams · 2022-09-20T23:39:12Z

src/X86Optimize.cpp

+namespace Halide {
+namespace Internal {
+
+#if defined(WITH_X86)


I don't think this is necessary. It's only necessary if the file is going to refer to llvm x86-specific stuff

abadams · 2022-09-20T23:41:07Z

src/X86Optimize.cpp

+}
+
+/** A top-down code optimizer that replaces Halide IR with VectorInstructions specific to x86. */
+class Optimize_X86 : public InstructionSelector {


InstructionSelector_X86 for the class and file?

abadams · 2022-09-20T23:43:51Z

src/X86Optimize.cpp

+
+    using IRGraphMutator::mutate;
+    Expr mutate(const Expr &e) override {
+        Expr expr = IRGraphMutator::mutate(e);


Everywhere you have IRGraphMutator:: here should probably be InstructionSelector::

abadams · 2022-09-20T23:44:21Z

src/X86Optimize.cpp

+    }
+
+protected:
+    bool should_peephole_optimize(const Type &type) {


Not convinced this is necessary or helpful. It may well be that we want to apply these rewrites to scalars too.

abadams · 2022-09-20T23:45:55Z

src/X86Optimize.cpp

+        // as a series of rewrite-rules? lossless_cast is the hardest part.
+        const int lanes = op->type.lanes();
+
+        // FIXME: should we check for accumulating dot_products first?


remove this fixme

abadams · 2022-09-20T23:49:18Z

src/X86Optimize.cpp

+            return mutate(rewrite.result);
+        }
+
+        // TODO: should we handle CodeGen_X86's weird 8 -> 16 bit issue here?


Should resolve this TODO one way or the other

abadams · 2022-09-20T23:51:43Z

src/X86Optimize.cpp

+
+        // Fixed-point intrinsics should be lowered here.
+        // This is safe because this mutator is top-down.
+        // FIXME: Should this be default behavior of the base InstructionSelector class?


Address FIXME

abadams · 2022-09-20T23:52:32Z

src/X86Optimize.cpp

+    }
+
+    Expr visit(const VectorReduce *op) override {
+        // FIXME: We need to split up VectorReduce nodes in the same way that


Remove FIXME

abadams · 2022-09-20T23:55:42Z

src/runtime/x86_avx2.ll

+   ret <8 x i32> %1
+ }
+
+ define weak_odr <8 x i32> @wmul_pmaddwd_avx2(<8 x i16> %a, <8 x i16> %b) nounwind alwaysinline {


This should be removed from x86.ll if it's still there

rootjalex · 2022-09-25T23:04:26Z

Just to update any watchers of this PR: I fully intend to address Andrew's review points, but there will be a delay in getting it done, due to moving / general busy-ness with the start of the semester.

abadams · 2022-09-28T16:24:26Z

I have discovered some suboptimal codegen on x86 which I believe would be best fixed in this PR. The following generates a single pmaddwd:

    Var x;

    Func f1, f2;
    f1(x) = cast<uint8_t>(x);
    f2(x) = cast<uint8_t>(x);

    Func g;
    g(x) = cast<int32_t>(0);
    RDom r(0, 2);
    g(x) += cast<int32_t>(f1(2 * x + r)) * f2(2 * x + r);

    g.update().atomic().vectorize(r).vectorize(x, 8);
    f1.compute_root();
    f2.compute_root();

But if you change the int32_t to a uint32_t, it generates much worse code, even though the high bit is known to be zero because it's widening from a u8. If you try to work around it by adding a uint32_t cast outside the widening mul, then the outer cast is just removed.

rootjalex · 2022-09-28T16:57:23Z

But if you change the int32_t to a uint32_t, it generates much worse code, even though the high bit is known to be zero because it's widening from a u8. If you try to work around it by adding a uint32_t cast outside the widening mul, then the outer cast is just removed.

Yep, definitely seems like a useful rule to add to this PR - it would be a bit annoying to add to the current rule set in CodeGen_X86 due to the reinterprets that would be used. I will add this rule when I am addressing the rest of the comments!

rootjalex added 5 commits July 25, 2022 13:14

create VectorIntrininsic node

85e8c69

update IRMatch for VectorIntrinsic node

f0931c6

implement optimize_x86_instructions

ac5b6f2

fix typo

09193f4

clang-format

24f74a9

rootjalex requested review from steven-johnson and abadams July 25, 2022 22:03

rootjalex added 2 commits July 25, 2022 18:09

add VectorIntrinsic comment

58ff01b

format

0d30b56

steven-johnson reviewed Jul 26, 2022

View reviewed changes

rootjalex mentioned this pull request Jul 26, 2022

[x86] Codegen phaddw, phaddd, and pmaddwd #6878

Closed

rootjalex added 19 commits July 25, 2022 23:31

add missing horizontal_add x86Intrinsics

e9029a2

fix bfloat16 abs issue

9d2deb4

fix unhandled bitwise_or in IRMatch.h

1a51b83

missing paren

614b7ea

fix buildbot failures (I hope?)

a5b7e72

clang-format

c58f85e

add empty Expr return to Deinterleaver::visic(const VectorIntrinsic*)

c9efd32

fix horizontal_add references

0e94961

fix bfloat16 abs issue (again)

fb538e3

fix instruction selection location

c2a6175

clang format

78edb81

fix virtual function hidden error

53c560b

fix absd codegen bug

2cfc0c1

attempt to fix x86 vector-reduction splitting

0675e86

clang tidy

6471226

fix MSVC templating bug

fb82166

implement Andrew's requested changes

f092606

Merge branch 'main' of github.com:halide/Halide into rootjalex/x86-op…

d660816

…timize

undef -> poison

339b6b7

abadams reviewed Sep 20, 2022

View reviewed changes

src/IRMatch.h

struct IsBFloat {

struct pattern_tag {};

A a;

int bits;

Copy link

Member

abadams Sep 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bits can be template parameter

abadams reviewed Sep 20, 2022

View reviewed changes

abadams mentioned this pull request Sep 22, 2022

Make Halide::round behave as documented #7012

Merged

rootjalex mentioned this pull request Oct 18, 2022

Generate dot() in the Metal backend #7085

Merged

rootjalex mentioned this pull request Oct 26, 2022

[x86] Generate AVX512 fixed-point instructions #7129

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[x86] Separate vector instruction selection and CodeGen passes #6884

[x86] Separate vector instruction selection and CodeGen passes #6884

rootjalex commented Jul 25, 2022 •

edited

Loading

steven-johnson Jul 26, 2022

rootjalex Jul 26, 2022

rootjalex Jul 26, 2022

steven-johnson Jul 26, 2022

rootjalex Jul 26, 2022

abadams Sep 20, 2022

abadams Sep 20, 2022

abadams Sep 20, 2022

abadams Sep 20, 2022

abadams Sep 20, 2022

abadams Sep 20, 2022

abadams Sep 20, 2022

abadams Sep 20, 2022

abadams Sep 20, 2022

abadams Sep 20, 2022

abadams Sep 20, 2022

abadams Sep 20, 2022

abadams Sep 20, 2022

abadams Sep 20, 2022

abadams Sep 20, 2022

abadams Sep 20, 2022

abadams Sep 20, 2022

abadams Sep 20, 2022

abadams Sep 20, 2022

abadams Sep 20, 2022

abadams Sep 20, 2022

abadams Sep 20, 2022

abadams Sep 20, 2022

abadams Sep 20, 2022

abadams Sep 20, 2022

abadams Sep 20, 2022

rootjalex commented Sep 25, 2022 •

edited

Loading

abadams commented Sep 28, 2022

rootjalex commented Sep 28, 2022


		static const IRNodeType _node_type = IRNodeType::VectorInstruction;

		const char *get_instruction_name() const;

[x86] Separate vector instruction selection and CodeGen passes #6884

Are you sure you want to change the base?

[x86] Separate vector instruction selection and CodeGen passes #6884

Conversation

rootjalex commented Jul 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rootjalex commented Sep 25, 2022 • edited Loading

abadams commented Sep 28, 2022

rootjalex commented Sep 28, 2022

rootjalex commented Jul 25, 2022 •

edited

Loading

rootjalex commented Sep 25, 2022 •

edited

Loading