Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[x86] Separate vector instruction selection and CodeGen passes #6884

Open
wants to merge 65 commits into
base: main
Choose a base branch
from

Conversation

rootjalex
Copy link
Member

@rootjalex rootjalex commented Jul 25, 2022

Left as a draft for now because this work is unfinished, but I was hoping to get feedback.

This PR is intended to create a separate vector-instruction-selection pass for x86, akin to HexagonOptimize. That pass is defined in X86Optimize.(h | cpp), and as many rules as possible are written via the IRMatch.h templating. In order to successfully create this TRS, this PR also adds:

  • a VectorInstruction IR Node, which is not restricted to element-wise operations (i.e. it supports dot_products)
  • updates to IRMatch.h, i.e. supporting bitwise operations and type hints via a new typed() method

Work that still needs to be done (possibly):

This TRS addresses the issue that @abadams raised on #6878, namely that rewriting horizontal_widening_add directly into Halide IR that corresponds to the dot_product instructions is cleaner and more extensible than the implementation in that PR.

I am tagging people that might be interested, but would love feedback from anyone (especially on the points above, and anywhere in the code that says TODO / FIXME).

// TODO(rootjalex): How do we do type hints for the args?
// TODO(rootjalex): Is there a way to do basically an unrolled
// loop of the below? this is ugly.
// Supposedly C++20 will have constexpr std::transform, perhaps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will likely be a long while before Halide can rely on C++20 being available.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Is there a reason why? I thought we just upgraded to C++17, so I figured C++20 wouldn't be that far off.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And thanks for taking a look at this!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anecdotally, C++20 is getting adopted at a much slower pace than C++11 or C++17 did. (It's not clear that there's even a timetable to adopt it inside Google.) We'd have to be confident that ~all significant Halide customers are ready to make that commitment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, thanks for the explanation! I guess I'll stick to the ugly if constexpr expressions :')


static const IRNodeType _node_type = IRNodeType::VectorInstruction;

const char *get_instruction_name() const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be static and accept the enum to mirror Call::get_intrinsic_name

e->op == op->op &&
e->args.size() == op->args.size()) {
for (size_t i = 0; result && (i < e->args.size()); i++) {
// FIXME: should we early-out? Here and in Call*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does early out (result is in the for loop condition)

@@ -1803,6 +1894,7 @@ inline std::ostream &operator<<(std::ostream &s, const BroadcastOp<A, B> &op) {
template<typename A, typename B>
HALIDE_ALWAYS_INLINE auto broadcast(A &&a, B lanes) noexcept -> BroadcastOp<decltype(pattern_arg(a)), decltype(pattern_arg(lanes))> {
assert_is_lvalue_if_expr<A>();
assert_is_lvalue_if_expr<B>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can lanes really be a concrete Expr? Maybe assert not concrete expr instead.

HALIDE_ALWAYS_INLINE
Expr make(MatcherState &state, halide_type_t type_hint) const {
std::vector<Expr> r_args(sizeof...(Args));
// TODO(rootjalex): How do we do type hints for the args?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is done

r_args[2] = std::get<const_min(2, sizeof...(Args) - 1)>(args).make(state, {});
}

// for (int i = 0; i < sizeof...(Args); i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented-out code

template<typename... Args>
std::ostream &operator<<(std::ostream &s, const VectorInstructionOp<Args...> &op) {
// TODO(rootjalex): Should we print the type?
s << "vector_instr(\"";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vector_instr -> v_instr

@@ -2080,6 +2282,39 @@ HALIDE_ALWAYS_INLINE auto cast(halide_type_t t, A &&a) noexcept -> CastOp<declty
return {t, pattern_arg(a)};
}

// A node for expressing type hints, when rules are ambiguously typed.
template<typename A>
struct TypeHint {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, make the type a template parameter instead of a member (either a C++ type or a halide_type_t), because each pattern consumes stack space in methods that use rewriters.


HALIDE_ALWAYS_INLINE
Expr make(MatcherState &state, halide_type_t type_hint) const {
return a.make(state, type);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this to use the template type with the lanes of the type_hint field

@@ -2306,6 +2541,8 @@ template<typename A>
struct IsFloat {
struct pattern_tag {};
A a;
int bits;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bits can be a template parameter instead (and also in is_int and is_uint)

struct IsBFloat {
struct pattern_tag {};
A a;
int bits;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bits can be template parameter

return IRGraphMutator::visit(op);
}

Expr InstructionSelector::visit(const Mod *op) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider lowering lerp here too

}

Expr InstructionSelector::visit(const VectorReduce *op) {
return mutate(codegen->split_vector_reduce(op, Expr()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should pull split_vector_reduce out of CodeGen_LLVM into the InstructionSelector and have it emit VectorInstruction nodes with enum values that correspond to the things that LLVM natively supports.

This might mean the instruction selector needs to know the native vector width, and maybe how to upgrade types for arithmetic. This could be pulled out of LLVM into common support code, e.g. in CodeGen_Internal.cpp

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(because currently we believe that this will cause InstructionSelector to emit call nodes to llvm intrinsics)

@@ -213,6 +214,12 @@ void ComputeModulusRemainder::visit(const Shuffle *op) {
result = ModulusRemainder{};
}

void ComputeModulusRemainder::visit(const VectorInstruction *op) {
internal_error << "modulus_remainder of VectorInstruction:\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems likely to trigger because ModulusRemainder is used inside codegen for alignment queries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy what the VectorReduce visitor does

@@ -535,6 +535,11 @@ class DerivativeBounds : public IRVisitor {
result = ConstantInterval::single_point(0);
}

void visit(const VectorInstruction *op) override {
// TODO(rootjalex): Should this be an error?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just remove this todo

@@ -59,6 +59,11 @@ Expr Simplify::visit(const Broadcast *op, ExprInfo *bounds) {
}
}

Expr Simplify::visit(const VectorInstruction *op, ExprInfo *bounds) {
clear_bounds_info(bounds);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should recursively simplify the args

namespace Halide {
namespace Internal {

#if defined(WITH_X86)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary. It's only necessary if the file is going to refer to llvm x86-specific stuff

}

/** A top-down code optimizer that replaces Halide IR with VectorInstructions specific to x86. */
class Optimize_X86 : public InstructionSelector {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InstructionSelector_X86 for the class and file?


using IRGraphMutator::mutate;
Expr mutate(const Expr &e) override {
Expr expr = IRGraphMutator::mutate(e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everywhere you have IRGraphMutator:: here should probably be InstructionSelector::

}

protected:
bool should_peephole_optimize(const Type &type) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not convinced this is necessary or helpful. It may well be that we want to apply these rewrites to scalars too.

// as a series of rewrite-rules? lossless_cast is the hardest part.
const int lanes = op->type.lanes();

// FIXME: should we check for accumulating dot_products first?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this fixme

return mutate(rewrite.result);
}

// TODO: should we handle CodeGen_X86's weird 8 -> 16 bit issue here?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should resolve this TODO one way or the other


// Fixed-point intrinsics should be lowered here.
// This is safe because this mutator is top-down.
// FIXME: Should this be default behavior of the base InstructionSelector class?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address FIXME

}

Expr visit(const VectorReduce *op) override {
// FIXME: We need to split up VectorReduce nodes in the same way that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove FIXME

ret <8 x i32> %1
}

define weak_odr <8 x i32> @wmul_pmaddwd_avx2(<8 x i16> %a, <8 x i16> %b) nounwind alwaysinline {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be removed from x86.ll if it's still there

@rootjalex
Copy link
Member Author

rootjalex commented Sep 25, 2022

Just to update any watchers of this PR: I fully intend to address Andrew's review points, but there will be a delay in getting it done, due to moving / general busy-ness with the start of the semester.

@abadams
Copy link
Member

abadams commented Sep 28, 2022

I have discovered some suboptimal codegen on x86 which I believe would be best fixed in this PR. The following generates a single pmaddwd:

    Var x;

    Func f1, f2;
    f1(x) = cast<uint8_t>(x);
    f2(x) = cast<uint8_t>(x);

    Func g;
    g(x) = cast<int32_t>(0);
    RDom r(0, 2);
    g(x) += cast<int32_t>(f1(2 * x + r)) * f2(2 * x + r);

    g.update().atomic().vectorize(r).vectorize(x, 8);
    f1.compute_root();
    f2.compute_root();

But if you change the int32_t to a uint32_t, it generates much worse code, even though the high bit is known to be zero because it's widening from a u8. If you try to work around it by adding a uint32_t cast outside the widening mul, then the outer cast is just removed.

@rootjalex
Copy link
Member Author

But if you change the int32_t to a uint32_t, it generates much worse code, even though the high bit is known to be zero because it's widening from a u8. If you try to work around it by adding a uint32_t cast outside the widening mul, then the outer cast is just removed.

Yep, definitely seems like a useful rule to add to this PR - it would be a bit annoying to add to the current rule set in CodeGen_X86 due to the reinterprets that would be used. I will add this rule when I am addressing the rest of the comments!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants