-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[x86] Separate vector instruction selection and CodeGen passes #6884
base: main
Are you sure you want to change the base?
Conversation
// TODO(rootjalex): How do we do type hints for the args? | ||
// TODO(rootjalex): Is there a way to do basically an unrolled | ||
// loop of the below? this is ugly. | ||
// Supposedly C++20 will have constexpr std::transform, perhaps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will likely be a long while before Halide can rely on C++20 being available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. Is there a reason why? I thought we just upgraded to C++17, so I figured C++20 wouldn't be that far off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And thanks for taking a look at this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anecdotally, C++20 is getting adopted at a much slower pace than C++11 or C++17 did. (It's not clear that there's even a timetable to adopt it inside Google.) We'd have to be confident that ~all significant Halide customers are ready to make that commitment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, thanks for the explanation! I guess I'll stick to the ugly if constexpr
expressions :')
|
||
static const IRNodeType _node_type = IRNodeType::VectorInstruction; | ||
|
||
const char *get_instruction_name() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be static and accept the enum to mirror Call::get_intrinsic_name
e->op == op->op && | ||
e->args.size() == op->args.size()) { | ||
for (size_t i = 0; result && (i < e->args.size()); i++) { | ||
// FIXME: should we early-out? Here and in Call* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does early out (result is in the for loop condition)
@@ -1803,6 +1894,7 @@ inline std::ostream &operator<<(std::ostream &s, const BroadcastOp<A, B> &op) { | |||
template<typename A, typename B> | |||
HALIDE_ALWAYS_INLINE auto broadcast(A &&a, B lanes) noexcept -> BroadcastOp<decltype(pattern_arg(a)), decltype(pattern_arg(lanes))> { | |||
assert_is_lvalue_if_expr<A>(); | |||
assert_is_lvalue_if_expr<B>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can lanes really be a concrete Expr? Maybe assert not concrete expr instead.
HALIDE_ALWAYS_INLINE | ||
Expr make(MatcherState &state, halide_type_t type_hint) const { | ||
std::vector<Expr> r_args(sizeof...(Args)); | ||
// TODO(rootjalex): How do we do type hints for the args? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is done
r_args[2] = std::get<const_min(2, sizeof...(Args) - 1)>(args).make(state, {}); | ||
} | ||
|
||
// for (int i = 0; i < sizeof...(Args); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented-out code
template<typename... Args> | ||
std::ostream &operator<<(std::ostream &s, const VectorInstructionOp<Args...> &op) { | ||
// TODO(rootjalex): Should we print the type? | ||
s << "vector_instr(\""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vector_instr -> v_instr
@@ -2080,6 +2282,39 @@ HALIDE_ALWAYS_INLINE auto cast(halide_type_t t, A &&a) noexcept -> CastOp<declty | |||
return {t, pattern_arg(a)}; | |||
} | |||
|
|||
// A node for expressing type hints, when rules are ambiguously typed. | |||
template<typename A> | |||
struct TypeHint { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, make the type a template parameter instead of a member (either a C++ type or a halide_type_t), because each pattern consumes stack space in methods that use rewriters.
|
||
HALIDE_ALWAYS_INLINE | ||
Expr make(MatcherState &state, halide_type_t type_hint) const { | ||
return a.make(state, type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update this to use the template type with the lanes of the type_hint field
@@ -2306,6 +2541,8 @@ template<typename A> | |||
struct IsFloat { | |||
struct pattern_tag {}; | |||
A a; | |||
int bits; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bits can be a template parameter instead (and also in is_int and is_uint)
struct IsBFloat { | ||
struct pattern_tag {}; | ||
A a; | ||
int bits; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bits can be template parameter
return IRGraphMutator::visit(op); | ||
} | ||
|
||
Expr InstructionSelector::visit(const Mod *op) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider lowering lerp here too
} | ||
|
||
Expr InstructionSelector::visit(const VectorReduce *op) { | ||
return mutate(codegen->split_vector_reduce(op, Expr())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should pull split_vector_reduce out of CodeGen_LLVM into the InstructionSelector and have it emit VectorInstruction nodes with enum values that correspond to the things that LLVM natively supports.
This might mean the instruction selector needs to know the native vector width, and maybe how to upgrade types for arithmetic. This could be pulled out of LLVM into common support code, e.g. in CodeGen_Internal.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(because currently we believe that this will cause InstructionSelector to emit call nodes to llvm intrinsics)
@@ -213,6 +214,12 @@ void ComputeModulusRemainder::visit(const Shuffle *op) { | |||
result = ModulusRemainder{}; | |||
} | |||
|
|||
void ComputeModulusRemainder::visit(const VectorInstruction *op) { | |||
internal_error << "modulus_remainder of VectorInstruction:\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems likely to trigger because ModulusRemainder is used inside codegen for alignment queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy what the VectorReduce visitor does
@@ -535,6 +535,11 @@ class DerivativeBounds : public IRVisitor { | |||
result = ConstantInterval::single_point(0); | |||
} | |||
|
|||
void visit(const VectorInstruction *op) override { | |||
// TODO(rootjalex): Should this be an error? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, just remove this todo
@@ -59,6 +59,11 @@ Expr Simplify::visit(const Broadcast *op, ExprInfo *bounds) { | |||
} | |||
} | |||
|
|||
Expr Simplify::visit(const VectorInstruction *op, ExprInfo *bounds) { | |||
clear_bounds_info(bounds); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should recursively simplify the args
namespace Halide { | ||
namespace Internal { | ||
|
||
#if defined(WITH_X86) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is necessary. It's only necessary if the file is going to refer to llvm x86-specific stuff
} | ||
|
||
/** A top-down code optimizer that replaces Halide IR with VectorInstructions specific to x86. */ | ||
class Optimize_X86 : public InstructionSelector { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InstructionSelector_X86 for the class and file?
|
||
using IRGraphMutator::mutate; | ||
Expr mutate(const Expr &e) override { | ||
Expr expr = IRGraphMutator::mutate(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everywhere you have IRGraphMutator:: here should probably be InstructionSelector::
} | ||
|
||
protected: | ||
bool should_peephole_optimize(const Type &type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not convinced this is necessary or helpful. It may well be that we want to apply these rewrites to scalars too.
// as a series of rewrite-rules? lossless_cast is the hardest part. | ||
const int lanes = op->type.lanes(); | ||
|
||
// FIXME: should we check for accumulating dot_products first? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this fixme
return mutate(rewrite.result); | ||
} | ||
|
||
// TODO: should we handle CodeGen_X86's weird 8 -> 16 bit issue here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should resolve this TODO one way or the other
|
||
// Fixed-point intrinsics should be lowered here. | ||
// This is safe because this mutator is top-down. | ||
// FIXME: Should this be default behavior of the base InstructionSelector class? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address FIXME
} | ||
|
||
Expr visit(const VectorReduce *op) override { | ||
// FIXME: We need to split up VectorReduce nodes in the same way that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove FIXME
ret <8 x i32> %1 | ||
} | ||
|
||
define weak_odr <8 x i32> @wmul_pmaddwd_avx2(<8 x i16> %a, <8 x i16> %b) nounwind alwaysinline { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be removed from x86.ll if it's still there
Just to update any watchers of this PR: I fully intend to address Andrew's review points, but there will be a delay in getting it done, due to moving / general busy-ness with the start of the semester. |
I have discovered some suboptimal codegen on x86 which I believe would be best fixed in this PR. The following generates a single pmaddwd:
But if you change the int32_t to a uint32_t, it generates much worse code, even though the high bit is known to be zero because it's widening from a u8. If you try to work around it by adding a uint32_t cast outside the widening mul, then the outer cast is just removed. |
Yep, definitely seems like a useful rule to add to this PR - it would be a bit annoying to add to the current rule set in CodeGen_X86 due to the reinterprets that would be used. I will add this rule when I am addressing the rest of the comments! |
Left as a draft for now because this work is unfinished, but I was hoping to get feedback.This PR is intended to create a separate vector-instruction-selection pass for x86, akin to HexagonOptimize. That pass is defined in X86Optimize.(h | cpp), and as many rules as possible are written via the IRMatch.h templating. In order to successfully create this TRS, this PR also adds:
dot_product
s)typed()
methodWork that still needs to be done (possibly):
CodeGen_LLVM::codegen_vector_reduce
to split up VectorReduce nodesdot_product
pattern, and thepsadbw
from Add support for generating x86 sum-of-absolute-difference reductions #6872This TRS addresses the issue that @abadams raised on #6878, namely that rewriting
horizontal_widening_add
directly into Halide IR that corresponds to thedot_product
instructions is cleaner and more extensible than the implementation in that PR.I am tagging people that might be interested, but would love feedback from anyone (especially on the points above, and anywhere in the code that says TODO / FIXME).