-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose various integer intrinsics for Avx512F, Avx512BW, and Avx512CD #85833
Conversation
Note regarding the This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change. |
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsThis exposes some instructions unique to the AVX512 family of instructions making progress towards completing:
Remaining after this are the
|
const TernaryLogicInfo& TernaryLogicInfo::lookup(uint8_t control) | ||
{ | ||
// clang-format off | ||
static const TernaryLogicInfo ternaryLogicFlags[256] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This table is 768 bytes and is about as small as we can make it.
The way the constants work is we have three keys:
A
:0xF0
B
:0xCC
C
:0xAA
To compute the correct control
byte, you simply perform the corresponding operation on these keys. So, if you wanted to do (A & B) ^ C
, you would compute (0xF0 & 0xCC) ^ 0xAA
or 0x6A
.
This table allows us to compute the inverse information, that is given a control, what are the operations it performs. This allows us to determine things like what operands are actually used (so we can correctly compute isRMW
) and what operations are performed and in what order (such that we can do constant folding in the future).
The total set of operations supported are:
true: AllBitsSet
false: Zero
not: ~value
and: left & right
nand: ~(left & right)
or: left | right
nor: ~(left | right)
xor: left ^ right
xnor: ~(left ^ right)
cndsel: a ? b : c; aka (B & A) | (C & ~A)
major: 0 if two+ input bits are 0
minor: 1 if two+ input bits are 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put this comment in the source?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
…the JIT will never produce such nodes itself
Resolved feedback and fixed an issue where we were double encoding a |
Initial diffs from the explicit We see 9k saved bytes in There is a bit of a TP hit (up to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not super comfortable with the complexity of what happens in the importer, but I don't have a good way of simplifying and don't see anything obviously wrong anymore, so LGTM.
You may want to get a review from @BruceForstall or @kunalspathak for the emitter changes. And perhaps @EgorBo wants to look at the vectorization changes.
/// VPTERNLOGD xmm1 {k1}{z}, xmm2, xmm3/m128, imm8 | ||
/// The above native signature does not exist. We provide this additional overload for consistency with the other bitwise APIs. | ||
/// </summary> | ||
public static Vector128<sbyte> TernaryLogic(Vector128<sbyte> a, Vector128<sbyte> b, Vector128<sbyte> c, [ConstantExpected] byte control) => TernaryLogic(a, b, c, control); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's [ConstantExpected]
by the way and do we switch-expand if it's called with a variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an analyzer that flags if users aren't passing in a constant (or a value that is itself marked [ConstantExpected]
). It's not perfect, but it gets most of the job done and flags the obvious cases where we'll have to pessimize.
Yes, we fallback to the recursive call and switch expand in that. Fixing that to be done late phase (such as lowering) is desirable but requires various ABI fixes so the call is happy with the SIMD args again.
GenTree* control; | ||
|
||
control = gtNewIconNode(static_cast<uint8_t>((0xF0 | 0xCC) ^ 0xAA)); // (A | B)) ^ C | ||
xor1 = gtNewSimdTernaryLogicNode(simdType, vec1, toLowerVec1, cnsVec1, control, baseType, simdSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be a sort of morph/lower pattern match optimization instead? So we can catch user patterns as well, this pattern doesn't look uncommon, ternarylogic seems to allow to fold many other pattens so would want to have a single place to do them all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're catching this one early because its simple/obvious. We do want to add the morph stuff as well, but that's a much more complex/involved PR. Handling these explicit cases early will also help with JIT throughput and avoid us needing to do the much more expensive general handling logic later.
If we decide just handling it in morph is better, than I can remove these explicit cases once that support is in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some questions/comments. I assume you verified that the latest tests are executing in CI, right?
// This table is 768 bytes and is about as small as we can make it. | ||
// | ||
// The way the constants work is we have three keys: | ||
// * A: 0xF0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious, how did you come up with the values of these keys? Are they related to the encoding or just something you picked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are directly part of the encoding and are covered in the architecture manuals.
// clang-format off | ||
static const TernaryLogicInfo ternaryLogicFlags[256] = { | ||
/* FALSE */ { TernaryLogicOperKind::False, TernaryLogicUseFlags::None, TernaryLogicOperKind::None, TernaryLogicUseFlags::None, TernaryLogicOperKind::None, TernaryLogicUseFlags::None }, | ||
/* norABC */ { TernaryLogicOperKind::Nor, TernaryLogicUseFlags::ABC, TernaryLogicOperKind::None, TernaryLogicUseFlags::None, TernaryLogicOperKind::None, TernaryLogicUseFlags::None }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this work? When I read norABC
I see that as A nor B nor C
and having TernaryLogicUseFlags::ABC
as op1Use
while all the others are left None
is slightly confusing to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
norABC
is ~(A | B | C)
. They shorthand the ones that do the same operation on all three, where-as the full name would've been something like norAorBC
Most of these control bytes do exactly 2 operations at once and they are always read as oper Input1 Input2
. So andAorBC
is And(A, Or(B, C))
(aka. A & (B | C)
).
The exception is the conditional select
instructions which do 3 operations (with the third one always being the condition code that "picks" bits from the first or second operation).
/* FALSE */ { TernaryLogicOperKind::False, TernaryLogicUseFlags::None, TernaryLogicOperKind::None, TernaryLogicUseFlags::None, TernaryLogicOperKind::None, TernaryLogicUseFlags::None }, | ||
/* norABC */ { TernaryLogicOperKind::Nor, TernaryLogicUseFlags::ABC, TernaryLogicOperKind::None, TernaryLogicUseFlags::None, TernaryLogicOperKind::None, TernaryLogicUseFlags::None }, | ||
/* andCnorBA */ { TernaryLogicOperKind::Nor, TernaryLogicUseFlags::AB, TernaryLogicOperKind::And, TernaryLogicUseFlags::C, TernaryLogicOperKind::None, TernaryLogicUseFlags::None }, | ||
/* norBA */ { TernaryLogicOperKind::Nor, TernaryLogicUseFlags::AB, TernaryLogicOperKind::None, TernaryLogicUseFlags::None, TernaryLogicOperKind::None, TernaryLogicUseFlags::None }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to keep the ordering of use flags representation you have here?
/* norBA */ { TernaryLogicOperKind::Nor, TernaryLogicUseFlags::AB, TernaryLogicOperKind::None, TernaryLogicUseFlags::None, TernaryLogicOperKind::None, TernaryLogicUseFlags::None }, | |
/* norAB */ { TernaryLogicOperKind::Nor, TernaryLogicUseFlags::AB, TernaryLogicOperKind::None, TernaryLogicUseFlags::None, TernaryLogicOperKind::None, TernaryLogicUseFlags::None }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The names are taken from the architecture docs, so its helpful to keep them matching as it makes it possible to search for the name in the manual.
We use a normalized form in our flags to keep things simpler.
const TernaryLogicInfo& info = TernaryLogicInfo::lookup(control); | ||
TernaryLogicUseFlags useFlags = info.GetAllUseFlags(); | ||
|
||
if (useFlags != TernaryLogicUseFlags::ABC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I completely follow what happens when useFlag == ABC
. Can you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When all three operands are used we have a real ternary instruction and don't have an optimizations or normalization to do. This is because the instruction is already "complete" (aside from constant folding) and can't represent any more state.
For the cases when all operands aren't used, then we have the ability to represent the operation more optimally. This happens by either converting it to the regular unary/binary node (e.g. gtNewSimdBinOpNode(GT_AND, ...)
) or by normalizing the format so that the used operands are B/C
(which allows us to not be marked RMW
and allows greatest chance for containment). It will also simplify the logic that will be added in a future PR to combine unary/binary ops into ternary logic nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Found a pre-existing bug in the scalar |
This exposes some instructions unique to the AVX512 family of instructions making progress towards completing:
Remaining after this are the
Scatter/Gather
andShuffle/Permute
intrinsics