Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all: support AVX-512 #20

Closed
mmcloughlin opened this issue Dec 31, 2018 · 48 comments
Closed

all: support AVX-512 #20

mmcloughlin opened this issue Dec 31, 2018 · 48 comments
Labels
enhancement New feature or request

Comments

@mmcloughlin
Copy link
Owner

For complexity reasons AVX-512 was not initially considered. We should add support.

avo/internal/load/load.go

Lines 130 to 133 in 9fbb71b

// TODO(mbm): support AVX512
if strings.HasPrefix(isa.ID, "AVX512") {
return false
}

@quasilyte
Copy link

Can I provide any help here?

@rleiwang
Copy link

@mmcloughlin thanks for the effort. Is there any planned date AVX512 support can be added?

@mmcloughlin
Copy link
Owner Author

@mmcloughlin thanks for the effort. Is there any planned date AVX512 support can be added?

@rleiwang I don't have an exact date, but I am aware that AVX-512 is now an essential feature for this library, since real hardware deployment of AVX-512 is now far more widespread.

I'm trying to finish off another side project right now, then there's another little thing I have to do, but after that I'd like to focus on this. Maybe I'll start work in a couple of weeks.

@mmcloughlin
Copy link
Owner Author

The opcodes database only supports a subset of AVX-512 extensions.

avo$ cat internal/data/x86_64.xml  | grep AVX512 | grep ISA | sort | uniq -c | sort -nr
   1710       <ISA id="AVX512F"/>
   1498       <ISA id="AVX512VL"/>
    552       <ISA id="AVX512BW"/>
    252       <ISA id="AVX512DQ"/>
     30       <ISA id="AVX512CD"/>
     24       <ISA id="AVX512VBMI"/>
     20       <ISA id="AVX512ER"/>
     16       <ISA id="AVX512PF"/>
     12       <ISA id="AVX512IFMA"/>
      4       <ISA id="AVX512VPOPCNTDQ"/>

Using the opcodes database is the simplest way to get some AVX-512 support for now. Complete support would probably rely on XED #23.

@vsivsi
Copy link
Contributor

vsivsi commented Dec 3, 2020

I'm very interested in this getting this issue solved. If the 6 month lapse in activity is any hint, it seems to have stalled out (for now).

I've been using Avo for some AVX2 coding with great success, but I'm not yet an "expert" in either. That said, if there's any way I can contribute to getting this issue moving, I'd be happy to help.

I did some quick poking around, and it appears that Avo already knows about the AVX-512 registers (well, the ZMMs, not sure about the masks). The xml file referenced above seems to have all of the information for at least a large subset of the most commonly deployed AVX-512 features (e.g. those being widely deployed in the 10th Gen Core architecture rollout). Since all of the Avo x86 instruction API code seems to be auto-generated (from that file?), I'm assuming the code generator is where the bulk of the needed work is concentrated. That and register allocation? Just trying to sort out the next couple of concrete steps.

Or do you have some grander vision for the package that is less incremental?

@mmcloughlin
Copy link
Owner Author

@vsivsi Sorry about this. My progress on open-source work kinda hit a wall this year due to what I could only describe as COVID malaise. Then I moved apartments, now I'm trying to apply to grad school (including UW!). But this is very much on my radar, and I'm hoping to return to this once the grad school deadlines have passed (December 15). I also have an M1 mac on order, so I'm wondering about avo for arm64 as well!

As for the technical details, yes you've got it about right. I need to add support for masked registers. Then it's a matter of extending the instruction code generator to output those AVX512* instruction types. The XML doesn't contain absolutely everything but I expect this would be fine for almost all use cases.

What's the timeline for your work? How soon do you need this?

@vsivsi
Copy link
Contributor

vsivsi commented Dec 3, 2020

Well, need is probably putting it too strongly... But I'd like to have it ASAP. The Go assembler works fine for AVX-512 as far as I've been able to test, so manually writing code (or writing simple bespoke generators) is the fallback.

Bringing up ARM64 is interesting, and it has its (semi-)equivalent Neon SIMD/vector extensions. Would you say that Avo was designed with the necessary flexibility to mostly just plug-in different ISA targets? Or would it be a major rewrite to generalize?

BTW, as a part of supporting AVX-512, it might be great if Avo autogenerated the necessary architecture/CPUID checks for a given assembly function to either switch to a less restrictive assembly function (e.g, AVX2 vs AVX-512), a pure-Go implementation, or fail that. return an error/panic when the runtime CPU doesn't support required features. That kind of boilerplate is tricky. It's helpful that Avo already tracks the CPUID dependencies of the used instructions, so that is a logical next step (and would also ease multi-architecture ARM/Intel implematations as well!)

Well this is promising... I guess my next question is, if I were to throw, say, 4 hours at this, what should my next move be? Sussing out the Kn mask register situation in the code? Figuring out what might be missing from the XML data and how to pull that into the "internal database"? I'm pretty resourceful, but with something like this, it's just a bit hard to know where to productively start!

@lukechampine
Copy link
Contributor

Just to chime in here -- it's technically possible to use AVX-512 instructions in Avo today, you just need to do a lot of the heavy lifting yourself. Here's an example from one of my projects.

@vsivsi
Copy link
Contributor

vsivsi commented Dec 3, 2020

@lukechampine That's excellent! I had wondered how difficult it would be to just define a few instructions using Avo's lower level interfaces, and it seems from your linked code to be pretty straightforward. Also, seems like a decent starting place to grok a little better how all of this works internally within Avo before diving in to solve this issue more generally. Thanks!

@mmcloughlin
Copy link
Owner Author

mmcloughlin commented Dec 21, 2020

Goals

  • Backwards Compatible. Don't break any existing avo users.
  • Mimic Go Syntax. If possible, AVX-512 avo syntax should look the same as it would in Go assembler.
  • Exhaustive Support. Support all AVX-512 instructions, if possible.

Features

Masking

"Instructions that support masking can omit K register operand."

"K register should be placed right before destination operand."

VADDPD.Z (AX), Z30, K3, Z10

Zeroing

"Zeroing-masking can be activated with Z opcode suffix."

Broadcast

"For reg-mem instrictons with m32bcst/m64bcst operand, broadcasting can be turned on with BCST opcode suffix."

Rounding

"For reg-reg FP instructions with {er} enabled, rounding opcode suffix can be specified:"

VADDPD.RU_SAE Z3, Z2, K1, Z1
VADDPD.RD_SAE Z3, Z2, K1, Z1
VADDPD.RZ_SAE Z3, Z2, K1, Z1
VADDPD.RN_SAE Z3, Z2, K1, Z1

SAE

"For reg-reg FP instructions with {sae} enabled, exception suppression can be specified with SAE opcode suffix."

VMAXPD.SAE.Z Z3, Z2, K1, Z1

Register Blocks

Opcodes database does not support 4VNNIW or 4FMAPS extensions.

VP4DPWSSD 7(SI)(DI*1), [Z2-Z5], K4, Z23

Note: these instructions are only available in Knights Mill processors at the moment, and do not seem to a priority.

Other Libraries

Go

The Go assembler explicitly represents instruction suffixes using the Scond field of the obj.Prog type.

https://github.com/golang/go/blob/89b44b4e2bb2f88474d6b8476f5c28ea2aea9b28/src/cmd/internal/obj/link.go#L301

See the following file for EVEX support in the x86 assembler:

https://github.com/golang/go/blob/89b44b4e2bb2f88474d6b8476f5c28ea2aea9b28/src/cmd/internal/obj/x86/evex.go

PeachPy

  • KRegister register type
  • RegisterMask specifies a mask register and zeroing mode
  • MaskedRegister is a Register and RegisterMask

Examples:

# zeroing, rounding control
VADDSS(xmm30(k2.z), xmm4, xmm19, {rn_sae})

# Broadcast
VEXPANDPS(zmm9(k6.z), zword[r8 + 64])

asmjit

The following are part of the instruction:

  • Mask Selection
  • Zeroing
  • Rounding Control & Suppress Exceptions

https://asmjit.com/doc/classasmjit_1_1x86_1_1Assembler.html

#include <asmjit/x86.h>

using namespace asmjit;

void generateAVX512Code(x86::Assembler& a) {
  using namespace x86;

  // Opmask Selectors
  // ----------------
  //
  //   - Opmask / zeroing is part of the instruction options / extraReg.
  //   - k(reg) is like {kreg} in Intel syntax.
  //   - z() is like {z} in Intel syntax.

  // vaddpd zmm {k1} {z}, zmm1, zmm2
  a.k(k1).z().vaddpd(zmm0, zmm1, zmm2);

  // Memory Broadcasts
  // -----------------
  //
  //   - Broadcast data is part of memory operand.
  //   - Use x86::Mem::_1toN(), which returns a new x86::Mem operand.

  // vaddpd zmm0 {k1} {z}, zmm1, [rcx] {1to8}
  a.k(k1).z().vaddpd(zmm0, zmm1, x86::mem(rcx)._1to8());

  // Embedded Rounding & Suppress-All-Exceptoins
  // -------------------------------------------
  //
  //   - Rounding mode and {sae} are part of instruction options.
  //   - Use sae() to enable exception suppression.
  //   - Use rn_sae(), rd_sae(), ru_sae(), and rz_sae() - to enable rounding.
  //   - Embedded rounding implicitly sets {sae} as well, that's why the API
  //     also has sae() suffix, to make it clear.

  // vcmppd k1, zmm1, zmm2, 0x00 {sae}
  a.sae().vcmppd(k1, zmm1, zmm2, 0);

  // vaddpd zmm0, zmm1, zmm2 {rz}
  a.rz_sae().vaddpd(zmm0, zmm1, zmm2);
}

Opcodes Database

The Opcodes database is going to be the easiest way to incorporate AVX-512 into avo for now.

Note that rounding modes and exception suppression are handled as operands, for example:

    <InstructionForm gas-name="vcvtsd2si" xmm-mode="AVX">
      <ISA id="AVX512F"/>
      <Operand type="r32" input="false" output="true"/>
      <Operand type="xmm" input="true" output="false"/>
      <Operand type="{er}"/>
      <Encoding>
        <EVEX mm="01" pp="11" LL="#2" W="0" vvvv="0000" V="0" RR="#0" B="#1" X="#1" b="#2" aaa="000" z="0"/>
        <Opcode byte="2D"/>
        <ModRM mode="11" reg="#0" rm="#1"/>
      </Encoding>
    </InstructionForm>

or:

    <InstructionForm gas-name="vucomiss" xmm-mode="AVX">
      <ISA id="AVX512F"/>
      <Operand type="xmm" input="true" output="false"/>
      <Operand type="xmm" input="true" output="false"/>
      <Operand type="{sae}"/>
      <Encoding>
        <EVEX mm="01" pp="00" W="0" vvvv="0000" V="0" RR="#0" B="#1" X="#1" b="#2" aaa="000" z="0"/>
        <Opcode byte="2E"/>
        <ModRM mode="11" reg="#0" rm="#1"/>
      </Encoding>
    </InstructionForm>

asmdb

https://github.com/asmjit/asmdb/blob/9bb794dc539c048bba8b6e089b116b65c8bef49b/x86data.js#L43-L49

//   * "{op}"  - Optional operand. Mostly used by AVX_512:
//
//               - {k} mask selector.
//               - {z} zeroing.
//               - {1tox} broadcast.
//               - {er} embedded-rounding.
//               - {sae} suppress-all-exceptions.

In reality I don't see {1tox} syntax used. They use b64.

https://github.com/asmjit/asmdb/blob/9bb794dc539c048bba8b6e089b116b65c8bef49b/x86data.js#L2846-L2848

    ["vaddpd"           , "W:xmm {kz},~xmm,~xmm/m128/b64"                   , "RVM-FV"  , "EVEX.128.66.0F.W1 58 /r"      , "AVX512_F-VL"],
    ["vaddpd"           , "W:ymm {kz},~ymm,~ymm/m256/b64"                   , "RVM-FV"  , "EVEX.256.66.0F.W1 58 /r"      , "AVX512_F-VL"],
    ["vaddpd"           , "W:zmm {kz},~zmm,~zmm/m512/b64 {er}"              , "RVM-FV"  , "EVEX.512.66.0F.W1 58 /r"      , "AVX512_F"],

Resources

@mmcloughlin
Copy link
Owner Author

mmcloughlin commented Dec 21, 2020

Hmm, trying to come up with a good syntax for opcode suffixes.

Example Go assembly:

VADDPD.RD_SAE.Z Z3, Z2, K1, Z1

Possible syntax:

// Accept variadic flags.
VADDPD(Z3, Z2, K1, Z1, mode.Z, mode.RD_SAE)

// Codegen all combinations.
VADDPD_RD_SAE_Z(Z3, Z2, K1, Z1)

// Function decorator.
Mod(VADDPD, mode.RD_SAE, mode.Z)(Z3, Z2, K1, Z1)

// Underscore version that takes flags.
VADDPD_(mode.RD_SAE, mode.Z)(Z3, Z2, K1, Z1)

// Methods.
VADDPD(Z3, Z2, K1, Z1).RD_SAE().Z()

Thoughts?

@lukechampine
Copy link
Contributor

I think generating all combinations is the most consistent with avo's existing design. It also seems like the easiest API for converting existing asm. The biggest downsides are that the namespace becomes a lot more cluttered (but let's be real, the avo namespace is always going to be enormous) and there could be some annoyances relating to how suffixes are ordered (namely, there's only one legal ordering per combination).

Another consideration is that (unless I'm mistaken) the .Z suffix is going to be very common. If you're typing out tons of instructions, VADDPD_Z is far nicer than any of the alternatives.

Failing that, I think my preference is for the "decorator" or "underscore" APIs, because they make it easy to do this:

var VPADDD_ZB = VADDPD_(mode.Z, mode.BCST)

I defined a bunch of helper functions like this for my blake3 implementation. This feels fairly ergonomic, since most of the time you only need a tiny subset of the possible suffix combinations. Of course, you can accomplish this using any of the suggested APIs; it's just a little nicer to do it concisely in one line with var rather than a full func.

@vsivsi
Copy link
Contributor

vsivsi commented Dec 21, 2020

I’m not super opinionated about this stuff so long as it’s done consistently throughout a design. A couple of principles that haven’t been mentioned:

  1. Please design APIs that are autocomplete friendly. Start typing an opcode name and have a bunch of variants drop down is helpful? Honestly not sure. One of the benefits of Avo is that autocomplete works, unlike Go assembly.

  2. Generally speaking, parameterizing is superior to hard coding. It means less repetition, which is kind of the whole point of Avo, right?

Also, think about whether/how to make the opmask register optional as well. Having to explicitly specify K0 everywhere would be cluttered, and would make spotting non-default masks more difficult.

@vsivsi
Copy link
Contributor

vsivsi commented Dec 21, 2020

Now that I’ve had the chance to read everything through again, I think I’d vote for a slightly modded version of the “methods“ design.

VADDPD().RD_SAE().Z(Z3, Z2, K1, Z1)

This has the advantage of reading closer to the generated assembly, being autocomplete friendly, and still supporting the case @lukechampine describes above:

var VPADDD_ZB = VADDPD().RD_SAE().Z

And since each function would need to be variadic to handle 0 or 4 parameters, might as well support 3 as well:

VADDPD().RD_SAE().Z(Z3, Z2, Z1)  // K0 implied

Thoughts?

@mmcloughlin
Copy link
Owner Author

Thanks for your input @lukechampine @vsivsi.

I think generating all combinations is the most consistent with avo's existing design. It also seems like the easiest API for converting existing asm. The biggest downsides are that the namespace becomes a lot more cluttered (but let's be real, the avo namespace is always going to be enormous) and there could be some annoyances relating to how suffixes are ordered (namely, there's only one legal ordering per combination).

Maybe this is the point you're trying to make, but I think at the moment there is only one legal ordering in Go assembly itself. Specifically, https://github.com/golang/go/wiki/AVX512 sys "It is important to put zeroing opcode suffix last, otherwise it is a compilation error." That's implemented here:

https://github.com/golang/go/blob/89b44b4e2bb2f88474d6b8476f5c28ea2aea9b28/src/cmd/internal/obj/x86/evex.go#L292

As you say, the avo namespace is already huge, so I'm not especially concerned about that. However, even with the small number of suffixes, it could cause substantial bloat over what we already have. And if more are added we'll have a combinatorial explosion.

Another factor is I have ARM support in the back of my mind, which also uses instruction suffixes. After a very quick search, I'm still not sure how many they have. Does anyone here know? Anyway, whatever design is chosen should ideally work for ARM as well.

I'm leaning against codegen. I agree with @vsivsi that in general "parameterizing is superior to hard coding", and that you can still build shorthands given a parameterizable API.

  1. Please design APIs that are autocomplete friendly. Start typing an opcode name and have a bunch of variants drop down is helpful? Honestly not sure. One of the benefits of Avo is that autocomplete works, unlike Go assembly.

Good point, hadn't considered this aspect.

Now that I’ve had the chance to read everything through again, I think I’d vote for a slightly modded version of the “methods“ design.

VADDPD().RD_SAE().Z(Z3, Z2, K1, Z1)

This would be nice. I don't really like the "variadic flags" or "methods" approach I listed before since they put the suffixes in a different place to Go assembly. This would match Go assembly much more closely.

I don't quite see how to implement this given the way the instruction functions like VADDPD work right now. However, @zeebo made a suggestion in Slack #assembly room which might just work. He said "extra wacky option is to have VADDPD be a named func type and it has methods like VADDPD.RD_SAE().Z()(Z3, Z2, K1, Z1)". I need to think through the details of this to confirm it's actually feasible.

@mmcloughlin
Copy link
Owner Author

mmcloughlin commented Dec 22, 2020

And since each function would need to be variadic to handle 0 or 4 parameters, might as well support 3 as well:

VADDPD().RD_SAE().Z(Z3, Z2, Z1)  // K0 implied

Yep, regarding masking, I see two high-level approaches.

First, as you describe, you could have multiple instruction forms, one with the mask and one without.

Second, you could copy PeachPy's approach and have a MaskedRegister operand type. This way it would look like:

VADDPD(Z3, Z2, Z1) // K0 implied
VADDPD(Z3, Z2, Masked(K1, Z1))

My preference right now is for multiple forms, since it looks closest to Go syntax. However, I haven't thought through the implementation details, there may be a reason to prefer keeping the mask and the register associated with each other.

@mmcloughlin
Copy link
Owner Author

mmcloughlin commented Dec 22, 2020

Straw Man Proposal

There are multiple places where instructions are generated.

  1. Global functions in build package that operate on the global build.Context.
  2. Builder methods on the build.Context type.
  3. Instruction constructors in the x86 package. These are called from the build.Context methods.

These are in approximate order of importance in terms of API usability. Most people will only ever use the global functions in the build package. Library authors or somebody doing something unusually complex might use the build.Context type directly. The x86 package is an implementation detail and probably should have been internal.

I'm thinking of some kind of chaining builder API. The build.Context might have a Suffix(...) method that adds instruction suffixes to a list? There could be convenience methods for each concrete suffix, like RD_SAE() and Z(). The suffix list would be applied to the next instruction emitted, and then reset to empty. This would end up looking very similar to the asmjit API, something like this:

ctx.RD_SAE().Z().VADDPD(Z3, Z2, K1, Z1)

Remember this is not the most common API, so I think it's fine that it looks "backwards" in this case.

The most common interface is the global functions in the build package. For this we would use the "wacky" idea proposed by @zeebo, specifically VADDPD would become a variable instead of a function. Its type would be a function type that has methods RD_SAE(), Z(). Each of these methods would simply call the corresponding method on the global context, and return the same instruction constructor (mockup). This would achieve the desired syntax in the common case:

VADDPD.RD_SAE().Z()(Z3, Z2, K1, Z1)
VADDPD(Z3, Z2, K1, Z1) // can still be called as a function

There's still an open question about what the constructors in the x86 package would look like, but I don't think this matters much since it's mostly an implementation detail. Perhaps they just take an extra parameter for suffixes?

Questions:

  • Allowed suffix combinations. How do we record which ones are allowed on a given instruction? Do we generate an error?
  • Suffix ordering. For example, the .Z suffix must come last. How do we enforce that? Do we enforce it on the input, or could avo be more flexible than Go itself but ensure the output is sorted correctly?

@mmcloughlin
Copy link
Owner Author

Making good progress on #163! Does anyone have recommendations for AVX-512 examples? Ideally, these are realistic, small-ish, and demonstrate multiple AVX-512 features.

@mmcloughlin
Copy link
Owner Author

TL;DR; I've ended up going with code generated functions, so it would look like VADDPD_RD_SAE_Z(Z3, Z2, K1, Z1). Let me know ASAP if you disagree with this choice because #163 is getting close!


Okay, I've gone in circles on this a bit. Yet another example of when I spent too much time thinking and not enough time prototyping.

The proposed implementation right now #163 uses code generation rather than any of the fancier APIs discussed above. The reasons are:

  • This was actually quite easy to implement given the way avo code generation works right now. The more flexible forms discussed were a bit messy and I wasn't convinced it was worth it.
  • Looking at @lukechampine's blake3 code, I believe that code generated functions with underscore suffixes will be easiest to read and write. They also look pretty close to Go assembly.
  • Although I agree in principle that parameterization is better, I can't right now think of a use case for parameterized instruction suffixes. I also don't think using code generated suffixes right now actually locks us into that choice forever, we could make them parameterizable later and reimplement the code generated functions using the parameterized versions (something like a whole file of shortcuts of the form var VPADDD_BCST_Z = VADDPD_(mode.BCST, mode.Z)).

Thoughts? @lukechampine @vsivsi

@mmcloughlin
Copy link
Owner Author

@vsivsi You were talking about implicit masking before. I've implemented this by duplicating instruction forms that support masking, one with a k operand and one without. So the VADDPD entry in the instruction table looks like

avo/internal/inst/ztable.go

Lines 14224 to 14301 in ebe0387

{
Opcode: "VADDPD",
Summary: "Add Packed Double-Precision Floating-Point Values",
Forms: []Form{
{
ISA: []string{"AVX512F"},
Operands: []Operand{
{Type: "m512/m64bcst", Action: 0x1},
{Type: "zmm", Action: 0x1},
{Type: "zmm", Action: 0x2},
},
Broadcast: true,
},
{
ISA: []string{"AVX512F"},
Operands: []Operand{
{Type: "m512/m64bcst", Action: 0x1},
{Type: "zmm", Action: 0x1},
{Type: "k", Action: 0x1},
{Type: "zmm", Action: 0x2},
},
Zeroing: true,
Broadcast: true,
},
{
ISA: []string{"AVX"},
Operands: []Operand{
{Type: "xmm", Action: 0x1},
{Type: "xmm", Action: 0x1},
{Type: "xmm", Action: 0x2},
},
},
{
ISA: []string{"AVX"},
Operands: []Operand{
{Type: "m128", Action: 0x1},
{Type: "xmm", Action: 0x1},
{Type: "xmm", Action: 0x2},
},
},
{
ISA: []string{"AVX"},
Operands: []Operand{
{Type: "ymm", Action: 0x1},
{Type: "ymm", Action: 0x1},
{Type: "ymm", Action: 0x2},
},
},
{
ISA: []string{"AVX"},
Operands: []Operand{
{Type: "m256", Action: 0x1},
{Type: "ymm", Action: 0x1},
{Type: "ymm", Action: 0x2},
},
},
{
ISA: []string{"AVX512F"},
Operands: []Operand{
{Type: "zmm", Action: 0x1},
{Type: "zmm", Action: 0x1},
{Type: "zmm", Action: 0x2},
},
EmbeddedRounding: true,
},
{
ISA: []string{"AVX512F"},
Operands: []Operand{
{Type: "zmm", Action: 0x1},
{Type: "zmm", Action: 0x1},
{Type: "k", Action: 0x1},
{Type: "zmm", Action: 0x2},
},
Zeroing: true,
EmbeddedRounding: true,
},
},
},

@mmcloughlin
Copy link
Owner Author

mmcloughlin commented Jan 2, 2021

Spent more time agnonizing over a detail. Leaving notes for myself.

Problem

Masking instruction forms have different actions on the output register depending on whether zeroing is enabled.

So for example, the VPADDD and VPADDD_Z masking forms are actually different. In the first case the output register has RW action because of merge-masking, whereas the zeroing form has W action only.

Currently this is handled incorrectly, there is not a way to specify different actions based on the zeroing flag.

Solutions

Fixup Pass. Have a pass that removes the output register from input list in the case where the Z suffix is provided.

Thoughts: Not a big fan of handling something dynamically that could be done with code generation. However, this solution would be more robust if we do later support parameterized instruction suffixes.

Redesign Instruction Table. The current version of the instruction table cannot represent this properly. Each instruction form has a flag indicating whether it supports zeroing, wheras the operands list statically specifies the RW action on each operand. So it cannot represent the fact that the RW action depends on whether zeroing is used. So one option is to redesign the schema so that the zeroing and merging forms are separate forms.

The simplest way to do this is to redefine the meaning of the Zeroing field to mean not that zeroing is optional, but rather that this is the zeroing form of the instruction. In practical terms this means Zeroing = true would always imply the ".Z" as a suffix, rather than allowing both "" and ".Z". Given this interpretation, AVX-512 instructions that support masking and zeroing would have an additional form, allowing the merging/zeroing forms to have different operand actions.

Thoughts: Perhaps confusing that the Zeroing flag would have a different interpretation. Otherwise quite a simple solution. Should fit well with the downstream consumers without many changes, if at all.

Redesign Function Table. Code generation works by first transforming the instruction table into a list of functions that will construct those instruction forms. Some instructions have multiple different functions corresponding to supported suffix combinations. Similar to above, we could handle this at the point instructions are converted into functions.

Thoughts: Not a fan of this approach. At the moment there is a load of ugly logic in the instruction table building, but after that it gets simpler. I'd rather continue to keep all the nasty details in one place, where you expect them.

Special Action Type. Perhaps an action type like M for "merge" or "masked". This would be interpreted as R in the merging case, and ignored in the zeroing case.

Thoughts: Again quite simple. Perhaps confusing to introduce a third action type, especially since some combinations would make no sense, like RM.

Other Frameworks

PeachPy: I don't see any handling for this. I could have missed it but I suspect there is an edge case bug here.

asmjit: Does not handle this case either. Petr Kobalicek confirmed the bug.

@mmcloughlin
Copy link
Owner Author

Status: AVX-512 PR #163 builds for the first time, after dealing with more edge cases and enabling all the AVX-512 instruction sets supported by Opcodes database.

I could land this now probably, but I'm hesitating mainly because the generated code has substantially grown to the point that it's affecting compile times. The CI jobs take a lot longer, especially linting. I'm exploring ways to reduce duplication in the generated code, which I hope will also reduce compile times. See Slack discussion, thanks @josharian!

There's also some assorted todos left:

  • I'd like to test a real example to make sure it works, and for documentation. I'll probably copy the histogram computation in the AVX512 documentation.
  • I suspect the canceling inputs pass is broken for AVX-512 opcodes. Need to test this and possibly fix.
  • Code coverage is down and the codecov service is complaining. Need to autogen some test cases to fix this.

The compile times issue is the only real blocker since it affects the API design, which we'll be committed to once it lands. If it can't be fixed, might have to revisit the decision of autogenerating functions for every instruction suffix.

@vsivsi
Copy link
Contributor

vsivsi commented Jan 6, 2021

@mmcloughlin

Apologies for dropping out of the discussion for a bit there. I decided preserving my sanity required actually treating my vacation time as such, even though we didn't actually go anywhere.

Looks like you've made excellent progress on this. Nice work spotting the result value merge under masking edge cases for register scheduling.

I also find the compile time (and related name space) explosion to be concerning. One consideration is to think about this in terms of the inevitability of ever more complicated future ISA extensions from Intel (and likely on the ARM side as well). These architectures are steadily evolving into full blown vector units, and so I expect that the instruction sets will continue to become ever more complex over time. It feels like continuing to (approximately) double the size of the generated namespace with each such additional instruction option will rapidly become self-limiting in the future, even if you manage to overcome it for the generated forms this time.

Its been a long time since I've had insider knowledge at Intel, but based on recent public statements they seem fully committed to continuing to expand features of the AVX512 architecture and whatever follows it.

All of that aside, I also continue to aesthetically prefer the earlier proposed methods-based API, for whatever that's worth.

@mmcloughlin
Copy link
Owner Author

@vsivsi Glad you got a real vacation :) Thanks for your input!

I take your points. I'll take some time to consider the alternative API again, maybe prototype it. I actually think with the naive implementation I have, both APIs might stress the compiler. Compile times are not necessarily a deal-breaker because of the build cache, but it doesn't give users a good first impression and it's annoying for the development feedback loop.

On the positive side, we do have something that works now. Wrangling the instruction database into shape was the most annoying part of it, and that's done :)

@vsivsi
Copy link
Contributor

vsivsi commented Feb 25, 2021

Updated! See next comment.


I've continued to use this branch to great effect. This afternoon, for the first time, I found a little wrinkle in the new forms of :

VGATHERDPS/VGATHERDPD/VGATHERQPS/VGATHERQPD
VPGATHERDD/VPGATHERDQ/VPGATHERQD/VPGATHERQQ
VPSCATTERDD/VPSCATTERDQ/VPSCATTERQD/VPSCATTERQQ
VSCATTERDPS/VSCATTERDPD/VSCATTERQPS/VSCATTERQPD

The EVEX coded forms of these instructions should optionally accept a Kn mask register. So I would expect Avo to implement them as a variadic function implementing both two and three parameter versions (with and without a mask register). The current implementation appears to require specification of a mask register. Seems simple enough to work around by hard coding K0:

VSCATTERQPD(ZMM(), reg.K0, Mem{Base: GP64(), Index: ZMM(), Scale: 1, Disp: 0})

But this is not parsimonious with how any of the other EVEX instructions optionally accepting mask registers work (that I've encountered). There may be other examples of this, but these are the ones I've discovered so far.

Thanks again for all of the work that went into this branch! It's been a boon to have this support for the past couple of months. The fact that this is the first little annoyance I've discovered is a testament to how thorough you were with this AVX-512 implementation!

Hope all is well and your grad school decision process is going smoothly!

Best, -V

@vsivsi
Copy link
Contributor

vsivsi commented Mar 3, 2021

Apologies, I was mistaken about the issue above! I just got back to this code, and being the first time I've attempted to use the scatter/gather instructions in AVX-512, I didn't realize these instructions seem to be irregular relative the norm in their use of the mask register. Specifically, they zero each bit for every value that is successfully transferred. So it seems that it is not valid to use K0, or to omit a mask register! I was misled by this instruction syntax definition taken straight from the Intel documentation:

EVEX.512.66.0F38.W1 91 /vsib VPGATHERQQ zmm1 {k1}, vm64z

The {k1} implies to me that specifying a Kn register is optional, as is typically the case. The Intel docs are not super clear about all of this. It is certainly possible that you could omit using a K register and the processor could start by copying all bits set to the K0 register and returning the results in the actual K0 register. Intel doesn't spell out that behavior (though it is implied). This is incorrect, see next post

But in this case, at least as far as the Golang assembler is concerned, K0 is not a valid mask for these instructions (used explicitly).

asm: invalid instruction: 00290 (.../8way_amd64.s:2407)	VPGATHERQQ	56(BP)(Z0*1), K0, Z8
asm: assembly failed

And if you try to run the assembler against source code that omits a mask register, it actually crashes!

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1191bc2]

goroutine 1 [running]:
cmd/internal/obj/x86.(*AsmBuf).doasm(0xc000591b88, 0xc0001a6000, 0xc000226300, 0xc00023aa90)
	/usr/local/Cellar/go/1.16/libexec/src/cmd/internal/obj/x86/asm6.go:4231 +0x74e2
cmd/internal/obj/x86.(*AsmBuf).asmins(0xc000591b88, 0xc0001a6000, 0xc000226300, 0xc00023aa90)
	/usr/local/Cellar/go/1.16/libexec/src/cmd/internal/obj/x86/asm6.go:5378 +0x65
cmd/internal/obj/x86.span6(0xc0001a6000, 0xc000226300, 0xc000116cf0)
	/usr/local/Cellar/go/1.16/libexec/src/cmd/internal/obj/x86/asm6.go:2150 +0x748
cmd/internal/obj.Flushplist(0xc0001a6000, 0xc000591e28, 0x0, 0x7ffeefbff617, 0x17)
	/usr/local/Cellar/go/1.16/libexec/src/cmd/internal/obj/plist.go:107 +0x7ed
main.main()
	/usr/local/Cellar/go/1.16/libexec/src/cmd/asm/main.go:91 +0x79c

So... Seems like for now, the behavior that Avo currently enforces is needed. Obviously, the Golang assembler should never crash like this, so that's a separate issue that I'll need to chase down.

Thanks again! -V

@vsivsi
Copy link
Contributor

vsivsi commented Mar 4, 2021

Found the definitive answer in the Intel SDM (Vol. 1):

15.6.1.1 Opmask Register K0
The only exception to the opmask rules described above is that opmask k0 can not be used as a predicate operand. Opmask k0 cannot be encoded as a predicate operand for a vector operation; the encoding value that would select opmask k0 will instead select an implicit opmask value of 0xFFFFFFFFFFFFFFFF, thereby effectively disabling
masking. Opmask register k0 can still be used for any instruction that takes opmask register(s) as operand(s) (either source or destination).

Note that certain instructions implicitly use the opmask as an extra destination operand. In such cases, trying to use the “no mask” feature will translate into a #UD fault being raised.

That last sentence answers the questions I raised in the post above. Omitting a mask register (K1-K7) from VPSCATTER/GATHER instructions is not allowed, as the K register needs to be a writable destination, and my conjecture about implicit mask --> K0 dest functionality is false.

So the only bug here is the Golang assembler crashing when the mask register is omitted from one of these instructions. I'll nail that down to a simple set of test cases and submit an issue for this on the Go project repo.

@vsivsi
Copy link
Contributor

vsivsi commented Mar 5, 2021

And... I just found the golang issue @mmcloughlin filed on this in December!

golang/go#43421

@vsivsi
Copy link
Contributor

vsivsi commented Mar 19, 2021

Okay, this time I think I've found a genuine omission of consequence for my work.

According to the Intel SW Dev Prog Ref, when CPUID bit AVX512VL is present at runtime, all forms of the VPOPCNT[B|W|D|Q] instructions are valid for use with 256-bit ymm and 128-bit xmm registers. And because VPOPCNT is an AVX512 "heavy" instruction when used with 512-bit registers, there is a strong performance benefit (avoided license downclocking) to only using the register width needed.

The AVX512 branch currently returns "bad operands" when VPOPCNT is used with registers other than zmm.

For example:

in, out := YMM(), YMM()
VPOPCNTQ(in, out)  // <--- VPOPCNTQ: bad operands 

Not sure if this is caused by a deficiency in the instruction database used in generation or something in Avo's interpretation. I haven't encountered similar difficulties with any other AVX512VL impacted instructions.

The zctors.go file is too large to link to, but here is an example entry for a VPOPCNT form:

// VPOPCNTQ: Packed Population Count for Quadword Integers.
//
// Forms:
//
// 	VPOPCNTQ m512 k zmm
// 	VPOPCNTQ m512 zmm
// 	VPOPCNTQ zmm  k zmm
// 	VPOPCNTQ zmm  zmm
func VPOPCNTQ(ops ...operand.Op) (*intrep.Instruction, error) {
	switch {
	case len(ops) == 3 && operand.IsM512(ops[0]) && operand.IsK(ops[1]) && operand.IsZMM(ops[2]),
		len(ops) == 3 && operand.IsZMM(ops[0]) && operand.IsK(ops[1]) && operand.IsZMM(ops[2]):
		return &intrep.Instruction{
			Opcode:   "VPOPCNTQ",
			Operands: ops,
			Inputs:   []operand.Op{ops[0], ops[1], ops[2]},
			Outputs:  []operand.Op{ops[2]},
			ISA:      []string{"AVX512VPOPCNTDQ"},
		}, nil
	case len(ops) == 2 && operand.IsM512(ops[0]) && operand.IsZMM(ops[1]),
		len(ops) == 2 && operand.IsZMM(ops[0]) && operand.IsZMM(ops[1]):
		return &intrep.Instruction{
			Opcode:   "VPOPCNTQ",
			Operands: ops,
			Inputs:   []operand.Op{ops[0]},
			Outputs:  []operand.Op{ops[1]},
			ISA:      []string{"AVX512VPOPCNTDQ"},
		}, nil
	}
	return nil, errors.New("VPOPCNTQ: bad operands")
}

Other AVX512VL impacted instructions in that file appear to fully elaborate all of the valid xmm/ymm and m128/m256 combinations.

@vsivsi
Copy link
Contributor

vsivsi commented Mar 19, 2021

@mmcloughlin I've traced the above VPOPCNT issue to the x86_64.xml file. It appears to be missing the AVX512VL forms of VPOPCNT[D|Q]. As an aside, it also appears to be missing all AVX512_BITALG instructions, including VPOPCNT[B|W]. But that's not currently a blocker for me.

I've hand edited the VPOPCNT section of that file to add the 128 and 256 bit forms, using analogous forms for the instruction VPORQ and the Intel documentation as my guide.

The result is here: https://gist.github.com/vsivsi/06b742a04d2c7e226fae4fd3ab0753dd

Unless I've screwed something up, in theory you should just be able to patch that section of the file with the gist contents and this will just work. But I haven't been able to test that yet, because I'm having some trouble sorting out precisely how to reproduce the steps to generate all of the Avo code that depends on x86_64.xml, this being my first time attempting to dive this deeply into the guts of Avo.

How should I proceed here? scripts/generate doesn't just run successfully, and the scripts/bootstrap also doesn't complete successfully. It feels like just casting about trying to haphazardly reproduce your build process isn't going to be terribly productive, so any guidance you can provide would be helpful!

@vsivsi
Copy link
Contributor

vsivsi commented Mar 19, 2021

A quick update... merging the changes in the go1.16 branch into the avx512 branch solved the bootstrap issues. After that the generate script ran flawlessly. \o/

With that working, the patch in the gist linked above works and solves my immediate blocker. For completeness, over the weekend I'll update it to add support for all of the AVX512_BITALG instructions (and their VL forms).

I guess the next question is strategy for merging these changes. Should I put together a PR for Avo directly, or attempt to submit a PR for Maratyszcza/Opcodes the original source (that seems pretty dormant?)

@mmcloughlin
Copy link
Owner Author

Thanks for all your feedback here and sorry for not getting back to you! I'll respond to various points you've made, maybe not all at once.

Just a quick update. I've been using the avx512 branch for a few days now and so far I've seen no problems as all. Currently I'm just doing integer "bit bashing" with it, so I'm not exercising the opcode suffix code.

Really glad to hear it's working well for you. Do you recall being annoyed by compile times?

I have some local work on using an optab approach for instruction generation, and I think this should deal with the compile time problem. However, I ended up getting blocked in "analysis paralysis" regarding the instruction suffix API (underscores VADDPD_RD_SAE_Z(...) vs method-chaining VADDPD().RD_SAE().Z(...)), and ended up stalling as a result. This probably doesn't matter anywhere near as much as I think it does, but I'm hesitant because API choices are hard to reverse. Any more input to help unblock me haha? @vsivsi @lukechampine

@mmcloughlin
Copy link
Owner Author

Semi-related question: Is there a programmatic way to get at the required CPUID feature bits for each generated function?

Good question! I thought about this a while ago, although your suggestion is more extensive than what I had in mind. I've created a separate issue for discussion #168.

@mmcloughlin
Copy link
Owner Author

Apologies, I was mistaken about the issue above! I just got back to this code, and being the first time I've attempted to use the scatter/gather instructions in AVX-512, I didn't realize these instructions seem to be irregular relative the norm in their use of the mask register. Specifically, they zero each bit for every value that is successfully transferred. So it seems that it is not valid to use K0, or to omit a mask register! I was misled by this instruction syntax definition taken straight from the Intel documentation:

You went through the exact same journey I did regarding the gather/scatter instructions. Yes, these are indeed special cases. See:

avo/internal/load/load.go

Lines 534 to 539 in d60cc02

// Almost all instructions take an optional mask, apart from a few
// special cases.
if maskrequired[opcode] {
return []inst.Form{masked}
}
return []inst.Form{unmasked, masked}

// maskrequired is a set of AVX-512 opcodes where the mask register is required.
// Usually the mask register can be omitted, in which case K0 is implied.
var maskrequired = map[string]bool{
// Reference: https://github.com/golang/go/blob/4fd94558820100129b98f284e21b19fc27a99926/src/cmd/internal/obj/x86/asm6.go#L4219-L4240
//
// // Checks to warn about instruction/arguments combinations that
// // will unconditionally trigger illegal instruction trap (#UD).
// switch p.As {
// case AVGATHERDPD,
// AVGATHERQPD,
// AVGATHERDPS,
// AVGATHERQPS,
// AVPGATHERDD,
// AVPGATHERQD,
// AVPGATHERDQ,
// AVPGATHERQQ:
// // AVX512 gather requires explicit K mask.
// if p.GetFrom3().Reg >= REG_K0 && p.GetFrom3().Reg <= REG_K7 {
// if !avx512gatherValid(ctxt, p) {
// return
// }
// } else {
// if !avx2gatherValid(ctxt, p) {
// return
// }
// }
// }
//
"VGATHERDPD": true,
"VGATHERQPD": true,
"VGATHERDPS": true,
"VGATHERQPS": true,
"VPGATHERDD": true,
"VPGATHERQD": true,
"VPGATHERDQ": true,
"VPGATHERQQ": true,
// Restriction applies to SCATTER instructions too.
"VPSCATTERDD": true,
"VPSCATTERDQ": true,
"VPSCATTERQD": true,
"VPSCATTERQQ": true,
"VSCATTERDPD": true,
"VSCATTERDPS": true,
"VSCATTERQPD": true,
"VSCATTERQPS": true,
}

https://github.com/golang/go/blob/4fd94558820100129b98f284e21b19fc27a99926/src/cmd/internal/obj/x86/asm6.go#L4219-L4240

@mmcloughlin
Copy link
Owner Author

A quick update... merging the changes in the go1.16 branch into the avx512 branch solved the bootstrap issues. After that the generate script ran flawlessly. \o/

You figured this out. Unfortunately Go 1.16 broke the gobin method I was using to install tools dependencies, which is probably what you were running up against. PR #166 fixes this but hasn't landed yet because other changes to the Go toolchain have broken the third-party test suite in infuriating ways.

With that working, the patch in the gist linked above works and solves my immediate blocker. For completeness, over the weekend I'll update it to add support for all of the AVX512_BITALG instructions (and their VL forms).

I guess the next question is strategy for merging these changes. Should I put together a PR for Avo directly, or attempt to submit a PR for Maratyszcza/Opcodes the original source (that seems pretty dormant?)

So, yes, you also diagnosed the VPOPCNT problem correctly. The Opcodes database does not seem to contain the BITALG instructions. You could put a PR out on the Opcodes project but I suspect there is some underlying script that generates the XML file which isn't checked into that repo (I might be wrong). Maybe @Maratyszcza can advise? It would seem that fixing this would be good for PeachPy as well.

In the meantime, I'm fine with adding a patch file to avo. The real solution is #23.

@vsivsi
Copy link
Contributor

vsivsi commented Mar 20, 2021

Yeah, hand editing that x86_64.xml file feels really icky. I think I've finished with all of the BITALG instructions/forms as well. Decided to push through it tonight rather than have it haunt me all weekend!

Everything looks good as far as the generated code goes. The only wrinkles were the fact that the BITALG VPOPCNTB/W don't support memory broadcast, so I had to puzzle that out using the Intel docs and the XML schema for the x86_64.xml file.

The final remaining BITALG instruction VPSHUFBITQMB is pretty crazy, I can't imagine what I'd use it for, but I suppose I'll know it when I see it. That really required me to dig into the EVEX encoding to get right, but I'm about 95% sure I got it. The generated Avo code looks correct.

All of this is on this branch on my fork. It triggered avogen to generate a bunch of new code beyond the instructions themselves because VPSHUFBITQMB has a unique operand signature, but Avo seems to have taken it in stride. So kudos again!

https://github.com/vsivsi/avo/tree/avx512_vpopcnt_VL

I should probably contribute to the test coverage of this new stuff before folding it into a PR, but I'm out of steam for tonight, so it'll need to wait until Monday.

@mmcloughlin
Copy link
Owner Author

I've picked up work on this in the last few days. #217 has the most recent work. I've finally completed the transition to using an optab approach for the function constructors in x86. It's not exactly hyper efficient or anything, but it helps a lot with the ridiculous compile times was seeing with the previous version.

@kalamay The segmentio/asm third-party test fails on the PR:

https://github.com/mmcloughlin/avo/runs/4128468815?check_suite_focus=true#step:6:37

It's a tiny fix so if it's okay I'd prefer to just make the fix on your end after the upgrade. The issue is that extending the instruction database has turned the add functions into variadic functions rather than a fixed number of arguments (they now take masked and unmasked versions). Probably most users wouldn't notice but you have some code that relies on the signature, here:

https://github.com/segmentio/asm/blob/18af27c3ce38682a2545f2d95e5ef564dd312d6e/build/slices/sums_asm.go#L23

The fix is just:

diff --git a/build/slices/sums_asm.go b/build/slices/sums_asm.go
index c46d300..4c98520 100644
--- a/build/slices/sums_asm.go
+++ b/build/slices/sums_asm.go
@@ -19,7 +19,7 @@ type Processor struct {
        typ string
        scale uint8
        avxOffset uint64
-       avxAdd func(mxy, xy, xy1 Op)
+       avxAdd func(...Op)
        x86Mov func(imr, mr Op)
        x86Add func(imr, amr Op)
        x86Reg reg.GPVirtual

I've confirmed everything else still generates, at least on your v1.0.0 tag. What do you think?

@vsivsi Thanks for your PR #199. I think this is something we can address. It'll be easier once the avx-512 stuff is actually landed. I guess you've had the most experience actually working with the avx512 branch? Has it worked for you? Do you think it's a good enough interface to land?

@kalamay
Copy link
Contributor

kalamay commented Nov 11, 2021

Hi @mmcloughlin I've just created a branch pulling in this version of Avo here: segmentio/asm#59

There's a little bit of a chicken-or-egg situation as far as released versions go, but we should be fine pulling in this x-avx512 branch for now and updating once it's merged.

@mmcloughlin
Copy link
Owner Author

Oh wow @kalamay I just realized you're now depending on the x-avx512 branch on master. That really lights a fire under me to land this, which is actually a good thing :)

So as of last night CI is green on #217. I'm working on adding an AVX-512 example or two, which is also a way for me to kick the tires. I have a long weekend to focus on it, so hoping to finally get it landed!

@kalamay
Copy link
Contributor

kalamay commented Nov 11, 2021

Hah 🔥, well let me know if me need to tweak anything else or help move anything forward. Currently though, we only run avo manually, and we always verify any changes in the produced assembly. This should just be temporary to get around the block on your side, but it's also fairly quick back out if need be.

@vsivsi
Copy link
Contributor

vsivsi commented Nov 11, 2021

@mmcloughlin Yes, I've been working extensively off my fork of the avx-512 branch (shared in PR #199)

I've build code that's using Avo to generate ~150K lines of Golang asm. It's all integer (no FP) but otherwise is exercising a good amount of both AVX2 and AVX-512 (I'm generating codepaths for each). The AVX-512 code is exercising opmasks, broadcast and zerofill suffixes, scatter/gather operations, AVX512VL variants, etc.

It was unit testing of Avo generated AVX-512 code that revealed the MacOS kernel bug (opmask clobbering) I recently identified (see: golang/go#49233).

I'd be more than happy to port my code back to using your testing branch once PR #199 is merged, as I need those changes for my code to work.

Please let me know if there's any other way I can help.

mmcloughlin added a commit that referenced this issue Nov 13, 2021
all: AVX-512

Extends avo to support most AVX-512 instruction sets.

The instruction type is extended to support suffixes. The K family of opmask
registers is added to the register package, and the operand package is updated
to support the new operand types. Move instruction deduction in `Load` and
`Store` is extended to support KMOV* and VMOV* forms.

Internal code generation packages were overhauled. Instruction database loading
required various messy changes to account for the additional complexities of the
AVX-512 instruction sets. The internal/api package was added to introduce a
separation between instruction forms in the database, and the functions avo
provides to create them. This was required since with instruction suffixes there
is no longer a one-to-one mapping between instruction constructors and opcodes.

AVX-512 bloated generated source code size substantially, initially increasing
compilation and CI test times to an unacceptable level. Two changes were made to
address this:

1.	Instruction constructors in the `x86` package moved to an optab-based
	approach. This compiles substantially faster than the verbose code
	generation we had before.

2.	The most verbose code-generated tests are moved under build tags and limited
	to a stress test mode. Stress test builds are run on schedule but not in
	regular CI.

An example of AVX-512 accelerated 16-lane MD5 is provided to demonstrate and
test the new functionality.

Updates #20 #163 #229

Co-authored-by: Vaughn Iverson <vsivsi@yahoo.com>
@mmcloughlin
Copy link
Owner Author

Just landed #217! Thank you for your input and infinite patience while I worked on this.

Excited to see what people build. Hope it works smoothly, and please file bugs if it doesn't.

mmcloughlin added a commit that referenced this issue Nov 13, 2021
Extends avo to support most AVX-512 instruction sets.

The instruction type is extended to support suffixes. The K family of opmask
registers is added to the register package, and the operand package is updated
to support the new operand types. Move instruction deduction in `Load` and
`Store` is extended to support KMOV* and VMOV* forms.

Internal code generation packages were overhauled. Instruction database loading
required various messy changes to account for the additional complexities of the
AVX-512 instruction sets. The internal/api package was added to introduce a
separation between instruction forms in the database, and the functions avo
provides to create them. This was required since with instruction suffixes there
is no longer a one-to-one mapping between instruction constructors and opcodes.

AVX-512 bloated generated source code size substantially, initially increasing
compilation and CI test times to an unacceptable level. Two changes were made to
address this:

1.	Instruction constructors in the `x86` package moved to an optab-based
	approach. This compiles substantially faster than the verbose code
	generation we had before.

2.	The most verbose code-generated tests are moved under build tags and limited
	to a stress test mode. Stress test builds are run on schedule but not in
	regular CI.

An example of AVX-512 accelerated 16-lane MD5 is provided to demonstrate and
test the new functionality.

Updates #20 #163 #229

Co-authored-by: Vaughn Iverson <vsivsi@yahoo.com>
mmcloughlin added a commit that referenced this issue Nov 13, 2021
Extends avo to support most AVX-512 instruction sets.

The instruction type is extended to support suffixes. The K family of opmask
registers is added to the register package, and the operand package is updated
to support the new operand types. Move instruction deduction in `Load` and
`Store` is extended to support KMOV* and VMOV* forms.

Internal code generation packages were overhauled. Instruction database loading
required various messy changes to account for the additional complexities of the
AVX-512 instruction sets. The internal/api package was added to introduce a
separation between instruction forms in the database, and the functions avo
provides to create them. This was required since with instruction suffixes there
is no longer a one-to-one mapping between instruction constructors and opcodes.

AVX-512 bloated generated source code size substantially, initially increasing
compilation and CI test times to an unacceptable level. Two changes were made to
address this:

1.  Instruction constructors in the `x86` package moved to an optab-based
    approach. This compiles substantially faster than the verbose code
    generation we had before.

2.  The most verbose code-generated tests are moved under build tags and
    limited to a stress test mode. Stress test builds are run on
    schedule but not in regular CI.

An example of AVX-512 accelerated 16-lane MD5 is provided to demonstrate and
test the new functionality.

Updates #20 #163 #229

Co-authored-by: Vaughn Iverson <vsivsi@yahoo.com>
@kalamay
Copy link
Contributor

kalamay commented Nov 13, 2021

Congratulations on the release, and thanks so much for the amazing effort!

@vsivsi
Copy link
Contributor

vsivsi commented Nov 13, 2021

Agreed, it's been quite a journey to get here. Can't wait to try it out. Congrats!

@lukechampine
Copy link
Contributor

king 😤

@mmcloughlin
Copy link
Owner Author

Thanks 😊

Going to close this. Please open another issue if there are problems or additional AVX-512 feature requests. For now I'm aware of #193 and #199.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants