Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Status of AVX 512 ? #28

Open
ManuelCostanzo opened this issue Oct 6, 2020 · 27 comments
Open

Status of AVX 512 ? #28

ManuelCostanzo opened this issue Oct 6, 2020 · 27 comments
Labels
E-needs-docs Needs documentation added.

Comments

@ManuelCostanzo
Copy link

Hello !

I want to ask if this crate supports AVX 512 instructions. If not, Is it in the plans to be able to support it ? This would be the definitive rate for simd in Rust ? Because I understand that the one that is in the official documentation does not have more support.

Thanks

@Lokathor
Copy link
Contributor

Lokathor commented Oct 6, 2020

Hello.

We will support 512-bit vectors. However, you'll need to turn up the enabled features during compilation because by default Rust binaries are not compiled with avx-512 enabled.

@ManuelCostanzo
Copy link
Author

ManuelCostanzo commented Oct 6, 2020

Thank you for reply ! And what features I have to enable ?

@Lokathor
Copy link
Contributor

Lokathor commented Oct 6, 2020

You'd usually use a target-feature list in the RUSTFLAGS value during build.

The allowed features are the same as for the target_feature attribute

It appears that you can't enable avx-512 on stable yet.

Perhaps @Amanieu knows more? I've seen them merging work in stdarch lately.

@ManuelCostanzo
Copy link
Author

ManuelCostanzo commented Oct 6, 2020

I made a N-Body algorithm implementation, and on my server, compiling with target-cpu=knl works much better than with target-cpu=native.

It's like it vectorizes better, but without adding any target-feature.

Although I am not in a KNL, it is true that the server has similar instructions and for some reason it works better (i mean, the algorithm takes less time to finish)

@Amanieu
Copy link
Member

Amanieu commented Oct 6, 2020

Currrently we tie the stabilization of target_feature features with the implementation of the relevant intrinsics in stdarch (The AVX512 intrinsics are still incomplete). However I think we should separate these now that we have stdsimd.

@workingjubilee
Copy link
Member

Knight's Landing chips lack the narrower-width SSE instructions so it is likely that some things that are lowering to SSE instructions while using -Ctarget-cpu=native are lowering to AVX instructions with -Ctarget-cpu=knl.

@workingjubilee
Copy link
Member

I just pestered everyone by mentioning this in the Zulip so I should mention it here: I should note that "AVX-512" is by no means a singular unitary feature, there is avx512f ("F" for "Foundation", perhaps?) and also extension features for AVX-512 that build on top of avx512f, so that's something to be aware of. Our main attention will be on supporting the concepts of 512-bit vectors abstractly in our API and in a manner that is vendor-neutral so that the compiler can do the best job it can with a desired intention without the programmer having to get into the nitty-gritty specifics of Intel's API.

@jedbrown
Copy link

@ManuelCostanzo Note that gcc/clang/icc generally avoid 512-bit registers even when compiled for skylake-avx512 due to license-based downclocking, which includes stalls at frequency transitions. You have to specifically request them by something like -mprefer-vector-width=512. Ice Lake has a big improvement in downclocking so we might see compilers using 512-bit registers by default. rustc --print target-features suggests that there is no way to encourage the compiler to actually use 512-bit registers (which is what you want if you spend lots of time in sustained avx512 code).

@calebzulawski
Copy link
Member

Unfortunately I believe that will also be entirely out of our hands unless LLVM provides a mechanism for encouraging it. Using target-cpu=native may help in some cases?

@jedbrown
Copy link

Aha! rustc -Ctarget-cpu=skylake-avx512 -Ctarget-feature=-prefer-256-bit. It's confusing because +prefer-256-bit is the default and one specifies that they want 512 by disabling it -- I'd been looking for +prefer-512-bit, which doesn't exist.

https://godbolt.org/z/nEMWz9

@workingjubilee
Copy link
Member

At first I was considering to myself, "shouldn't this issue be closed?" since it's not something the Portable SIMD API can help with per se. However, past and future Jubilees, please consider: These specific instructions on targeting AVX512-enabled architectures should probably go somewhere, and from that "guide-level" perspective, this is within the scope of our mandate.

@tarcieri
Copy link

tarcieri commented Feb 4, 2021

AVX512 would certainly be nice for cryptography. For example, curve25519-dalek has a backend leveraging AVX512-IFMA.

GHASH (used by AES-GCM) also benefits from VPCLMULQDQ, but it's already possible to leverage from Rust just by using target-cpu=skylake

The Keccak sponge function (used by the SHA3 family and the KangarooTwelve XOF) is another example of an algorithm that could benefit: https://github.com/XKCP/K12/blob/master/lib/Optimized64/KeccakP-1600-AVX512-plainC.c

@workingjubilee
Copy link
Member

@tarcieri When I said "specific instructions" I meant for human usage.

Conversely, guaranteeing specific machine instructions, including for specific SIMD architectures, are compiled into the binary is not actually in-scope for the SIMD API project, as much of a paradox as that may seem, so usages like those will likely continue to depend on core::arch::x86_64, etc.

@tarcieri
Copy link

tarcieri commented Feb 5, 2021

If I understand what you're saying, there are specific logical operations the above AVX512 use cases map to, but there may not be corresponding Rust traits for those operations.

The curve25519-dalek use case requires a multiply-accumulate operation, namely fused multiply–add.

The GHASH use case is carryless multiplication. I'm not sure what a good API is for distinguishing that from a more traditional multiply-with-carry.

Keccak is simple bitwise ops like XOR and shuffles.

@calebzulawski
Copy link
Member

FMA will likely be supported at some point (regardless of AVX-512). Unfortunately llvm doesn't expose carry-less multiply (https://groups.google.com/g/llvm-dev/c/5cpOboKOBg4/m/kJ9z_xkVAQAJ) so you'd probably need to use std::arch for that.

@workingjubilee
Copy link
Member

Knight's Landing chips lack the narrower-width SSE instructions so it is likely that some things that are lowering to SSE instructions while using -Ctarget-cpu=native are lowering to AVX instructions with -Ctarget-cpu=knl.

This was wrong, actually! It is Knight's Corner and Knight's Ferry that don't support SSE! KNL does support SSE, but it has the really wide vectors plus some other performance quirks that cause LLVM to favor using big fat full vectors.

@jedbrown
Copy link

It'll be 256-bit AVX/FMA versus AVX-512. KNL didn't suffer the license-based downclocking so compilers issue 512-bit (zmm) instructions by default. They need coaxing to issue those when targeting skylake-avx512 (#28 (comment)). Note that the skylake target does not support AVX-512 at all.

@workingjubilee workingjubilee added the E-needs-docs Needs documentation added. label May 3, 2021
@mhnatiuk
Copy link

Hi,
I'm starting to learn Rust for scientific computing. Is this issue already resolved elsewhere by rust developers?

@jedbrown
Copy link

@mhnatiuk It depends what you're striving for. rustc makes portable binaries by default, but you can either change the target globally (see examples ☝️; this is the most common approach in scientific computing) or compile multiple variants of hot vectorizable parts of your code and specialize at run-time (nicer for packaging and distribution).

@jorgecarleitao
Copy link
Contributor

portable-simd does seem to hit the AVX512 instruction set when compiled with target-cpu=native".

This is implicitly deduced by the fact that the performance of a masked sum equals the sum of an un-masked sum when the mask is represented as a bitmap. See https://github.com/DataEngineeringLabs/simd-benches#bench-results-on-native for details. The particular comparison is "Sum of nullable values (Bitmap)" vs "Sum of values".

@JeWaVe
Copy link

JeWaVe commented Feb 8, 2024

Hi,

any news for this issue ?
We are now in 2024 and with rustc 1.76 I get

the target feature avx512f is currently unstable

How could I help to stabilize ?

@HadrienG2
Copy link

I guess the right place to ask would be rust-lang/stdarch#310 ?

@calebzulawski
Copy link
Member

Currrently we tie the stabilization of target_feature features with the implementation of the relevant intrinsics in stdarch (The AVX512 intrinsics are still incomplete). However I think we should separate these now that we have stdsimd.

I think this is the relevant comment--someone will need to spend some time splitting the feature and stabilizing the target features and leave the intrinsics for another time. I'm not sure if there's any good reason for holding back stabilization at this point.

@Amanieu
Copy link
Member

Amanieu commented Feb 12, 2024

I agree that it's fine not to block the target feature on the intrinsics.

@tarcieri
Copy link

Notably it would be nice to have the target_feature stable so ZMM registers can be used with inline assembly, even if the relevant intrinsics aren't stable

@mert-kurttutan
Copy link

Notably it would be nice to have the target_feature stable so ZMM registers can be used with inline assembly, even if the relevant intrinsics aren't stable

Using ZMM registers as clobbered registers (i.e. out("zmm0) _,) (and using runtime detection with rawcpu_id or cpufeatures crate) seems to work. If you dont need to have ZMM registers as input or output, this may work.
@tarcieri I would love to see your insight on this method?

@tarcieri
Copy link

tarcieri commented Jul 8, 2024

@mert-kurttutan while we could potentially go out of our way to avoid using ZMM registers as inputs/outputs, what we'd really like to eventually use are intrinsics like _mm512_aesenc_epi128, which take ZMM registers as inputs and outputs. But in the meantime, we can use an asm! polyfill instead, or at least we could if ZMM registers are stable.

Really these operations benefit the most from always being able to keep data in ZMM registers, and unless we have a stable way to get data in and out of them it involves hoisting more and more into the inline assembly to fill those ZMM registers. We also have algorithms factored into different crates where it would be nice to be able to keep data in ZMM registers even when calling functions between crates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
E-needs-docs Needs documentation added.
Projects
None yet
Development

No branches or pull requests