-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question re SSE vs AVX2 #202
Comments
Looks correct to me. Which kind of performance do you expect ? The arithmetic intensity of your kernel (work / memory traffic) is:
(note: your CPU might be able to execute multiple vector instructions per cycle, e.g., 3, which could reduce the numbers above by a factor of ~3 if you unroll the loop). If you have a CPU running at 3 Ghz, then each cycle takes 1 / 3 nanoseconds, so these kernels need a memory bandwidth of (note 1 nanoseconds is 1e-9s and 1 byte is 1e-9 Gb):
(note: this completely ignores memory latency) You can compare this rough numbers with the max memory bandwidth of your CPU to get an idea whether your kernel could be compute bound or memory bound. I doubt your CPU can sustain 36Gb/s for a single thread (although it is on the ballpark of what's possible), much less 144 Gb/s, so while you could do a proper roofline analysis of the kernel to see what is limiting its performance, my bet is that if your problem size does not fit into any of the CPU caches, your CPU is idle most of the time just waiting for memory. Doing more work in the same amount of time (using SIMD, parallelization, etc.) just lowers the arithmetic intensity even further, making the CPU idle even more. Iff you change your kernel to not allocate any memory, e.g., take a mutable reference to the result buffer as an argument, and benchmark it against problem sizes that fit in your CPU L1 cache (e.g. an array length of 2048 or 1024 elements), you might see a difference between the 128-bit and 256-bit versions, depending on how high the memory-bandwidth between your CPU and L1 is (e.g. skylake has a bandwidth of 128 bytes / cycle so you might be able to see a difference there). |
Hi @gnzlbg, Thanks very much for the response. You are indeed correct that it seems it is memory bound, here are the specs of my cpu.
Naively I was expecting that I do have a few follow up questions if you do not mind... Is there an advantage to providing all the implementations possible and allowing the user of your library to select at compile time the instruction set to use or is runtime detection of the "best possible" instruction set always better? Is there any way to reconcile the instruction set names (the argument to Thank you. I will close the issue as there is no issue as such in |
No solution is always better than the other. If the user knows at compile-time what instruction set it always wants to use, it can avoid the cost of run-time feature detection. If it doesn't, the cost of run-time feature detection is often acceptable as long as the performance improvement of the SIMD algorithms make it worth it.
The types in So what you want to do is make sure that the appropriate target features are available, either by enabling them globally via |
This leads me to think that I should always use the largest registers in |
That sounds like a good solution. Just keep in mind that the code shown above does not check that the length of the array is divisible by the number of elements in the vector. If that is allowed to mismatch you are going to need to handle the remainder elements "somehow". |
I need to check this fully but I believe that the memory allocations in the Arrow spec (and the Rust impl in particular) might mean that I don't have to worry above this (I'd only be writing to some spare capacity). If you were worried about trampling on adjacent memory though what is the most efficient or idiomatic way to assign back only a portion of the simd register into the result buffer? |
You probably just want to convert the simd register into an appropriate There are better ways to do that, but if your array is large enough how exactly you handle the tail doesn't really matter unless one does something really bad performance wise (the naive solution here is probably ok). |
I see. Thank you very much for all your help. It is very much appreciated! |
Hi,
I am experimenting with adding SIMD support to Apache Arrow with
packed_simd
. I have a prototype working but I am not getting any speed up using AVX2(f32x8) vs SSE(f32x4).I am creating my accelerated functions using the following macro:
My tests pass just fine and
is_x86_feature_detected!
returnstrue
for both"sse"
and"avx2"
. However, when I benchmark my code the results for both of these functions are pretty much the same.Am I missing not using
packed_simd
correctly? Thank you.The text was updated successfully, but these errors were encountered: