Question re SSE vs AVX2 #202

paddyhoran · 2019-01-21T02:57:44Z

Hi,

I am experimenting with adding SIMD support to Apache Arrow with packed_simd. I have a prototype working but I am not getting any speed up using AVX2(f32x8) vs SSE(f32x4).

I am creating my accelerated functions using the following macro:

macro_rules! simd_add {
    ($ins_set:expr, $instruction_set:ident, $simd_ty:ident, $native_ty:ty, $array_ty:ident) => {
        #[target_feature(enable = $ins_set)]
        pub unsafe fn $instruction_set(left: &$array_ty, right: &$array_ty) -> Result<$array_ty>
        {
            if left.len() != right.len() {
                return Err(ArrowError::ComputeError(
                    "Cannot perform math operation on arrays of different length".to_string(),
                ));
            }

            let lanes = $simd_ty::lanes();
            let buffer_size = left.len() * mem::size_of::<$native_ty>();
            let mut result = MutableBuffer::new(buffer_size).with_bitset(buffer_size, false);

            for i in (0..left.len()).step_by(lanes) {
                let simd_left = $simd_ty::from_slice_unaligned_unchecked(left.value_slice(i, lanes));
                let simd_right = $simd_ty::from_slice_unaligned_unchecked(right.value_slice(i, lanes));
                let simd_result = simd_left + simd_right;

                let result_slice: &mut [$native_ty] = from_raw_parts_mut(
                    (result.data_mut().as_mut_ptr() as *mut $native_ty).offset(i as isize),
                    lanes
                );
                simd_result.write_to_slice_unaligned_unchecked(result_slice);
            }

            Ok($array_ty::new(
                left.len(),
                result.freeze(),
                0,
                0))
        }
    };
}

simd_add!("sse", add_sse, f32x4, f32, Float32Array);
simd_add!("avx2", add_avx2, f32x8, f32, Float32Array);

My tests pass just fine and is_x86_feature_detected! returns true for both "sse" and "avx2". However, when I benchmark my code the results for both of these functions are pretty much the same.

Am I missing not using packed_simd correctly? Thank you.

The text was updated successfully, but these errors were encountered:

gnzlbg · 2019-01-21T08:41:33Z

Am I missing not using packed_simd correctly?

Looks correct to me. Which kind of performance do you expect ?

The arithmetic intensity of your kernel (work / memory traffic) is:

scalar code: (1 scalar floating-point add) / (2 * load f32 + write f32)) ~= 1 cycle / 12 bytes
simd 128-bit: ~= (1 vector 4x floating-point add) / (4 * (2 * load f32 + write f32)) ~= 1 cycle / 48 bytes
simd 256-bit: ~= (1 vector 8x floating-point add) / (8 * (2* load f32 + write f32)) ~= 1 cycle / 96 bytes

(note: your CPU might be able to execute multiple vector instructions per cycle, e.g., 3, which could reduce the numbers above by a factor of ~3 if you unroll the loop).

If you have a CPU running at 3 Ghz, then each cycle takes 1 / 3 nanoseconds, so these kernels need a memory bandwidth of (note 1 nanoseconds is 1e-9s and 1 byte is 1e-9 Gb):

scalar: ~= 12 / (1/3) = 36 Gb/s
simd 128-bit: ~= 48 / (1/3) = 144 Gb/s
simd 256-bit: ~= 96 / (1/3) = 288 Gb/s

(note: this completely ignores memory latency)

You can compare this rough numbers with the max memory bandwidth of your CPU to get an idea whether your kernel could be compute bound or memory bound. I doubt your CPU can sustain 36Gb/s for a single thread (although it is on the ballpark of what's possible), much less 144 Gb/s, so while you could do a proper roofline analysis of the kernel to see what is limiting its performance, my bet is that if your problem size does not fit into any of the CPU caches, your CPU is idle most of the time just waiting for memory. Doing more work in the same amount of time (using SIMD, parallelization, etc.) just lowers the arithmetic intensity even further, making the CPU idle even more.

Iff you change your kernel to not allocate any memory, e.g., take a mutable reference to the result buffer as an argument, and benchmark it against problem sizes that fit in your CPU L1 cache (e.g. an array length of 2048 or 1024 elements), you might see a difference between the 128-bit and 256-bit versions, depending on how high the memory-bandwidth between your CPU and L1 is (e.g. skylake has a bandwidth of 128 bytes / cycle so you might be able to see a difference there).

paddyhoran · 2019-01-22T03:30:22Z

Hi @gnzlbg,

Thanks very much for the response. You are indeed correct that it seems it is memory bound, here are the specs of my cpu.

Which kind of performance do you expect ?

Naively I was expecting that f32x8 would be roughly twice as fast as f32x4, I see now that there are a lot of factors at play.

I do have a few follow up questions if you do not mind...

Is there an advantage to providing all the implementations possible and allowing the user of your library to select at compile time the instruction set to use or is runtime detection of the "best possible" instruction set always better?

Is there any way to reconcile the instruction set names (the argument to is_x86_feature_detected) to the types in packed_simd or the methods on those types (if they leverage newer intrinsics rather than larger registers)?

Thank you. I will close the issue as there is no issue as such in packed_simd.

gnzlbg · 2019-01-22T08:46:58Z

Is there an advantage to providing all the implementations possible and allowing the user of your library to select at compile time the instruction set to use or is runtime detection of the "best possible" instruction set always better?

No solution is always better than the other. If the user knows at compile-time what instruction set it always wants to use, it can avoid the cost of run-time feature detection. If it doesn't, the cost of run-time feature detection is often acceptable as long as the performance improvement of the SIMD algorithms make it worth it.

Is there any way to reconcile the instruction set names (the argument to is_x86_feature_detected) to the types in packed_simd or the methods on those types (if they leverage newer intrinsics rather than larger registers)?

The types in packed_simd work correctly independently of the target features available - packed_simd chooses different implementations depending on the target features (e.g. if AVX is not available, f32x8 operations will work on two 128-bit registers instead of a 256-bit one).

So what you want to do is make sure that the appropriate target features are available, either by enabling them globally via -C target-feature or on a per-function basis via #[target_feature(enable = ...)]. Some optimizations in packed_simd only work when the features are enabled globally though.

paddyhoran · 2019-01-22T17:58:53Z

if AVX is not available, f32x8 operations will work on two 128-bit registers instead of a 256-bit one

This leads me to think that I should always use the largest registers in packed_simd and allow this conversion on targets where only smaller registers are available? So if I don't want runtime feature detection then for Float32Array above I could provide a single implementation of the add kernel using f32x16 and allow the user to use -C target-feature?

gnzlbg · 2019-01-22T18:12:35Z

That sounds like a good solution.

Just keep in mind that the code shown above does not check that the length of the array is divisible by the number of elements in the vector. If that is allowed to mismatch you are going to need to handle the remainder elements "somehow".

paddyhoran · 2019-01-22T18:35:21Z

I need to check this fully but I believe that the memory allocations in the Arrow spec (and the Rust impl in particular) might mean that I don't have to worry above this (I'd only be writing to some spare capacity).

If you were worried about trampling on adjacent memory though what is the most efficient or idiomatic way to assign back only a portion of the simd register into the result buffer?

gnzlbg · 2019-01-22T18:37:42Z

If you were worried about trampling on adjacent memory though what is the most efficient or idiomatic way to assign back only a portion of the simd register into the result buffer?

You probably just want to convert the simd register into an appropriate [T; N] array, and then only copy the elements of the array that you care about into the resulting buffer (using a normal for loop).

There are better ways to do that, but if your array is large enough how exactly you handle the tail doesn't really matter unless one does something really bad performance wise (the naive solution here is probably ok).

paddyhoran · 2019-01-22T19:59:55Z

I see. Thank you very much for all your help. It is very much appreciated!

paddyhoran closed this as completed Jan 22, 2019

paddyhoran mentioned this issue Jan 28, 2019

[WIP] [Rust] Add explicit SIMD vectorization for arithmetic ops in "array_ops" apache/arrow#3451

Closed

paddyhoran mentioned this issue Feb 6, 2019

ARROW-4468: [Rust] Implement BitAnd/BitOr for &Buffer (with SIMD) apache/arrow#3571

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question re SSE vs AVX2 #202

Question re SSE vs AVX2 #202

paddyhoran commented Jan 21, 2019

gnzlbg commented Jan 21, 2019 •

edited

Loading

paddyhoran commented Jan 22, 2019

gnzlbg commented Jan 22, 2019

paddyhoran commented Jan 22, 2019

gnzlbg commented Jan 22, 2019

paddyhoran commented Jan 22, 2019

gnzlbg commented Jan 22, 2019 •

edited

Loading

paddyhoran commented Jan 22, 2019

Question re SSE vs AVX2 #202

Question re SSE vs AVX2 #202

Comments

paddyhoran commented Jan 21, 2019

gnzlbg commented Jan 21, 2019 • edited Loading

paddyhoran commented Jan 22, 2019

gnzlbg commented Jan 22, 2019

paddyhoran commented Jan 22, 2019

gnzlbg commented Jan 22, 2019

paddyhoran commented Jan 22, 2019

gnzlbg commented Jan 22, 2019 • edited Loading

paddyhoran commented Jan 22, 2019

gnzlbg commented Jan 21, 2019 •

edited

Loading

gnzlbg commented Jan 22, 2019 •

edited

Loading