Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question re SSE vs AVX2 #202

Closed
paddyhoran opened this issue Jan 21, 2019 · 8 comments
Closed

Question re SSE vs AVX2 #202

paddyhoran opened this issue Jan 21, 2019 · 8 comments

Comments

@paddyhoran
Copy link

Hi,

I am experimenting with adding SIMD support to Apache Arrow with packed_simd. I have a prototype working but I am not getting any speed up using AVX2(f32x8) vs SSE(f32x4).

I am creating my accelerated functions using the following macro:

macro_rules! simd_add {
    ($ins_set:expr, $instruction_set:ident, $simd_ty:ident, $native_ty:ty, $array_ty:ident) => {
        #[target_feature(enable = $ins_set)]
        pub unsafe fn $instruction_set(left: &$array_ty, right: &$array_ty) -> Result<$array_ty>
        {
            if left.len() != right.len() {
                return Err(ArrowError::ComputeError(
                    "Cannot perform math operation on arrays of different length".to_string(),
                ));
            }

            let lanes = $simd_ty::lanes();
            let buffer_size = left.len() * mem::size_of::<$native_ty>();
            let mut result = MutableBuffer::new(buffer_size).with_bitset(buffer_size, false);

            for i in (0..left.len()).step_by(lanes) {
                let simd_left = $simd_ty::from_slice_unaligned_unchecked(left.value_slice(i, lanes));
                let simd_right = $simd_ty::from_slice_unaligned_unchecked(right.value_slice(i, lanes));
                let simd_result = simd_left + simd_right;

                let result_slice: &mut [$native_ty] = from_raw_parts_mut(
                    (result.data_mut().as_mut_ptr() as *mut $native_ty).offset(i as isize),
                    lanes
                );
                simd_result.write_to_slice_unaligned_unchecked(result_slice);
            }

            Ok($array_ty::new(
                left.len(),
                result.freeze(),
                0,
                0))
        }
    };
}

simd_add!("sse", add_sse, f32x4, f32, Float32Array);
simd_add!("avx2", add_avx2, f32x8, f32, Float32Array);

My tests pass just fine and is_x86_feature_detected! returns true for both "sse" and "avx2". However, when I benchmark my code the results for both of these functions are pretty much the same.

Am I missing not using packed_simd correctly? Thank you.

@gnzlbg
Copy link
Contributor

gnzlbg commented Jan 21, 2019

Am I missing not using packed_simd correctly?

Looks correct to me. Which kind of performance do you expect ?

The arithmetic intensity of your kernel (work / memory traffic) is:

  • scalar code: (1 scalar floating-point add) / (2 * load f32 + write f32)) ~= 1 cycle / 12 bytes
  • simd 128-bit: ~= (1 vector 4x floating-point add) / (4 * (2 * load f32 + write f32)) ~= 1 cycle / 48 bytes
  • simd 256-bit: ~= (1 vector 8x floating-point add) / (8 * (2* load f32 + write f32)) ~= 1 cycle / 96 bytes

(note: your CPU might be able to execute multiple vector instructions per cycle, e.g., 3, which could reduce the numbers above by a factor of ~3 if you unroll the loop).

If you have a CPU running at 3 Ghz, then each cycle takes 1 / 3 nanoseconds, so these kernels need a memory bandwidth of (note 1 nanoseconds is 1e-9s and 1 byte is 1e-9 Gb):

  • scalar: ~= 12 / (1/3) = 36 Gb/s
  • simd 128-bit: ~= 48 / (1/3) = 144 Gb/s
  • simd 256-bit: ~= 96 / (1/3) = 288 Gb/s

(note: this completely ignores memory latency)

You can compare this rough numbers with the max memory bandwidth of your CPU to get an idea whether your kernel could be compute bound or memory bound. I doubt your CPU can sustain 36Gb/s for a single thread (although it is on the ballpark of what's possible), much less 144 Gb/s, so while you could do a proper roofline analysis of the kernel to see what is limiting its performance, my bet is that if your problem size does not fit into any of the CPU caches, your CPU is idle most of the time just waiting for memory. Doing more work in the same amount of time (using SIMD, parallelization, etc.) just lowers the arithmetic intensity even further, making the CPU idle even more.

Iff you change your kernel to not allocate any memory, e.g., take a mutable reference to the result buffer as an argument, and benchmark it against problem sizes that fit in your CPU L1 cache (e.g. an array length of 2048 or 1024 elements), you might see a difference between the 128-bit and 256-bit versions, depending on how high the memory-bandwidth between your CPU and L1 is (e.g. skylake has a bandwidth of 128 bytes / cycle so you might be able to see a difference there).

@paddyhoran
Copy link
Author

Hi @gnzlbg,

Thanks very much for the response. You are indeed correct that it seems it is memory bound, here are the specs of my cpu.

Which kind of performance do you expect ?

Naively I was expecting that f32x8 would be roughly twice as fast as f32x4, I see now that there are a lot of factors at play.

I do have a few follow up questions if you do not mind...

Is there an advantage to providing all the implementations possible and allowing the user of your library to select at compile time the instruction set to use or is runtime detection of the "best possible" instruction set always better?

Is there any way to reconcile the instruction set names (the argument to is_x86_feature_detected) to the types in packed_simd or the methods on those types (if they leverage newer intrinsics rather than larger registers)?

Thank you. I will close the issue as there is no issue as such in packed_simd.

@gnzlbg
Copy link
Contributor

gnzlbg commented Jan 22, 2019

Is there an advantage to providing all the implementations possible and allowing the user of your library to select at compile time the instruction set to use or is runtime detection of the "best possible" instruction set always better?

No solution is always better than the other. If the user knows at compile-time what instruction set it always wants to use, it can avoid the cost of run-time feature detection. If it doesn't, the cost of run-time feature detection is often acceptable as long as the performance improvement of the SIMD algorithms make it worth it.

Is there any way to reconcile the instruction set names (the argument to is_x86_feature_detected) to the types in packed_simd or the methods on those types (if they leverage newer intrinsics rather than larger registers)?

The types in packed_simd work correctly independently of the target features available - packed_simd chooses different implementations depending on the target features (e.g. if AVX is not available, f32x8 operations will work on two 128-bit registers instead of a 256-bit one).

So what you want to do is make sure that the appropriate target features are available, either by enabling them globally via -C target-feature or on a per-function basis via #[target_feature(enable = ...)]. Some optimizations in packed_simd only work when the features are enabled globally though.

@paddyhoran
Copy link
Author

if AVX is not available, f32x8 operations will work on two 128-bit registers instead of a 256-bit one

This leads me to think that I should always use the largest registers in packed_simd and allow this conversion on targets where only smaller registers are available? So if I don't want runtime feature detection then for Float32Array above I could provide a single implementation of the add kernel using f32x16 and allow the user to use -C target-feature?

@gnzlbg
Copy link
Contributor

gnzlbg commented Jan 22, 2019

That sounds like a good solution.

Just keep in mind that the code shown above does not check that the length of the array is divisible by the number of elements in the vector. If that is allowed to mismatch you are going to need to handle the remainder elements "somehow".

@paddyhoran
Copy link
Author

I need to check this fully but I believe that the memory allocations in the Arrow spec (and the Rust impl in particular) might mean that I don't have to worry above this (I'd only be writing to some spare capacity).

If you were worried about trampling on adjacent memory though what is the most efficient or idiomatic way to assign back only a portion of the simd register into the result buffer?

@gnzlbg
Copy link
Contributor

gnzlbg commented Jan 22, 2019

If you were worried about trampling on adjacent memory though what is the most efficient or idiomatic way to assign back only a portion of the simd register into the result buffer?

You probably just want to convert the simd register into an appropriate [T; N] array, and then only copy the elements of the array that you care about into the resulting buffer (using a normal for loop).

There are better ways to do that, but if your array is large enough how exactly you handle the tail doesn't really matter unless one does something really bad performance wise (the naive solution here is probably ok).

@paddyhoran
Copy link
Author

I see. Thank you very much for all your help. It is very much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants