[WIP] [Rust] Add explicit SIMD vectorization for arithmetic ops in "array_ops" #3451

paddyhoran · 2019-01-22T04:16:08Z

@andygrove @sunchao this is nowhere near done but I did want to get your opinions on two items before I go further:

the choice of packed_simd. There are other options out there and some try to provide a higher level api. However, I feel that ecosystem has not matured enough yet to the point that any single third party library is the clear choice for SIMD in rust and therefore is not worth picking it up as a dependency at this time. In arrow we will always be working with packed vectors, couple this with the fact that the objective of packed_simd to to get stablized in the future and I think that packed_simd is a good choice. Alternatively, we could you the raw intrinsics in std::arch.
Re-organization of what is called array_ops into the compute sub module. This is for two reasons. Although I will try to make the SIMD optimized versions of the code as easy to use as possible (with run-time detection, etc.), the actual implementation of SIMD code tends to be a little verbose as we will want to conditionally compile different versions for different cpu's. The C++ version is structured this way with a compute sub module.

xhochy · 2019-01-22T14:07:50Z

@paddyhoran This PR includes some commits in the diff view that should not come up. Can you rebase on master?

paddyhoran · 2019-01-22T15:15:04Z

Yep, sorry about that. I'll rebase when I get a chance.

andygrove · 2019-01-23T02:44:32Z

@paddyhoran This is looking good. I don't have experience with SIMD yet but it was on my list to learn so this seems like a good opportunity .. I will start testing this.

I like the compute module re-org.

sunchao · 2019-01-23T03:34:57Z

Thanks for the work @paddyhoran ! I'll take a look at this too.

paddyhoran · 2019-01-23T03:36:42Z

Hold off for a little, I'm trying to clean it up tonight. I'll post what I have to get your opinions.

nevi-me · 2019-01-23T11:25:13Z

I don't have any SIMD experience, I however 👍 the compute addition, as it's similar to what's being done in the cpp codebase.

sunchao

Thanks @paddyhoran . I also think packed_simd is a good choice for now, for this type of purpose. Left a few comments.

sunchao · 2019-01-25T08:24:17Z

rust/arrow/src/array.rs

-        let raw = unsafe { std::slice::from_raw_parts(self.raw_values(), self.len()) };
-        &raw[offset..offset + len]
+        let raw =
+            unsafe { std::slice::from_raw_parts(self.raw_values().offset(offset as isize), len) };


Why this change?

The comment says that it does not do bounds checking but I found that it did.

sunchao · 2019-01-26T06:54:31Z

rust/arrow/src/compute/arithmetic_kernels.rs

+        + Div<Output = T::Native>
+        + Zero,
+{
+    #[cfg(any(target_arch = "x86", target_arch = "x86_64"))]


Can we make this runtime detection? e.g.:

if is_x86_feature_detected!("avx2") { return add_simd(&left, &right); } else { math_op(left, right, |a, b| Ok(a + b)) }

sunchao · 2019-01-26T07:00:42Z

rust/arrow/src/datatypes.rs

+
+    /// Performs a SIMD add operation
+    #[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
+    fn add(left: Self::Simd, right: Self::Simd) -> Self::Simd;


Can we make this general to all math operations? e.g., +, -, *, /. Seems they are supported by the simd type.

sunchao · 2019-01-26T07:01:12Z

rust/arrow/src/compute/arithmetic_kernels.rs

+    for i in (0..left.len()).step_by(lanes) {
+        let simd_left = T::load(left.value_slice(i, lanes));
+        let simd_right = T::load(right.value_slice(i, lanes));
+        let simd_result = T::add(simd_left, simd_right);


How are we going to handle nulls?

I still have to work on nulls

sunchao · 2019-01-26T07:04:43Z

rust/arrow/src/mod.rs

@@ -17,6 +17,7 @@

 pub mod array;
 pub mod array_data;
+pub mod compute;


We should not need this file anymore since we have lib.rs.

sunchao · 2019-01-26T07:12:14Z

rust/arrow/src/datatypes.rs

+/// available.
+pub trait ArrowNumericType: ArrowPrimitiveType {
+    /// Defines the SIMD type that should be used for this numeric type
+    #[cfg(any(target_arch = "x86", target_arch = "x86_64"))]


Can we put this above the trait definition? also, I'm not sure if we should define another trait just for SIMD, since now ArrowNumericType really is almost all about SIMD.

It looked wrong to have #[cfg(any(target_arch = "x86", target_arch = "x86_64"))] in multiple places, I plan to go back and see how to clean this up.

also, I'm not sure if we should define another trait just for SIMD, since now ArrowNumericType really is almost all about SIMD.

I don't quite understand what you mean here? What do you propose?

Oh. I'm proposing to perhaps add another trait such as ArrowSIMDType, just for the SIMD purpose. Let me know if this makes sense.

Right, that probably does make sense as eventually we will need SIMD ops over Boolean Arrays as well.

sunchao · 2019-01-26T07:15:51Z

rust/arrow/src/datatypes.rs

+    };
+}
+
+make_numeric_type!(Int8Type, i8, i8x64);


Have you considered AVX-512? also wondering if it would be possible to support both depending on the architecture...

I actually thought that packed_simd only supported up to 256, I need to update this.

paddyhoran · 2019-01-26T18:29:16Z

I wanted to leave a note on the general direction. At first I wanted runtime detection of the "best" intrinsics. However, after some research there is no "best" solution, it depends on the situation.

Here is what I'm planning. Support the larges SIMD registers available in packed_simd and allow the user to compile for which ever intrinsics they want via RUSTFLAGS="-C target-feature=+avx2" or similar.

packed_simd will do the correct thing where only smaller registers are available. i.e. f32x16 would be converted to two f32x8.

I believe that sse is available on all intel cpu's and if you do not use RUSTFLAGS="-C target-feature=***" at all it will still use sse.

I think we should add runtime detection via a feature flag in another jira. Runtime detection is not always worth using as even though you may have access to intrinsics with wider registers the memory bandwidth of your cpu won't really allow you to take advantage of it, that's the situation I'm in on my dev machine. In this case you might end up checking for different intrinsics at runtime even though the most basic sse version is just as fast.

sunchao · 2019-01-27T23:11:23Z

I think we should add runtime detection via a feature flag in another jira.

Sure. I'm OK with adding the runtime detection later.

Runtime detection is not always worth using as even though you may have access to intrinsics with wider registers the memory bandwidth of your cpu won't really allow you to take advantage of it, that's the situation I'm in on my dev machine.

Thanks. This is interesting to know. Do you have any benchmark number, or any reference that explains this?

paddyhoran · 2019-01-28T01:41:52Z

The best resource is the issue I opened in packed_simd. I had compiled different versions of the add kernel for different instruction sets but when I bench-marked them there was no improvement in performance using larger registers leading me to open the issue. The discussion with the maintainer of packed_simd explains it well

# Conflicts: # rust/arrow/src/compute/array_ops.rs

paddyhoran · 2019-02-04T04:24:04Z

Closing in favor of smaller incremental PR's.

paddyhoran force-pushed the simd branch from 3d33e6d to 42fc40e Compare January 23, 2019 04:17

paddyhoran added 15 commits January 24, 2019 23:12

Working with packed-simd

5be0f45

Simd moved

ddd0415

sse and avx2 with benchmark

0cb88ea

Added conditional compilation and renamed

27b1c0d

Fixed lints

6a26c53

Cleaned up implementation

e5b8780

Fixed lints

a07d99a

Generics working

b32df1a

Simd operations added to ArrowNumericType

dd71421

Cleaned up testing.

108238f

Added documentation

3ee5a29

Removed add tests from array_ops.

6cf1fd1

Moved test with nulls.

236608b

Fixed benchmarks.

2f8dd56

Restored test with nulls.

f307436

paddyhoran force-pushed the simd branch from d7908bd to 627c5ae Compare January 25, 2019 04:19

paddyhoran added 4 commits January 24, 2019 23:25

Fixed value slice.

24fe8bb

Skipping test for now.

627c5ae

Removed temp example

6c47424

Fixed lints

9925e6a

sunchao reviewed Jan 26, 2019

View reviewed changes

paddyhoran added 6 commits January 27, 2019 22:46

Split SIMD into separate trait.

7f115e8

Merge remote-tracking branch 'origin/master' into simd-merge

bc960c5

# Conflicts: # rust/arrow/src/compute/array_ops.rs

Updated to be generic over binary ops

e5fad8a

Implemented Bitor trait for Buffer

36608f8

Fixed linting

cb00309

Changed BitOr to BitAnd

f866d57

paddyhoran closed this Feb 4, 2019

paddyhoran changed the title ~~[WIP] ARROW-4196: [Rust] Add explicit SIMD vectorization for arithmetic ops in "array_ops"~~ [WIP] [Rust] Add explicit SIMD vectorization for arithmetic ops in "array_ops" Feb 4, 2019

paddyhoran deleted the simd branch February 25, 2020 02:32

asfimport mentioned this pull request Feb 18, 2019

[Rust] Add explicit SIMD vectorization for arithmetic ops in "array_ops" #20777

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] [Rust] Add explicit SIMD vectorization for arithmetic ops in "array_ops" #3451

[WIP] [Rust] Add explicit SIMD vectorization for arithmetic ops in "array_ops" #3451

paddyhoran commented Jan 22, 2019

xhochy commented Jan 22, 2019

paddyhoran commented Jan 22, 2019

andygrove commented Jan 23, 2019

sunchao commented Jan 23, 2019

paddyhoran commented Jan 23, 2019

nevi-me commented Jan 23, 2019

sunchao left a comment

sunchao Jan 25, 2019

paddyhoran Jan 26, 2019

sunchao Jan 26, 2019

paddyhoran Jan 26, 2019

sunchao Jan 26, 2019

sunchao Jan 26, 2019

paddyhoran Jan 26, 2019

sunchao Jan 26, 2019

sunchao Jan 26, 2019

paddyhoran Jan 26, 2019

sunchao Jan 27, 2019

paddyhoran Jan 28, 2019

sunchao Jan 26, 2019

paddyhoran Jan 26, 2019

paddyhoran commented Jan 26, 2019

sunchao commented Jan 27, 2019 •

edited

Loading

paddyhoran commented Jan 28, 2019

paddyhoran commented Feb 4, 2019

[WIP] [Rust] Add explicit SIMD vectorization for arithmetic ops in "array_ops" #3451

[WIP] [Rust] Add explicit SIMD vectorization for arithmetic ops in "array_ops" #3451

Conversation

paddyhoran commented Jan 22, 2019

xhochy commented Jan 22, 2019

paddyhoran commented Jan 22, 2019

andygrove commented Jan 23, 2019

sunchao commented Jan 23, 2019

paddyhoran commented Jan 23, 2019

nevi-me commented Jan 23, 2019

sunchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paddyhoran commented Jan 26, 2019

sunchao commented Jan 27, 2019 • edited Loading

paddyhoran commented Jan 28, 2019

paddyhoran commented Feb 4, 2019

sunchao commented Jan 27, 2019 •

edited

Loading