Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [Rust] Add explicit SIMD vectorization for arithmetic ops in "array_ops" #3451

Closed
wants to merge 25 commits into from

Conversation

paddyhoran
Copy link
Contributor

@andygrove @sunchao this is nowhere near done but I did want to get your opinions on two items before I go further:

  • the choice of packed_simd. There are other options out there and some try to provide a higher level api. However, I feel that ecosystem has not matured enough yet to the point that any single third party library is the clear choice for SIMD in rust and therefore is not worth picking it up as a dependency at this time. In arrow we will always be working with packed vectors, couple this with the fact that the objective of packed_simd to to get stablized in the future and I think that packed_simd is a good choice. Alternatively, we could you the raw intrinsics in std::arch.

  • Re-organization of what is called array_ops into the compute sub module. This is for two reasons. Although I will try to make the SIMD optimized versions of the code as easy to use as possible (with run-time detection, etc.), the actual implementation of SIMD code tends to be a little verbose as we will want to conditionally compile different versions for different cpu's. The C++ version is structured this way with a compute sub module.

@xhochy
Copy link
Member

xhochy commented Jan 22, 2019

@paddyhoran This PR includes some commits in the diff view that should not come up. Can you rebase on master?

@paddyhoran
Copy link
Contributor Author

Yep, sorry about that. I'll rebase when I get a chance.

@andygrove
Copy link
Member

@paddyhoran This is looking good. I don't have experience with SIMD yet but it was on my list to learn so this seems like a good opportunity .. I will start testing this.

I like the compute module re-org.

@sunchao
Copy link
Member

sunchao commented Jan 23, 2019

Thanks for the work @paddyhoran ! I'll take a look at this too.

@paddyhoran
Copy link
Contributor Author

Hold off for a little, I'm trying to clean it up tonight. I'll post what I have to get your opinions.

@nevi-me
Copy link
Contributor

nevi-me commented Jan 23, 2019

I don't have any SIMD experience, I however 👍 the compute addition, as it's similar to what's being done in the cpp codebase.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @paddyhoran . I also think packed_simd is a good choice for now, for this type of purpose. Left a few comments.

let raw = unsafe { std::slice::from_raw_parts(self.raw_values(), self.len()) };
&raw[offset..offset + len]
let raw =
unsafe { std::slice::from_raw_parts(self.raw_values().offset(offset as isize), len) };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says that it does not do bounds checking but I found that it did.

+ Div<Output = T::Native>
+ Zero,
{
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this runtime detection? e.g.:

    if is_x86_feature_detected!("avx2") {
        return add_simd(&left, &right);
    } else {
        math_op(left, right, |a, b| Ok(a + b))
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below.


/// Performs a SIMD add operation
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
fn add(left: Self::Simd, right: Self::Simd) -> Self::Simd;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this general to all math operations? e.g., +, -, *, /. Seems they are supported by the simd type.

for i in (0..left.len()).step_by(lanes) {
let simd_left = T::load(left.value_slice(i, lanes));
let simd_right = T::load(right.value_slice(i, lanes));
let simd_result = T::add(simd_left, simd_right);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we going to handle nulls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have to work on nulls

@@ -17,6 +17,7 @@

pub mod array;
pub mod array_data;
pub mod compute;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not need this file anymore since we have lib.rs.

/// available.
pub trait ArrowNumericType: ArrowPrimitiveType {
/// Defines the SIMD type that should be used for this numeric type
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put this above the trait definition? also, I'm not sure if we should define another trait just for SIMD, since now ArrowNumericType really is almost all about SIMD.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looked wrong to have #[cfg(any(target_arch = "x86", target_arch = "x86_64"))] in multiple places, I plan to go back and see how to clean this up.

also, I'm not sure if we should define another trait just for SIMD, since now ArrowNumericType really is almost all about SIMD.

I don't quite understand what you mean here? What do you propose?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I'm proposing to perhaps add another trait such as ArrowSIMDType, just for the SIMD purpose. Let me know if this makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that probably does make sense as eventually we will need SIMD ops over Boolean Arrays as well.

};
}

make_numeric_type!(Int8Type, i8, i8x64);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered AVX-512? also wondering if it would be possible to support both depending on the architecture...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually thought that packed_simd only supported up to 256, I need to update this.

@paddyhoran
Copy link
Contributor Author

I wanted to leave a note on the general direction. At first I wanted runtime detection of the "best" intrinsics. However, after some research there is no "best" solution, it depends on the situation.

Here is what I'm planning. Support the larges SIMD registers available in packed_simd and allow the user to compile for which ever intrinsics they want via RUSTFLAGS="-C target-feature=+avx2" or similar.

packed_simd will do the correct thing where only smaller registers are available. i.e. f32x16 would be converted to two f32x8.

I believe that sse is available on all intel cpu's and if you do not use RUSTFLAGS="-C target-feature=***" at all it will still use sse.

I think we should add runtime detection via a feature flag in another jira. Runtime detection is not always worth using as even though you may have access to intrinsics with wider registers the memory bandwidth of your cpu won't really allow you to take advantage of it, that's the situation I'm in on my dev machine. In this case you might end up checking for different intrinsics at runtime even though the most basic sse version is just as fast.

@sunchao
Copy link
Member

sunchao commented Jan 27, 2019

I think we should add runtime detection via a feature flag in another jira.

Sure. I'm OK with adding the runtime detection later.

Runtime detection is not always worth using as even though you may have access to intrinsics with wider registers the memory bandwidth of your cpu won't really allow you to take advantage of it, that's the situation I'm in on my dev machine.

Thanks. This is interesting to know. Do you have any benchmark number, or any reference that explains this?

@paddyhoran
Copy link
Contributor Author

The best resource is the issue I opened in packed_simd. I had compiled different versions of the add kernel for different instruction sets but when I bench-marked them there was no improvement in performance using larger registers leading me to open the issue. The discussion with the maintainer of packed_simd explains it well

@paddyhoran
Copy link
Contributor Author

Closing in favor of smaller incremental PR's.

@paddyhoran paddyhoran closed this Feb 4, 2019
@paddyhoran paddyhoran changed the title [WIP] ARROW-4196: [Rust] Add explicit SIMD vectorization for arithmetic ops in "array_ops" [WIP] [Rust] Add explicit SIMD vectorization for arithmetic ops in "array_ops" Feb 4, 2019
@paddyhoran paddyhoran deleted the simd branch February 25, 2020 02:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants