Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: update example in README and documentation #73

Merged
merged 4 commits into from
Dec 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ categories = ["hardware-support", "science", "game-engines"]
edition = "2018"

[lib]
doctest = false
doctest = true

[features]
default = []
Expand Down
81 changes: 46 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
A library that abstracts over SIMD instruction sets, including ones with differing widths.
SIMDeez is designed to allow you to write a function one time and produce SSE2, SSE41, and AVX2 versions of the function.
SIMDeez is designed to allow you to write a function one time and produce SSE2, SSE41, AVX2 and Neon versions of the function.
You can either have the version you want chosen at compile time or automatically at runtime.

Originally developed by @jackmott, however I volunteered to take over ownership.

If there are intrinsics you need that are not currently implemented, create an issue
and I'll add them. PRs to add more intrinsics are welcome. Currently things are well fleshed out for i32, i64, f32, and f64 types.

As Rust stabilizes support for Neon and AVX-512 I plan to add those as well.
As Rust stabilizes support for AVX-512 I plan to add those as well.

Refer to the excellent [Intel Intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#) for documentation on these functions:

# Features

* SSE2, SSE41, AVX and AVX2, and scalar fallback
* SSE2, SSE41, AVX2, Neon and scalar fallback
* Can be used with compile time or run time selection
* No runtime overhead
* Uses familiar intel intrinsic naming conventions, easy to port.
Expand Down Expand Up @@ -50,25 +50,18 @@ performance as long as you don't run into some of the slower fallback functions.
# Example

```rust
use simdeez::*;
use simdeez::scalar::*;
use simdeez::sse2::*;
use simdeez::sse41::*;
use simdeez::avx::*;
use simdeez::avx2::*;
// If you want your SIMD function to use use runtime feature detection to call
// the fastest available version, use the simd_runtime_generate macro:
simd_runtime_generate!(
fn distance(
x1: &[f32],
y1: &[f32],
x2: &[f32],
y2: &[f32]) -> Vec<f32> {
use simdeez::{prelude::*, simd_runtime_generate};

use rand::prelude::*;

// If you want your SIMD function to use use runtime feature detection to call
// the fastest available version, use the simd_runtime_generate macro:
simd_runtime_generate!(
fn distance(x1: &[f32], y1: &[f32], x2: &[f32], y2: &[f32]) -> Vec<f32> {
let mut result: Vec<f32> = Vec::with_capacity(x1.len());
result.set_len(x1.len()); // for efficiency

/// Set each slice to the same length for iteration efficiency
// Set each slice to the same length for iteration efficiency
let mut x1 = &x1[..x1.len()];
let mut y1 = &y1[..x1.len()];
let mut x2 = &x2[..x1.len()];
Expand All @@ -79,34 +72,34 @@ use simdeez::*;
// so that it will work with any size vector.
// the width of a vector type is provided as a constant
// so the compiler is free to optimize it more.
// S::VF32_WIDTH is a constant, 4 when using SSE, 8 when using AVX2, etc
while x1.len() >= S::VF32_WIDTH {
// Vf32::WIDTH is a constant, 4 when using SSE, 8 when using AVX2, etc
while x1.len() >= S::Vf32::WIDTH {
//load data from your vec into an SIMD value
let xv1 = S::loadu_ps(&x1[0]);
let yv1 = S::loadu_ps(&y1[0]);
let xv2 = S::loadu_ps(&x2[0]);
let yv2 = S::loadu_ps(&y2[0]);
let xv1 = S::Vf32::load_from_slice(&x1);
let yv1 = S::Vf32::load_from_slice(&y1);
let xv2 = S::Vf32::load_from_slice(&x2);
let yv2 = S::Vf32::load_from_slice(&y2);

// Use the usual intrinsic syntax if you prefer
let mut xdiff = S::sub_ps(xv1, xv2);
let mut xdiff = xv1 - xv2;
// Or use operater overloading if you like
let mut ydiff = yv1 - yv2;
xdiff *= xdiff;
ydiff *= ydiff;
let distance = S::sqrt_ps(xdiff + ydiff);
let distance = (xdiff + ydiff).sqrt();
// Store the SIMD value into the result vec
S::storeu_ps(&mut res[0], distance);
distance.copy_to_slice(res);

// Move each slice to the next position
x1 = &x1[S::VF32_WIDTH..];
y1 = &y1[S::VF32_WIDTH..];
x2 = &x2[S::VF32_WIDTH..];
y2 = &y2[S::VF32_WIDTH..];
res = &mut res[S::VF32_WIDTH..];
x1 = &x1[S::Vf32::WIDTH..];
y1 = &y1[S::Vf32::WIDTH..];
x2 = &x2[S::Vf32::WIDTH..];
y2 = &y2[S::Vf32::WIDTH..];
res = &mut res[S::Vf32::WIDTH..];
}

// (Optional) Compute the remaining elements. Not necessary if you are sure the length
// of your data is always a multiple of the maximum S::VF32_WIDTH you compile for (4 for SSE, 8 for AVX2, etc).
// of your data is always a multiple of the maximum S::Vf32_WIDTH you compile for (4 for SSE, 8 for AVX2, etc).
// This can be asserted by putting `assert_eq!(x1.len(), 0);` here
for i in 0..x1.len() {
let mut xdiff = x1[i] - x2[i];
Expand All @@ -118,17 +111,35 @@ use simdeez::*;
}

result
});
}
);

const SIZE: usize = 200;

fn main() {
let mut rng = rand::thread_rng();

let raw = (0..4)
.map(|_i| (0..SIZE).map(|_j| rng.gen::<f32>()).collect::<Vec<f32>>())
.collect::<Vec<Vec<f32>>>();

let distances = distance(
raw[0].as_slice(),
raw[1].as_slice(),
raw[2].as_slice(),
raw[3].as_slice(),
);
assert_eq!(distances.len(), SIZE);
dbg!(distances);
}
```
This will generate 5 functions for you:
* `distance<S:Simd>` the generic version of your function
* `distance_scalar` a scalar fallback
* `distance_sse2` SSE2 version
* `distance_sse41` SSE41 version
* `distance_avx` AVX version
* `distance_avx2` AVX2 version
* `distance_neon` Neon version
* `distance_runtime_select` // picks the fastest of the above at runtime

You can use any of these you wish, though typically you would use the runtime_select version
Expand Down
142 changes: 75 additions & 67 deletions src/lib.rs
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
//! A library that abstracts over SIMD instruction sets, including ones with differing widths.
//! SIMDeez is designed to allow you to write a function one time and produce scalar, SSE2, SSE41, and AVX2 versions of the function.
//! SIMDeez is designed to allow you to write a function one time and produce scalar, SSE2, SSE41, AVX2 and Neon versions of the function.
//! You can either have the version you want selected automatically at runtime, at compiletime, or
//! select yourself by hand.
//!
//! SIMDeez is currently in Beta, if there are intrinsics you need that are not currently implemented, create an issue
//! and I'll add them. PRs to add more intrinsics are welcome. Currently things are well fleshed out for i32, i64, f32, and f64 types.
//!
//! As Rust stabilizes support for Neon and AVX-512 I plan to add those as well.
//! As Rust stabilizes support for AVX-512 I plan to add those as well.
//!
//! Refer to the excellent [Intel Intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#) for documentation on these functions.
//!
//! # Features
//!
//! * SSE2, SSE41, and AVX2 and scalar fallback
//! * SSE2, SSE41, AVX2, Neon and scalar fallback
//! * Can be used with compile time or run time selection
//! * No runtime overhead
//! * Uses familiar intel intrinsic naming conventions, easy to port.
Expand Down Expand Up @@ -49,85 +49,93 @@
//! # Example
//!
//! ```rust
//! use simdeez::*;
//! use simdeez::scalar::*;
//! use simdeez::sse2::*;
//! use simdeez::sse41::*;
//! use simdeez::avx2::*;
//! // If you want your SIMD function to use use runtime feature detection to call
//! // the fastest available version, use the simd_runtime_generate macro:
//! simd_runtime_generate!(
//! fn distance(
//! x1: &[f32],
//! y1: &[f32],
//! x2: &[f32],
//! y2: &[f32]) -> Vec<f32> {
//!use simdeez::{prelude::*, simd_runtime_generate};
//!
//! let mut result: Vec<f32> = Vec::with_capacity(x1.len());
//! result.set_len(x1.len()); // for efficiency
//! // If you want your SIMD function to use use runtime feature detection to call
//!// the fastest available version, use the simd_runtime_generate macro:
//!simd_runtime_generate!(
//! fn distance(x1: &[f32], y1: &[f32], x2: &[f32], y2: &[f32]) -> Vec<f32> {
//! let mut result: Vec<f32> = Vec::with_capacity(x1.len());
//! result.set_len(x1.len()); // for efficiency
//!
//! /// Set each slice to the same length for iteration efficiency
//! let mut x1 = &x1[..x1.len()];
//! let mut y1 = &y1[..x1.len()];
//! let mut x2 = &x2[..x1.len()];
//! let mut y2 = &y2[..x1.len()];
//! let mut res = &mut result[..x1.len()];
//! // Set each slice to the same length for iteration efficiency
//! let mut x1 = &x1[..x1.len()];
//! let mut y1 = &y1[..x1.len()];
//! let mut x2 = &x2[..x1.len()];
//! let mut y2 = &y2[..x1.len()];
//! let mut res = &mut result[..x1.len()];
//!
//! // Operations have to be done in terms of the vector width
//! // so that it will work with any size vector.
//! // the width of a vector type is provided as a constant
//! // so the compiler is free to optimize it more.
//! // S::VF32_WIDTH is a constant, 4 when using SSE, 8 when using AVX2, etc
//! while x1.len() >= S::VF32_WIDTH {
//! //load data from your vec into an SIMD value
//! let xv1 = S::loadu_ps(&x1[0]);
//! let yv1 = S::loadu_ps(&y1[0]);
//! let xv2 = S::loadu_ps(&x2[0]);
//! let yv2 = S::loadu_ps(&y2[0]);
//! // Operations have to be done in terms of the vector width
//! // so that it will work with any size vector.
//! // the width of a vector type is provided as a constant
//! // so the compiler is free to optimize it more.
//! // Vf32::WIDTH is a constant, 4 when using SSE, 8 when using AVX2, etc
//! while x1.len() >= S::Vf32::WIDTH {
//! //load data from your vec into an SIMD value
//! let xv1 = S::Vf32::load_from_slice(&x1);
//! let yv1 = S::Vf32::load_from_slice(&y1);
//! let xv2 = S::Vf32::load_from_slice(&x2);
//! let yv2 = S::Vf32::load_from_slice(&y2);
//!
//! // Use the usual intrinsic syntax if you prefer
//! let mut xdiff = S::sub_ps(xv1, xv2);
//! // Or use operater overloading if you like
//! let mut ydiff = yv1 - yv2;
//! xdiff *= xdiff;
//! ydiff *= ydiff;
//! let distance = S::sqrt_ps(xdiff + ydiff);
//! // Store the SIMD value into the result vec
//! S::storeu_ps(&mut res[0], distance);
//! // Use the usual intrinsic syntax if you prefer
//! let mut xdiff = xv1 - xv2;
//! // Or use operater overloading if you like
//! let mut ydiff = yv1 - yv2;
//! xdiff *= xdiff;
//! ydiff *= ydiff;
//! let distance = (xdiff + ydiff).sqrt();
//! // Store the SIMD value into the result vec
//! distance.copy_to_slice(res);
//!
//! // Move each slice to the next position
//! x1 = &x1[S::VF32_WIDTH..];
//! y1 = &y1[S::VF32_WIDTH..];
//! x2 = &x2[S::VF32_WIDTH..];
//! y2 = &y2[S::VF32_WIDTH..];
//! res = &mut res[S::VF32_WIDTH..];
//! }
//! // Move each slice to the next position
//! x1 = &x1[S::Vf32::WIDTH..];
//! y1 = &y1[S::Vf32::WIDTH..];
//! x2 = &x2[S::Vf32::WIDTH..];
//! y2 = &y2[S::Vf32::WIDTH..];
//! res = &mut res[S::Vf32::WIDTH..];
//! }
//!
//! // (Optional) Compute the remaining elements. Not necessary if you are sure the length
//! // of your data is always a multiple of the maximum S::VF32_WIDTH you compile for (4 for SSE, 8 for AVX2, etc).
//! // This can be asserted by putting `assert_eq!(x1.len(), 0);` here
//! for i in 0..x1.len() {
//! let mut xdiff = x1[i] - x2[i];
//! let mut ydiff = y1[i] - y2[i];
//! xdiff *= xdiff;
//! ydiff *= ydiff;
//! let distance = (xdiff + ydiff).sqrt();
//! res[i] = distance;
//! }
//! // (Optional) Compute the remaining elements. Not necessary if you are sure the length
//! // of your data is always a multiple of the maximum S::Vf32_WIDTH you compile for (4 for SSE, 8 for AVX2, etc).
//! // This can be asserted by putting `assert_eq!(x1.len(), 0);` here
//! for i in 0..x1.len() {
//! let mut xdiff = x1[i] - x2[i];
//! let mut ydiff = y1[i] - y2[i];
//! xdiff *= xdiff;
//! ydiff *= ydiff;
//! let distance = (xdiff + ydiff).sqrt();
//! res[i] = distance;
//! }
//!
//! result
//! });
//! # fn main() {
//! # }
//! result
//! }
//!);
//!
//!const SIZE: usize = 200;
//!
//!fn main() {
//! let raw = (0..4)
//! .map(|i| (0..SIZE).map(|j| (i*j) as f32).collect::<Vec<f32>>())
//! .collect::<Vec<Vec<f32>>>();
//!
//! let distances = distance(
//! raw[0].as_slice(),
//! raw[1].as_slice(),
//! raw[2].as_slice(),
//! raw[3].as_slice(),
//! );
//! assert_eq!(distances.len(), SIZE);
//! dbg!(distances);
//!}
//! ```
//!
//! This will generate 5 functions for you:
//! * `distance<S:Simd>` the generic version of your function
//! * `distance_scalar` a scalar fallback
//! * `distance_sse2` SSE2 version
//! * `distance_sse41` SSE41 version
//! * `distance_avx` AVX version
//! * `distance_avx2` AVX2 version
//! * `distance_neon` Neon version
//! * `distance_runtime_select` picks the fastest of the above at runtime
//!
//! You can use any of these you wish, though typically you would use the runtime_select version
Expand All @@ -145,7 +153,7 @@
//! of arcane subtleties with inlining and target_features that must be managed. See how the macros
//! expand for more detail.
#![cfg_attr(
all(target_arch = "wasm32", not(feature = "stable")),

Check failure on line 156 in src/lib.rs

View workflow job for this annotation

GitHub Actions / Code Checks (formatting, clippy)

unexpected `cfg` condition value: `stable`

Check warning on line 156 in src/lib.rs

View workflow job for this annotation

GitHub Actions / x86 Tests

unexpected `cfg` condition value: `stable`

Check warning on line 156 in src/lib.rs

View workflow job for this annotation

GitHub Actions / x86 Tests

unexpected `cfg` condition value: `stable`

Check warning on line 156 in src/lib.rs

View workflow job for this annotation

GitHub Actions / Arm Neon Tests

unexpected `cfg` condition value: `stable`
feature(core_intrinsics)
)]
#![allow(clippy::missing_safety_doc)] // TODO: Work on the safety of functions
Expand Down
Loading