-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slice performance #571
Comments
I did some profiling of the
[package]
name = "nilgoyette_test"
version = "0.1.0"
edition = "2018"
[dependencies]
ndarray = "0.12"
[profile.release]
debug = true
[patch.crates-io]
ndarray = { git = "https://github.com/jturner314/ndarray.git", branch = "master" } (Using the current
extern crate ndarray;
use ndarray::prelude::*;
use ndarray::{azip, s};
#[inline(never)]
fn test_slice(data: &Array3<f64>, m: usize, n: usize, o: usize) -> f64 {
let pl = 3;
let br = 4;
let s1 = data.slice(s![br..br + pl, br..br + pl, br..br + pl]);
let s2 = data.slice(s![m..m + pl, n..n + pl, o..o + pl]);
let mut sum = 0.0;
azip!(s1, s2 in { sum += (s1 - s2).powi(2) });
sum
}
fn main() {
let data = Array::from_shape_fn((11, 11, 11), |(i, j, k)| {
i as f64 * 121. + j as f64 * 11. + k as f64
});
for _ in 0..50_000_000 {
test_slice(&data, 2, 4, 5);
}
} I built with
The biggest piece of the time is The However, 142 ns is very little time anyway, so is this really an issue? |
Thank you for testing, @jturner314. You're right, 142ns is nothing! I made my bench as simple as possible to have a simple issue, but there are more details. Of course, all parameters can change. This function is called in 3 for-loop, for m, n, o, which is called for all voxels in a 3D or 4D image, which dimension are at least 100x100x100. All this makes my running time go from 1m30s to 6m00s for the same 3D dataset. So you're telling me most of the time difference is in this block?
My goal wasn't to complain, I mostly wanted to know if the situation is normal and if there's an easy fix. I guess it a 'yes' and a 'no' :) |
Ah, so you really are calling it hundreds of millions of times or more. :)
I understand; I do think it's worth looking into things like this to see if there are any clear areas for improvement. Differences of nanoseconds can make a practical difference when the algorithm is O(n^3) or more, as you demonstrated. :)
Within the slicing stuff, yes, most of the time is spent there. I wonder if it would be worth adding a check if the step is 1 before doing the division? I would think a step of 1 would be the most common case by far. [Edit: I've created a branch for this. See my comment below.] We might be able to improve the performance of [Edit: Another thing that might help is adding support for slicing with types like I think we should make an effort to improve the performance of |
@nilgoyette Will you please try your benchmark with the [dependencies]
ndarray = "0.12"
[patch.crates-io]
ndarray = { git = "https://github.com/jturner314/ndarray.git", branch = "specialize-step-1" } [Edit: This still isn't close to the |
I wonder if there are any possible algorithmic improvements rather than just improving this small section in isolation. Are you able to provide more of the surrounding code? |
Thanks for providing a clear benchmark, thats the ground work for improvement! Might even be worth it to avoid the division for more than the 1 case, and it's a nice idea jturner. A static div by 2 is strength reduced to shift etc, basically any static div can be inserted conditionally. I guess that yes, making a small cutout in a 3d array has quite some overhead in the loop and the slicing. I'd like zip to handle it but maybe it needs a special case function. |
Looks like is_standard_layout for 2d arrays can be improved. |
@jturner314 @bluss Thanks a lot for looking into this. I am @nilgoyette colleague. We have a Rust implementation of 1 based on the implementation in 2 and we found this I tested your patch on a real use-case (3D image 256x256x176) and we went from I ran |
I'd say if you want a short answer, it's not surprising, it is not designed with this in mind (overhead of slicing becomes large when the array slice is very small — like 3x3x3). We do have some overhead we need to combat with special cases for lower dimensions (2 and 3d arrays). We have some mandatory overhead since we use bounds checking for correctness, and in this case, we are bounds checking in three dimensions. The example at We do have an ndproducer called Zip::from(&mut sum)
.and(X.windows((pl, pl, pl)))
.and(Y.windows((pl, pl, pl)))
.apply(|sum, w1, w2| /* compare your windows */); does |
One possible higher-level improvement is to change the order of iteration. For the sake of illustration, consider a much simpler but related problem: calculating for each element in a 2D array the sum of squares of differences between that element and nearby elements (within some distance), i.e.
The most obvious approach is to iterate over windows centered on the elements and calculate the sum of squares of differences for each window, like this: const RADIUS: usize = 1;
const WINDOW_SIZE: usize = 2 * RADIUS + 1;
fn sum_sq_diff_windows(data: ArrayView2<f64>) -> Array2<f64> {
let mut out = Array2::zeros((data.rows() - 2 * RADIUS, data.cols() - 2 * RADIUS));
Zip::from(&mut out)
.and(data.windows((WINDOW_SIZE, WINDOW_SIZE)))
.apply(|out, window| {
let center = window[(RADIUS, RADIUS)];
*out = window.fold(0., |acc, x| acc + (x - center).powi(2));
});
out
} This seems similar to the approach @nilgoyette / @fmorency are currently trying. Note that calling Another approach is to swap the order of iteration so that the outer loops are short and the inner loops are long. We can do this by making the outer loops iterate over the offsets within the window shape and making the inner loops iterate over the window centers. Note the example below may be slightly confusing because I'm using const RADIUS: usize = 1;
const WINDOW_SIZE: usize = 2 * RADIUS + 1;
fn sum_sq_diff_offset(data: ArrayView2<f64>) -> Array2<f64> {
let mut acc = Array2::zeros((data.rows() - 2 * RADIUS, data.cols() - 2 * RADIUS));
let radius = RADIUS as isize;
let centers = data.slice(s![radius..(-radius), radius..(-radius)]);
for offset_data in data.windows(centers.raw_dim()) {
Zip::from(&mut acc)
.and(centers)
.and(offset_data)
.apply(|acc, center, x| *acc += (x - center).powi(2));
}
acc
} There is a trade-off here. The first approach has better cache locality, while the second approach has lower overhead and is much friendlier to the branch predictor because the inner loops are a lot longer. The performance of the second approach is quite a bit better for this problem. The results are shown below, where the horiziontal axis is the axis length of the square [Note: I ran this benchmark using a fork of It's worth noting that while the second approach works better for this example problem, that's not necessarily true for the @nilgoyette's / @fmorency's problem, because there is a trade-off as I mentioned. It also may be difficult to make this transformation for the real algorithm; I haven't looked at the paper enough in detail to know whether or not that's the case. |
@jturner314 I tested your branch and I do see an interesting speedup on my 9x9x9 (m, n, o) benches.
Was there a drawback to your branch? @bluss Yes, the windows approach was my faster clean version. With this trick, I'm 2x slower instead of 3x slower. However, I just found out that I think this doesn't make the current issue useless. You seem to have some good optimization idea. I'll let you close it when you're ready. |
It might make slicing marginally slower for the |
Out of curiousity, as I found this thread interesting/informative; The performance issue was from accessing a small 3D/Voxel volume within a large dataset right? And that was due to the multiple dimensions? (internal logic issues identified aside) It was suggested that reducing dimensionality would have helped? Would something like a hilbert curve have helped here with data layout and cache locality? It's meant to be good for that afaik and should go well with a voxel grid? A simpler to implement alternative would be a Z-order curve / Morton Code. I'm not sure how practical that is with ndarray(I don't have any experience with this crate yet), I see Z-order curve support was requested 2 years ago, I take it that response still applies? So while there may be more a suitable data structure to use for working with this sort of data/logic, ndarray was a pragmatic choice to go with? |
Don't try to understand my first version too much, the second function is much simpler. It's simply comparing two 3x3x3 patches of data in a 11x11x11 image and returning a score. I refactored this function from
to
Of course, I'm happy with the code quality now (!), but the clean version is surprisingly slow. I benched those 2 functions using the same data, with
test_loop
using&data.as_slice().unwrap()
instead of&data
Are those results surprising to you? Both versions don't allocate, calculate the indices (in src or lib) and the sum, etc. I fail to see why the clean version is almost 6 times slower. Is
slice
doing something really complex?The text was updated successfully, but these errors were encountered: