-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request : Populate library with c++ STL like algorithm #5
Comments
Hi Thomas, The main goal of the library is to wrap intrinsics and providing C and C++ APIs. Thus STL like algorithms which are C++ only and higher level than intrinsics wrapping are not meant to be in The policy is to provide those as separate modules so that the end-user, in your case, will have to do #include <nsimd/modules/stl-algorithms.hpp> The organization inside As for STL-like algorithms, if you take code from Boost.SIMD, please be aware of the following:
|
Hi thanks for this usefull explanation of potential pitfalls
WHat do you think? For loop unrolling OK for two functions Thanks |
Hi Thomas, Sorry for the late answer. I am not sure why you need SFINAE. I mean just take arguments as "standard" template. I think the reason is that you want to check that the given types are what the function expects. From my point of view this is the role of concepts which are not available yet (I mean part of the standard, not a TS, as of 2019) and will therefore not be broadly available anytime soon in the industry. "Standard" templates are not meant to do this kind of checking plus it makes compilation times longer, so I suggest not to use them for this. As for #include <iostream>
typedef double foo __attribute__ ((aligned (64)));
alignas(64) double bar;
double baz __attribute__ ((aligned (64)));
int main(int argc, char *argv[]) {
std::cout << "foo sizeof: " << sizeof(foo) << " alignof: " << alignof(foo) << "\n";
std::cout << "bar sizeof: " << sizeof(bar) << " alignof: " << alignof(decltype(bar)) << "\n";
std::cout << "baz sizeof: " << sizeof(baz) << " alignof: " << alignof(decltype(baz)) << "\n";
} The above snippet is from https://stackoverflow.com/questions/46457449/is-it-always-the-case-that-sizeoft-alignoft-for-all-object-types-t. So I suggest to take all cases into account:
|
Hello , i restart on this issue because i want to switch some code using an another simd lib to nsimd I 'm stuck on how to deal with tests for this kind of function (eg for stl like algorithm in general) and yes of course thanks for the impressive work on the library 👍 Regards |
Hi ThomasRetornaz, (Thank you for the impressive work) My answerYou are right we test on all types and SIMD extensions. Maybe something like def gen_test(opts):
for op_name, operator in operators.operators.items():
if len(operator.params) == 2: # return type + (1 arguments type)
# generate C++:
# std::transform(a.begin(), a.end(), b.begin(), [](arg){
# return op_name(arg);
# });
# for (int i = 0 to n) { nsimd_scalar_op_name(arg); }
# Check that the for-loop and the std::transform compute the same things
if len(operator.params) == 3: # return type + (2 arguments types)
# generate C++:
# std::transform(a.begin(), a.end(), b.begin(), c.begin(), [](arg1, arg2){
# return op_name(arg1, arg2);
# });
# for (int i = 0 to n) { nsimd_scalar_op_name(arg1, arg2); }
# Check that the for-loop and the std::transform compute the same things If it works like I think the NSIMD operators that are eligible for the STL algorithms are the same that are eligible to be used in expressions templates and SPMD kernels. One has to check for op_name, operator in operators.operators.items():
if not operator.has_scalar_impl:
continue Note also that the other two NSIMD modules do not split the interval for using aligned loads and stores. You can write a first version of My commentIn general, performance-wise, stl-like algorithm are not optimized properly by compilers. I am not talking about a single transform but when you have several algorithms which are chained. For example, consider the following snippet of code: #include <algorithm>
#include <cmath>
void foo(float *a, int n) {
std::transform(a, a + n, a, [](float item) {
return std::abs(item);
});
std::transform(a, a + n, a, [](float item) {
return std::sqrt(item);
});
}
#include <nsimd/modules/tet1d.hpp>
void bar(float *a, int n) {
tet1d::out(a) = tet1d::sqrt(tet1d::abs(tet1d::in(a, n)));
} The You will see that the two I advise you to write a old-style-C for-loop or use an expression template mechanism. Not a C++ for-loop because void foo(std::vector<float> &v) {
for (int i = 0; i < v.size(); i += 4) {
_mm_storeu_ps(&v[i],
_mm_sqrt_ps(_mm_loadu_ps(&v[i]))
);
}
} gives
while the "C" version: void bar(std::vector<float> &v_) {
float *v = v_.data();
int n = v_.size();
for (int i = 0; i < n; i += 4) {
_mm_storeu_ps(&v[i],
_mm_sqrt_ps(_mm_loadu_ps(&v[i]))
);
}
} gives
|
Thanks for the detailled anwser 👍 and self explained exemples. |
I believe there is an interest to provide STL-like algorithms to let people port their code especially if it supports GPUs. It would be really interesting to do that in fact. again the fact that kernels won't be merge will imply a loss in performances but it would be a really good first step for C++-people who needs to port their algorithms. |
Hi, i'm stuck to find the canonical way to template <NSIMD_CONCEPT_VALUE_TYPE L, NSIMD_CONCEPT_VALUE_TYPE T,
typename UnaryOp>
NSIMD_REQUIRES(sizeof_v<L> == sizeof_v<T>)
T *transform(L const *first, L const *last, T *out, UnaryOp f) {
typedef typename nsimd::pack<L> pack_L;
typedef typename nsimd::pack<T> pack_T;
//const std::size_t alignment = NSIMD_MAX_ALIGNMENT;
// Define loop counter and range
const int step_simd = ???????
const std::ptrdiff_t range_size = std::distance(first, last);
const std::ptrdiff_t size_simd_loop =
(range_size > step_simd) ? ((std::ptrdiff_t)step_simd *
((range_size) / (std::ptrdiff_t)step_simd))
: (std::ptrdiff_t)0;
std::ptrdiff_t i = 0;
//---main simd loop
for (; i < size_simd_loop; i += step_simd) {
pack_L element, res;
element = nsimd::loadu<pack_L>(
first); //[Q] if we check alignement we could use aligned
// load/store right?
res = f(element);
nsimd::storeu(out, res);
first += step_simd;
out += step_simd;
}
//---epilogue
for (; i < range_size; ++i) {
*out++ = f(*first++);
}
return out;
} After that, Thanks |
Hi Thomas, I think what you are looking for is You are right you can check whether pointers are aligned or not and use the proper loads/stores. For convenience we added templated versions of loads and stores based on the alignement so that you have to write your kernel only one. I also would write the code like this to take into account GPUs (but I think that for CUDA it will be impossible without some help from the end-user), I think you should separate kernels from the implementation of // WARNING: this is more pseudo-code than C++
template <NSIMD_CONCEPT_VALUE_TYPE L, NSIMD_CONCEPT_VALUE_TYPE T,
NSIMD_CONCEPT_ALIGNMENT InputAlignment,
NSIMD_CONCEPT_ALIGNMENT OutputAlignment
typename UnaryOp>
NSIMD_REQUIRES((sizeof_v<L> == sizeof_v<T>))
T *transform_impl(L const *first, L const *last, T *out, UnaryOp f) {
typedef typename nsimd::pack<L> pack_L;
const int step_simd = nsimd::len(pack_L());
const std::ptrdiff_t n = std::distance(first, last);
nsimd::nat i = 0;
for (; i + ste_simd <= n; i += step_simd) {
nsimd::store<OutputAlignment>(out + i, f(
nsimd::load<InputAlignment, pack_L>(first + i)));
}
for (; i < n; ++i) {
out[i] = f(first[i]);
}
return out;
}
template <NSIMD_CONCEPT_VALUE_TYPE L, NSIMD_CONCEPT_VALUE_TYPE T,
typename UnaryOp>
NSIMD_REQUIRES((sizeof_v<L> == sizeof_v<T>))
T *transform(L const *first, L const *last, T *out, UnaryOp f) {
if (first and out are aligned) {
return transform<L, T, nsimd::aligned, nsimd::aligned, UnaryOp>(first, last, out, f);
} else if (only first is aligned) {
return transform<L, T, nsimd::aligned, nsimd::unaligned, UnaryOp>(first, last, out, f);
} else if (only out is aligned) {
return transform<L, T, nsimd::unaligned, nsimd::aligned, UnaryOp>(first, last, out, f);
} else {
return transform<L, T, nsimd::unaligned, nsimd::unaligned, UnaryOp>(first, last, out, f);
}
} For GPUs, the problem is CUDA. For HIP and oneAPI there is no need to annotate ( #ifdef NSIMD_CUDA
#define NSIMD_DEVICE __device__
#else
#define NSIMD_DEVICE
#endif in which case the // WARNING: this is more pseudo-code than C++
#ifdef NSIMD_CUDA
template <NSIMD_CONCEPT_VALUE_TYPE L, NSIMD_CONCEPT_VALUE_TYPE T,
typename UnaryOp>
NSIMD_REQUIRES((sizeof_v<L> == sizeof_v<T>))
__global__ void transform_impl(L const *first, L const *last, T *out, UnaryOp f) {
int i = blockIdx.x ...;
if (i < last - first) {
out[i] = f(in[i]);
}
}
#else
template <NSIMD_CONCEPT_VALUE_TYPE L, NSIMD_CONCEPT_VALUE_TYPE T,
NSIMD_CONCEPT_ALIGNMENT InputAlignment,
NSIMD_CONCEPT_ALIGNMENT OutputAlignment
typename UnaryOp>
NSIMD_REQUIRES((sizeof_v<L> == sizeof_v<T>))
T *transform_impl(L const *first, L const *last, T *out, UnaryOp f) {
typedef typename nsimd::pack<L> pack_L;
const int step_simd = nsimd::len(pack_L());
const std::ptrdiff_t n = std::distance(first, last);
nsimd::nat i = 0;
for (; i + ste_simd <= n; i += step_simd) {
nsimd::store<OutputAlignment>(out + i, f(
nsimd::load<InputAlignment, pack_L>(first + i)));
}
for (; i < n; ++i) {
out[i] = f(first[i]);
}
return out;
}
#endif
template <NSIMD_CONCEPT_VALUE_TYPE L, NSIMD_CONCEPT_VALUE_TYPE T,
typename UnaryOp>
NSIMD_REQUIRES((sizeof_v<L> == sizeof_v<T>))
T *transform(L const *first, L const *last, T *out, UnaryOp f) {
#ifdef NSIMD_CUDA
// we suppose that pointers are GPU pointers
// It is not the job of transform nor NSIMD to
// handle memory copies between host and
// devices
transform_impl<<< ... >>>(first, last, out, f);
return out;
#else
if (first and out are aligned) {
return transform<L, T, nsimd::aligned, nsimd::aligned, UnaryOp>(first, last, out, f);
} else if (only first is aligned) {
return transform<L, T, nsimd::aligned, nsimd::unaligned, UnaryOp>(first, last, out, f);
} else if (only out is aligned) {
return transform<L, T, nsimd::unaligned, nsimd::aligned, UnaryOp>(first, last, out, f);
} else {
return transform<L, T, nsimd::unaligned, nsimd::unaligned, UnaryOp>(first, last, out, f);
}
#endif
} Hope it helps. This just got out of my head so errors are surely in there. I did not take time to check all my asumptions especially on CUDA. |
Thanks a lot for this inspirational answer, i will follow your tips and advices Do you want such a "details" namespace for impl? Regards |
Hi Thomas, I think we already use a |
Sorry for the delay
But not all "nsimd" operators have stl counterpart
|
My bad i simply not think enough template<typename T>
struct UnaryOP
{{
UnaryOP(){{}}
T operator()(T const &a0) const
{{
return nsimd_scalar_{op_name}_{typ}(a0);
}}
template<typename U>
U operator()(U const &a0) const
{{
return nsimd::{op_name}(a0);
}}
}};
void compute_result({typ} *dst, {typ} const *tab0,
unsigned int n) {{
UnaryOP<{typ}> op;
std::transform(tab0, tab0+n, dst, op);
}}
void compute_output({typ} *dst, {typ} const *tab0,
unsigned int n) {{
UnaryOP<{typ}> op;
nsimd::stl_algorithms::transform(tab0, tab0+n, dst, op);
}} do the job |
Hi Thomas, I saw your 2 posts, I am really busy right now. As soon as I have some time I will answer you properly. |
Hi Thomas, I finally took the time to read your posts. You found the solution yourself by using the scalar versions of the operators. As for GPUs, NSIMD also provides functions to use inside kernels (all is in void kernel() {
// CPU scalar version
nsimd::scalar_OPERATOR_TYPE(arg1, ..., argN);
// GPU scalar version
nsimd::gpu_OPERATOR_TYPE(arg1, ..., argN);
} |
Hi i migrate from boost::simd to nsimd
It seems that nsimd doesnt provide high level STL like algorithm (Transform,Reduce,etc...)
I could try to implement them on my side
Questions:
Where to put them?
Where to put tests?
The text was updated successfully, but these errors were encountered: