Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions thrust/examples/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,12 @@ function(thrust_add_example target_name_var example_name example_src thrust_targ
thrust_configure_cuda_target(${example_target} RDC ${THRUST_FORCE_RDC})
endif()

# We do not want to explicitly include `host_device.h` if not needed, so force include the file for non CUDA targets
target_compile_options(${example_target} PRIVATE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alliepiper Is that fine? I believe we do not want to add all the noise of the host_device includes to all the examples

Copy link
Contributor Author

@charan-003 charan-003 Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@miscco I was thinking there was something else, why CI tests were failing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good to show in the examples that users need to be mindful of __host__ __device__ annotations when targeting multiple backends. I'd almost prefer directly including the contents of host_device.h in each example, without the _CCCL macros, to show users how to implement this themselves.

It's noisy and repetitive, but might be worth doing with a good comment explaining why/when it's needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, what you have here is functional and would be fine from a technical standpoint. I just don't like force-including important user-setup bits in example code, hidden magic is bad for teaching IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we should move the examples, what does device_vector even mean in the context of TBB and openMP?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move them?

what does device_vector even mean in the context of TBB and openMP?

It's backed by host memory when those backends are used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that anything that uses a device_vector should live in examples/cuda and then use appropriately annotated functions.

But anything in examples should work out of the box for any backend without any magic intervention

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@miscco device_vector allocates memory on the current Thrust device system. If that is CUDA, it's CUDA device memory. If the device system is TBB, OMP or CPP, then a device_vector just behaves like a host vector. This is so Thrust can switch backends with the preprocessor.

$<$<COMPILE_LANG_AND_ID:CXX,MSVC>:/FI include/host_device.h>
$<$<COMPILE_LANGUAGE:CXX>:-include include/host_device.h>
)

# Add to the active configuration's meta target
add_dependencies(${config_meta_target} ${example_target})

Expand Down
23 changes: 6 additions & 17 deletions thrust/examples/arbitrary_transformation.cu
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,6 @@

#include <iostream>

#include "include/host_device.h"

// This example shows how to implement an arbitrary transformation of
// the form output[i] = F(first[i], second[i], third[i], ... ).
// In this example, we use a function with 3 inputs and 1 output.
Expand Down Expand Up @@ -62,29 +60,20 @@ struct arbitrary_functor2

int main()
{
// allocate storage
thrust::device_vector<float> A(5);
thrust::device_vector<float> B(5);
thrust::device_vector<float> C(5);
// allocate and initialize
thrust::device_vector<float> A{3, 4, 0, 8, 2};
thrust::device_vector<float> B{6, 7, 2, 1, 8};
thrust::device_vector<float> C{2, 5, 7, 4, 3};
thrust::device_vector<float> D1(5);

// clang-format off
// initialize input vectors
A[0] = 3; B[0] = 6; C[0] = 2;
A[1] = 4; B[1] = 7; C[1] = 5;
A[2] = 0; B[2] = 2; C[2] = 7;
A[3] = 8; B[3] = 1; C[3] = 4;
A[4] = 2; B[4] = 8; C[4] = 3;
// clang-format on

// apply the transformation
thrust::for_each(thrust::make_zip_iterator(A.begin(), B.begin(), C.begin(), D1.begin()),
thrust::make_zip_iterator(A.end(), B.end(), C.end(), D1.end()),
arbitrary_functor1());

// print the output
std::cout << "Tuple functor" << std::endl;
for (int i = 0; i < 5; i++)
for (size_t i = 0; i < A.size(); i++)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the preferred form for (size_t i = 0; i != size(A); ++i) ?

Also,

  • is it possible to use iterators? (My C++ has been rusting for 5 years now)
  • let's not use std::endl unless needed (cf here)
  • can we use std::format to our advantage?
  • free functions improve encapsulation (cf. here)
for (
  auto it = make_zip_iterator(make_tuple(begin(A), begin(B), begin(C), begin(D))));
  it != make_zip_iterator(make_tuple(end(A), end(B), end(C), end(D))));
 ++it)
{
  std::cout << std::format("{} + {} * {} = {}\n", *it);
}

Maybe the make_zip_iterator(make_tuple(begin(A), ...))) can be extracted into a generic somehow, along the lines of

auto zip_begin(auto containers..) {
  return make_zip_iterator(make_tuple(begin(containers)...));
}
auto zip_end(auto containers..) {
  return make_zip_iterator(make_tuple(end(containers)...));
}

In which case the above simplifies further to

for (
  auto it = zip_begin(A, B, C, D);  it != zip_end(A, B, C, D); ++it)
{
  std::cout << std::format("{} + {} * {} = {}\n", *it);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback! You are always free to create a PR yourself or start a discussion.

Isn't the preferred form for (size_t i = 0; i != size(A); ++i) ?

I have no preference here. The PR improved the situation by not using a magic number, which is good.

  • is it possible to use iterators? (My C++ has been rusting for 5 years now)

Yes, but iterating 4 ranges at the same time using a zip may also be a bit over-engineered. Using an index if fine here IMO. Examples should be easy.

  • let's not use std::endl unless needed (cf here)

Correct. Feel free to propose a PR to replace them by '\n'.

  • can we use std::format to our advantage?

CCCL still supports C++17, but I don't see a blocker with using C++20 in examples only. I will start a discussion internally.

  • free functions improve encapsulation (cf. here)

Again, for example code I have no preference here. I agree with this when writing library code.

for (
  auto it = make_zip_iterator(make_tuple(begin(A), begin(B), begin(C), begin(D))));
  it != make_zip_iterator(make_tuple(end(A), end(B), end(C), end(D))));
 ++it)
{
  std::cout << std::format("{} + {} * {} = {}\n", *it);
}

I think this does not increase readability or clarity of the example.

Maybe the make_zip_iterator(make_tuple(begin(A), ...))) can be extracted into a generic somehow, along the lines of

We have that today, just construct the zip_iterator and led CTAD deduce the arguments:

zip_iterator(begin(A), begin(B), begin(C);

Should deduce zip_iterator<decltype(begin(A)), ...>. That only works with cuda::zip_iterator. For thrust, you can at least skip the make_tuple, we fixed that some time ago.

{
std::cout << A[i] << " + " << B[i] << " * " << C[i] << " = " << D1[i] << std::endl;
}
Expand All @@ -97,7 +86,7 @@ int main()

// print the output
std::cout << "N-ary functor" << std::endl;
for (int i = 0; i < 5; i++)
for (size_t i = 0; i < A.size(); i++)
{
std::cout << A[i] << " + " << B[i] << " * " << C[i] << " = " << D2[i] << std::endl;
}
Expand Down
10 changes: 2 additions & 8 deletions thrust/examples/basic_vector.cu
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,8 @@

int main()
{
// H has storage for 4 integers
thrust::host_vector<int> H(4);

// initialize individual elements
H[0] = 14;
H[1] = 20;
H[2] = 38;
H[3] = 46;
// H holds 4 integers
thrust::host_vector<int> H{14, 20, 38, 46};

// H.size() returns the size of vector H
std::cout << "H has size " << H.size() << std::endl;
Expand Down
15 changes: 5 additions & 10 deletions thrust/examples/bounding_box.cu
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,6 @@
#include <thrust/random.h>
#include <thrust/transform_reduce.h>

#include "include/host_device.h"

// This example shows how to compute a bounding box
// for a set of points in two dimensions.

Expand Down Expand Up @@ -54,7 +52,7 @@ struct bbox
};

// reduce a pair of bounding boxes (a,b) to a bounding box containing a and b
struct bbox_reduction
struct bbox_union
{
__host__ __device__ bbox operator()(bbox a, bbox b)
{
Expand All @@ -71,13 +69,13 @@ struct bbox_reduction
int main()
{
const size_t N = 40;
thrust::default_random_engine rng;
thrust::uniform_real_distribution<float> u01(0.0f, 1.0f);

// allocate storage for points
thrust::device_vector<point2d> points(N);

// generate some random points in the unit square
thrust::default_random_engine rng;
thrust::uniform_real_distribution<float> u01(0.0f, 1.0f);
for (size_t i = 0; i < N; i++)
{
float x = u01(rng);
Expand All @@ -86,13 +84,10 @@ int main()
}

// initial bounding box contains first point
bbox init = bbox(points[0], points[0]);

// binary reduction operation
bbox_reduction binary_op;
bbox init(points[0], points[0]);

// compute the bounding box for the point set
bbox result = thrust::reduce(points.begin(), points.end(), init, binary_op);
bbox result = thrust::reduce(points.begin(), points.end(), init, bbox_union{});

// print output
std::cout << "bounding box " << std::fixed;
Expand Down
2 changes: 0 additions & 2 deletions thrust/examples/bucket_sort2d.cu
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,6 @@
#include <iomanip>
#include <iostream>

#include "include/host_device.h"

// define a 2d float vector
using vec2 = thrust::tuple<float, float>;

Expand Down
6 changes: 1 addition & 5 deletions thrust/examples/constant_iterator.cu
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,7 @@

int main()
{
thrust::device_vector<int> data(4);
data[0] = 3;
data[1] = 7;
data[2] = 2;
data[3] = 5;
thrust::device_vector<int> data{3, 7, 2, 5};

// add 10 to all values in data
thrust::transform(
Expand Down
12 changes: 2 additions & 10 deletions thrust/examples/counting_iterator.cu
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,7 @@ int main()
// this example computes indices for all the nonzero values in a sequence

// sequence of zero and nonzero values
thrust::device_vector<int> stencil(8);
stencil[0] = 0;
stencil[1] = 1;
stencil[2] = 1;
stencil[3] = 0;
stencil[4] = 0;
stencil[5] = 1;
stencil[6] = 0;
stencil[7] = 1;
thrust::device_vector<int> stencil{0, 1, 1, 0, 0, 1, 0, 1};

// storage for the nonzero indices
thrust::device_vector<int> indices(8);
Expand All @@ -35,7 +27,7 @@ int main()
// indices now contains [1,2,5,7]

// print result
std::cout << "found " << (indices_end - indices.begin()) << " nonzero values at indices:\n";
std::cout << "found " << cuda::std::distance(indices.begin(), indices_end) << " nonzero values at indices:\n";
thrust::copy(indices.begin(), indices_end, std::ostream_iterator<int>(std::cout, "\n"));

return 0;
Expand Down
2 changes: 1 addition & 1 deletion thrust/examples/device_ptr.cu
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ int main()
// device_ptr supports pointer arithmetic
thrust::device_ptr<int> first = d_ptr;
thrust::device_ptr<int> last = d_ptr + 10;
std::cout << "device array contains " << (last - first) << " values\n";
std::cout << "device array contains " << cuda::std::distance(first, last) << " values\n";

// algorithms work as expected
thrust::sequence(first, last);
Expand Down
7 changes: 3 additions & 4 deletions thrust/examples/discrete_voronoi.cu
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@
#include <iomanip>
#include <iostream>

#include "include/host_device.h"
#include "include/timer.h"

// Compute an approximate Voronoi Diagram with a Jump Flooding Algorithm (JFA)
Expand All @@ -26,11 +25,11 @@
// Tuple = <seeds,seeds + k,seeds + m*k, seeds - k,
// seeds - m*k, seeds+ k+m*k,seeds + k-m*k,
// seeds- k+m*k,seeds - k+m*k, i>
struct minFunctor
struct voronoi_site_selector
{
int m, n, k;

__host__ __device__ minFunctor(int m, int n, int k)
__host__ __device__ voronoi_site_selector(int m, int n, int k)
: m(m)
, n(n)
, k(k)
Expand Down Expand Up @@ -199,7 +198,7 @@ void jfa(thrust::device_vector<int>& in, thrust::device_vector<int>& out, unsign
thrust::counting_iterator<int>(0))
+ n * m,
out.begin(),
minFunctor(m, n, k));
voronoi_site_selector(m, n, k));
}
/********************************************/

Expand Down
4 changes: 1 addition & 3 deletions thrust/examples/dot_products_with_zip.cu
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,6 @@
#include <thrust/random.h>
#include <thrust/transform.h>

#include "include/host_device.h"

// This example shows how thrust::zip_iterator can be used to create a
// 'virtual' array of structures. In this case the structure is a 3d
// vector type (Float3) whose (x,y,z) components will be stored in
Expand All @@ -33,7 +31,7 @@ thrust::host_vector<float> random_vector(const size_t N, unsigned int seed = thr
thrust::default_random_engine rng(seed);
thrust::uniform_real_distribution<float> u01(0.0f, 1.0f);
thrust::host_vector<float> temp(N);
for (size_t i = 0; i < N; i++)
for (size_t i = 0; i < N; ++i)
{
temp[i] = u01(rng);
}
Expand Down
13 changes: 4 additions & 9 deletions thrust/examples/expand.cu
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ OutputIterator expand(InputIterator1 first1, InputIterator1 last1, InputIterator

// compute max-scan over the output indices, filling in the holes
thrust::inclusive_scan(
output_indices.begin(), output_indices.end(), output_indices.begin(), ::cuda::maximum<difference_type>());
output_indices.begin(), output_indices.end(), output_indices.begin(), cuda::maximum<difference_type>{});

// gather input values according to index array (output = first2[output_indices])
thrust::gather(output_indices.begin(), output_indices.end(), first2, output);
Expand All @@ -62,15 +62,10 @@ void print(const std::string& s, const Vector& v)

int main()
{
int counts[] = {3, 5, 2, 0, 1, 3, 4, 2, 4};
int values[] = {1, 2, 3, 4, 5, 6, 7, 8, 9};
thrust::device_vector<int> d_counts = {3, 5, 2, 0, 1, 3, 4, 2, 4};
thrust::device_vector<int> d_values = {1, 2, 3, 4, 5, 6, 7, 8, 9};

size_t input_size = sizeof(counts) / sizeof(int);
size_t output_size = thrust::reduce(counts, counts + input_size);

// copy inputs to device
thrust::device_vector<int> d_counts(counts, counts + input_size);
thrust::device_vector<int> d_values(values, values + input_size);
const size_t output_size = thrust::reduce(d_counts.begin(), d_counts.end());
thrust::device_vector<int> d_output(output_size);

// expand values according to counts
Expand Down
2 changes: 2 additions & 0 deletions thrust/examples/include/host_device.h
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@

#pragma once

#include <cuda/__cccl_config>

#if !_CCCL_HAS_CUDA_COMPILER()

# ifndef __host__
Expand Down
24 changes: 11 additions & 13 deletions thrust/examples/lambda.cu
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,6 @@

#include <iostream>

#include "include/host_device.h"

// This example demonstrates the use of placeholders to implement
// the SAXPY operation (i.e. Y[i] = a * X[i] + Y[i]).
//
Expand Down Expand Up @@ -41,14 +39,14 @@ struct saxpy_functor
int main()
{
// input data
float a = 2.0f;
float x[4] = {1, 2, 3, 4};
float y[4] = {1, 1, 1, 1};
float a = 2.0f;
thrust::device_vector<float> x_data = {1, 2, 3, 4};
thrust::device_vector<float> y_data = {1, 1, 1, 1};

// SAXPY implemented with a functor (function object)
{
thrust::device_vector<float> X(x, x + 4);
thrust::device_vector<float> Y(y, y + 4);
thrust::device_vector<float> X = x_data;
thrust::device_vector<float> Y = y_data;

thrust::transform(
X.begin(),
Expand All @@ -58,16 +56,16 @@ int main()
saxpy_functor(a)); // functor

std::cout << "SAXPY (functor method)" << std::endl;
for (size_t i = 0; i < 4; i++)
for (size_t i = 0; i < Y.size(); i++)
{
std::cout << a << " * " << x[i] << " + " << y[i] << " = " << Y[i] << std::endl;
std::cout << a << " * " << x_data[i] << " + " << y_data[i] << " = " << Y[i] << std::endl;
}
}

// SAXPY implemented with a placeholders
{
thrust::device_vector<float> X(x, x + 4);
thrust::device_vector<float> Y(y, y + 4);
thrust::device_vector<float> X = x_data;
thrust::device_vector<float> Y = y_data;

thrust::transform(
X.begin(),
Expand All @@ -77,9 +75,9 @@ int main()
a * _1 + _2); // placeholder expression

std::cout << "SAXPY (placeholder method)" << std::endl;
for (size_t i = 0; i < 4; i++)
for (size_t i = 0; i < Y.size(); i++)
{
std::cout << a << " * " << x[i] << " + " << y[i] << " = " << Y[i] << std::endl;
std::cout << a << " * " << x_data[i] << " + " << y_data[i] << " = " << Y[i] << std::endl;
}
}

Expand Down
4 changes: 2 additions & 2 deletions thrust/examples/lexicographical_sort.cu
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ int main()
thrust::device_vector<int> lower = random_vector(N);

std::cout << "Unsorted Keys" << std::endl;
for (size_t i = 0; i < N; i++)
for (size_t i = 0; i < upper.size(); i++)
{
std::cout << "(" << upper[i] << "," << middle[i] << "," << lower[i] << ")" << std::endl;
}
Expand All @@ -82,7 +82,7 @@ int main()
apply_permutation(upper, permutation);

std::cout << "Sorted Keys" << std::endl;
for (size_t i = 0; i < N; i++)
for (size_t i = 0; i < upper.size(); i++)
{
std::cout << "(" << upper[i] << "," << middle[i] << "," << lower[i] << ")" << std::endl;
}
Expand Down
18 changes: 5 additions & 13 deletions thrust/examples/max_abs_diff.cu
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,9 @@
#include <thrust/functional.h>
#include <thrust/inner_product.h>

#include <cmath>
#include <iostream>
#include <cuda/functional>

#include "include/host_device.h"
#include <iostream>

// this example computes the maximum absolute difference
// between the elements of two vectors
Expand All @@ -21,21 +20,14 @@ struct abs_diff

int main()
{
thrust::device_vector<float> d_a(4);
thrust::device_vector<float> d_b(4);

// clang-format off
d_a[0] = 1.0; d_b[0] = 2.0;
d_a[1] = 2.0; d_b[1] = 4.0;
d_a[2] = 3.0; d_b[2] = 3.0;
d_a[3] = 4.0; d_b[3] = 0.0;
// clang-format on
thrust::device_vector<float> d_a = {1.0, 2.0, 3.0, 4.0};
thrust::device_vector<float> d_b = {2.0, 4.0, 3.0, 0.0};

// initial value of the reduction
float init = 0;

// binary operations
::cuda::maximum<float> binary_op1;
cuda::maximum<float> binary_op1{};
abs_diff<float> binary_op2;

float max_abs_diff = thrust::inner_product(d_a.begin(), d_a.end(), d_b.begin(), init, binary_op1, binary_op2);
Expand Down
Loading