Meta-Algorithm: Fold/Parallel Scan different implementation #6925

soufianekhiat · 2022-08-06T11:13:50Z

soufianekhiat
Aug 6, 2022

Hello,

I'd to compare different implementation for a prefix-scan which is equivalent of a Fold or FoldList in Mathematica:
https://reference.wolfram.com/language/ref/Fold.html
https://reference.wolfram.com/language/ref/FoldList.html

For the sake of the example I'll focus on FoldAdd '+' (but the question is the same of min, max, ...).
Let say I want:

Int32 const iSize = // ...
Buffer<f32> buf( &src[ 0 ], { iSize }, "buffer_src" );
Var x( "x" );
Func src{ "source" };
src( x ) = buf( x );
Func out{ "result" };
RDom r( 0, iSize );
out() = sum( src( r ) );

After parallisation, always produce an atomic_add which is far from optimal.
Based on definition of an "algorithm" of Halide it's as design, so I'd to implement a meta-algorithm with:

out() = meta_sum( src( r ), Fold::Hillis_Steele ); // { Hillis_Steele, Blelloch, ... }

cf. https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda

Now with the current definition of "Algorithm" by Halide, that implies one implementation per "semantic-function".
Semantically I fand to accumulate the whole number in a 1D-Array.

Currently I'm discussion the feasability of that. On my understanding that implies a creation of N-Function which can be scheduled differently with "Ping-pong" buffer or reusing the same buffer. The main issue is 'N' is a consequence of internal-size.
For instance we want to do:

namespace H = Halide;
H::Expr width = src.output_buffer().width();
// or
H::Expr width = /* from_input_param */;
H::Expr depth = H::cast< int >( H::floor( H::log( H::cast< f32 >( width ) )/H::log( 2.0f ) ) );
H::Array<H::Func> passes( "pass_%d", depth );

// TBD:
H::For(Expr k = 0, k < depth; ++k,
[&passes](H::Func& f, H::Expr i)
{
   H::Func iter;
   iter(x) = passes[k - 1](x) + passes[k - 1](x + (1 << (i - 1)));
   return H::select( k == 0, src, iter );
} );

// Schedule
H::For(Expr k = 0, k < depth; ++k,
[&passes](H::Func& f, Expr i)
{
   f.parallel( f.rvars( 0 ) );
} );

It's an open discussion which could open an higher level of description for generic algorithm.

Best.

zvookin · 2022-08-07T15:35:25Z

zvookin
Aug 7, 2022
Maintainer

In the interest of responding quickly rather than comprehensively, have you looked at the rfactor scheduling directive? This problem likely wants each level of the decomposition to be an update stage of a single Func, not multiple Funcs and rfactor is likely the correct way to create them.

There is an rfactor tutorial: https://github.com/halide/Halide/blob/main/tutorial/lesson_18_parallel_associative_reductions.cpp .

Alas, using the built-in helpers such as sum will not work as they do not surface enough information to schedule the code they create.

0 replies

abadams · 2022-08-08T23:54:43Z

abadams
Aug 8, 2022
Maintainer

I have a CPU vector sum scan implementation that uses Sklansky for the within-vector summation here, in case it helps: https://github.com/halide/Halide/blob/abadams/gaussian_blur_app/apps/gaussian_blur/box_blur_generator.cpp#L241

It could probably be adapted to the other approaches. I place each vector of inputs at the vertices of a hypercube, and then each pass is an update step on that Func, vectorizing over all of the RDom dimensions (each of which is size 2).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meta-Algorithm: Fold/Parallel Scan different implementation #6925

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Meta-Algorithm: Fold/Parallel Scan different implementation #6925

soufianekhiat Aug 6, 2022

Replies: 2 comments

zvookin Aug 7, 2022 Maintainer

abadams Aug 8, 2022 Maintainer

soufianekhiat
Aug 6, 2022

zvookin
Aug 7, 2022
Maintainer

abadams
Aug 8, 2022
Maintainer