Adds BlockRunLengthDecode algorithm and tests #354

elstehle · 2021-08-05T11:57:28Z

Algorithm Overview

The BlockRunLengthDecode class supports decoding a run-length encoded array of unique_items. That is, given the two arrays unique_items[N] and run_lengths[N], unique_items[i] is repeated run_lengths[i] many times in the output array decoded_items[M]. Note: runs of length 0 are supported and will not appear in the output.
The application of BlockRunLengthDecode goes beyond just the decompression use case. Two other use cases are load-balancing and generating the keys array of *_by_key algorithm variants. This is a preliminary building block for the BatchMemcpy #297.

Example:

unique_items	0	2	9	5	8
run_lengths	1	2	1	3	1

decoded_items	0	2	2	9	5	5	5	8
decoded_size	8

Specialisation that will also output relative offsets:
The relative offset indicates the offset within its run for each decoded item

unique_items	0	2	9	5	8
run_lengths	1	2	1	3	1

decoded_items	0	2	2	9	5	5	5	8
relative_offsets	0	0	1	0	0	1	2	0
decoded_size	8

In the fashion of CUB's block-level algorithms, the items being contributed by each thread are expected in a blocked arrangement. The items being returned are also returned in a blocked arrangement.

Due to the nature of the run-length decoding algorithm ("decompression"), the output size of the algorithm invocation is unbounded. To address this, BlockRunLengthDecode allows retrieving a "window" from the run-length decoded output. The window's offset can be specified by the user and BLOCK_THREADS * DECODED_ITEMS_PER_THREAD (i.e., referred to as window_size) decoded items from the specified window will be returned. Memory requirements are always just O(window_size).

// The data type of the items being run-length decoded (i.e., the output data type)
using UniqueItemT = uint16_t;
constexpr uint32_t BLOCK_DIM_X = 128; // the number of threads per block
constexpr uint32_t RUNS_PER_THREAD = 2; // the number of runs contributed by each thread
constexpr uint32_t DECODED_ITEMS_PER_THREAD = 4; // the number of decoded items returned to each thread
using BlockRunLengthDecodeT = cub::BlockRunLengthDecode<UniqueItemT, BLOCK_DIM_X, RUNS_PER_THREAD, DECODED_ITEMS_PER_THREAD>;

// temporary shared memory required by the block-level run-length decode for the its lifetime
__shared__ typename BlockRunLengthDecodeT::TempStorage tmp_storage;

// Instantiate the BlockRunLengthDecode and give it the memory it can work with
BlockRunLengthDecodeT run_length_decode(tmp_storage);

// Initialize the BlockRunLengthDecode with the runs that we want to run-length decode
UniqueItemT unique_items[RUNS_PER_THREAD];  // assume this holds values
RunLengthT run_lengths[RUNS_PER_THREAD];    // assume this holds run lengths
uint32_t decoded_size = 0U;
run_length_decode.Init(unique_items, run_lengths, decoded_size);

// Run-length decode ("decompress") the runs into a window buffer of limited size. This is repeated until all runs
// have been decoded.
uint32_t decoded_window_offset = 0U;
while (decoded_window_offset < decoded_size)
{
  RunLengthT relative_offsets[DECODED_ITEMS_PER_THREAD];
  UniqueItemT decoded_items[DECODED_ITEMS_PER_THREAD];

  // The number of decoded items that are valid within this window (aka pass) of run-length decoding
  uint32_t num_valid_items = decoded_size - decoded_window_offset;
  run_length_decode.RunLengthDecode(decoded_items, relative_offsets, decoded_window_offset);
  BlockStoreDecodedItemT(temp_storage.store_decoded_runs_storage)
    .Store(d_block_decoded_out + decoded_window_offset, decoded_items, num_valid_items);
}

Use Cases

Load-balancing batched problems: While individual problems from a batch can be processed independently, simply assigning one thread/warp to each problem has two shortcomings: (1) load imbalance: some problems may take way longer than others, leaving some threads fiddling fingers while others are hard at work, (2) limited parallelism: your batch size may be much smaller than the parallelism you could exploit, (3) non-coalesced memory accesses, as individual problems may reside far apart in memory. RunLengthDecode allows to evenly sub-divide these problems into equi-sized sub-problems.
Generating keys for *_by_keys algorithm variants
Compression / Decompression

Implementation Details

The original approach (Scan-based Approach), from when this PR was originally opened, was replaced in favour of the "New Binary Search-based Approach". In the current state, the PR uses what is described in "New Binary Search-based Approach".

New Binary Search-based Approach

Every thread is assigned to generate the same number of output items per RunLengthDecode() invocation. Each thread knows the offset from which to generate the output. For instance, if each thread is assigned to generate 4 output items, thread0 is generating output items [0,4), thread1 does [4,8), etc. Now, each thread needs to figure out which run it is initially assigned to. Taking the previous example, thread1 needs to find out the run at offset 4.
To find out the corresponding run, we use the prefix sum over the runs' offset (that yields us the beginning offset of each run). We call that result the runs-offsets array. We then do a binary search into the runs-offsets array, using a threads assigned output offset, using UpperBound. So, thread1 would do: UpperBound(runs-offsets, my_output_offset).

Original Scan-based Approach

Overview of processing stages:

unique_items	0	2	9	5	8
run_lengths	1	2	1	3	1
[1] prefix sum
run_offsets	0	1	3	4	7
aggregate	8

[2] init buffer
indexes	0	1	2	3	4	5	6	7
decode_buffer.uniques	-	-	-	-	-	-	-	-
[3] scatter
decode_buffer.uniques	0	2	-	9	5	-	-	8
[4] incl. scan
decoded_items	0	2	2	9	5	5	5	8

An exclusive prefix-sum computes the index of the beginning of each run in the decoded output
We initialize the decode_buffer with continuation-of-run items. This is just to differentiate between items that have already been resolved and items that yet need to be "filled in".
Scatter the beginning of each run into the appropriate position in the decode buffer, which was computed in [1], if that run's length is not 0.
Perform an inclusive prefix scan that uses the following bin_op (where - from the table above means ìs_unresolved)

__device__ __forceinline__ T operator()(const T &lhs, const T &rhs)
{
    // If the rhs is still unresolved, propagate the lhs (which may or may not be resolved _at this point_, but
    // eventually _some_ lhs will provide the answer)
    return rhs.is_unresolved ? lhs : rhs;
}

A question that arises is how to differentiate between an unresolved (i.e., - and an already decoded (or "resolved") item. If there was a value representable by the data type of unique_items that will never appear in the user-provided input, we could simply use that to represent unresolved items. The other alternative is to make decoded_items temporarily a pair of {unique_value, is_resolved}. I had started with the former and later added the latter. The former is 10-20% faster but not so nice with regards to the interface. So currently, we only provide an interface for the latter.

If the user also wants to retrieve relative_offsets, then the pair of {unique_value, is_resolved} is becoming {unique_value, relative_offset} and the scan operator is becoming slightly more involved.

So, essentially there's three specialisations (or instances) of the BlockRunLengthDecode which have a lot of overlap implementation-wise. They mostly vary in the amount of TempStorage they require and the RunLengthDecode member function signature:
UNUSED (regular run-length decoding, but the user has to tell us which value we can use to represent yet-unresolved items):

RunLengthDecode(UniqueItemT (&decoded_items)[DECODED_ITEMS_PER_THREAD], UniqueItemT unused_item_that_can_be_used_to_indicate_unresolved, int32_t from_decoded_offset = 0)

NORMAL (regular run-length decoding):

RunLengthDecode(UniqueItemT (&decoded_items)[DECODED_ITEMS_PER_THREAD], int32_t from_decoded_offset = 0)

OFFSETS (...):

template<typename UserRelativeOffsetT>
RunLengthDecode(UniqueItemT (&decoded_items)[DECODED_ITEMS_PER_THREAD], UserRelativeOffsetT (&item_offsets)[DECODED_ITEMS_PER_THREAD], int32_t from_decoded_offset = 0)

Currently only OFFSETS (...): and NORMAL (regular run-length decoding): are supported. And which implementation the user wants is decided by passing a template parameter to BlockRunLengthDecode.

Question:
I'm inclined to have three different super classes (one for each of above specialisations) with a CRTP base class. So far CUB has refrained from having any inheritance. But here, the specialisations not only differ in the implementation but also the interface they expose. So, I think different classes would be the cleanest way to express that?

Overall I tried to match the CUB style wherever possible. I just deviated and had decided to go with fixed-width types.

Performance

These are some numbers from a V100 when decoding uint32_t as the unique_items.

RUNS_PER_THREAD	DECODED_ITEMS_PER_THREAD	THREADS_PER_BLOCK	time_decode STATIC	achieved BW STATIC	duration (ms) default	achieved BW default	speedup
1	1	64	0.55936	722.254	0.601472	671.685	107.53%
1	1	64	0.487776	414.125	0.540544	373.698	110.82%
1	1	64	0.566144	713.599	0.604736	668.06	106.82%
1	1	64	0.4888	413.257	0.536896	376.237	109.84%
1	3	192	0.520352	776.398	0.518944	778.504	99.73%
1	3	192	0.301472	670.046	0.338432	596.87	112.26%
1	3	192	0.521152	775.206	0.528928	763.809	101.49%
1	3	192	0.289728	697.206	0.33712	599.193	116.36%
1	1	128	0.598688	674.809	0.667072	605.632	111.42%
1	1	128	0.5416	372.969	0.604192	334.331	111.56%
1	1	128	0.588256	686.776	0.650624	620.942	110.60%
1	1	128	0.537536	375.789	0.601728	335.7	111.94%
1	8	128	0.53088	761.001	0.523968	771.039	98.70%
1	8	128	0.282368	715.379	0.288256	700.766	102.09%
1	8	128	0.523008	772.455	0.513568	786.653	98.20%
1	8	128	0.284672	709.589	0.287648	702.247	101.05%
2	8	128	0.523648	771.511	0.52912	763.532	101.05%
2	8	128	0.288416	700.377	0.2904	695.592	100.69%
2	8	128	0.527136	766.406	0.527872	765.337	100.14%
2	8	128	0.28736	702.951	0.289472	697.822	100.74%
3	1	256	0.86096	469.244	0.845504	477.822	98.20%
3	1	256	0.739264	273.245	0.795808	253.83	107.65%
3	1	256	0.820992	492.088	0.87232	463.133	106.25%
3	1	256	0.746784	270.493	0.781408	258.508	104.64%
1	8	256	0.532384	758.851	0.5144	785.381	96.62%
1	8	256	0.28176	716.922	0.287712	702.091	102.11%
1	8	256	0.515712	783.383	0.517344	780.912	100.32%
1	8	256	0.30032	672.616	0.286784	704.363	95.49%
8	1	256	1.03683	389.648	1.09642	368.473	105.75%
8	1	256	0.941856	214.47	1.00131	201.735	106.31%
8	1	256	1.0384	389.06	1.10077	367.016	106.01%
8	1	256	0.945888	213.556	0.999488	202.103	105.67%
1	1	256	0.681152	593.113	0.73616	548.794	108.08%
1	1	256	0.59584	339.017	0.659936	306.09	110.76%
1	1	256	0.675616	597.973	0.741312	544.98	109.72%
1	1	256	0.588512	343.239	0.661696	305.276	112.44%
2	2	384	0.54528	740.904	0.554176	729.01	101.63%
2	2	384	0.426112	474.054	0.451648	447.251	105.99%
2	2	384	0.549056	735.808	0.557888	724.16	101.61%
2	2	384	0.430048	469.715	0.471104	428.78	109.55%

TODOs

Describe how decoding into a window of the decode buffer works
Performance optimisation: pack uniques and offsets into a single struct in shared memory if they both fit within a four-byte word.
Performance optimisation: resurrect the UnusedUniquespecialisation
Split into three classes with common CRTP base class

alliepiper · 2021-08-05T17:36:15Z

Thanks @elstehle! Just a heads up, I won't be able to look at this until I get back from vacation on 8/16, but @senior-zero will do an initial review in the meantime.

Do you need this merged for any particular release?

alliepiper · 2021-08-05T17:47:35Z

I just noticed the TODOs -- should we wait until those are finished to start reviewing?

elstehle · 2021-08-05T21:02:06Z

Thanks @elstehle! Just a heads up, I won't be able to look at this until I get back from vacation on 8/16, but @senior-zero will do an initial review in the meantime.

Do you need this merged for any particular release?

Thanks, Allison. No rush. Enjoy your vacation. I'll align with @senior-zero in the meanwhile.

I just noticed the TODOs -- should we wait until those are finished to start reviewing?

I think, only the CRTP base class would be some more "intrusive" change (yet, not too intrusive). If we can get a decision on whether we want to pursue it, I'll push that. All other TODOs are either optional or really just minor additions.

gevtushenko

It's a crucial algorithm to have in CUB! Thank you for developing it 😃

I've proposed an alternative approach for RLE decode below. I think we should have a second iteration when it's considered.

cub/block/block_run_length_decode.cuh

elstehle · 2021-10-04T09:17:14Z

On top of addressing the review comments, I also switched to a compile-time-"unrolled" binary search that would always do log2(NUMBER_OF_RUNS) passes of binary search for all the threads, getting rid of some extra branching in the binary search. That had improved performance by another 5% on average.

achieved BW STATIC and duration (ms) STATIC represent the BW/time of the new binary search, that would have a statically determined number of binary search passes, whereas achieved BW default and duration (ms) default would denote the original binary search approach.

Experiments were run on a V100.

RUNS_PER_THREAD	DECODED_ITEMS_PER_THREAD	THREADS_PER_BLOCK	time_decode STATIC	achieved BW STATIC	duration (ms) default	achieved BW default	relative performance
1	1	64	0.55936	722.254	0.601472	671.685	107.53%
1	1	64	0.487776	414.125	0.540544	373.698	110.82%
1	1	64	0.566144	713.599	0.604736	668.06	106.82%
1	1	64	0.4888	413.257	0.536896	376.237	109.84%
1	3	192	0.520352	776.398	0.518944	778.504	99.73%
1	3	192	0.301472	670.046	0.338432	596.87	112.26%
1	3	192	0.521152	775.206	0.528928	763.809	101.49%
1	3	192	0.289728	697.206	0.33712	599.193	116.36%
1	1	128	0.598688	674.809	0.667072	605.632	111.42%
1	1	128	0.5416	372.969	0.604192	334.331	111.56%
1	1	128	0.588256	686.776	0.650624	620.942	110.60%
1	1	128	0.537536	375.789	0.601728	335.7	111.94%
1	8	128	0.53088	761.001	0.523968	771.039	98.70%
1	8	128	0.282368	715.379	0.288256	700.766	102.09%
1	8	128	0.523008	772.455	0.513568	786.653	98.20%
1	8	128	0.284672	709.589	0.287648	702.247	101.05%
2	8	128	0.523648	771.511	0.52912	763.532	101.05%
2	8	128	0.288416	700.377	0.2904	695.592	100.69%
2	8	128	0.527136	766.406	0.527872	765.337	100.14%
2	8	128	0.28736	702.951	0.289472	697.822	100.74%
3	1	256	0.86096	469.244	0.845504	477.822	98.20%
3	1	256	0.739264	273.245	0.795808	253.83	107.65%
3	1	256	0.820992	492.088	0.87232	463.133	106.25%
3	1	256	0.746784	270.493	0.781408	258.508	104.64%
1	8	256	0.532384	758.851	0.5144	785.381	96.62%
1	8	256	0.28176	716.922	0.287712	702.091	102.11%
1	8	256	0.515712	783.383	0.517344	780.912	100.32%
1	8	256	0.30032	672.616	0.286784	704.363	95.49%
8	1	256	1.03683	389.648	1.09642	368.473	105.75%
8	1	256	0.941856	214.47	1.00131	201.735	106.31%
8	1	256	1.0384	389.06	1.10077	367.016	106.01%
8	1	256	0.945888	213.556	0.999488	202.103	105.67%
1	1	256	0.681152	593.113	0.73616	548.794	108.08%
1	1	256	0.59584	339.017	0.659936	306.09	110.76%
1	1	256	0.675616	597.973	0.741312	544.98	109.72%
1	1	256	0.588512	343.239	0.661696	305.276	112.44%
2	2	384	0.54528	740.904	0.554176	729.01	101.63%
2	2	384	0.426112	474.054	0.451648	447.251	105.99%
2	2	384	0.549056	735.808	0.557888	724.16	101.61%
2	2	384	0.430048	469.715	0.471104	428.78	109.55%

gevtushenko

Thank you for these changes, the code is faster, clearer and shorter! I have a few minor comments.

cub/block/block_run_length_decode.cuh

gevtushenko · 2021-10-15T18:02:46Z

cub/block/block_run_length_decode.cuh

+#pragma unroll
+    for (int i = 0; i < RUNS_PER_THREAD; i++)
+    {
+      temp_storage.runs.run_values[thread_dst_offset]  = run_values[i];


This memory access pattern can produce bank conflicts (for example with power-of-two RUNS_PER_THREAD). I wonder if padding insertion can help. You can check BlockExchange::ScatterToBlocked for reference. It uses SHR_ADD to distribute accesses. If you are time-limited, it should be fine to file a different issue and research this later.

I'll think about it. It won't be straight forward due to the subsequent binary search on that array.

cub/block/block_run_length_decode.cuh

alliepiper

Had some minor comments but overall I like how this is looking 👍

cub/block/block_run_length_decode.cuh

test/test_block_run_length_decode.cu

alliepiper · 2021-10-18T18:39:39Z

LGTM -- kicking off CI:

DVS CL: 30545251
gpuCI: NVIDIA/thrust#1540

Requested changes were made.

alliepiper · 2021-10-19T16:02:08Z

All set! This will ship with 1.15.

alliepiper added this to the 1.14.0 milestone Aug 5, 2021

alliepiper requested review from alliepiper and gevtushenko August 5, 2021 17:36

gevtushenko previously requested changes Aug 12, 2021

View reviewed changes

alliepiper modified the milestones: 1.14.0, 1.15.0 Aug 17, 2021

elstehle mentioned this pull request Aug 18, 2021

Adds DeviceBatchMemcpy algorithm and tests #359

Merged

elstehle force-pushed the feature/block-run-length-decode branch from 975e74f to 40951c0 Compare October 4, 2021 09:00

alliepiper added helps: rapids Helps or needed by RAPIDS. P1: should have Necessary, but not critical. labels Oct 14, 2021

gevtushenko reviewed Oct 15, 2021

View reviewed changes

alliepiper suggested changes Oct 15, 2021

View reviewed changes

alliepiper assigned elstehle Oct 15, 2021

elstehle added 10 commits October 18, 2021 01:53

Adds BlockRunLengthDecode algorithm and tests

114fc65

Changing to new namespace macros

0ba865f

Changed to a binary search-based approach for run-length decode

92badc7

Switch to cta_sync

2a49223

fix signedness

b710981

Dropping algorithm specialisation template parameter

6727698

Getting rid of using logical and in tests

955cc15

Fix some more signedness warnings in tests

e06a651

added docs and renamed unique_item to run_item for clarity

b07d93c

add interface to provide run offsets and binary search optimization

e6a73a1

elstehle added 6 commits October 18, 2021 01:53

made static binary search a member function

b5abe32

added tests for init with run offsets

884bbcf

reduced test complexity

190dddf

revert changes in thread_search

a40f1ee

print only in benchmark mode

dab7703

addresses review comments

5ea9eaa

elstehle force-pushed the feature/block-run-length-decode branch from 3ab765c to 5ea9eaa Compare October 18, 2021 08:54

elstehle added 4 commits October 18, 2021 02:05

fixes return type of StaticUpperBound

d78cf8f

makes doxygen cmts consistent

f06592b

fixes doxygen in RLD

1ef404b

adds code example

5ead406

alliepiper assigned alliepiper and unassigned elstehle Oct 18, 2021

alliepiper added a commit to alliepiper/thrust that referenced this pull request Oct 18, 2021

Do not merge; testing NVIDIA/cub#354.

601a3d1

alliepiper approved these changes Oct 19, 2021

View reviewed changes

alliepiper merged commit da50ae9 into NVIDIA:main Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds BlockRunLengthDecode algorithm and tests #354

Adds BlockRunLengthDecode algorithm and tests #354

elstehle commented Aug 5, 2021 •

edited

Loading

alliepiper commented Aug 5, 2021

alliepiper commented Aug 5, 2021

elstehle commented Aug 5, 2021

gevtushenko left a comment

elstehle commented Oct 4, 2021 •

edited

Loading

gevtushenko left a comment

gevtushenko Oct 15, 2021

elstehle Oct 18, 2021

alliepiper left a comment

alliepiper commented Oct 18, 2021

alliepiper commented Oct 19, 2021

Adds BlockRunLengthDecode algorithm and tests #354

Adds BlockRunLengthDecode algorithm and tests #354

Conversation

elstehle commented Aug 5, 2021 • edited Loading

Algorithm Overview

Use Cases

Implementation Details

New Binary Search-based Approach

Original Scan-based Approach

Performance

TODOs

alliepiper commented Aug 5, 2021

alliepiper commented Aug 5, 2021

elstehle commented Aug 5, 2021

gevtushenko left a comment

Choose a reason for hiding this comment

elstehle commented Oct 4, 2021 • edited Loading

gevtushenko left a comment

Choose a reason for hiding this comment

gevtushenko Oct 15, 2021

Choose a reason for hiding this comment

elstehle Oct 18, 2021

Choose a reason for hiding this comment

alliepiper left a comment

Choose a reason for hiding this comment

alliepiper commented Oct 18, 2021

alliepiper commented Oct 19, 2021

elstehle commented Aug 5, 2021 •

edited

Loading

elstehle commented Oct 4, 2021 •

edited

Loading