Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Adds BlockRunLengthDecode algorithm and tests #354

Merged
merged 20 commits into from
Oct 19, 2021

Conversation

elstehle
Copy link
Collaborator

@elstehle elstehle commented Aug 5, 2021

Algorithm Overview

The BlockRunLengthDecode class supports decoding a run-length encoded array of unique_items. That is, given the two arrays unique_items[N] and run_lengths[N], unique_items[i] is repeated run_lengths[i] many times in the output array decoded_items[M]. Note: runs of length 0 are supported and will not appear in the output.
The application of BlockRunLengthDecode goes beyond just the decompression use case. Two other use cases are load-balancing and generating the keys array of *_by_key algorithm variants. This is a preliminary building block for the BatchMemcpy #297.

Example:

unique_items 0 2 9 5 8
run_lengths 1 2 1 3 1
decoded_items 0 2 2 9 5 5 5 8
decoded_size 8

Specialisation that will also output relative offsets:
The relative offset indicates the offset within its run for each decoded item

unique_items 0 2 9 5 8
run_lengths 1 2 1 3 1
decoded_items 0 2 2 9 5 5 5 8
relative_offsets 0 0 1 0 0 1 2 0
decoded_size 8

In the fashion of CUB's block-level algorithms, the items being contributed by each thread are expected in a blocked arrangement. The items being returned are also returned in a blocked arrangement.

Due to the nature of the run-length decoding algorithm ("decompression"), the output size of the algorithm invocation is unbounded. To address this, BlockRunLengthDecode allows retrieving a "window" from the run-length decoded output. The window's offset can be specified by the user and BLOCK_THREADS * DECODED_ITEMS_PER_THREAD (i.e., referred to as window_size) decoded items from the specified window will be returned. Memory requirements are always just O(window_size).

// The data type of the items being run-length decoded (i.e., the output data type)
using UniqueItemT = uint16_t;
constexpr uint32_t BLOCK_DIM_X = 128; // the number of threads per block
constexpr uint32_t RUNS_PER_THREAD = 2; // the number of runs contributed by each thread
constexpr uint32_t DECODED_ITEMS_PER_THREAD = 4; // the number of decoded items returned to each thread
using BlockRunLengthDecodeT = cub::BlockRunLengthDecode<UniqueItemT, BLOCK_DIM_X, RUNS_PER_THREAD, DECODED_ITEMS_PER_THREAD>;

// temporary shared memory required by the block-level run-length decode for the its lifetime
__shared__ typename BlockRunLengthDecodeT::TempStorage tmp_storage;

// Instantiate the BlockRunLengthDecode and give it the memory it can work with
BlockRunLengthDecodeT run_length_decode(tmp_storage);

// Initialize the BlockRunLengthDecode with the runs that we want to run-length decode
UniqueItemT unique_items[RUNS_PER_THREAD];  // assume this holds values
RunLengthT run_lengths[RUNS_PER_THREAD];    // assume this holds run lengths
uint32_t decoded_size = 0U;
run_length_decode.Init(unique_items, run_lengths, decoded_size);

// Run-length decode ("decompress") the runs into a window buffer of limited size. This is repeated until all runs
// have been decoded.
uint32_t decoded_window_offset = 0U;
while (decoded_window_offset < decoded_size)
{
  RunLengthT relative_offsets[DECODED_ITEMS_PER_THREAD];
  UniqueItemT decoded_items[DECODED_ITEMS_PER_THREAD];

  // The number of decoded items that are valid within this window (aka pass) of run-length decoding
  uint32_t num_valid_items = decoded_size - decoded_window_offset;
  run_length_decode.RunLengthDecode(decoded_items, relative_offsets, decoded_window_offset);
  BlockStoreDecodedItemT(temp_storage.store_decoded_runs_storage)
    .Store(d_block_decoded_out + decoded_window_offset, decoded_items, num_valid_items);
}

Use Cases

  • Load-balancing batched problems: While individual problems from a batch can be processed independently, simply assigning one thread/warp to each problem has two shortcomings: (1) load imbalance: some problems may take way longer than others, leaving some threads fiddling fingers while others are hard at work, (2) limited parallelism: your batch size may be much smaller than the parallelism you could exploit, (3) non-coalesced memory accesses, as individual problems may reside far apart in memory. RunLengthDecode allows to evenly sub-divide these problems into equi-sized sub-problems.
  • Generating keys for *_by_keys algorithm variants
  • Compression / Decompression

Implementation Details

The original approach (Scan-based Approach), from when this PR was originally opened, was replaced in favour of the "New Binary Search-based Approach". In the current state, the PR uses what is described in "New Binary Search-based Approach".

New Binary Search-based Approach

Every thread is assigned to generate the same number of output items per RunLengthDecode() invocation. Each thread knows the offset from which to generate the output. For instance, if each thread is assigned to generate 4 output items, thread0 is generating output items [0,4), thread1 does [4,8), etc. Now, each thread needs to figure out which run it is initially assigned to. Taking the previous example, thread1 needs to find out the run at offset 4.
To find out the corresponding run, we use the prefix sum over the runs' offset (that yields us the beginning offset of each run). We call that result the runs-offsets array. We then do a binary search into the runs-offsets array, using a threads assigned output offset, using UpperBound. So, thread1 would do: UpperBound(runs-offsets, my_output_offset).

Original Scan-based Approach

Overview of processing stages:

unique_items 0 2 9 5 8
run_lengths 1 2 1 3 1
[1] prefix sum
run_offsets 0 1 3 4 7
aggregate 8
[2] init buffer
indexes 0 1 2 3 4 5 6 7
decode_buffer.uniques - - - - - - - -
*[3] scatter *
decode_buffer.uniques 0 2 - 9 5 - - 8
[4] incl. scan
decoded_items 0 2 2 9 5 5 5 8
  1. An exclusive prefix-sum computes the index of the beginning of each run in the decoded output
  2. We initialize the decode_buffer with continuation-of-run items. This is just to differentiate between items that have already been resolved and items that yet need to be "filled in".
  3. Scatter the beginning of each run into the appropriate position in the decode buffer, which was computed in [1], if that run's length is not 0.
  4. Perform an inclusive prefix scan that uses the following bin_op (where - from the table above means ìs_unresolved)
__device__ __forceinline__ T operator()(const T &lhs, const T &rhs)
{
    // If the rhs is still unresolved, propagate the lhs (which may or may not be resolved _at this point_, but
    // eventually _some_ lhs will provide the answer)
    return rhs.is_unresolved ? lhs : rhs;
}

A question that arises is how to differentiate between an unresolved (i.e., - and an already decoded (or "resolved") item. If there was a value representable by the data type of unique_items that will never appear in the user-provided input, we could simply use that to represent unresolved items. The other alternative is to make decoded_items temporarily a pair of {unique_value, is_resolved}. I had started with the former and later added the latter. The former is 10-20% faster but not so nice with regards to the interface. So currently, we only provide an interface for the latter.

If the user also wants to retrieve relative_offsets, then the pair of {unique_value, is_resolved} is becoming {unique_value, relative_offset} and the scan operator is becoming slightly more involved.

So, essentially there's three specialisations (or instances) of the BlockRunLengthDecode which have a lot of overlap implementation-wise. They mostly vary in the amount of TempStorage they require and the RunLengthDecode member function signature:
UNUSED (regular run-length decoding, but the user has to tell us which value we can use to represent yet-unresolved items):

RunLengthDecode(UniqueItemT (&decoded_items)[DECODED_ITEMS_PER_THREAD], UniqueItemT unused_item_that_can_be_used_to_indicate_unresolved, int32_t from_decoded_offset = 0)

NORMAL (regular run-length decoding):

RunLengthDecode(UniqueItemT (&decoded_items)[DECODED_ITEMS_PER_THREAD], int32_t from_decoded_offset = 0)

OFFSETS (...):

template<typename UserRelativeOffsetT>
RunLengthDecode(UniqueItemT (&decoded_items)[DECODED_ITEMS_PER_THREAD], UserRelativeOffsetT (&item_offsets)[DECODED_ITEMS_PER_THREAD], int32_t from_decoded_offset = 0)

Currently only OFFSETS (...): and NORMAL (regular run-length decoding): are supported. And which implementation the user wants is decided by passing a template parameter to BlockRunLengthDecode.

Question:
I'm inclined to have three different super classes (one for each of above specialisations) with a CRTP base class. So far CUB has refrained from having any inheritance. But here, the specialisations not only differ in the implementation but also the interface they expose. So, I think different classes would be the cleanest way to express that?

Overall I tried to match the CUB style wherever possible. I just deviated and had decided to go with fixed-width types.

Performance

These are some numbers from a V100 when decoding uint32_t as the unique_items.

RUNS_PER_THREAD DECODED_ITEMS_PER_THREAD THREADS_PER_BLOCK time_decode STATIC achieved BW STATIC duration (ms) default achieved BW default speedup
1 1 64 0.55936 722.254 0.601472 671.685 107.53%
1 1 64 0.487776 414.125 0.540544 373.698 110.82%
1 1 64 0.566144 713.599 0.604736 668.06 106.82%
1 1 64 0.4888 413.257 0.536896 376.237 109.84%
1 3 192 0.520352 776.398 0.518944 778.504 99.73%
1 3 192 0.301472 670.046 0.338432 596.87 112.26%
1 3 192 0.521152 775.206 0.528928 763.809 101.49%
1 3 192 0.289728 697.206 0.33712 599.193 116.36%
1 1 128 0.598688 674.809 0.667072 605.632 111.42%
1 1 128 0.5416 372.969 0.604192 334.331 111.56%
1 1 128 0.588256 686.776 0.650624 620.942 110.60%
1 1 128 0.537536 375.789 0.601728 335.7 111.94%
1 8 128 0.53088 761.001 0.523968 771.039 98.70%
1 8 128 0.282368 715.379 0.288256 700.766 102.09%
1 8 128 0.523008 772.455 0.513568 786.653 98.20%
1 8 128 0.284672 709.589 0.287648 702.247 101.05%
2 8 128 0.523648 771.511 0.52912 763.532 101.05%
2 8 128 0.288416 700.377 0.2904 695.592 100.69%
2 8 128 0.527136 766.406 0.527872 765.337 100.14%
2 8 128 0.28736 702.951 0.289472 697.822 100.74%
3 1 256 0.86096 469.244 0.845504 477.822 98.20%
3 1 256 0.739264 273.245 0.795808 253.83 107.65%
3 1 256 0.820992 492.088 0.87232 463.133 106.25%
3 1 256 0.746784 270.493 0.781408 258.508 104.64%
1 8 256 0.532384 758.851 0.5144 785.381 96.62%
1 8 256 0.28176 716.922 0.287712 702.091 102.11%
1 8 256 0.515712 783.383 0.517344 780.912 100.32%
1 8 256 0.30032 672.616 0.286784 704.363 95.49%
8 1 256 1.03683 389.648 1.09642 368.473 105.75%
8 1 256 0.941856 214.47 1.00131 201.735 106.31%
8 1 256 1.0384 389.06 1.10077 367.016 106.01%
8 1 256 0.945888 213.556 0.999488 202.103 105.67%
1 1 256 0.681152 593.113 0.73616 548.794 108.08%
1 1 256 0.59584 339.017 0.659936 306.09 110.76%
1 1 256 0.675616 597.973 0.741312 544.98 109.72%
1 1 256 0.588512 343.239 0.661696 305.276 112.44%
2 2 384 0.54528 740.904 0.554176 729.01 101.63%
2 2 384 0.426112 474.054 0.451648 447.251 105.99%
2 2 384 0.549056 735.808 0.557888 724.16 101.61%
2 2 384 0.430048 469.715 0.471104 428.78 109.55%

TODOs

  • Describe how decoding into a window of the decode buffer works
  • Performance optimisation: pack uniques and offsets into a single struct in shared memory if they both fit within a four-byte word.
  • Performance optimisation: resurrect the UnusedUniquespecialisation
  • Split into three classes with common CRTP base class

@alliepiper alliepiper added this to the 1.14.0 milestone Aug 5, 2021
@alliepiper
Copy link
Collaborator

Thanks @elstehle! Just a heads up, I won't be able to look at this until I get back from vacation on 8/16, but @senior-zero will do an initial review in the meantime.

Do you need this merged for any particular release?

@alliepiper
Copy link
Collaborator

I just noticed the TODOs -- should we wait until those are finished to start reviewing?

@elstehle
Copy link
Collaborator Author

elstehle commented Aug 5, 2021

Thanks @elstehle! Just a heads up, I won't be able to look at this until I get back from vacation on 8/16, but @senior-zero will do an initial review in the meantime.

Do you need this merged for any particular release?

Thanks, Allison. No rush. Enjoy your vacation. I'll align with @senior-zero in the meanwhile.

I just noticed the TODOs -- should we wait until those are finished to start reviewing?

I think, only the CRTP base class would be some more "intrusive" change (yet, not too intrusive). If we can get a decision on whether we want to pursue it, I'll push that. All other TODOs are either optional or really just minor additions.

Copy link
Collaborator

@gevtushenko gevtushenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a crucial algorithm to have in CUB! Thank you for developing it 😃

I've proposed an alternative approach for RLE decode below. I think we should have a second iteration when it's considered.

cub/block/block_run_length_decode.cuh Outdated Show resolved Hide resolved
cub/block/block_run_length_decode.cuh Outdated Show resolved Hide resolved
cub/block/block_run_length_decode.cuh Outdated Show resolved Hide resolved
cub/block/block_run_length_decode.cuh Outdated Show resolved Hide resolved
cub/block/block_run_length_decode.cuh Outdated Show resolved Hide resolved
cub/block/block_run_length_decode.cuh Outdated Show resolved Hide resolved
cub/block/block_run_length_decode.cuh Outdated Show resolved Hide resolved
@elstehle
Copy link
Collaborator Author

elstehle commented Oct 4, 2021

On top of addressing the review comments, I also switched to a compile-time-"unrolled" binary search that would always do log2(NUMBER_OF_RUNS) passes of binary search for all the threads, getting rid of some extra branching in the binary search. That had improved performance by another 5% on average.

achieved BW STATIC and duration (ms) STATIC represent the BW/time of the new binary search, that would have a statically determined number of binary search passes, whereas achieved BW default and duration (ms) default would denote the original binary search approach.

Experiments were run on a V100.

RUNS_PER_THREAD DECODED_ITEMS_PER_THREAD THREADS_PER_BLOCK time_decode STATIC achieved BW STATIC duration (ms) default achieved BW default relative performance
1 1 64 0.55936 722.254 0.601472 671.685 107.53%
1 1 64 0.487776 414.125 0.540544 373.698 110.82%
1 1 64 0.566144 713.599 0.604736 668.06 106.82%
1 1 64 0.4888 413.257 0.536896 376.237 109.84%
1 3 192 0.520352 776.398 0.518944 778.504 99.73%
1 3 192 0.301472 670.046 0.338432 596.87 112.26%
1 3 192 0.521152 775.206 0.528928 763.809 101.49%
1 3 192 0.289728 697.206 0.33712 599.193 116.36%
1 1 128 0.598688 674.809 0.667072 605.632 111.42%
1 1 128 0.5416 372.969 0.604192 334.331 111.56%
1 1 128 0.588256 686.776 0.650624 620.942 110.60%
1 1 128 0.537536 375.789 0.601728 335.7 111.94%
1 8 128 0.53088 761.001 0.523968 771.039 98.70%
1 8 128 0.282368 715.379 0.288256 700.766 102.09%
1 8 128 0.523008 772.455 0.513568 786.653 98.20%
1 8 128 0.284672 709.589 0.287648 702.247 101.05%
2 8 128 0.523648 771.511 0.52912 763.532 101.05%
2 8 128 0.288416 700.377 0.2904 695.592 100.69%
2 8 128 0.527136 766.406 0.527872 765.337 100.14%
2 8 128 0.28736 702.951 0.289472 697.822 100.74%
3 1 256 0.86096 469.244 0.845504 477.822 98.20%
3 1 256 0.739264 273.245 0.795808 253.83 107.65%
3 1 256 0.820992 492.088 0.87232 463.133 106.25%
3 1 256 0.746784 270.493 0.781408 258.508 104.64%
1 8 256 0.532384 758.851 0.5144 785.381 96.62%
1 8 256 0.28176 716.922 0.287712 702.091 102.11%
1 8 256 0.515712 783.383 0.517344 780.912 100.32%
1 8 256 0.30032 672.616 0.286784 704.363 95.49%
8 1 256 1.03683 389.648 1.09642 368.473 105.75%
8 1 256 0.941856 214.47 1.00131 201.735 106.31%
8 1 256 1.0384 389.06 1.10077 367.016 106.01%
8 1 256 0.945888 213.556 0.999488 202.103 105.67%
1 1 256 0.681152 593.113 0.73616 548.794 108.08%
1 1 256 0.59584 339.017 0.659936 306.09 110.76%
1 1 256 0.675616 597.973 0.741312 544.98 109.72%
1 1 256 0.588512 343.239 0.661696 305.276 112.44%
2 2 384 0.54528 740.904 0.554176 729.01 101.63%
2 2 384 0.426112 474.054 0.451648 447.251 105.99%
2 2 384 0.549056 735.808 0.557888 724.16 101.61%
2 2 384 0.430048 469.715 0.471104 428.78 109.55%

@alliepiper alliepiper added helps: rapids Helps or needed by RAPIDS. P1: should have Necessary, but not critical. labels Oct 14, 2021
Copy link
Collaborator

@gevtushenko gevtushenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these changes, the code is faster, clearer and shorter! I have a few minor comments.

cub/block/block_run_length_decode.cuh Outdated Show resolved Hide resolved
cub/block/block_run_length_decode.cuh Show resolved Hide resolved
cub/block/block_run_length_decode.cuh Outdated Show resolved Hide resolved
cub/block/block_run_length_decode.cuh Outdated Show resolved Hide resolved
#pragma unroll
for (int i = 0; i < RUNS_PER_THREAD; i++)
{
temp_storage.runs.run_values[thread_dst_offset] = run_values[i];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This memory access pattern can produce bank conflicts (for example with power-of-two RUNS_PER_THREAD). I wonder if padding insertion can help. You can check BlockExchange::ScatterToBlocked for reference. It uses SHR_ADD to distribute accesses. If you are time-limited, it should be fine to file a different issue and research this later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll think about it. It won't be straight forward due to the subsequent binary search on that array.

cub/block/block_run_length_decode.cuh Outdated Show resolved Hide resolved
cub/block/block_run_length_decode.cuh Show resolved Hide resolved
cub/block/block_run_length_decode.cuh Show resolved Hide resolved
cub/block/block_run_length_decode.cuh Show resolved Hide resolved
cub/block/block_run_length_decode.cuh Show resolved Hide resolved
Copy link
Collaborator

@alliepiper alliepiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had some minor comments but overall I like how this is looking 👍

cub/block/block_run_length_decode.cuh Outdated Show resolved Hide resolved
cub/block/block_run_length_decode.cuh Show resolved Hide resolved
cub/block/block_run_length_decode.cuh Show resolved Hide resolved
cub/block/block_run_length_decode.cuh Outdated Show resolved Hide resolved
cub/block/block_run_length_decode.cuh Outdated Show resolved Hide resolved
cub/block/block_run_length_decode.cuh Outdated Show resolved Hide resolved
cub/block/block_run_length_decode.cuh Outdated Show resolved Hide resolved
cub/block/block_run_length_decode.cuh Show resolved Hide resolved
cub/block/block_run_length_decode.cuh Outdated Show resolved Hide resolved
test/test_block_run_length_decode.cu Outdated Show resolved Hide resolved
@elstehle elstehle force-pushed the feature/block-run-length-decode branch from 3ab765c to 5ea9eaa Compare October 18, 2021 08:54
@alliepiper alliepiper assigned alliepiper and unassigned elstehle Oct 18, 2021
alliepiper added a commit to alliepiper/thrust that referenced this pull request Oct 18, 2021
@alliepiper
Copy link
Collaborator

LGTM -- kicking off CI:

DVS CL: 30545251
gpuCI: NVIDIA/thrust#1540

@alliepiper alliepiper added testing: gpuCI in progress Started gpuCI testing. testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). testing: gpuCI passed Passed gpuCI testing. testing: internal ci passed Passed internal NVIDIA CI (DVS). and removed testing: gpuCI in progress Started gpuCI testing. testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). labels Oct 18, 2021
@alliepiper alliepiper dismissed gevtushenko’s stale review October 19, 2021 16:01

Requested changes were made.

@alliepiper alliepiper merged commit da50ae9 into NVIDIA:main Oct 19, 2021
@alliepiper
Copy link
Collaborator

All set! This will ship with 1.15.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
helps: rapids Helps or needed by RAPIDS. P1: should have Necessary, but not critical. testing: gpuCI passed Passed gpuCI testing. testing: internal ci passed Passed internal NVIDIA CI (DVS).
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants