-
Notifications
You must be signed in to change notification settings - Fork 447
Adds BlockRunLengthDecode algorithm and tests #354
Adds BlockRunLengthDecode algorithm and tests #354
Conversation
Thanks @elstehle! Just a heads up, I won't be able to look at this until I get back from vacation on 8/16, but @senior-zero will do an initial review in the meantime. Do you need this merged for any particular release? |
I just noticed the TODOs -- should we wait until those are finished to start reviewing? |
Thanks, Allison. No rush. Enjoy your vacation. I'll align with @senior-zero in the meanwhile.
I think, only the CRTP base class would be some more "intrusive" change (yet, not too intrusive). If we can get a decision on whether we want to pursue it, I'll push that. All other TODOs are either optional or really just minor additions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a crucial algorithm to have in CUB! Thank you for developing it 😃
I've proposed an alternative approach for RLE decode below. I think we should have a second iteration when it's considered.
975e74f
to
40951c0
Compare
On top of addressing the review comments, I also switched to a compile-time-"unrolled" binary search that would always do log2(
Experiments were run on a V100.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for these changes, the code is faster, clearer and shorter! I have a few minor comments.
#pragma unroll | ||
for (int i = 0; i < RUNS_PER_THREAD; i++) | ||
{ | ||
temp_storage.runs.run_values[thread_dst_offset] = run_values[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This memory access pattern can produce bank conflicts (for example with power-of-two RUNS_PER_THREAD
). I wonder if padding insertion can help. You can check BlockExchange::ScatterToBlocked
for reference. It uses SHR_ADD
to distribute accesses. If you are time-limited, it should be fine to file a different issue and research this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll think about it. It won't be straight forward due to the subsequent binary search on that array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had some minor comments but overall I like how this is looking 👍
3ab765c
to
5ea9eaa
Compare
LGTM -- kicking off CI: DVS CL: 30545251 |
All set! This will ship with 1.15. |
Algorithm Overview
The
BlockRunLengthDecode
class supports decoding a run-length encoded array ofunique_items
. That is, given the two arraysunique_items[N]
andrun_lengths[N]
,unique_items[i]
is repeatedrun_lengths[i]
many times in the output arraydecoded_items[M]
. Note: runs of length0
are supported and will not appear in the output.The application of
BlockRunLengthDecode
goes beyond just the decompression use case. Two other use cases are load-balancing and generating thekeys
array of*_by_key
algorithm variants. This is a preliminary building block for the BatchMemcpy #297.Example:
Specialisation that will also output relative offsets:
The relative offset indicates the offset within its run for each decoded item
In the fashion of CUB's block-level algorithms, the items being contributed by each thread are expected in a blocked arrangement. The items being returned are also returned in a blocked arrangement.
Due to the nature of the run-length decoding algorithm ("decompression"), the output size of the algorithm invocation is unbounded. To address this,
BlockRunLengthDecode
allows retrieving a "window" from the run-length decoded output. The window's offset can be specified by the user andBLOCK_THREADS * DECODED_ITEMS_PER_THREAD
(i.e., referred to aswindow_size
) decoded items from the specified window will be returned. Memory requirements are always justO(window_size)
.Use Cases
RunLengthDecode
allows to evenly sub-divide these problems into equi-sized sub-problems.*_by_keys
algorithm variantsImplementation Details
The original approach (Scan-based Approach), from when this PR was originally opened, was replaced in favour of the "New Binary Search-based Approach". In the current state, the PR uses what is described in "New Binary Search-based Approach".
New Binary Search-based Approach
Every thread is assigned to generate the same number of output items per
RunLengthDecode()
invocation. Each thread knows the offset from which to generate the output. For instance, if each thread is assigned to generate 4 output items,thread0
is generating output items[0,4)
,thread1
does[4,8)
, etc. Now, each thread needs to figure out which run it is initially assigned to. Taking the previous example,thread1
needs to find out the run at offset 4.To find out the corresponding run, we use the prefix sum over the runs' offset (that yields us the beginning offset of each run). We call that result the runs-offsets array. We then do a binary search into the runs-offsets array, using a threads assigned output offset, using
UpperBound
. So,thread1
would do:UpperBound(runs-offsets, my_output_offset)
.Original Scan-based Approach
Overview of processing stages:
decode_buffer
withcontinuation-of-run
items. This is just to differentiate between items that have already been resolved and items that yet need to be "filled in".0
.bin_op
(where-
from the table above meansìs_unresolved
)A question that arises is how to differentiate between an
unresolved
(i.e.,-
and an already decoded (or "resolved") item. If there was a value representable by the data type ofunique_items
that will never appear in the user-provided input, we could simply use that to represent unresolved items. The other alternative is to makedecoded_items
temporarily a pair of {unique_value
,is_resolved
}. I had started with the former and later added the latter. The former is 10-20% faster but not so nice with regards to the interface. So currently, we only provide an interface for the latter.If the user also wants to retrieve
relative_offsets
, then the pair of {unique_value
,is_resolved
} is becoming {unique_value
,relative_offset
} and the scan operator is becoming slightly more involved.So, essentially there's three specialisations (or instances) of the
BlockRunLengthDecode
which have a lot of overlap implementation-wise. They mostly vary in the amount of TempStorage they require and theRunLengthDecode
member function signature:UNUSED
(regular run-length decoding, but the user has to tell us which value we can use to represent yet-unresolved items):NORMAL
(regular run-length decoding):OFFSETS
(...):Currently only
OFFSETS
(...): andNORMAL
(regular run-length decoding): are supported. And which implementation the user wants is decided by passing a template parameter toBlockRunLengthDecode
.Question:
I'm inclined to have three different super classes (one for each of above specialisations) with a CRTP base class. So far CUB has refrained from having any inheritance. But here, the specialisations not only differ in the implementation but also the interface they expose. So, I think different classes would be the cleanest way to express that?
Overall I tried to match the CUB style wherever possible. I just deviated and had decided to go with fixed-width types.
Performance
These are some numbers from a
V100
when decodinguint32_t
as theunique_items
.TODOs
uniques
andoffsets
into a single struct in shared memory if they both fit within a four-byte word.UnusedUnique
specialisation