Welcome to GPUSorting Discussions! #8

b0nes164 · 2024-09-20T17:46:33Z

b0nes164
Sep 20, 2024
Maintainer

👋 Welcome!

We’re using Discussions as a place to connect with other members of our community. We hope that you:

Ask questions you’re wondering about.
Share ideas.
Engage with other community members.
Welcome others and are open-minded. Remember that this is a community we
build together 💪.

To get started, comment below with an introduction of yourself and tell us about what you do with this community.

cmhhelgeson · 2024-09-23T04:43:14Z

cmhhelgeson
Sep 23, 2024

Hi b0nes,

Outside of the readings listed on the Github, or there any video lectures, powerpoints, or visualizations of the differing parallel prefix sum algorithms that could be used to better understand them?

2 replies

b0nes164 Sep 23, 2024
Maintainer Author

EDIT: I misread your question, and thought you were asking about sorting.

Hi @cmhhelgeson,

Great question. There are probably some out there (sorting diagrams), but I'm not aware of them. I wish I could help, but I mainly learned the algorithm by reading the paper. I do plan on making at least some sort of visualization or writeup, but I've been so busy recently that I doubt I'll ever get around to it.

The best advice I can give is to ignore sorting, and learn how to a device level prefix sum. Once you grok how to do a prefix sum, the sorting will follow. I do have some help on that part. If you go to my other repo here, I have a survey of prefix sums, and I show how to build up to a single thread-block prefix sum. I recommend starting at TrueBlockInclusiveScan, and once you have understanding of that, you can look start looking at a device prefix sum that uses multiple thread blocks like ReduceThenScan and ChainedScanDecoupledLookback. I'll also reupload some prefix sum diagrams I have lying around, which should help with visualizing the prefix sum patterns. Good Luck!

b0nes164 Sep 23, 2024
Maintainer Author

Ok, I've added back the survey diagrams, which can be found on the README. That should take you all the way up to Reduce-Then-Scan. ChainedScanDecoupledLookback I don't have a diagram for. More specifically, I don't have a good diagram for it, but you should be able to get it once you get RTS.

vassvik · 2024-09-23T10:00:21Z

vassvik
Sep 23, 2024

Hi, I'm Morten, I work with JangaFX on VFX authoring tools for games and film such as EmberGen and LiquiGen.

My primary interests in terms of sorting are sorting smallish input sizes relative to most GPUs (i.e. between 2^10 and 2^18 or so key-values, barely enough to saturate the highest end GPUs) with as little overhead as possible, as well as segmented sorts of similar ranges of larger datasets. Doing a global sort of relatively small datasets (for the GPU) and doing smallish segmented sorts for larger datasets has some interesting similarities, but also very different tradeoffs, which I find interesting and somewhat underexplored. Add in the different characteristics of different GPU architectures and it gets even more interesting.

To be more specific: I'm currently working on a sparse fluid simulation where the number of tiles is typically on the order of a few thousand, and worst case on the order of a few tens of thousands. I need to sort these by distance to the camera for rendering purposes, and I want to try to sort these based spatial locality using some kind of z-ordering curve, and I'd like to have these sorts take up as little time as possible, preferably close to the cost of dispatching a compute shader altogether if possible. If someone wants to integrate what I am working on in an actual video game where there's a 1-2ms budget then spending 100µs per isn't going to be quite acceptable.

My interest in GPUSorting in particular probably stops at using it as a great reference implementation and reference benchmark of OneSweep so far, where I'm applying it to a range of inputs it's not really designed for at all. I'll probably look into the SplitSort aspect as well at some point in the future as well, but for now I'm mostly focusing on my own implementation. I only have a limited selection of GPUs available for testing locally, so I'm also interested in more comprehensive benchmarks across different architectures (e.g. 1000 series and onwards for NV, and RDNA1 and onward for AMD).

4 replies

b0nes164 Sep 23, 2024
Maintainer Author

Hi Morten,

I can say with certainty that if your objective is the best possible speed at the ranges you specifiy, and you're ok with getting your hands a little dirty, you should be able to get some significant speed improvements for your application.

At sizes up to 2^12, the sort can be done locally in registers and shared memory, yielding significant speedups.
From sizes 2^13 - 2 ^15 sort the input in tiles of 2^12, then iteratively mergesort the results in device memory using mergepath.
At sizes 2^16 and above, OneSweep will tend to beat out the mergesort approach.

If your application is in CUDA, you can push the tuning harder and increase the size of the local sorting to 2^13, because you can increase the shared memory carveout to above 32k and can force the register counts to stay in bounds with __launchbounds__.
Increasing the max local sort size pushes the max mergesort size to 2^17. Another thing you can do with CUDA is that with forward thread progress guarantees and __thread_fence you can elide an unecessary write to device memory during each round of merging.

So in this approach you would have 3 different algorithms that you choose based on the input size.

b0nes164 Sep 23, 2024
Maintainer Author

To get a rough idea of the speedups you can gain, we can use the data from my SplitSort benchmarking, which uses the same input-size specialization technique I described above, except that it sorts n segments instead of just one.

As you can see, you can get some really nice speedups, particularly at smaller sizes when sorting locally. Keep in mind, this is sorting key value pairs, and again, not all the performance you see here may be portable outside of CUDA for the reasons I mention above.

vassvik Sep 24, 2024

I actually do have some experiments and results here that I could share, but I didn't want to bloat the top level discussion with any of it up front. I'm using compute shaders and GLSL, but the choice of API shouldn't matter much in the end.

Details inside

I initially experimented with segments in the range up to about 2048 on a large range in datasets, but with fixed segment lengths. Eventually I pivoted into actual global sorts of smaller datasets, so I didn't actually wrap up or finish anything, but the results might be interesting, especially juxtaposed against the 2080 Super results.

Caveat: These are all fixed segment lengths, not maximum/random lengths, and it's only sorting 32-bit uniformly random keys (although adding a value should improve bandwidth on the high end since there's more data to load which the heavier calculations must hide).

Here's my early result sorting 32, 64, 128, 256, 512 and 1024 segments on a desktop 4090 (1 TB/s theoretical)

I also experimented with different variants for 512 to get better bandwidth numbers:

And also 1024 segments, which I couldn't quite get sufficiently close to peak theoretical bandwidth at the time, but I think I could do it now if I revisited it:

Eventually I pivoted into getting more interested in sorting smaller sizes, when I noticed the different tradeoffs in different variants:

The last one is colored row by row. Notice how on the low count (max occupancy on the 4090 is 196608 threads) where it's overhead dominated the opposite side of the "variants" are proportionally faster, whereas on the high count (where it's bandwidth limited) it's the other way around.

It's interesting to note how very different the performance characteristics of the 4090 (or any 4000 series) can be due to its architectural differences vs earlier series

As mentioned eventually I pivoted into doing small count sorts (up to 64K) by hand-optimizing bitonic sorts, summarized in this image (this time on a laptop 4090, a bit slower than a desktop 4090):

The way to read the graphs is that on the left hand side it stacks the cost of incrementally larger segments on a 64K sized input array of keys, and on the right hand side it splits each into columns. The "Spinlocked" variant is using a single dispatch and syncing workgroups between different phases with spinlocks and atomics, while the "Chained" variant does one dispatch per column.

In the end I ended up doing a comparison against your Onesweep (which of course is not appropriate for anyway since it's designed for the polar opposite use case and has high overhead) as something someone might naively try to implement instead of going through this trouble I have.

If there's any interest in this I could make a separate discussion thread and document my journey so far.

b0nes164 Sep 24, 2024
Maintainer Author

Absolutely, could you make a copy of this in a new discussion and then I'll post my feedback? I don't think/don't know how to move this thread to its own discussion.

cmhhelgeson · 2024-09-23T22:57:39Z

cmhhelgeson
Sep 23, 2024

Could a discussions section be created for the PrefixSums repository? I mostly have questions about parallel prefix sums, but I feel those questions would be better addressed in the relevant repository.

1 reply

b0nes164 Sep 23, 2024
Maintainer Author

It's up!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Welcome to GPUSorting Discussions! #8

{{title}}

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Welcome to GPUSorting Discussions! #8

b0nes164 Sep 20, 2024 Maintainer

👋 Welcome!

Replies: 3 comments · 7 replies

cmhhelgeson Sep 23, 2024

b0nes164 Sep 23, 2024 Maintainer Author

b0nes164 Sep 23, 2024 Maintainer Author

vassvik Sep 23, 2024

b0nes164 Sep 23, 2024 Maintainer Author

b0nes164 Sep 23, 2024 Maintainer Author

vassvik Sep 24, 2024

b0nes164 Sep 24, 2024 Maintainer Author

cmhhelgeson Sep 23, 2024

b0nes164 Sep 23, 2024 Maintainer Author

b0nes164
Sep 20, 2024
Maintainer

Replies: 3 comments 7 replies

cmhhelgeson
Sep 23, 2024

b0nes164 Sep 23, 2024
Maintainer Author

b0nes164 Sep 23, 2024
Maintainer Author

vassvik
Sep 23, 2024

b0nes164 Sep 23, 2024
Maintainer Author

b0nes164 Sep 23, 2024
Maintainer Author

b0nes164 Sep 24, 2024
Maintainer Author

cmhhelgeson
Sep 23, 2024

b0nes164 Sep 23, 2024
Maintainer Author