Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Improve shuffle quality #1309

Merged
merged 7 commits into from
Nov 12, 2020
Merged

Improve shuffle quality #1309

merged 7 commits into from
Nov 12, 2020

Conversation

RAMitchell
Copy link
Contributor

@RAMitchell RAMitchell commented Oct 10, 2020

This is another version of #1257, without using any proprietary hash functions.

I have included thrust::random::ranlux48 as an additional hash, as well as improved the mixing function (before I used XOR alone, which turns out to be a poor choice) to match boost::hash_combine.

I also increased the number of rounds from 8->16.

This passes all tests from #1257 and fixes #1256

@djns99 please review.

return (uint32_t)((1ull << (uint64_t)w) - 1ull);
// Equivalent to boost::hash_combine
__host__ __device__ size_t hash_combine(uint64_t lhs, uint64_t rhs) const {
lhs ^= rhs + 0x9e3779b9 + (lhs << 6) + (lhs >> 2);
Copy link
Contributor

@djns99 djns99 Oct 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is intended for 32 bit numbers. Though this is probably fine since the inputs can be uint32_t, but we would get less randomness from the key so its a trade-off.
I believe boost uses a variant murmur hash 2 for uint64_t.

        inline void hash_combine_impl(boost::uint64_t& h,
                boost::uint64_t k)
        {
            const boost::uint64_t m = UINT64_C(0xc6a4a7935bd1e995);
            const int r = 47;

            k *= m;
            k ^= k >> r;
            k *= m;

            h ^= k;
            h *= m;

            // Completely arbitrary number, to prevent 0's
            // from hashing to 0.
            h += 0xe6546b64;
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok will use this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the above function causes tests to fail for some reason. I will keep my version for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are going to do that might be worth changing the key to uint32_t so it takes less space

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave that for future work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok no worries. I am happy with it then.

thrust/system/detail/generic/shuffle.inl Outdated Show resolved Hide resolved
thrust/system/detail/generic/shuffle.inl Show resolved Hide resolved
@alliepiper alliepiper added this to the 1.11.0 milestone Oct 12, 2020
@alliepiper alliepiper added the type: enhancement New feature or request. label Oct 12, 2020
Copy link
Collaborator

@alliepiper alliepiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for redoing this!

I've requested a few minor changes (mostly style and docs) inline, once those are addressed I'll start testing.

testing/shuffle.cu Show resolved Hide resolved
testing/shuffle.cu Outdated Show resolved Hide resolved
testing/shuffle.cu Outdated Show resolved Hide resolved
testing/shuffle.cu Outdated Show resolved Hide resolved
testing/shuffle.cu Outdated Show resolved Hide resolved
@alliepiper
Copy link
Collaborator

LGTM, thanks for the quick updates!

DVS CL 29194481.

@alliepiper alliepiper added testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). testing: gpuCI in progress Started gpuCI testing. labels Oct 13, 2020
@alliepiper
Copy link
Collaborator

This is failing our internal CI. Can you take a look at this error? Looks like an easy fix.


/dvs/p4/build/sw/gpgpu/thrust/testing/unittest/assertions.h: In instantiation of 'void unittest::assert_equal(T1, T2, const string&, int) [with T1 = long unsigned int; T2 = int; std::__cxx11::string = std::__cxx11::basic_string]':
/dvs/p4/build/sw/gpgpu/thrust/testing//shuffle.cu:407:109: required from here
/dvs/p4/build/sw/gpgpu/thrust/testing/unittest/assertions.h:120:9: error: comparison of integer expressions of different signedness: 'long unsigned int' and 'int' [-Werror=sign-compare]

@alliepiper alliepiper removed testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). testing: gpuCI in progress Started gpuCI testing. labels Oct 14, 2020
@alliepiper alliepiper added the testing: gpuCI in progress Started gpuCI testing. label Oct 15, 2020
@alliepiper alliepiper added testing: gpuCI passed Passed gpuCI testing. and removed testing: gpuCI in progress Started gpuCI testing. labels Oct 15, 2020
@alliepiper
Copy link
Collaborator

DVS CL: 29265465

@alliepiper alliepiper added the testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). label Oct 30, 2020
@alliepiper alliepiper removed testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). testing: gpuCI passed Passed gpuCI testing. labels Nov 3, 2020
@alliepiper
Copy link
Collaborator

CI found some janky casts in the cephes code. Added some alignment hints to make sure these work out in practice.

New DVS CL: 29278660

@alliepiper alliepiper added the testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). label Nov 3, 2020
@alliepiper alliepiper added the testing: gpuCI passed Passed gpuCI testing. label Nov 5, 2020
@alliepiper
Copy link
Collaborator

More XLC fixes, new DVS CL: 29287861

@alliepiper alliepiper self-assigned this Nov 9, 2020
@alliepiper
Copy link
Collaborator

New DVS CL: 29301680

Comment on lines 129 to 141
static constexpr unsigned short A[] = {
0x6661, 0x2733, 0x9850, 0x3f4a, 0xe943, 0xb580, 0x7fbd,
0xbf43, 0x5ebb, 0x20dc, 0x019f, 0x3f4a, 0xa5a1, 0x16b0,
0xc16c, 0xbf66, 0x554b, 0x5555, 0x5555, 0x3fb5};
static constexpr unsigned short B[] = {
0x6761, 0x8ff3, 0x8901, 0xc095, 0xb93e, 0x355b, 0xf234, 0xc0e2,
0x89e5, 0xf890, 0x3d73, 0xc114, 0xdb51, 0xf994, 0xbc82, 0xc131,
0xf20b, 0x0219, 0x4589, 0xc13a, 0x055e, 0x5418, 0x0c67, 0xc12a};
static constexpr unsigned short C[] = {
/*0x0000,0x0000,0x0000,0x3ff0,*/
0x12b2, 0x1cf3, 0xfd0d, 0xc075, 0xd757, 0x7b89, 0xaa0d, 0xc0d0,
0x4c9b, 0xb974, 0xeb84, 0xc10a, 0x0043, 0x7195, 0x6286, 0xc131,
0xf34c, 0x892f, 0x5255, 0xc143, 0xe14a, 0x6a11, 0xce4b, 0xc13e};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to come up with a different implementation for this. I'm not sure why this implementation is written this way, but it's defining a buffer of 16-bit ints that are then bitcast into 64-bit floats.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried using some alignment hints to work around this, but some compilers still aren't happy, and tbh making these sorts of assumptions about alignment, endianness, and type representations makes me nervous.

@RAMitchell @djns99 please find an alternate implementation for this -- if we can get something that passes tests by Friday I can include this PR in 1.11.0, otherwise it may slip to 1.12.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RAMitchell @allisonvacanti
I am not able to test right now, but here are the tables as a double. This should resolve the issues I believe:

static constexpr double A[] = {0.000811614167470508488140545910738410384510643780,-0.000595061904284301438315674115386855191900394857,0.000793650340457716942620114419781884862459264696,-0.002777777777300996942672073330982129846233874559,0.083333333333333189929525985917280195280909538269};
static constexpr double B[] = {-1378.251525691208598800585605204105377197265625,-38801.631513463784358464181423187255859375,-331612.9927388711948879063129425048828125,-1162370.97492762305773794651031494140625,-1721737.00820839661173522472381591796875,-853555.66424576542340219020843505859375};
static constexpr double C[] = {-351.8157014365234545039129443466663360595703125,-17064.21066518811494461260735988616943359375,-220528.59055385444662533700466156005859375,-1139334.44367982516996562480926513671875,-2532523.07177582941949367523193359375,-2018891.4143353276886045932769775390625};

Comment on lines 129 to 141
static constexpr unsigned short A[] = {
0x6661, 0x2733, 0x9850, 0x3f4a, 0xe943, 0xb580, 0x7fbd,
0xbf43, 0x5ebb, 0x20dc, 0x019f, 0x3f4a, 0xa5a1, 0x16b0,
0xc16c, 0xbf66, 0x554b, 0x5555, 0x5555, 0x3fb5};
static constexpr unsigned short B[] = {
0x6761, 0x8ff3, 0x8901, 0xc095, 0xb93e, 0x355b, 0xf234, 0xc0e2,
0x89e5, 0xf890, 0x3d73, 0xc114, 0xdb51, 0xf994, 0xbc82, 0xc131,
0xf20b, 0x0219, 0x4589, 0xc13a, 0x055e, 0x5418, 0x0c67, 0xc12a};
static constexpr unsigned short C[] = {
/*0x0000,0x0000,0x0000,0x3ff0,*/
0x12b2, 0x1cf3, 0xfd0d, 0xc075, 0xd757, 0x7b89, 0xaa0d, 0xc0d0,
0x4c9b, 0xb974, 0xeb84, 0xc10a, 0x0043, 0x7195, 0x6286, 0xc131,
0xf34c, 0x892f, 0x5255, 0xc143, 0xe14a, 0x6a11, 0xce4b, 0xc13e};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried using some alignment hints to work around this, but some compilers still aren't happy, and tbh making these sorts of assumptions about alignment, endianness, and type representations makes me nervous.

@RAMitchell @djns99 please find an alternate implementation for this -- if we can get something that passes tests by Friday I can include this PR in 1.11.0, otherwise it may slip to 1.12.0.

@alliepiper alliepiper removed testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). testing: gpuCI passed Passed gpuCI testing. labels Nov 10, 2020
@alliepiper alliepiper assigned RAMitchell and unassigned alliepiper Nov 10, 2020
@alliepiper
Copy link
Collaborator

Thanks for quick fix -- restarted testing under CL 29315896

@alliepiper alliepiper added testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). testing: gpuCI in progress Started gpuCI testing. testing: gpuCI passed Passed gpuCI testing. testing: internal ci passed Passed internal NVIDIA CI (DVS). and removed testing: gpuCI in progress Started gpuCI testing. testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). labels Nov 12, 2020
@alliepiper
Copy link
Collaborator

Tests are passing 👍 Merging now.

Thanks to both of you for helping me get this PR over all of the hurdles :)

@alliepiper alliepiper merged commit a61e4e1 into NVIDIA:main Nov 12, 2020
@alliepiper
Copy link
Collaborator

@RAMitchell @djns99 QA noticed that this patch drops the performance of thrust::shuffle by 60-95%. I don't think we want to revert the patch over this -- the shuffle quality improvements are worth paying a bit of perf, so I'll have them update their baselines.

Would either of you be interested in optimizing the new implementation to see if some of the performance can be recovered?

@RAMitchell
Copy link
Contributor Author

Yep, we will be doing more work on it over summer. Most importantly we now have an accurate baseline. Can I ask how the 60-95% is measured?

@alliepiper
Copy link
Collaborator

They're running the shuffle benchmarks in internal/benchmark/bench.cu and comparing the average throughput before and after this PR.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
testing: gpuCI passed Passed gpuCI testing. testing: internal ci passed Passed internal NVIDIA CI (DVS). type: enhancement New feature or request.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

thrust::shuffle gives low randomness permutations
3 participants