Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Alignment requirements of cuda::std::complex #151

Closed
gonzalobg opened this issue Apr 6, 2021 · 9 comments
Closed

Alignment requirements of cuda::std::complex #151

gonzalobg opened this issue Apr 6, 2021 · 9 comments
Assignees
Labels
bug: performance Does not perform as intended. helps: quda Helps or needed by QUDA.
Milestone

Comments

@gonzalobg
Copy link
Collaborator

The following two static_asserts compile without issues:

#include <cuda/std/complex>
static_assert(alignof(cuda::std::complex<double>) == 8);
static_assert(alignof(cuda::std::complex<float>) == 4);

I'd expected them to be 16 and 8 to match the double2 and float2 types.

@dkolsen-pgi
Copy link
Collaborator

I'd expect them to be 8 and 4 to match the std::complex<double> and std::complex<float> types.

There are competing requirements here. It may very well be that the higher alignment is the better choice, but it is not unambiguously the right choice.

@gonzalobg
Copy link
Collaborator Author

What are the requirements for cuda::std::complex?

@jrhemstad
Copy link
Collaborator

jrhemstad commented Apr 6, 2021

What are the requirements for cuda::std::complex?

The requirements for anything in cuda::std are exactly equivalent (when possible) to its counterpart in std::.

Any extensions or deviations from std:: go in cuda:: as opposed to cuda::std::.

@cliffburdick
Copy link

cliffburdick commented Apr 11, 2021

To add from a separate discussion, this becomes an issue when using cuda::std::complex to access global memory within a warp. The unaligned accesses cause a 50% performance drop due to the uncoalesced access.

@brycelelbach brycelelbach added this to the 2.0.0 milestone Apr 12, 2021
@brycelelbach
Copy link
Collaborator

brycelelbach commented Apr 12, 2021

I think we should break from what the standard requires here. The only case where it hits us is if someone is casting unaligned T[2] to [cuda::]std::complex<T[2]>, or casting std::complex<T> to cuda::std::complex<T> (this case can be handled by implicit conversion instead.

There's a big perf hit for doing the wrong thing here.

My inclination is that we just make cuda::std::complex aligned to sizeof(T)*2 and add implicit conversions from std::complex when being used with NVCC.

This will require us to introduce a new ABI version, V4.

We can add a way to opt into the standard-conforming behavior.

@gonzalobg
Copy link
Collaborator Author

@brycelelbach

I think we should break from what the standard requires here. The only case where it hits us is if someone is casting unaligned T[2] to [cuda::]std::complex<T[2]>, or casting std::complex to cuda::std::complex

Can you point to the paragraph of the standard that guarantees that this works?

@griwes
Copy link
Collaborator

griwes commented Apr 12, 2021

The standard only requires reinterpreting complex as an array, not the other way around.

@wmaxey wmaxey self-assigned this Apr 13, 2021
@wmaxey wmaxey added the bug: performance Does not perform as intended. label Apr 13, 2021
@maddyscientist maddyscientist added the helps: quda Helps or needed by QUDA. label Apr 20, 2021
@leofang
Copy link
Member

leofang commented May 7, 2021

My inclination is that we just make cuda::std::complex aligned to sizeof(T)*2 and add implicit conversions from std::complex when being used with NVCC.

This is what Thrust currently does too, and it helps improve the performance.

@wmaxey
Copy link
Member

wmaxey commented Sep 10, 2021

This has been fixed in 1.6.0 with #172

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug: performance Does not perform as intended. helps: quda Helps or needed by QUDA.
Projects
None yet
Development

No branches or pull requests

9 participants