Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ext_vector_type GCC ABI incompatibility #1740

Closed
jglaser opened this issue Dec 16, 2019 · 1 comment · Fixed by #1789
Closed

ext_vector_type GCC ABI incompatibility #1740

jglaser opened this issue Dec 16, 2019 · 1 comment · Fixed by #1789

Comments

@jglaser
Copy link

jglaser commented Dec 16, 2019

Running HOOMD-blue using several GPUs over MPI with HIP 651a91b results in application crashes. The culprit is a packed data structure in HOOMD-blue containing vector types.

The struct in question is

struct pdata_element
    {
    Scalar4 pos;               //!< Position
    Scalar4 vel;               //!< Velocity
    Scalar3 accel;             //!< Acceleration
    Scalar charge;             //!< Charge
    Scalar diameter;           //!< Diameter
    int3 image;                //!< Image
    unsigned int body;         //!< Body id
    Scalar4 orientation;       //!< Orientation
    Scalar4 angmom;            //!< Angular momentum
    Scalar3 inertia;           //!< Moments of inertia
    unsigned int tag;          //!< global tag
    Scalar4 net_force;         //!< net force
    Scalar4 net_torque;        //!< net torque
    Scalar net_virial[6];      //!< net virial
    };

here ScalarX is doubleX or floatX, depending on the build configuration. For now, assume it is double.

In HOOMD, .cu files get compiled using hcc (clang internally), the rest of the code with the host compiler (can be gcc or clang). I put in two printf statements, one in hcc compiled code, and one in gcc compiled code, to report the size of pdata_element:

sizeof(pdata_element) in .cc 328
sizeof(pdata_element) in .cu 352

Obviously, they disagree, explaining the memory access fault and subsequent application crash. The underlying reason is that the vector types now use ABI incompatible underlying data types (ext_vector_type in clang, simple C array in gcc). They are incompatible because ext_vector_type does not exist in gcc, so it cannot be part of the ABI. If we used the same data type in both compilers, e.g. the simple C array or vector_size, I would expect it to work (as it does in roc 2.10.0). I know that @AlexVlx prefers ext_vector_type over vector_size (#1675, #1690), but it seems that we should use the least common denominator between the compilers to guarantee ABI compatibility. Or better even, no special vector type at all.

Alternatively, as I suggested before, compilation with a single compiler could be enforced and using gcc would result in an error. Even though it may seem like an unnecessary restriction, I believe having no compatibility would be preferable over mysterious application crashes the user has to debug.

@AlexVlx
Copy link
Contributor

AlexVlx commented Dec 16, 2019

@jglaser thank you for flagging this - I believe it's actually somewhat less onerous than it appears at first sight, and it is something you pointed out but I forgot to correct i.e. alignment. Vector types have greater alignment than the array equivalent for ranks greater than 1. A PR addressing this will be forthcoming.

mangupta pushed a commit that referenced this issue Feb 10, 2020
Should fix #1740 and the related internal bug.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants