You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running HOOMD-blue using several GPUs over MPI with HIP 651a91b results in application crashes. The culprit is a packed data structure in HOOMD-blue containing vector types.
The struct in question is
struct pdata_element
{
Scalar4 pos; //!< Position
Scalar4 vel; //!< Velocity
Scalar3 accel; //!< Acceleration
Scalar charge; //!< Charge
Scalar diameter; //!< Diameter
int3 image; //!< Image
unsigned int body; //!< Body id
Scalar4 orientation; //!< Orientation
Scalar4 angmom; //!< Angular momentum
Scalar3 inertia; //!< Moments of inertia
unsigned int tag; //!< global tag
Scalar4 net_force; //!< net force
Scalar4 net_torque; //!< net torque
Scalar net_virial[6]; //!< net virial
};
here ScalarX is doubleX or floatX, depending on the build configuration. For now, assume it is double.
In HOOMD, .cu files get compiled using hcc (clang internally), the rest of the code with the host compiler (can be gcc or clang). I put in two printf statements, one in hcc compiled code, and one in gcc compiled code, to report the size of pdata_element:
sizeof(pdata_element) in .cc 328
sizeof(pdata_element) in .cu 352
Obviously, they disagree, explaining the memory access fault and subsequent application crash. The underlying reason is that the vector types now use ABI incompatible underlying data types (ext_vector_type in clang, simple C array in gcc). They are incompatible because ext_vector_type does not exist in gcc, so it cannot be part of the ABI. If we used the same data type in both compilers, e.g. the simple C array or vector_size, I would expect it to work (as it does in roc 2.10.0). I know that @AlexVlx prefers ext_vector_type over vector_size (#1675, #1690), but it seems that we should use the least common denominator between the compilers to guarantee ABI compatibility. Or better even, no special vector type at all.
Alternatively, as I suggested before, compilation with a single compiler could be enforced and using gcc would result in an error. Even though it may seem like an unnecessary restriction, I believe having no compatibility would be preferable over mysterious application crashes the user has to debug.
The text was updated successfully, but these errors were encountered:
@jglaser thank you for flagging this - I believe it's actually somewhat less onerous than it appears at first sight, and it is something you pointed out but I forgot to correct i.e. alignment. Vector types have greater alignment than the array equivalent for ranks greater than 1. A PR addressing this will be forthcoming.
Running HOOMD-blue using several GPUs over MPI with HIP 651a91b results in application crashes. The culprit is a packed data structure in HOOMD-blue containing vector types.
The struct in question is
here ScalarX is doubleX or floatX, depending on the build configuration. For now, assume it is double.
In HOOMD, .cu files get compiled using hcc (clang internally), the rest of the code with the host compiler (can be gcc or clang). I put in two
printf
statements, one in hcc compiled code, and one in gcc compiled code, to report the size ofpdata_element
:Obviously, they disagree, explaining the memory access fault and subsequent application crash. The underlying reason is that the vector types now use ABI incompatible underlying data types (
ext_vector_type
in clang, simple C array in gcc). They are incompatible becauseext_vector_type
does not exist in gcc, so it cannot be part of the ABI. If we used the same data type in both compilers, e.g. the simple C array orvector_size
, I would expect it to work (as it does in roc 2.10.0). I know that @AlexVlx prefersext_vector_type
overvector_size
(#1675, #1690), but it seems that we should use the least common denominator between the compilers to guarantee ABI compatibility. Or better even, no special vector type at all.Alternatively, as I suggested before, compilation with a single compiler could be enforced and using gcc would result in an error. Even though it may seem like an unnecessary restriction, I believe having no compatibility would be preferable over mysterious application crashes the user has to debug.
The text was updated successfully, but these errors were encountered: