-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ompi/datatype: use size_t for count arguments #12351
Conversation
This change in API will break the MPI datatype management in many ways (basically everything that relies on the datatype storage description such as one-sided, IO, and all the datatype representation manipulation functions, the combiner_*). The root cause is that the internal storage format for the datatype, and all the functions to expose it to other libraries are based on int32_t. I have the feeling that the correct solution is to split the datatype API in two, one used by MPI where we follow MPI expectations, and the internal one where we can use counts but we will not provide any data representation support (the support will remain at the MPI level). Based on this, we will only use the internal API inside OMPI and the external API will be reserved for the MPI layer. |
I also had the concern of breaking API compatibility but none of our CI(internal and gh action) actually caught this so that's interesting. I was reading the code and saw that the I also agree with the proposed solution to introduce another internal API - I haven't yet figured out how. |
@wenduwan It looks like you are still working on this. Do you want to move this PR to Draft? |
ompi/datatype/ompi_datatype.h
Outdated
@@ -150,7 +150,7 @@ ompi_datatype_is_predefined( const ompi_datatype_t* type ) | |||
} | |||
|
|||
static inline int32_t | |||
ompi_datatype_is_contiguous_memory_layout( const ompi_datatype_t* type, int32_t count ) | |||
ompi_datatype_is_contiguous_memory_layout( const ompi_datatype_t* type, size_t count ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes no sense without modifying first opal_datatype_is_contiguous_memory_layout
.
ompi/datatype/ompi_datatype.h
Outdated
@@ -188,20 +188,20 @@ ompi_datatype_add( ompi_datatype_t* pdtBase, const ompi_datatype_t* pdtAdd, size | |||
OMPI_DECLSPEC int32_t | |||
ompi_datatype_duplicate( const ompi_datatype_t* oldType, ompi_datatype_t** newType ); | |||
|
|||
OMPI_DECLSPEC int32_t ompi_datatype_create_contiguous( int count, const ompi_datatype_t* oldType, ompi_datatype_t** newType ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are more complicated as we would need to change the ddt_elem_desc
and ddt_loop_desc
structs to have count
as int. Unfortunately, that will change the size of these structs, and increase the overall size of the datatype representation.
@@ -31,13 +31,12 @@ | |||
|
|||
|
|||
/* We try to merge together data that are contiguous */ | |||
int32_t ompi_datatype_create_indexed( int count, const int* pBlockLength, const int* pDisp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making the count
in indexed datatypes a size_t makes very little sense, because it indicates the number of entries in the displacement and blocklen arrays. In other terms the user-facing datatype representation will be several gigabytes long.
I can understand that for symmetry with the contiguous one would like to have this as a size_t, but my comment above remains valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bosilca I need your thoughts here. I figured MPI_Type_indexed
also has the large-count variant MPI_Type_indexed_c
, which I imagine will also need size_t
support in ompi_datatype_create_indexed
and related functions. Did I missing anything?
Same goes for *struct and *vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wenduwan Check out the work that @hppritcha and Jakob did on the *w collectives. There are now disp and count arrays that store whether the source is 32 or 64bit and hand out 64bit consistently: https://github.com/open-mpi/ompi/blob/main/ompi/util/count_disp_array.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The count is the number of elements in the indexed type, and if that number cannot be represented as an int, it basically means that the datatype description (where we need 64 bytes per contiguous element to represent) will be bigger than the represented data. No normal person should even imagine such an API.
@devreal suggestion is mostly irrelevant here. In the collective case the user-provided buffers are guaranteed to be available during the entire collective operation, so piggybacking into the user buffer is possible. There is no such guarantees with the datatype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bosilca Thanks. Could you elaborate more on "the datatype description .. will be bigger than the represented data." part? Are you referring to this struct? I noted that opal_datatype_count_t
is already size_t
struct dt_type_desc_t {
opal_datatype_count_t length; /**< the maximum number of elements in the description array */
opal_datatype_count_t used; /**< the number of used elements in the description array */
dt_elem_desc_t *desc;
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each data in the indexed/struct will be represented by an ddt_elem_desc
struct which is 32 bytes. So, if you have a number of entries that will not fit into an int, it means what the data description itself will be over 64GB, there will be little memory left for the data itself (especially that if you use an index or struct you have gaps between these elements).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bosilca Please correct me if I'm wrong. The fundamental concern here is the datatype descriptor struct size.
The root of this issue is in ompi_datatype_add
and therefore opal_datatype_add
, which allocates new memory here:
ompi/opal/datatype/opal_datatype_add.c
Lines 297 to 304 in ce2310a
pdtBase->bdt_used |= pdtAdd->bdt_used; | |
newLength = pdtBase->desc.used + place_needed; | |
if (newLength > pdtBase->desc.length) { | |
newLength = ((newLength / DT_INCREASE_STACK) + 1) * DT_INCREASE_STACK; | |
pdtBase->desc.desc = (dt_elem_desc_t *) realloc(pdtBase->desc.desc, | |
sizeof(dt_elem_desc_t) * newLength); | |
pdtBase->desc.length = newLength; | |
} |
I am likely missing important context here. Why would it be a problem for indexed/struct but not vector, which also calls into ompi_datatype_add
? 🤔
Also, suppose the user has a huge amount of memory to waste, would this concern still hold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vector representation is compact because its regular and repetitive, with a single ddt_elem_desc
(aka. 64 bytes) one can describe a datatype that covers the entire memory. Indexed and struct representations have a number of ddt_elem_desc
entries (I'm not talking about the datatype count here), and when the number of entries is larger than an int (which is what we are talking about here), the datatype representation itself will be comparable with the memory layout it covers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. ompi_datatype_create_vector
is not calling ompi_datatype_create
in a loop, but indexed/struct does.
But still, without embiggening the indexed/struct count(as well as block length, displ) how can we support the MPI_Type_{indexed,struct}_c
APIs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment was mostly about how stupid that API is, not about how we support it. And I'm not even talking about the performance of parsing that extremely large description to pack/unpack data.
But if you ask me how to do it, I would check that that number is below INT_MAX, and return some error otherwise (not enough memory or something). If the number is reasonable, we keep doing as today.
@@ -28,13 +28,12 @@ | |||
|
|||
#include "ompi/datatype/ompi_datatype.h" | |||
|
|||
int32_t ompi_datatype_create_struct( int count, const int* pBlockLength, const ptrdiff_t* pDisp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as for indexed types.
@@ -28,7 +28,7 @@ | |||
|
|||
#include "ompi/datatype/ompi_datatype.h" | |||
|
|||
int32_t ompi_datatype_create_vector( int count, int bLength, int stride, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one would be acceptable I guess.
@bosilca Thanks for your comments. I'm looking into this PR again. |
This patch prepares the opal datatype engine for large count support. Related function arguments need to accept size_t input, and accordingly we had to modify codes where those functions are called with smaller integer types. Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
ae65011
to
c79bc2d
Compare
{ | ||
opal_datatype_t *datatype = (opal_datatype_t *) OBJ_NEW(opal_datatype_t); | ||
|
||
if (expectedSize == -1) { | ||
if (expectedSize == (size_t) -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very unsure about this - what is the scenario where size will be -1
?
After a fresh look at the change, it appears to have a much larger blast radius than I expected. The original revision might started with the wrong place - it was changing ompi datatype APIs but that requires adaptation to the internal opal datatypes. With that in mind, I think it is better to start with opal datatypes instead. Updated PR accordingly. Still looking at other opal functions to find out what else needs to change. |
@hppritcha @bosilca I'm sorry that I won't have time to work on this. Unfortunately I have to leave this to someone else. Closing the PR. |
The use of int for count arguments is becoming restrictive especially to adopt large count. This change extends the argument type to size_t.