Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ompi/datatype: use size_t for count arguments #12351

Closed
wants to merge 1 commit into from

Conversation

wenduwan
Copy link
Contributor

The use of int for count arguments is becoming restrictive especially to adopt large count. This change extends the argument type to size_t.

@wenduwan wenduwan self-assigned this Feb 19, 2024
@bosilca
Copy link
Member

bosilca commented Feb 20, 2024

This change in API will break the MPI datatype management in many ways (basically everything that relies on the datatype storage description such as one-sided, IO, and all the datatype representation manipulation functions, the combiner_*). The root cause is that the internal storage format for the datatype, and all the functions to expose it to other libraries are based on int32_t.

I have the feeling that the correct solution is to split the datatype API in two, one used by MPI where we follow MPI expectations, and the internal one where we can use counts but we will not provide any data representation support (the support will remain at the MPI level). Based on this, we will only use the internal API inside OMPI and the external API will be reserved for the MPI layer.

@wenduwan
Copy link
Contributor Author

I also had the concern of breaking API compatibility but none of our CI(internal and gh action) actually caught this so that's interesting.

I was reading the code and saw that the count argument is sometimes typed size_t like here - so I have a hunch that it should work somehow and it is possible to change the type safely.

I also agree with the proposed solution to introduce another internal API - I haven't yet figured out how.

@jsquyres
Copy link
Member

jsquyres commented Mar 5, 2024

@wenduwan It looks like you are still working on this. Do you want to move this PR to Draft?

@@ -150,7 +150,7 @@ ompi_datatype_is_predefined( const ompi_datatype_t* type )
}

static inline int32_t
ompi_datatype_is_contiguous_memory_layout( const ompi_datatype_t* type, int32_t count )
ompi_datatype_is_contiguous_memory_layout( const ompi_datatype_t* type, size_t count )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes no sense without modifying first opal_datatype_is_contiguous_memory_layout.

@@ -188,20 +188,20 @@ ompi_datatype_add( ompi_datatype_t* pdtBase, const ompi_datatype_t* pdtAdd, size
OMPI_DECLSPEC int32_t
ompi_datatype_duplicate( const ompi_datatype_t* oldType, ompi_datatype_t** newType );

OMPI_DECLSPEC int32_t ompi_datatype_create_contiguous( int count, const ompi_datatype_t* oldType, ompi_datatype_t** newType );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are more complicated as we would need to change the ddt_elem_desc and ddt_loop_desc structs to have count as int. Unfortunately, that will change the size of these structs, and increase the overall size of the datatype representation.

@@ -31,13 +31,12 @@


/* We try to merge together data that are contiguous */
int32_t ompi_datatype_create_indexed( int count, const int* pBlockLength, const int* pDisp,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the count in indexed datatypes a size_t makes very little sense, because it indicates the number of entries in the displacement and blocklen arrays. In other terms the user-facing datatype representation will be several gigabytes long.

I can understand that for symmetry with the contiguous one would like to have this as a size_t, but my comment above remains valid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bosilca I need your thoughts here. I figured MPI_Type_indexed also has the large-count variant MPI_Type_indexed_c, which I imagine will also need size_t support in ompi_datatype_create_indexed and related functions. Did I missing anything?

Same goes for *struct and *vector.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wenduwan Check out the work that @hppritcha and Jakob did on the *w collectives. There are now disp and count arrays that store whether the source is 32 or 64bit and hand out 64bit consistently: https://github.com/open-mpi/ompi/blob/main/ompi/util/count_disp_array.h

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The count is the number of elements in the indexed type, and if that number cannot be represented as an int, it basically means that the datatype description (where we need 64 bytes per contiguous element to represent) will be bigger than the represented data. No normal person should even imagine such an API.

@devreal suggestion is mostly irrelevant here. In the collective case the user-provided buffers are guaranteed to be available during the entire collective operation, so piggybacking into the user buffer is possible. There is no such guarantees with the datatype.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bosilca Thanks. Could you elaborate more on "the datatype description .. will be bigger than the represented data." part? Are you referring to this struct? I noted that opal_datatype_count_t is already size_t

struct dt_type_desc_t {
    opal_datatype_count_t length; /**< the maximum number of elements in the description array */
    opal_datatype_count_t used;   /**< the number of used elements in the description array */
    dt_elem_desc_t *desc;
};

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each data in the indexed/struct will be represented by an ddt_elem_desc struct which is 32 bytes. So, if you have a number of entries that will not fit into an int, it means what the data description itself will be over 64GB, there will be little memory left for the data itself (especially that if you use an index or struct you have gaps between these elements).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bosilca Please correct me if I'm wrong. The fundamental concern here is the datatype descriptor struct size.

The root of this issue is in ompi_datatype_add and therefore opal_datatype_add, which allocates new memory here:

pdtBase->bdt_used |= pdtAdd->bdt_used;
newLength = pdtBase->desc.used + place_needed;
if (newLength > pdtBase->desc.length) {
newLength = ((newLength / DT_INCREASE_STACK) + 1) * DT_INCREASE_STACK;
pdtBase->desc.desc = (dt_elem_desc_t *) realloc(pdtBase->desc.desc,
sizeof(dt_elem_desc_t) * newLength);
pdtBase->desc.length = newLength;
}

I am likely missing important context here. Why would it be a problem for indexed/struct but not vector, which also calls into ompi_datatype_add? 🤔

Also, suppose the user has a huge amount of memory to waste, would this concern still hold?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vector representation is compact because its regular and repetitive, with a single ddt_elem_desc (aka. 64 bytes) one can describe a datatype that covers the entire memory. Indexed and struct representations have a number of ddt_elem_desc entries (I'm not talking about the datatype count here), and when the number of entries is larger than an int (which is what we are talking about here), the datatype representation itself will be comparable with the memory layout it covers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. ompi_datatype_create_vector is not calling ompi_datatype_create in a loop, but indexed/struct does.

But still, without embiggening the indexed/struct count(as well as block length, displ) how can we support the MPI_Type_{indexed,struct}_c APIs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment was mostly about how stupid that API is, not about how we support it. And I'm not even talking about the performance of parsing that extremely large description to pack/unpack data.

But if you ask me how to do it, I would check that that number is below INT_MAX, and return some error otherwise (not enough memory or something). If the number is reasonable, we keep doing as today.

@@ -28,13 +28,12 @@

#include "ompi/datatype/ompi_datatype.h"

int32_t ompi_datatype_create_struct( int count, const int* pBlockLength, const ptrdiff_t* pDisp,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as for indexed types.

@@ -28,7 +28,7 @@

#include "ompi/datatype/ompi_datatype.h"

int32_t ompi_datatype_create_vector( int count, int bLength, int stride,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one would be acceptable I guess.

@wenduwan
Copy link
Contributor Author

@bosilca Thanks for your comments. I'm looking into this PR again.

This patch prepares the opal datatype engine for large count support.
Related function arguments need to accept size_t input, and accordingly
we had to modify codes where those functions are called with smaller
integer types.

Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
@wenduwan wenduwan force-pushed the datatype_use_size_t branch from ae65011 to c79bc2d Compare July 26, 2024 14:34
{
opal_datatype_t *datatype = (opal_datatype_t *) OBJ_NEW(opal_datatype_t);

if (expectedSize == -1) {
if (expectedSize == (size_t) -1) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very unsure about this - what is the scenario where size will be -1?

@wenduwan
Copy link
Contributor Author

After a fresh look at the change, it appears to have a much larger blast radius than I expected.

The original revision might started with the wrong place - it was changing ompi datatype APIs but that requires adaptation to the internal opal datatypes. With that in mind, I think it is better to start with opal datatypes instead. Updated PR accordingly.

Still looking at other opal functions to find out what else needs to change.

@wenduwan
Copy link
Contributor Author

@hppritcha @bosilca I'm sorry that I won't have time to work on this. Unfortunately I have to leave this to someone else. Closing the PR.

@wenduwan wenduwan closed this Sep 10, 2024
@wenduwan wenduwan deleted the datatype_use_size_t branch September 10, 2024 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants