Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

part_persist: fix start_all() #12972

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ggouaillardet
Copy link
Contributor

No description provided.

@ggouaillardet ggouaillardet marked this pull request as ready for review December 9, 2024 08:18
Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand you mostly moved code around, but as I read through it I noticed several issues with the code that could be revised by the author.

/**
* This is a helper function that frees a request. This requires ompi_part_persist.lock be held before calling.
*/
__opal_attribute_always_inline__ static inline int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of having this function as __always_inline__ when its single use is protected by a mutex and the function itself has many calls to release objects and memory ?

} else {
request->req_ompi.req_status.MPI_SOURCE = request->req_comm->c_my_rank;
}
request->req_ompi.req_complete_cb = NULL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the correct behavior for requests. Setting req_complete_cb to NULL here will prevent the ompi_request_complete from properly calling the callback associated with the request. This means that some of the collectives components that use this functionality could not use partitioned communications (I'm not saying they do, just that this breakage of consistency will be a pain to debug in the future).

size_t i;

/* prevent re-entry, */
int block_entry = opal_atomic_add_fetch_32(&(ompi_part_persist.block_entry), 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 atomic operations for each failed access to progress and a lock for the successful case ? Why not using a try_lock ?

return OMPI_SUCCESS;
}

OPAL_LIST_FOREACH(current, ompi_part_persist.progress_list, mca_part_persist_list_t) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything before this FOREACH should be called only once per process. And checking for this should definitely not be part of each progress call !!!

if(false == req->first_send)
{
if(MCA_PART_PERSIST_REQUEST_PSEND == req->req_type) {
req->done_count = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done_count is set to zero in all case, can be moved out of the branches.

@hppritcha
Copy link
Member

please try rebasing on top of main to see if mpi4py error goes away.

@ggouaillardet
Copy link
Contributor Author

@bosilca: FWIW I did split this PR into two commits:

  • the first one is the bug fix
  • the second one moves code around (and get rid of some trailing whitespaces), but has no code change

@ggouaillardet
Copy link
Contributor Author

@hppritcha rebasing did get rid of the mpi4py error.

use a separate loop index for the innermost loop.

Thanks Jonas Harlacher for bringing this to our attention.

Fixes open-mpi#12969

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
move subroutines from header to source files.

Many subroutines are only used by a single source file,
so move them out of the header file. This will make it
easier for debuggers since setting a breakpoint in a header file
is not always correctly supported.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants