Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel rank0 deadlock fixes #1183

Merged
merged 3 commits into from
Jan 22, 2022

Conversation

jhendersonHDF
Copy link
Collaborator

This PR rewrites several places where rank 0 could skip past collective MPI operations on failure, leading to deadlocks. There are a few cases after this rewrite where rank 0 participating in, e.g. an MPI_Bcast after failure may lead to library crash from segfaults or similar, but this seems to me like a better alternative than hanging.

HGOTO_ERROR(H5E_CACHE, H5E_CANTFREE, FAIL, "Can't build address list for clean entries")
if (NULL == (addr_buf_ptr = (haddr_t *)H5MM_malloc(buf_size))) {
/* Push an error, but still participate in following MPI_Bcast */
HDONE_ERROR(H5E_CACHE, H5E_CANTALLOC, FAIL, "memory allocation failed for addr buffer")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a case where the MPI implementation may crash when MPI_Bcast is called with a NULL addr_buf_ptr. At least, this seems to be the case for OpenMPI. However, if memory allocation fails here, there isn't a whole lot else we can do, and rank 0 still needs to participate in the BCast in a similar manner as the other ranks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per comment above, I don't see much point in worrying about the Bcast if rank 0 can't create the buffer.

If you want to recover, the better solution would be to allocate the buffer before we Bcast the number of entries in it, and pretend entry_count is zero if the malloc fails. That said, if rank 0 can't allocate a smallish buffer, the computation is hosed regardless.

@@ -1397,9 +1397,7 @@ H5CX_set_apl(hid_t *acspl_id, const H5P_libclass_t *libclass,

/* If parallel is enabled and the file driver used is the MPI-IO
* VFD, issue an MPI barrier for easier debugging if the API function
* calling this is supposed to be called collectively. Note that this
* happens only when the environment variable H5_COLL_BARRIER is set
* to non 0.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor documentation issue here I noticed while fixing these other issues: The H5_COLL_BARRIER environment variable doesn't exist

@@ -997,6 +997,9 @@ H5C__read_cache_image(H5F_t *f, H5C_t *cache_ptr)
#endif /* H5_HAVE_PARALLEL */

/* Read the buffer (if serial access, or rank 0 of parallel access) */
/* NOTE: if this block read is being performed on rank 0 only, throwing
* an error here will cause other ranks to hang in the following MPI_Bcast.
*/
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I'm just leaving a note here as properly fixing the potential hang is a little tricky and very messy with #ifdefs

@jhendersonHDF jhendersonHDF linked an issue Nov 10, 2021 that may be closed by this pull request
@jhendersonHDF
Copy link
Collaborator Author

@lrknox please give @jrmainzer a chance to review this before merging, since I believe most of the changes occur in cache code.

/* If we fail to log the deleted entry, push an error but still
* participate in a possible sync point ahead
*/
HDONE_ERROR(H5E_CACHE, H5E_CANTUNPROTECT, FAIL, "H5AC__log_deleted_entry() failed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was of the impression that HDONE_ERROR is only used after the done: tag. Is this this correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general yes. However the main upside to the macro is just that it doesn't do the typical goto. It still pushes an error message to the stack, sets the err_occurred variable and sets the return value. The HERROR macro only pushes an error message to the stack, without setting those needed variables. The HCOMMON_ERROR macro might be a good alternative, but the header comment for it says that it shouldn't need to be used outside H5Eprivate.h.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand what is going on here, you are trying to flag an error but not jump to done so as to participate in the upcoming sync point.

Using HDONE_ERROR doesn't seem the right way to do this.

For starters, we have comments expressly stating that HDONE_ERROR is only to be used after the done: flag. Breaking this rule without explanation is bound to cause confusion.

I note that we do this already in H5Fint.c in H5F_flush_phase1, H5F_flush_phase2, and H5F_dest. However, we add the comment "/* Push error, but keep going */". At a minimum, we should have this. It would be good to explain why.

One could also argue that we should only be using HDONE_ERROR in this context in parallel builds. I'm not sure that this is a good idea, but it might be worth thinking about.

Copy link
Collaborator Author

@jhendersonHDF jhendersonHDF Nov 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I tried to add a comment to all these places to the same effect, mentioning that we push an error but keep participating in the collective operation. If any of those comments aren't too clear on that, please let me know so I can clarify. I do agree that HDONE_ERROR isn't necessarily meant for this, but I believe an alternative macro would be implemented exactly the same, just with a different name.

Perhaps "DONE_ERROR" was not the best name, because there should be no harm from calling it outside the done: tag. Seems to me like the reasoning was just the fact that calling HGOTO_ERROR after the done: tag would of course result in an infinite loop.

HGOTO_ERROR(H5E_CACHE, H5E_CANTFLUSH, FAIL, "Can't construct candidate buffer.")
if (H5AC__copy_candidate_list_to_buffer(cache_ptr, &chk_num_entries, &haddr_buf_ptr) < 0) {
/* Push an error, but still participate in following MPI_Bcast */
HDONE_ERROR(H5E_CACHE, H5E_CANTFLUSH, FAIL, "Can't construct candidate buffer.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call to HDONE_ERROR before done: flag?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if we can't copy the candidate list to buffer, I expect we will crash and burn regardless. Thus I don't see much point in participating in the Bcast.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will almost certainly crash. On the other hand, it will probably crash in a useful way from a debugging perspective. The alternative would be a hang, which seemed the worse of the two options, but I could be convinced either way.

@lrknox lrknox merged commit 99d3962 into HDFGroup:develop Jan 22, 2022
jhendersonHDF added a commit to jhendersonHDF/hdf5 that referenced this pull request Mar 25, 2022
* Fix several places where rank 0 can skip past collective MPI operations on failure

* Committing clang-format changes

Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
lrknox pushed a commit that referenced this pull request Mar 25, 2022
* Use internal version of H5Eprint2 to avoid possible stack overflow (#661)

* Add support for parallel filters to h5repack (#832)

* Allow parallel filters feature for comm size of 1 (#840)

* Avoid popping API context when one wasn't pushed (#848)

* Fix several warnings (#720)

* Don't allow H5Pset(get)_all_coll_metadata_ops for DXPLs (#1201)

* Fix free list tracking and cleanup cast alignment warnings (#1288)

* Fix free list tracking and cleanup cast alignment warnings

* Add free list tracking code to H5FL 'arr' routines

* Fix usage of several HDfprintf format specifiers after HDfprintf removal (#1324)

* Use appropriate printf format specifiers for haddr_t and hsize_t types directly (#1340)

* Fix H5ACmpio dirty bytes creation debugging (#1357)

* Fix documentation for H5D_space_status_t enum values (#1372)

* Parallel rank0 deadlock fixes (#1183)

* Fix several places where rank 0 can skip past collective MPI operations on failure

* Committing clang-format changes

Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>

* Fix a few issues noted by LGTM (#1421)

* Fix cache sanity checking code by moving functions to wider scope (#1435)

* Fix metadata cache bug when resizing a pinned/protected entry (v2) (#1463)

* Disable memory alloc sanity checks by default for Autotools debug builds (#1468)

* Committing clang-format changes

Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
jhendersonHDF added a commit to jhendersonHDF/hdf5 that referenced this pull request Apr 13, 2022
* Fix several places where rank 0 can skip past collective MPI operations on failure

* Committing clang-format changes

Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
@jhendersonHDF jhendersonHDF deleted the parallel_rank0_deadlock_fixes branch April 24, 2022 05:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deadlock in H5FD__mpio_open due to bad MPI_File_get_size
5 participants