Parallel rank0 deadlock fixes #1183

jhendersonHDF · 2021-11-10T04:34:02Z

This PR rewrites several places where rank 0 could skip past collective MPI operations on failure, leading to deadlocks. There are a few cases after this rewrite where rank 0 participating in, e.g. an MPI_Bcast after failure may lead to library crash from segfaults or similar, but this seems to me like a better alternative than hanging.

…ns on failure

jhendersonHDF · 2021-11-10T04:37:30Z

src/H5ACmpio.c

-            HGOTO_ERROR(H5E_CACHE, H5E_CANTFREE, FAIL, "Can't build address list for clean entries")
+        if (NULL == (addr_buf_ptr = (haddr_t *)H5MM_malloc(buf_size))) {
+            /* Push an error, but still participate in following MPI_Bcast */
+            HDONE_ERROR(H5E_CACHE, H5E_CANTALLOC, FAIL, "memory allocation failed for addr buffer")


This is a case where the MPI implementation may crash when MPI_Bcast is called with a NULL addr_buf_ptr. At least, this seems to be the case for OpenMPI. However, if memory allocation fails here, there isn't a whole lot else we can do, and rank 0 still needs to participate in the BCast in a similar manner as the other ranks.

Per comment above, I don't see much point in worrying about the Bcast if rank 0 can't create the buffer.

If you want to recover, the better solution would be to allocate the buffer before we Bcast the number of entries in it, and pretend entry_count is zero if the malloc fails. That said, if rank 0 can't allocate a smallish buffer, the computation is hosed regardless.

jhendersonHDF · 2021-11-10T04:39:09Z

src/H5CX.c

@@ -1397,9 +1397,7 @@ H5CX_set_apl(hid_t *acspl_id, const H5P_libclass_t *libclass,

        /* If parallel is enabled and the file driver used is the MPI-IO
         * VFD, issue an MPI barrier for easier debugging if the API function
-         * calling this is supposed to be called collectively. Note that this
-         * happens only when the environment variable H5_COLL_BARRIER is set
-         * to non 0.


Just a minor documentation issue here I noticed while fixing these other issues: The H5_COLL_BARRIER environment variable doesn't exist

jhendersonHDF · 2021-11-10T04:39:43Z

src/H5Cimage.c

@@ -997,6 +997,9 @@ H5C__read_cache_image(H5F_t *f, H5C_t *cache_ptr)
 #endif /* H5_HAVE_PARALLEL */

            /* Read the buffer (if serial access, or rank 0 of parallel access) */
+            /* NOTE: if this block read is being performed on rank 0 only, throwing
+             * an error here will cause other ranks to hang in the following MPI_Bcast.
+             */


For now, I'm just leaving a note here as properly fixing the potential hang is a little tricky and very messy with #ifdefs

jhendersonHDF · 2021-11-10T20:34:21Z

@lrknox please give @jrmainzer a chance to review this before merging, since I believe most of the changes occur in cache code.

jrmainzer · 2021-11-23T01:41:42Z

src/H5AC.c

+                /* If we fail to log the deleted entry, push an error but still
+                 * participate in a possible sync point ahead
+                 */
+                HDONE_ERROR(H5E_CACHE, H5E_CANTUNPROTECT, FAIL, "H5AC__log_deleted_entry() failed")


I was of the impression that HDONE_ERROR is only used after the done: tag. Is this this correct?

In general yes. However the main upside to the macro is just that it doesn't do the typical goto. It still pushes an error message to the stack, sets the err_occurred variable and sets the return value. The HERROR macro only pushes an error message to the stack, without setting those needed variables. The HCOMMON_ERROR macro might be a good alternative, but the header comment for it says that it shouldn't need to be used outside H5Eprivate.h.

If I understand what is going on here, you are trying to flag an error but not jump to done so as to participate in the upcoming sync point.

Using HDONE_ERROR doesn't seem the right way to do this.

For starters, we have comments expressly stating that HDONE_ERROR is only to be used after the done: flag. Breaking this rule without explanation is bound to cause confusion.

I note that we do this already in H5Fint.c in H5F_flush_phase1, H5F_flush_phase2, and H5F_dest. However, we add the comment "/* Push error, but keep going */". At a minimum, we should have this. It would be good to explain why.

One could also argue that we should only be using HDONE_ERROR in this context in parallel builds. I'm not sure that this is a good idea, but it might be worth thinking about.

In general, I tried to add a comment to all these places to the same effect, mentioning that we push an error but keep participating in the collective operation. If any of those comments aren't too clear on that, please let me know so I can clarify. I do agree that HDONE_ERROR isn't necessarily meant for this, but I believe an alternative macro would be implemented exactly the same, just with a different name.

Perhaps "DONE_ERROR" was not the best name, because there should be no harm from calling it outside the done: tag. Seems to me like the reasoning was just the fact that calling HGOTO_ERROR after the done: tag would of course result in an infinite loop.

jrmainzer · 2021-11-23T01:46:40Z

src/H5ACmpio.c

-            HGOTO_ERROR(H5E_CACHE, H5E_CANTFLUSH, FAIL, "Can't construct candidate buffer.")
+        if (H5AC__copy_candidate_list_to_buffer(cache_ptr, &chk_num_entries, &haddr_buf_ptr) < 0) {
+            /* Push an error, but still participate in following MPI_Bcast */
+            HDONE_ERROR(H5E_CACHE, H5E_CANTFLUSH, FAIL, "Can't construct candidate buffer.")


call to HDONE_ERROR before done: flag?

Also, if we can't copy the candidate list to buffer, I expect we will crash and burn regardless. Thus I don't see much point in participating in the Bcast.

Yes, it will almost certainly crash. On the other hand, it will probably crash in a useful way from a debugging perspective. The alternative would be a hang, which seemed the worse of the two options, but I could be convinced either way.

* Fix several places where rank 0 can skip past collective MPI operations on failure * Committing clang-format changes Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>

* Use internal version of H5Eprint2 to avoid possible stack overflow (#661) * Add support for parallel filters to h5repack (#832) * Allow parallel filters feature for comm size of 1 (#840) * Avoid popping API context when one wasn't pushed (#848) * Fix several warnings (#720) * Don't allow H5Pset(get)_all_coll_metadata_ops for DXPLs (#1201) * Fix free list tracking and cleanup cast alignment warnings (#1288) * Fix free list tracking and cleanup cast alignment warnings * Add free list tracking code to H5FL 'arr' routines * Fix usage of several HDfprintf format specifiers after HDfprintf removal (#1324) * Use appropriate printf format specifiers for haddr_t and hsize_t types directly (#1340) * Fix H5ACmpio dirty bytes creation debugging (#1357) * Fix documentation for H5D_space_status_t enum values (#1372) * Parallel rank0 deadlock fixes (#1183) * Fix several places where rank 0 can skip past collective MPI operations on failure * Committing clang-format changes Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com> * Fix a few issues noted by LGTM (#1421) * Fix cache sanity checking code by moving functions to wider scope (#1435) * Fix metadata cache bug when resizing a pinned/protected entry (v2) (#1463) * Disable memory alloc sanity checks by default for Autotools debug builds (#1468) * Committing clang-format changes Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>

* Fix several places where rank 0 can skip past collective MPI operations on failure * Committing clang-format changes Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>

jhendersonHDF and others added 2 commits November 9, 2021 22:28

Fix several places where rank 0 can skip past collective MPI operatio…

af7fcf2

…ns on failure

Committing clang-format changes

3f01e53

jhendersonHDF requested review from derobins, qkoziol, soumagne and jrmainzer November 10, 2021 04:34

jhendersonHDF requested review from bmribler, byrnHDF, fortnern, lrknox and vchoi-hdfgroup as code owners November 10, 2021 04:34

jhendersonHDF commented Nov 10, 2021

View reviewed changes

derobins approved these changes Nov 10, 2021

View reviewed changes

jhendersonHDF linked an issue Nov 10, 2021 that may be closed by this pull request

Deadlock in H5FD__mpio_open due to bad MPI_File_get_size #118

Closed

jrmainzer reviewed Nov 23, 2021

View reviewed changes

lrknox assigned jhendersonHDF Jan 14, 2022

soumagne approved these changes Jan 20, 2022

View reviewed changes

lrknox approved these changes Jan 20, 2022

View reviewed changes

Merge branch 'develop' into parallel_rank0_deadlock_fixes

64b7d4c

lrknox merged commit 99d3962 into HDFGroup:develop Jan 22, 2022

jhendersonHDF deleted the parallel_rank0_deadlock_fixes branch April 24, 2022 05:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel rank0 deadlock fixes #1183

Parallel rank0 deadlock fixes #1183

jhendersonHDF commented Nov 10, 2021

jhendersonHDF Nov 10, 2021

jrmainzer Nov 23, 2021

jhendersonHDF Nov 10, 2021

jhendersonHDF Nov 10, 2021

jhendersonHDF commented Nov 10, 2021

jrmainzer Nov 23, 2021

jhendersonHDF Nov 23, 2021

jrmainzer Nov 23, 2021

jhendersonHDF Nov 23, 2021 •

edited

Loading

jrmainzer Nov 23, 2021

jrmainzer Nov 23, 2021

jhendersonHDF Nov 23, 2021

Parallel rank0 deadlock fixes #1183

Parallel rank0 deadlock fixes #1183

Conversation

jhendersonHDF commented Nov 10, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhendersonHDF commented Nov 10, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhendersonHDF Nov 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhendersonHDF Nov 23, 2021 •

edited

Loading