-
-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel rank0 deadlock fixes #1183
Parallel rank0 deadlock fixes #1183
Conversation
HGOTO_ERROR(H5E_CACHE, H5E_CANTFREE, FAIL, "Can't build address list for clean entries") | ||
if (NULL == (addr_buf_ptr = (haddr_t *)H5MM_malloc(buf_size))) { | ||
/* Push an error, but still participate in following MPI_Bcast */ | ||
HDONE_ERROR(H5E_CACHE, H5E_CANTALLOC, FAIL, "memory allocation failed for addr buffer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a case where the MPI implementation may crash when MPI_Bcast is called with a NULL addr_buf_ptr. At least, this seems to be the case for OpenMPI. However, if memory allocation fails here, there isn't a whole lot else we can do, and rank 0 still needs to participate in the BCast in a similar manner as the other ranks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per comment above, I don't see much point in worrying about the Bcast if rank 0 can't create the buffer.
If you want to recover, the better solution would be to allocate the buffer before we Bcast the number of entries in it, and pretend entry_count is zero if the malloc fails. That said, if rank 0 can't allocate a smallish buffer, the computation is hosed regardless.
@@ -1397,9 +1397,7 @@ H5CX_set_apl(hid_t *acspl_id, const H5P_libclass_t *libclass, | |||
|
|||
/* If parallel is enabled and the file driver used is the MPI-IO | |||
* VFD, issue an MPI barrier for easier debugging if the API function | |||
* calling this is supposed to be called collectively. Note that this | |||
* happens only when the environment variable H5_COLL_BARRIER is set | |||
* to non 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a minor documentation issue here I noticed while fixing these other issues: The H5_COLL_BARRIER environment variable doesn't exist
@@ -997,6 +997,9 @@ H5C__read_cache_image(H5F_t *f, H5C_t *cache_ptr) | |||
#endif /* H5_HAVE_PARALLEL */ | |||
|
|||
/* Read the buffer (if serial access, or rank 0 of parallel access) */ | |||
/* NOTE: if this block read is being performed on rank 0 only, throwing | |||
* an error here will cause other ranks to hang in the following MPI_Bcast. | |||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I'm just leaving a note here as properly fixing the potential hang is a little tricky and very messy with #ifdefs
@lrknox please give @jrmainzer a chance to review this before merging, since I believe most of the changes occur in cache code. |
/* If we fail to log the deleted entry, push an error but still | ||
* participate in a possible sync point ahead | ||
*/ | ||
HDONE_ERROR(H5E_CACHE, H5E_CANTUNPROTECT, FAIL, "H5AC__log_deleted_entry() failed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was of the impression that HDONE_ERROR is only used after the done: tag. Is this this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general yes. However the main upside to the macro is just that it doesn't do the typical goto. It still pushes an error message to the stack, sets the err_occurred variable and sets the return value. The HERROR macro only pushes an error message to the stack, without setting those needed variables. The HCOMMON_ERROR macro might be a good alternative, but the header comment for it says that it shouldn't need to be used outside H5Eprivate.h.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand what is going on here, you are trying to flag an error but not jump to done so as to participate in the upcoming sync point.
Using HDONE_ERROR doesn't seem the right way to do this.
For starters, we have comments expressly stating that HDONE_ERROR is only to be used after the done: flag. Breaking this rule without explanation is bound to cause confusion.
I note that we do this already in H5Fint.c in H5F_flush_phase1, H5F_flush_phase2, and H5F_dest. However, we add the comment "/* Push error, but keep going */". At a minimum, we should have this. It would be good to explain why.
One could also argue that we should only be using HDONE_ERROR in this context in parallel builds. I'm not sure that this is a good idea, but it might be worth thinking about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I tried to add a comment to all these places to the same effect, mentioning that we push an error but keep participating in the collective operation. If any of those comments aren't too clear on that, please let me know so I can clarify. I do agree that HDONE_ERROR isn't necessarily meant for this, but I believe an alternative macro would be implemented exactly the same, just with a different name.
Perhaps "DONE_ERROR" was not the best name, because there should be no harm from calling it outside the done: tag. Seems to me like the reasoning was just the fact that calling HGOTO_ERROR after the done: tag would of course result in an infinite loop.
HGOTO_ERROR(H5E_CACHE, H5E_CANTFLUSH, FAIL, "Can't construct candidate buffer.") | ||
if (H5AC__copy_candidate_list_to_buffer(cache_ptr, &chk_num_entries, &haddr_buf_ptr) < 0) { | ||
/* Push an error, but still participate in following MPI_Bcast */ | ||
HDONE_ERROR(H5E_CACHE, H5E_CANTFLUSH, FAIL, "Can't construct candidate buffer.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call to HDONE_ERROR before done: flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if we can't copy the candidate list to buffer, I expect we will crash and burn regardless. Thus I don't see much point in participating in the Bcast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will almost certainly crash. On the other hand, it will probably crash in a useful way from a debugging perspective. The alternative would be a hang, which seemed the worse of the two options, but I could be convinced either way.
* Fix several places where rank 0 can skip past collective MPI operations on failure * Committing clang-format changes Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
* Use internal version of H5Eprint2 to avoid possible stack overflow (#661) * Add support for parallel filters to h5repack (#832) * Allow parallel filters feature for comm size of 1 (#840) * Avoid popping API context when one wasn't pushed (#848) * Fix several warnings (#720) * Don't allow H5Pset(get)_all_coll_metadata_ops for DXPLs (#1201) * Fix free list tracking and cleanup cast alignment warnings (#1288) * Fix free list tracking and cleanup cast alignment warnings * Add free list tracking code to H5FL 'arr' routines * Fix usage of several HDfprintf format specifiers after HDfprintf removal (#1324) * Use appropriate printf format specifiers for haddr_t and hsize_t types directly (#1340) * Fix H5ACmpio dirty bytes creation debugging (#1357) * Fix documentation for H5D_space_status_t enum values (#1372) * Parallel rank0 deadlock fixes (#1183) * Fix several places where rank 0 can skip past collective MPI operations on failure * Committing clang-format changes Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com> * Fix a few issues noted by LGTM (#1421) * Fix cache sanity checking code by moving functions to wider scope (#1435) * Fix metadata cache bug when resizing a pinned/protected entry (v2) (#1463) * Disable memory alloc sanity checks by default for Autotools debug builds (#1468) * Committing clang-format changes Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
* Fix several places where rank 0 can skip past collective MPI operations on failure * Committing clang-format changes Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
This PR rewrites several places where rank 0 could skip past collective MPI operations on failure, leading to deadlocks. There are a few cases after this rewrite where rank 0 participating in, e.g. an MPI_Bcast after failure may lead to library crash from segfaults or similar, but this seems to me like a better alternative than hanging.