-
-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix sporadic h5diff_172 test failure w/ NVHPC #4571
Labels
Component - Parallel
Parallel HDF5 (NOT thread-safety)
Component - Testing
Code in test or testpar directories, GitHub workflows
Priority - 0. Blocker ⛔
This MUST be merged for the release to happen
Type - Bug / Bugfix
Please report security issues to help@hdfgroup.org instead of creating an issue on GitHub
Milestone
Comments
derobins
added
Priority - 0. Blocker ⛔
This MUST be merged for the release to happen
Component - Parallel
Parallel HDF5 (NOT thread-safety)
Component - Testing
Code in test or testpar directories, GitHub workflows
Type - Bug / Bugfix
Please report security issues to help@hdfgroup.org instead of creating an issue on GitHub
labels
Jun 15, 2024
derobins
added a commit
to hyoklee/hdf5
that referenced
this issue
Jun 15, 2024
We don't test parallel in other GitHub actions, so this converts the NVHPC check to configure and build only while we discuss how we'll test parallel HDF5 in GitHub. There is a blocking GitHub issue to address the test failures for HDF5 1.14.5 (HDFGroup#4571).
derobins
pushed a commit
that referenced
this issue
Jun 15, 2024
We don't test parallel in other GitHub actions, so this also converts the NVHPC check to configure and build only while we discuss how we'll test parallel HDF5 in GitHub. There is a blocking GitHub issue to address the test failures for HDF5 1.14.5 (#4571).
byrnHDF
pushed a commit
to byrnHDF/hdf5
that referenced
this issue
Jun 26, 2024
We don't test parallel in other GitHub actions, so this also converts the NVHPC check to configure and build only while we discuss how we'll test parallel HDF5 in GitHub. There is a blocking GitHub issue to address the test failures for HDF5 1.14.5 (HDFGroup#4571).
lrknox
pushed a commit
to lrknox/hdf5
that referenced
this issue
Jul 2, 2024
We don't test parallel in other GitHub actions, so this also converts the NVHPC check to configure and build only while we discuss how we'll test parallel HDF5 in GitHub. There is a blocking GitHub issue to address the test failures for HDF5 1.14.5 (HDFGroup#4571).
lrknox
added a commit
that referenced
this issue
Jul 3, 2024
* Fix typos in context/property documentation (#4550) * Fix CI markdown link check http 500 errors (#4556) Sites like GitLab can have internal problems that return http 500 errors while they fix their problems. Some sites also return http 200 OK, which is fine. This PR adds a config file to the markdown link check so those are considered "passing" and don't break the CI. * Simplify property copying between lists internally (#4551) * Add Python examples (#4546) These examples are referred to from the replacement page of https://portal.hdfgroup.org/display/HDF5/Other+Examples. * Correct property cb signatures in docs (#4554) * Correct property cb signatures in docs * Correct delete callback type name in docs * add missing word to H5P__free_prop doc * Move C++ and Fortran and examples to HDF5Examples folder (#4552) * Document 'return-and-read' field in API context (#4560) * Add compression includes to tests needing zlib support (#4561) * Allow usage of page buffering for serial file access from parallel HDF5 builds (#4568) * Remove old version of libaec (#4567) * Add property names to context field docs (#4563) * Document property shared name behavior (#4565) * Clarify H5CX macro documentation (#4569) * Document H5Punregister modifying default properties (#4570) * Update NVHPC to 24.5 (#4171) We don't test parallel in other GitHub actions, so this also converts the NVHPC check to configure and build only while we discuss how we'll test parallel HDF5 in GitHub. There is a blocking GitHub issue to address the test failures for HDF5 1.14.5 (#4571). * Clean up comments in H5FDros3.c (#4572) * Rename INSTALL_Auto.txt to INSTALL_Autotools.txt (#4575) * Clean up ros3 VFD stats code (#4579) * Removes printf debugging * Simplifies and centralizes stats code * Use #ifdef ROS3_STATS instead of #if * Other misc tidying * Turn off ros3 VFD stat collection by default (#4581) Not a new change - an artifact from a previous check-in. * Pause recording errors instead of clearing the error stack (#4475) An internal capability that's similar to the H5E_BEGIN_TRY / H5E_END_TRY macros in H5Epublic.h, but more efficient since we can avoid pushing errors on the stack entirely (and those macros use public API routines). This capability (and other techniques) can be used to remove use of H5E_clear_stack() and H5E_BEGIN_TRY / H5E_END_TRY within library routines. We want to remove H5E_clear_stack() because it can trigger calls to the H5I interface from within the H5E code, which creates a great deal of complexity for threadsafe code. And we want to remove H5E_BEGIN_TRY / H5E_END_TRY's because they make public API calls from within the library code. Also some other minor tidying in routines related to removing the use of H5E_clear_stack() and H5E_BEGIN_TRY / H5E_END_TRY from H5Fint.c * Add page buffer cache command line option to tools (#4562) Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com> * Clarify documentation for H5CX_get_data_transform (#4580) * Correct comment for H5CX_get_data_transform * Document why data transform ctx field doesnt use macro * Remove public API call from ros3 VFD (#4583) * Remove printf debugging from H5FDs3comms.c (#4584) * Cleanup of ros3 test (#4587) * Removed JS* macro scheme (replaced w/ h5test.h macros) * Moved curl setup/teardown to main() * A lot of cleanup and simplification * Removed unused code from H5FDs3comms.c (#4588) * H5FD_s3comms_nlowercase() * H5FD_s3comms_trim() * H5FD_s3comms_uriencode() * Remove magic fields from s3comms structs (#4589) * Remove dead H5FD_s3comms_percent_encode_char() (#4591) * Rework the TestExpress usage and refactor dead code (#4590) * Skip examples if running sanitizers (#4592) * Clean up s3comms test code (#4594) * Remove JS* macros * Remove dead code * Bring in line with other test code * Add publish to bucket workflow (#4566) * Update abi report CI workflow for last release (#4596) * Update abi report workflow to handle 1.14.4.3 release * Update name of java report * Document that ctx VOL property isn't drawn from the FAPL (#4597) * Update macos workflow to 14 (keep 13 as alternate) (#4603) * Removed unnecessary call to H5E_clear_stack (#4607) H5FO_opened and H5SL_search don't push errors on the stack * Bring subfiling VFD code closer to typical library code (#4595) Remove API calls, use FUNC_ENTER/LEAVE macros, use the library's error macros, rename functions to have more standardized names, etc. * Correct documentation for return-and-read fields (#4598) * These two generators create strings without NUL for testing (#4608) * Fix Fortran pkconfig to indicate full path of modules (#4593) * Updated release schedule (#4615) 1.16 and 2.0 information * Document VOL object wrapping context (#4611) * Earray.c and farray.c in hdf5_1_14 still need time_t curr_time for HDsrandom. * Remove line to use future 116_API from CMakeListat.txt files in HDF5 examples directories
This isn't fixed. We're still getting h5diff mkdir failures after #4171 was merged. |
Not seeing h5diff mkdir failures anymore. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Component - Parallel
Parallel HDF5 (NOT thread-safety)
Component - Testing
Code in test or testpar directories, GitHub workflows
Priority - 0. Blocker ⛔
This MUST be merged for the release to happen
Type - Bug / Bugfix
Please report security issues to help@hdfgroup.org instead of creating an issue on GitHub
We are seeing sporadic test failures in the NVHPC CI action. The one I see is usually h5diff_172, though there may be others. These failures appear to be due to a mkdir call failing and this appears to be a known problem with OpenMPI.
See here:
open-mpi/ompi#8510
We're currently testing with a pretty elderly version of NVHPC (23.9.0) since newer versions have problems with some long double conversions. This version of NVHPC appears to use an older version of OpenMPI (3.1.5 - see the docs: https://docs.nvidia.com/hpc-sdk/archive/23.9/hpc-sdk-release-notes/index.html). They claim that this is fixed in recent versions of OpenMPI and I don't see it on my VMs, where I build with OpenMPI 4.1.5. We don't see this in other parallel test actions since we usually only configure and build for parallel in GitHub CI.
We probably have a few options:
--mca orte_tmpdir_base <dir>
to OpenMPI's mpiexec optionsThe test failures look like this:
The text was updated successfully, but these errors were encountered: