Fix some issues with collective metadata reads for chunked datasets #3716

jhendersonHDF · 2023-10-19T04:48:44Z

Add functions/callbacks for explicit control over chunk index open/close

Add functions/callbacks to check if chunk index is open or not so that it can be opened if necessary before temporarily disabling collective metadata reads in the library

Add functions/callbacks for requesting loading of additional chunk index metadata beyond the chunk index itself

jhendersonHDF · 2023-10-19T04:57:00Z

This PR is meant to address the issues found and reported on in section 3.2 of
HDF5_Collective_IO_Performance_Investigation.pdf and also reported in #2658. It's unclear if the independent reads in #2658 are specifically from the extensible array structure, header and/or index block, but these changes fix collective reading of all those, so it should at least help there.

Add functions/callbacks for explicit control over chunk index open/close Add functions/callbacks to check if chunk index is open or not so that it can be opened if necessary before temporarily disabling collective metadata reads in the library Add functions/callbacks for requesting loading of additional chunk index metadata beyond the chunk index itself

fortnern · 2023-10-20T17:27:54Z

src/H5Dbtree.c

+ *-------------------------------------------------------------------------
+ */
+static herr_t
+H5D__btree_idx_load_metadata(const H5D_chk_idx_info_t H5_ATTR_UNUSED *idx_info)


Should this load the root node?

For v1 btrees, it looks like the search starts from metadata that will already be in the cache as a consequence of opening the index.

So the root node is loaded collectively?

Yes, at least for v1 btrees the metadata needed should already be loaded collectively after the "index open" call

H5D__btree_idx_open() is a no-op. Does it happen in some other function?

Ah yes sorry, on second look it doesn't occur as part of the index open call, it seems that the v1 B-tree root node exists directly at the file address for the chunk index and so no open really occurs. This may need to be revisited, but v1 btrees seem to be less straightforward than other index types

I think you could implement it the same way as the other indices with an address query of chunk 0 (though see my notes on the b-tree 2 implementation)

fortnern · 2023-10-20T17:35:31Z

src/H5Dbtree2.c

+ *-------------------------------------------------------------------------
+ */
+static herr_t
+H5D__bt2_idx_load_metadata(const H5D_chk_idx_info_t H5_ATTR_UNUSED *idx_info)


Same comment about loading root note

Hmm, yes looks like I should probably add a lookup here for the root node.

It might take too much time to get in for 1.14.3 since I think you'd need to add a H5B2 function, but something to keep in mind for later

It looks like performing a bt2 get_addr call will have the same effect as earray and farray where in this case a fake chunk lookup will make sure the root node is read in collectively. We should eventually update the interfaces behind the chunk indices to have a more explicit way of controlling this in the future though.

This will also open other some internal nodes and the leaf node for chunk 0, which is honestly fine.

Though since there's multiple steps (need to read each node to get the address of the node further down) it could be wasted overhead if no process actually needs any data on that "side" of the dataset. But again, something to keep in mind for later.

Taking this further, we could, after collectively loading only the root node, collectively load every node at the next level down in a single operation, then again at the next level, and so on (up to a limit that maybe could be set depending on the size of the metadata cache and the average node size)

To prime the metadata cache in the most efficient way possible

Agreed that without having explicit control over loading the root node, there may be some wasted effort in loading the data for that chunk if no one needs it. I suppose the hope is that if the user has collective metadata reads enabled, the benefit of having the root node loaded at scale will be more worth it. That and the ranks that are interested in the chunk won't end up reading it independently.

fortnern · 2023-10-20T17:36:36Z

src/H5Dbtree2.c

@@ -854,7 +920,10 @@ H5D__bt2_idx_insert(const H5D_chk_idx_info_t *idx_info, H5D_chunk_ud_t *udata,
    assert(H5_addr_defined(udata->chunk_block.offset));

    /* Check if the v2 B-tree is open yet */
-    if (NULL == idx_info->storage->u.btree2.bt2) {
+    if (H5D__bt2_idx_is_open(idx_info, &index_open) < 0)


I'm a little nervous about potential function call overhead here (and other places that call this function within this file). It might get inlined, but maybe we should create a macro here to be safe?

I'd be very surprised if there was any overhead at all, but I could create a macro that the is_open function calls as well.

fortnern · 2023-10-20T18:09:30Z

src/H5Dearray.c

@@ -953,7 +1016,10 @@ H5D__earray_idx_insert(const H5D_chk_idx_info_t *idx_info, H5D_chunk_ud_t *udata
    assert(udata);

    /* Check if the extensible array is open yet */
-    if (NULL == idx_info->storage->u.earray.ea) {
+    if (H5D__earray_idx_is_open(idx_info, &index_open) < 0)


Same comment about function call overhead

fortnern · 2023-10-20T18:14:28Z

src/H5Dfarray.c

@@ -901,7 +963,10 @@ H5D__farray_idx_insert(const H5D_chk_idx_info_t *idx_info, H5D_chunk_ud_t *udata
    assert(udata);

    /* Check if the fixed array is open yet */
-    if (NULL == idx_info->storage->u.farray.fa) {
+    if (H5D__farray_idx_is_open(idx_info, &index_open) < 0)


Same comment about function call overhead

Fix comment in fixed array load_metadata callback

fortnern · 2023-10-20T18:28:56Z

Other than the is_open macros this looks good to me. We should keep in mind the idea of collective pre-loading of the root node as a potential performance enhancement

…DFGroup#3716) Add functions/callbacks for explicit control over chunk index open/close Add functions/callbacks to check if chunk index is open or not so that it can be opened if necessary before temporarily disabling collective metadata reads in the library Add functions/callbacks for requesting loading of additional chunk index metadata beyond the chunk index itself

* Add missing test files to distclean target (#3734) Cleans up new files in Autotools `make distclean` in the test directory * Add tools/libtest to Autotools builds (#3735) This was only added to CMake many years ago and tests the tools library. * Clean up onion VFD files in tools `make clean` (#3739) Cleans up h5dump and h5diff *.onion files in the Autotools when runing `make clean`. * Clean Java test files on Autotools (#3740) Removes generated HDF5 and text output files when running `make clean`. * Clean the flushrefresh test dir on Autotools (#3741) The flushrefresh_test directory was not being cleaned up w/ `make clean` under the Autotools * Fix file names in tfile.c (#3743) Some tests in tfile.c use h5_fileaccess to get a VFD-dependent file name but use the scheme from testhdf5, reusing the FILE1 and FILE8 names. This leads to files like test1.h5.h5 which are unintended and not cleaned up. This changes the filename scheme for a few tests to work with h5test, resulting in more informative names and allowing the files to be cleaned up at the end of the test. The test files have also been added to the `make clean` target for the Autotools. * Clean Autotools files in parallel tests (#3744) Adds missing files to `make clean` for parallel, including Fortran. * Add native VOL checks to deprecated functions (#3647) * Add native VOL checks to deprecated functions * Remove unneeded native VOL checks * Move native checks to top level calls * Fix buffer overflow in cache debugging code (#3691) * update stat arg for apple (#3726) * update stat arg for apple * use H5_HAVE_DARWIN for Apple ifdef * fix typo * removed duplicate H5_ih_info_t * added fortran async test to cmake * Fix windows cpack error in WiX package. (#3747) * Add a simple cache to the ros3 VFD (#3753) Adds a small cache of the first N bytes of a file opened with the read-only S3 (ros3) VFD, where N is 4kiB or the size of the file, whichever is smaller. This avoids a lot of small I/O operations on file open. Addresses GitHub issue #3381 * Update Autotools to correctly configure oneAPI (#3751) * Update Autotools to correctly configure oneAPI Splits the Intel config files under the Autotools into 'classic' Intel and oneAPI versions, fixing 'unsupported option' messages. Also turns off `-check uninit` (new in 2023) in Fortran, which kills the H5_buildiface program due to false positives. * Enable Fortran in oneAPI CI workflow * Turn on Fortran in CMake, update LD_LIBRARY_PATH * Go back to disabling Fortran w/ Intel For some reason there's a linking problem w/ Fortran error while loading shared libraries: libifport.so.5: cannot open shared object file: No such file or directory * Add h5pget_actual_selection_io_mode fortran wrapper (#3746) * added h5pget_actual_selection_io_mode_f test * added tests for h5pget_actual_selection_io_mode_f * fixed int_f type conversion * Update fortran action step (#3748) * Added missing DLL for H5PGET_ACTUAL_SELECTION_IO_MODE_F (#3760) * add missing H5PGET_ACTUAL_SELECTION_IO_MODE_F dll * Bump the ros3 VFD cache to 16 MiB (#3759) * Fix hangs during collective I/O with independent metadata writes (#3693) * Fix some issues with collective metadata reads for chunked datasets (#3716) Add functions/callbacks for explicit control over chunk index open/close Add functions/callbacks to check if chunk index is open or not so that it can be opened if necessary before temporarily disabling collective metadata reads in the library Add functions/callbacks for requesting loading of additional chunk index metadata beyond the chunk index itself * Fix failure in t_select_io_dset when run with more than 10 ranks (#3758) * Fix H5Pset_evict_on_close failing regardless of actual parallel use (#3761) Allow H5Pset_evict_on_close to be called regardless of whether a parallel build of HDF5 is being used Fail during file opens if H5Pset_evict_on_close has been set to true on the given File Access Property List and the size of the MPI communicator being used is greater than 1

jhendersonHDF requested review from lrknox, derobins, byrnHDF, fortnern, qkoziol, vchoi-hdfgroup, bmribler, glennsong09, mattjala and brtnfld as code owners October 19, 2023 04:48

jhendersonHDF force-pushed the coll_md_refactor branch from 2fd430f to aa0db1a Compare October 19, 2023 04:51

jhendersonHDF force-pushed the coll_md_refactor branch from aa0db1a to 765cec0 Compare October 19, 2023 05:00

fortnern reviewed Oct 20, 2023

View reviewed changes

Add load_metadata implementation to v2 B-trees to load root node

ba0acd6

Fix comment in fixed array load_metadata callback

jhendersonHDF added 2 commits October 20, 2023 13:44

Add macro for is_open callback checks in chunk indexing structures

2ecb002

Edit RELEASE.txt for v2 B-tree root node reading fix

490a5cc

fortnern approved these changes Oct 20, 2023

View reviewed changes

Merge branch 'develop' into coll_md_refactor

f8a3fdc

derobins approved these changes Oct 24, 2023

View reviewed changes

derobins merged commit 8cff88c into HDFGroup:develop Oct 24, 2023
40 checks passed

jhendersonHDF mentioned this pull request Oct 24, 2023

Unexpected independent reads with chunked datasets #2658

Closed

jhendersonHDF deleted the coll_md_refactor branch October 26, 2023 01:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix some issues with collective metadata reads for chunked datasets #3716

Fix some issues with collective metadata reads for chunked datasets #3716

jhendersonHDF commented Oct 19, 2023

jhendersonHDF commented Oct 19, 2023

fortnern Oct 20, 2023

jhendersonHDF Oct 20, 2023

fortnern Oct 20, 2023

jhendersonHDF Oct 20, 2023

fortnern Oct 20, 2023

jhendersonHDF Oct 20, 2023

fortnern Oct 20, 2023

fortnern Oct 20, 2023

jhendersonHDF Oct 20, 2023

fortnern Oct 20, 2023

jhendersonHDF Oct 20, 2023

fortnern Oct 20, 2023

fortnern Oct 20, 2023

fortnern Oct 20, 2023

fortnern Oct 20, 2023

jhendersonHDF Oct 20, 2023

fortnern Oct 20, 2023

jhendersonHDF Oct 20, 2023

fortnern Oct 20, 2023

fortnern Oct 20, 2023

fortnern commented Oct 20, 2023

Fix some issues with collective metadata reads for chunked datasets #3716

Fix some issues with collective metadata reads for chunked datasets #3716

Conversation

jhendersonHDF commented Oct 19, 2023

jhendersonHDF commented Oct 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fortnern commented Oct 20, 2023