-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow users to specify data alignment #2177
Comments
Ed asked:
I believe this change is "always on" from my understanding, at file opening time???
I presume you could change the alignment by closing and opening the file. It seems silly that it is a property for the file as a whole at opening time, but I understand that memory management is tricky and they might be trying to avoid a corner case. My code for patching from h5py._hl.files import make_fapl as _make_fapl
import h5py
from functools import wraps
@wraps(_make_fapl)
def make_fapl(driver, libver, rdcc_nslots, rdcc_nbytes, rdcc_w0, locking,
page_buf_size, min_meta_keep, min_raw_keep, **kwds):
fapl = _make_fapl(driver, libver, rdcc_nslots, rdcc_nbytes, rdcc_w0, locking,
page_buf_size, min_meta_keep, min_raw_keep, **kwds)
fapl.set_alignment(1024 * 1024 - 4096, 4096)
return fapl
h5py._hl.files.make_fapl = make_fapl fapl is called at creation / opening time: Cross referencing the post on h5py: h5py/h5py#2034 |
Just writing down other notes: it seems that there is some interplay between the file space strategy too: |
cc @edwardhartnett i realize you have two accounts. |
I am going to try to set the alignment on an open file. Can it be detected by h5dump? |
I don't think the data alignment is persistent. It is part of the I've been:
and for non chunked data
|
The datasets are generated by:
|
I think you can use:
|
In light of this whole discussion (especially around the backward compatible API), there may be a slew of other options that would help those of us that are performance minded. Things like the cache size, and other access property list items. It might be a good idea to think of how to allow the users to have the most flexibility for future tweaks. |
How about this? Add a new function to the netCDF API, to be optionally called BEFORE opening a file. A new function does not harm backward API compatibility. This is for the C API, similar for Python. Tweak the name as you see fit, I don't care.
This would simply change the default behavior of nc_create and nc_open, and invoke H5Pset_alignment under the hood, at the appropriate moment before H5Fcreate or H5Fopen. Because there is no existing file handle or FAPL at the user level before the netCDF open calls, the simplest implementation would be for the alignment settings to be global in some sense, and persist for multiple file opens, unless changed though another call. My understanding is that changing object alignment like this, should have no impact at all on backward format compatibility. |
That is a really good solution, thanks. And nothing is lost because the user can set and reset Looking ahead, and as has been noted, there are other properties that may need to be set |
In order to implement Greg's solution. I propose adding this function
|
I see alot of trouble with "global settings". The main challenge is that you want this to happen at the "application" level, but the definition of an "application" can change quite a bit. It starts to somewhat start a conflict between different libraries, each optimizing for a different "default" usecase. In python at least, where "loading" a library is not very different from "executing" the script, it may start to conflict with the import order of things. Things might get lazy loaded in different order causing conflict between settings. Furthermore, global settings also start to require "an access control mechanism", locks, coordination, or other. If you want like to have global settings to start, I think that is a good way to begin things. However, would it be possible to request have an explicit API too? I'm happy to count it a "future" addition, but so long as it is understood that the "global settings" are not always desirable. |
@hmaarrfk I do not understand your point about global settings. Actually we already have some global settings:
So this function would act the same way and should be named similarly. I would suggest not mentioning "hdf5" in the function name, since it's certainly conceivable that some other backend might one day allow control of alignment. (Does Zarr?) I would suggest simply: Just as with, for example nc_set_chunk_cache(), calling nc_set_alignment() will apply to all subsequent file opens. |
It is very conceivable to me, that one application may be using two netcdf files with vastly different optimization settings. One might be on a network drive, the other on a local drive. As such, global settings like this typically cause a clash. It would be nice to be able to override them, even if the initial API is "global-first". |
Good point. I am going to have to reconsider your original solution, although |
I too am not too excited that you can't change the memory allocator settings in HDF5 after a file is open. It seems a little silly. However, I know they are very interested in parallel access to files in a shared underlying file system. A tough problem. I think the API as i proposed it would also me alot of work for me exposing it to high level libraries. |
In terms of exposing it to higher level libraries, the nc_set_alignment API is in my experience |
Right. This means that you aren't really making this kind of I presume once you start to understand the scope of parameters that your users want later, you can start to build an API that can override the defaults on a per function call. |
Perfect is the enemy of good enough! |
@hmaarrfk, by "global setting" I meant only that the alignment parameters would persist from one |
Here is a description of what I propose to implement.Add support for setting HDF5 alignment property when creating a file Provide get/set functions to store global data alignment information The api is as follows:
If defined, then for every file created (via nc_create()) The nc_get_alignment function return the last values set by nc_set_alignment. |
Add "for every file opened", as well as every file created. (Alignment is a non-persistent access property. As such, the same alignment options should be provided for re-opened existing files, as well as for new files.) Same paragraph, I think you meant to say "... H5Pset_alignment is applied to the created FILE using ..." Suggest alternate wording, "using the MOST RECENTLY SET threshold and alignment values." This to avoid misunderstanding about the term "global". |
It is not clear to me that setting alignment on nc_open has any meaning. |
Also, I need a test case; any suggestions? |
i can provide a test case (pseudo code)
For files that are edited, existing variables will remain in place. New variables (that are larger than the threshold) will be aligned. In my test suite, I've avoided checking that without defining the alignment settings variable would not be aligned because there is a 1/4096 that it will be aligned. |
@DennisHeimbigner the idea is that these two functions would be added to the dispatch table, correct? |
Just an FYI: |
No, since they do not depend on the file, they are free-standing. |
H5Pset_alignment settings operate only when allocating low-level file objects, such as data chunks, at the physical disk space level. At this level, there is no sense of physical address alignment between chunks or other file objects. There is never any interference with alignment of previous variables, or parts of variables, etc. For this type of alignment API to be complete for netCDF, it should apply to re-opened files as well as newly created files. It would be good if someone would check my analysis. https://portal.hdfgroup.org/display/HDF5/H5P_SET_ALIGNMENT |
Alternatively, try a modification of the above test case from @hmaarrfk. Close and re-open the file, then write a second chunk, then close and test as per nos. 6-7 above. I predict that the first chunk will be aligned and the second chunk not aligned, unless nc_open is upgraded for alignment settings. |
Here is a test program, but it does not seem to work as expected.
|
I also tried using H5Dget_offset on the whole dataset, I get this.
|
Use one of the H5Dget_chunk_info* functions, not H5Dread_chunk, to get the physical offset of a chunk within the file. H5Dget_offset should have worked for the whole dataset offset. I do not see the problem there. However, that might be displaying the offset of the dataset head node, rather than the first data chunk. The utility command |
For what its worth, I had to add the following patch to my builds to make the alignment workout. You have to hit both the creation of the files, and the opening of the files diff --git a/libhdf5/hdf5create.c b/libhdf5/hdf5create.c
index 0475c525..f5e108d9 100644
--- a/libhdf5/hdf5create.c
+++ b/libhdf5/hdf5create.c
@@ -125,6 +125,10 @@ nc4_create_file(const char *path, int cmode, size_t initialsz,
BAIL(NC_EHDFERR);
if (H5Pset_fclose_degree(fapl_id, H5F_CLOSE_WEAK))
BAIL(NC_EHDFERR);
+ // if (H5Pset_alignment(fapl_id, alignment_threshold, alignment_interval) < 0) {
+ if (H5Pset_alignment(fapl_id, 15 * 4096, 4096) < 0) {
+ BAIL(NC_EHDFERR);
+ }
#ifdef USE_PARALLEL4
/* If this is a parallel file create, set up the file creation
diff --git a/libhdf5/hdf5open.c b/libhdf5/hdf5open.c
index f3ede3ed..bd787a6a 100644
--- a/libhdf5/hdf5open.c
+++ b/libhdf5/hdf5open.c
@@ -775,6 +775,10 @@ nc4_open_file(const char *path, int mode, void* parameters, int ncid)
if (H5Pset_fclose_degree(fapl_id, H5F_CLOSE_WEAK) < 0)
BAIL(NC_EHDFERR);
+ // if (H5Pset_alignment(fapl_id, alignment_threshold, alignment_interval) < 0) {
+ if (H5Pset_alignment(fapl_id, 15 * 4096, 4096) < 0) {
+ BAIL(NC_EHDFERR);
+ }
#ifdef USE_PARALLEL4
if (!(mode & (NC_INMEMORY | NC_DISKLESS)) && mpiinfo != NULL) {
/* If this is a parallel file create, set up the file creation edit: simplify patch to remove unecessary "cleanup" from demo. |
where do those constants come from: 15 and 4096? |
Generally speaking, for "large" reads you want things to be aligned to something. The linux kernel likes PAGES. So things that are "4096" in size are "nice'. I'm defining something as large when it takes more than 16 or more pages (an arbitrary number). So I'm defining the threshold as 15 * 4096. It is entirely possible that I'm off by one in my "threshold" computation. It is possible that it is |
In linux, the page size seems to be 4096 across popular architectures |
In my demo code, I left placeholders for variable names that might be used for the global configuration. |
Certainly there should not be bare constants, so at least use defines for 15 and 4096... |
It seems that the best way would be to allow the user to define the global constants you suggested:
In my "patched" version, I didn't want to build up a parallel API to yours. This "works" for my specific application, I simply wanted to show the two locations I identified that had to be modified. These constants would be set in the proposed |
I have a PR for this, but there are a large number of outstanding PRs, so I will wait until |
I'm working through them now @DennisHeimbigner so feel free :) |
wow! thank you both so much for working through this. Excited to integrated this in our system. |
re: Unidata#2177 re: Unidata#2178 Provide get/set functions to store global data alignment information and apply it when a file is created. The api is as follows: ```` int nc_set_alignment(int threshold, int alignment); int nc_get_alignment(int* thresholdp, int* alignmentp); ```` If defined, then for every file created opened after the call to nc_set_alignment, for every new variable added to the file, the most recently set threshold and alignment values will be applied to that variable. The nc_get_alignment function return the last values set by nc_set_alignment. If nc_set_alignment has not been called, then it returns the value 0 for both threshold and alignment. The alignment parameters are stored in the NCglobalstate object (see below) for use as needed. Repeated calls to nc_set_alignment will overwrite any existing values in NCglobalstate. The alignment parameters are applied in libhdf5/hdf5create.c and libhdf5/hdf5open.c The set/get alignment functions are defined in libsrc4/nc4internal.c. A test program was added as nc_test4/tst_alignment.c. ## Misc. Changes Unrelated to Alignment * The NCRCglobalstate type was renamed to NCglobalstate to indicate that it represented more general global state than just .rc data. It was also moved to nc4internal.h. This led to a large number of small changes: mostly renaming. The global state management functions were moved to nc4internal.c. * The global chunk cache variables have been moved into NCglobalstate. As warranted, other global state will be moved as well. * Some misc. problems with the nczarr performance tests were corrected.
Fixed by #2206 |
In the mailing list I raised the question about data being aligned.
I provided a small code example that showed how data could be misaligned (code in python with a recent version of netcdf4-python)
Code Example
We came to the point where we determined that he user should set the File Access Property at Creation time, near:
netcdf-c/libhdf5/hdf5open.c
Line 772 in 988e771
I'm opening this as an issue to keep track of the conversation on github.
cc: @edhartnett
Mailing list link: https://www.unidata.ucar.edu/mailing_lists/archives/netcdfgroup/2022/msg00000.html
The text was updated successfully, but these errors were encountered: