-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow lazy-read of netCDF-4/HDF5 files #857
Comments
I should note that the metadata read speedup that I already have |
Another issue I discovered was that if we do lazy eval, then the various integer |
Maybe an easy way to do this would be a special attribute that contains a table of metadata. So instead of reading each object in a lazy way, we read the table. Then all our data structure code continues to work. Files that do not have the table of metadata can not be read quickly. But they could add the table and then quick reads would work on the file. We could store the metadata in various ways, like a file level attribute that lists all the dims in the file, a group-level table that lists all vars in a group, etc. Or we can package it up as one big metadata element (perhaps just store the output of ncdump -h as a char attribute). |
Not a bad idea. The problem is that if we have large amounts of metadata, then |
Even a very large ncgen -h output, stored as a char array, would read hugely faster than opening all the objects in HDF5 and querying them each. The whole thing would be 1 disk access, instead of tens or hundreds of thousands. Essentially it would be O(1) instead of O(N), right? The more I think about it, the more I like it. Instead of re-writing all the libsrc4 code, it all will still work just fine. |
Would it be at all possible/practical to put the existing ncgen/ncdump parsing code into the library, so that we could literally use the output of ncdump -h? We could zip it to make it smaller. Or is that just nutty? |
That is just nutty :-) |
Let us say we have an optional text attribute in the root group with a protected name which contains some compressed, easy to parse representation of the metadata in the file. Each time any metadata is added or altered, the attribute is updated so it is always correct. We add a function nc_put_metadata() which does the bulk read of metadata and (re-)creates this attribute. Then why wouldn't ncdump be able to use the metadata too? Seems like that would be useful and a lot faster. |
But this does not solve anything; innstead of reading all the metadata at once |
It does remind me of the HDF-EOS thing too, and that is certainly not a compliment to the idea. So perhaps not the way to go. |
Before this goes much further, we need to have |
Having worked with this code a bunch, I do see an optimization which would be pretty easy, and could result in significant speed-up of file opens. In NC_VAR_INFO_T we we add a field atts_read. When opening the file we don't read any variable atts. When the user asks for a variable att, we check atts_read. If zero, we read all atts for that variable and set it to 1. Anyone have any thoughts or opinions on this approach? |
Developers, have you considered no attribute cacheing as an alternative strategy to cacheing and lazy reads, in the case of netCDF4? HDF5 offers efficient attribute access by name and by index number. It seems to me that a thin wrapper approach for attributes would result in less memory demand and simpler library code, with little impact on performance in most real cases. This approach may also be valid for other alternative storage formats such as cloud. |
An update on this ticket:
Here's some gprof output for the following code:
The file used is from a WRF model run, it's got lots of vars and attributes, but it is a real data file which is read a gadjillion times a day, all over the world, by anyone using the WRF model. (I have cut off the call graph where times dropped below .1 s.) This is with current master, + PR #1234 + some additional minor cleanups and optimizations that will be in a future PR.
Here's (significant part of) the flat profile:
I am analyzing this now to see where the time is really being spent opening and closing a file. |
I am going to close, as the changes discussed here have all be merged. |
At this point, this represents more an aspiration that a plan, but there has been some discussion (see PR #849) of how to enable lazy reads of netCDF-4 file metadata.
Files with a very large amount of metadata take a long time to load because netCDF reads all metadata at file open. For classic files, this doesn't seem to bother people much. But for netCDF-4/HDF5 files, it does. Perhaps this can be explained by the use of netCDF-4/HDF5 for some really complex and large datasets, which end up with tens of thousands of attributes, variables, dimensions, and/or groups. Or perhaps the classic formats, having all their metadata in a block at the beginning of the file, just load faster.
This has already cost us satellite users - the NPP uses netCDF-4, but the follow-on JPSS spacecraft switched to HDF5 without netCDF, due to the slow load times. I was told a similar story about a ESA satellite system by a very active netCDF user in the Netherlands. (Satellite L2 data files generally contain a very large number of attributes, some of which may be reasonably large arrays.)
One idea I suggested is to read each group only as needed. This would be pretty easy to implement I think. It would help where there's lot of groups. @DennisHeimbigner points out that this will not help with files that contain lots of vars. He indicates a known use case with a very large number of vars, all in the root group.
Well, that's another good idea all shot to hell. ;-)
In order to do lazy reads as Dennis suggests I think much of the libsrc4 code would have to be rewritten. (The good news is that with #849 soon to merge, and #856 to follow, the libsrc4 code will be a fair bit smaller than it is now.)
For example, if we open a file and read nothing, and then the user does an nc_inq(), we need to find out how many variables there are. In the current code, we count our list, because we have already read them. In the lazy-read code, we would rsee if there's a way we can get the numbers we need without reading every variable's metadata. That is probably possible in HDF5, but not how the code is currently written.
Handling dimensions in a lazy-read is going to be particularly tricky. They may be in different groups from a variable. So if the user opens a file and does a nc_inq_var() on a var deep in the group structure, we will have to have code smart enough to find all the dimensions in whatever group they are in. All this information is in the HDF5 file, but the code to read it and use it properly remains to be written.
The text was updated successfully, but these errors were encountered: