Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help with slow open performance with netCDF4 Files #489

Closed
4 of 12 tasks
mathomp4 opened this issue Sep 20, 2017 · 91 comments
Closed
4 of 12 tasks

Help with slow open performance with netCDF4 Files #489

mathomp4 opened this issue Sep 20, 2017 · 91 comments
Assignees
Labels
area/netcdf4 type/performance performance related issues

Comments

@mathomp4
Copy link

mathomp4 commented Sep 20, 2017

Environment Information

  • What platform are you using? (please provide specific distribution/version in summary)
    • Linux
    • Windows
    • OSX
    • Other
    • NA
  • 32 and/or 64 bit?
    • 32-bit
    • 64-bit
  • What build system are you using?
    • autotools (configure)
    • cmake
  • Can you provide a sample netCDF file or C code to recreate the issue?
    • Yes (please attach to this issue, thank you!)
    • No
    • Not at this time

Summary of Issue

I work on a climate model and I/O is slowly becoming a major bottleneck. By default we try to use nc4 files as our main format (a few Fortran binaries remain, but we are trying to rid ourselves of that). We recently did an experiment to see if our slow performance might be due to the fact that many files we write are written with an unlimited time dimension, so when we read them, they are unlimited. So, we fixed the time dimension, no change. We also compress, chunk, etc. the files but still no big changes. Finally, a colleague suggested converting to netCDF Classic, and, hooboy. Muuuuuuch faster.

Eventually I went whole hog. I took our NC4 file, decompressed it, converted it using nccopy -k ncX MERR.nc4 MERR.ncX to nc3, nc5, nc6, nc7 and hacked together a simple program to test it out which you can see here:

https://github.com/mathomp4/netcdfopentester/blob/master/testCNetcdf.c

All it does is open and close a file 1000 times (might be overkill). If I run this on my test files, I get the following output:

(1929)(master) $ ./testCNetcdf.exe
nc3 unlimit dim open Time =  0.820 seconds.
nc4 unlimit dim open Time = 19.870 seconds.
nc5 unlimit dim open Time =  1.020 seconds.
nc6 unlimit dim open Time =  1.130 seconds.
nc7 unlimit dim open Time = 20.950 seconds.
nc4 unlimit dim open with hdf5 Time =  0.250 seconds.

I added what I think is the correct HDF5 code to open and close a file, but I mainly program netCDF in Fortran and Python, so even this program is a huge success for me! (Also: why it's ugly.) It seems like a netCDF open and an HDF5 open are ~2 orders of magnitude in speed difference. Is there a way to tune netCDF to open like HDF5? Or do you pay for HDF5's quick opening later?

Or, assuming I might have got the HDF5 wrong, why does netcdf Classic open so much faster as well?

Note: The versions of the libraries I'm using are netCDF-C 4.4.1.1 and HDF5 1.8.19. I compiled with Intel 17.0.4 compilers (backed up by GCC 5.3.0). The file is one from the MERRA2 data that can be retrieved (some basic instructions in my README in boring repo mentioned above).

@edhartnett
Copy link
Contributor

Can you do an ncdump -s -h of the netCDF-4 file, and show the metadata?

@mathomp4
Copy link
Author

I can. Here is the file that is the one I uncompressed (MERR.nc4 in the above script).

MERR.nc4.ncdumphs.txt

@DennisHeimbigner
Copy link
Collaborator

Since I do not have the necessary authorization. Can you public post an actual
problematic .nc4 file somewhere temporarily?

@WardF
Copy link
Member

WardF commented Sep 20, 2017

I'm currently working on a different issue, but I can easily believe that there is an inefficiency somewhere. Using the testCNetcdf.exe above I will, as soon as I can, generate a usage profile and will post the results here for discussion and to see where the overhead is occuring.

@wkliao
Copy link
Contributor

wkliao commented Sep 20, 2017

Interesting evaluation!
I can confirm the slow open timing when using the files produced from test program "nc_test" which contain a long list of variables and attributes.

% ./read1000 test.nc4
Open test.nc4 1000 times costs 18.308271 seconds
% ./read1000 test.nc5
Open test.nc5 1000 times costs 0.146202 seconds

@DennisHeimbigner
Copy link
Collaborator

Opening a file with a large metadata is a known problem because, unlike HDF5,
netcdf-4 reads in all the metadata at once, rather than lazily.
However, looking at the file above (MERR.nc4.ncdumphs.txt)
It does not seem to have all that much metadata (~900 lines).
It has about 60 variables and about 800 attributes. I wonder
if our attribute handling code is the problem; without
looking, I would suspect a linear search problem.

@DennisHeimbigner
Copy link
Collaborator

After a quick look, it appears that attributes are stored in a singly linked list.
And as each attribute is read, it is stored at the end of the list => linear.
This is the case for all netcdf-4 metadata storage. Looks like it is time
for a comprehensive rebuild of the libsrc4 code to refactor this to use
at least arrays rather than lists. Hash tables would be better.
If memory serves someone did this for the netcdf-3 code already,
so perhaps we can share the code.
In any case, this is a big change, so it is will not occur until at least the next+1
release (e.g. after 4.5.0).

@WardF
Copy link
Member

WardF commented Sep 20, 2017

I agree. 4.5.1 would be better, 4.5.0 is overdue.

@mathomp4
Copy link
Author

Since I do not have the necessary authorization. Can you public post an actual
problematic .nc4 file somewhere temporarily?

Well, it sounds I'm a little late to the party, but I've put a compressed version of the file on Dropbox.

Opening a file with a large metadata is a known problem because, unlike HDF5,
netcdf-4 reads in all the metadata at once, rather than lazily.

Ah. This makes sense. Although for a bit there, HDF5 opened really fast because I forgot to check the error status (in Fortran). Turns out it'll open very fast if it's just erroring out. But I've confirmed it works. If only I could figure out how to cast the HDF5 file pointer to nf90...best of both worlds!

After a quick look, it appears that attributes are stored in a singly linked list.
And as each attribute is read, it is stored at the end of the list => linear.

But it sounds like @DennisHeimbigner might have figured it out. When I get into work tomorrow, I might construct a series of files with varying variables. Luckily I can create a stupid-long file with our climate model and then just remove variables with NCO/CDO. I'd like to know if I can reproduce a linear trend.

@mathomp4
Copy link
Author

I'd like to know if I can reproduce a linear trend.

Wrote a quick and dirty tester and...yup:
netcdf open test

The R2 of the fit didn't export with the image, but it was 0.9871. (https://plot.ly/~Fortran/333)

@DennisHeimbigner
Copy link
Collaborator

Thanks for the verification.

@Dave-Allured
Copy link
Contributor

Dave-Allured commented Sep 22, 2017

The virtual attribute _IsNetcdf4 was added in 4.4.1. I wondered if this was the performance problem, because it requires scanning all the file top metadata.

I compared Netcdf-C 4.4.0/HDF5 1.8.16 to current versions, 4.4.1.1 with 1.10.1. These timing results were actually run interleaved between versions, to compensate for possible network biases. I grouped by version and copy/pasted into this summary.

Netcdf version:
4.4.0 of Jan 15 2016 14:13:42 $
nc4 unlimit dim open Time = 33.390 seconds.
nc4 unlimit dim open Time = 39.030 seconds.
nc4 unlimit dim open Time = 36.940 seconds.
nc4 unlimit dim open Time = 42.890 seconds.
nc4 unlimit dim open Time = 41.310 seconds.
nc4 unlimit dim open Time = 43.640 seconds.
nc4 unlimit dim open Time = 44.250 seconds.
nc4 unlimit dim open with hdf5 Time =  0.550 seconds. (1.8.16, typical)

Netcdf version:
4.4.1.1 of May 22 2017 10:44:13 $
nc4 unlimit dim open Time = 51.660 seconds.
nc4 unlimit dim open Time = 55.440 seconds.
nc4 unlimit dim open Time = 54.300 seconds.
nc4 unlimit dim open Time = 56.420 seconds.
nc4 unlimit dim open Time = 51.510 seconds.
nc4 unlimit dim open with hdf5 Time =  0.530 seconds. (1.10.1, typical)

This proves that _IsNetcdf4 is not the primary culprit for slow opens, because the attribute was completely absent from netcdf-c 4.4.0. The main impact is presumably just as Dennis described. However there seems to be a smaller time increase between versions here. This might be _IsNetcdf4 or something else.

@DennisHeimbigner
Copy link
Collaborator

If memory serves, _IsNetcdf4. is computed only when asked for.
So it should not be accessed as part of the above tests.

@Dave-Allured
Copy link
Contributor

That makes sense. Not knowing the internals, I thought it was possible that nc_open might be pre-staging _IsNetcdf4 for some reason.

@Dave-Allured
Copy link
Contributor

@mathomp4, your experiment focused on only nc_open. How much of your model I/O time is consumed in nc_open vs. other file I/O? There are some known serious inefficiencies when the netcdf-4 chunk layout conflicts with read and write access patterns, or when chunk caches are too small.

My general experience with netcdf-4 is that performance rivals netcdf-3 when the file structure is reasonably matched with the I/O access pattern. Also in my current model project, 20 to 200 milliseconds is pretty reasonable for a file open that occurs only three times in a 40-minute compute cycle. Your model's file open pattern might be much different, of course.

@mathomp4
Copy link
Author

@Dave-Allured Well, this was sort of an odd test. We were looking at I/O and noticed this, so we sort of focused on the open. Now, in most cases, the open probably isn't as important as the read, so we are looking at that too. But we were wondering about open/read speeds when NC4 files were unlimited time v. fixed time, compressed v. uncompressed, etc. But the open was dominating!

We've also encountered what we think is a bug in Intel MPI when reading compressed NC4 files in our model. It was very weird. We found if we uncompressed, the code worked. Now that we have solved the open mystery, we'll probably look at the read more carefully. Chunking, compression, etc.

But all that said, in one way of running our model we have to open approximately 500 files at start up and ~100 every hour. In that version which was run at very high resolution, I/O (mainly I in this case) was ~25% of the entire model run. Thus, looking at I/O strategies now.

@Dave-Allured
Copy link
Contributor

@mathomp4, all things considered, it seems that NC3 format would be a good fit to improve I/O for this particular model. File opens will be much faster as demonstrated here. I think data reads and writes will be similar to NC4, possibly faster for some access patterns. The main thing you give up with NC3 is NC4 data compression. It seems like speed is your main issue, not storage space.

Watch out for NC3 file structure issues which could bog down data writes. You should pre-define all variables and attributes in a new NC3 file, before writing any large variables. Or else utilize the NC3 padding capability. Otherwise, adding or expanding any variable or attribute can trigger an internal re-write of the whole file when you are not careful. NC4 is thankfully free of this constraint because of dynamic storage management. Please excuse me if I am explaining stuff you already know.

If you start now with netcdf-3 format, you should be able to easily upgrade to netcdf-4 if nc_open is improved in a future release.

@DennisHeimbigner
Copy link
Collaborator

DennisHeimbigner commented Feb 22, 2018

Update: I have been experimenting with the netcdf4 code in
an attempt to improve the performance of nc_open. I thought the
issue was that when the metadata was very large -- multi-megabytes --
the linear searches used by the open() code were causing the slow open speeds.
After running some experiments with highly modified code, I am beginning to believe
that the problem is that the open cost is dominated by the cost for the hdf5 library
to read all the metadata. Currently, we do not/cannot make use of HDF5
lazy reading of the metadata because we always read in all of the metadata.

So, what can be done. Converting netcdf to fully use lazy metadata read is known
to be a large undertaking. However providing lazy read for attributes seems
to be possible because it appears that most often, attributes are accessed by name.
Does anyone have any insight on this?

@DennisHeimbigner DennisHeimbigner self-assigned this Feb 22, 2018
@DennisHeimbigner DennisHeimbigner added area/netcdf4 type/performance performance related issues labels Feb 22, 2018
@edhartnett
Copy link
Contributor

I have (mostly) split the HDF5 code from the libsrc4 data model code. I now see how lazy reads can work for (almost?) all HDF5 objects. I don't think it will be as hard as I previously estimated.

I deduce from @WardF's recent activity that my HDF4 dispatch PR is soon to be merged, so I can present the HDF5/libsrc4 split after that and we can discuss it.

After that, it will be an easy step to add lazy reads. Also easy would be to use an array for atts, as we now do for vars, instead of linked lists.

@DennisHeimbigner
Copy link
Collaborator

After some experimentation, my current speculation is that the only thing
that needs lazy reading is attributes, and since attributes are mostly read by name,
that should be pretty doable.

@edhartnett
Copy link
Contributor

Really? If I have a file with 10000 atts it's slow, but with 10000 vars its not? That seems surprising.

@DennisHeimbigner
Copy link
Collaborator

I am basing my comment off of the example that others gave me.

@DennisHeimbigner
Copy link
Collaborator

Also, based on that limited sample. 1000's of variables may not be common.
Our problem is that we do not have a good statistical sample of files.

@edhartnett
Copy link
Contributor

OK, but I do agree with you that lazy read of atts would be a good thing.

@gsjaardema
Copy link
Contributor

FYI -- I do have many files that have 1000's of variables or more for compilicated models.

@edhartnett
Copy link
Contributor

What would be great would be to send me the ncdump -h -s of some of the very worst offenders. (Or attach them here.) Pick your three or four slowest opening files, and give me their metadata.

Then I can create some samples and have some real examples to work with.

@DennisHeimbigner
Copy link
Collaborator

I created some programs to:

  1. create a file with large amounts of metadata (parameterized)
  2. time the open for a file.
    Both are attached
    bigmeta.zip
    We have one other file but it is apparently proprietary.

@DennisHeimbigner
Copy link
Collaborator

DennisHeimbigner commented Jul 30, 2018

The problem is that the hashmap keeps a copy of the key string. I have considered
several alternatives.

  1. Reuse the key string in the NC_OBJECT hdr. I actually did this, but
    in my measurements it did not seem to make any significant difference.
    That could have been because I did not test on sufficient numbers of hashmap
    entries. There is an additional cost in complicating the rename code.
  2. Use a hash function that has an infinitesimal chance of collisions.
    The problem is that if a collision occurs, then we have no chance to detect
    it and that would produce errors that would be very hard to debug.
  3. Add a byte vector as part of the hash table to store the copied key strings.
    This can be reclaimed with a single free(). This complicates renames.
  4. Switch from hashmap to a sorted table and use binary search. Cost
    is moving from O(1) to O(log n). If none of the tables gets "too" large,
    then the search cost difference is probably not an issue. Advantage is the
    rename becomes somewhat less complicated and we avoid computing hash keys
    and avoid rehash costs

I am open to other ideas.

@DennisHeimbigner
Copy link
Collaborator

For your tests, can you tell me what is the largest # of variables within a group.
This would tell me the largest table size.
For Greg, etc, Can you give me an upper bound on the largest variable count
within any group in your datafiles?

@wkliao
Copy link
Contributor

wkliao commented Jul 30, 2018

@edhartnett, can you make your test program and data files available?
Does this slow performance also happen for classic files?
If yes, I can use it to test PnetCDF hashing mechanism.

@gsjaardema
Copy link
Contributor

@DennisHeimbigner
I am not using groups yet other than in some prototypes, so all vars and dims are in the root group. Some of my complex models have around 500,000 vars

@gsjaardema
Copy link
Contributor

I don't think I trust your gprof output. Gprof is a sampling based profiler and it looks like there aren't enough samples to give you a good indication of what is happening. I would try using either valgrind with callgrind or perhaps intel vtune or some other profiler that actually counts the true time in a function instead of periodic samples.

I also don't see any of the utf8 functions in your profile and they tend to show up when I profile my usages. Perhaps the test program is using the library in a way that bypasses the utf8 conversions; if so, that would explain it.

@edhartnett
Copy link
Contributor

@gsjaardema the utf8 functions don't show in the profile table because I am only reading, so no utf8 conversions need to be done. 500K variable - well, you are one for the record books. I will try the valgrind profiler next.

@wkliao everything is available in the latest release of HPC NetCDF: https://github.com/NOAA-GSD/netcdf-c/releases. The timing program is nc_test4/tst_attsperf. I did not try classic format files, I will do that for the next round of testing.

@DennisHeimbigner let me rerun tests, as Greg suggests, before you mess with the hashes. Another improvement that is in easy reach (but I have not yet done) is to only create the hash for attributes attributes are actually read. That is, instead of creating a hash table for atts for every var, wait to see if the atts are really going to be read for that var. But first I will rerun the profiiling, as Greg suggests, to ensure we're targeting the correct code for performance improvements.

@mathomp4
Copy link
Author

mathomp4 commented Jul 31, 2018

@edhartnett I'll need a bit more time. I grabbed it this morning, but it looks like 4.6.2.4 of HPC NetCDF triggers the same ocprint build error on my MacBook as does 4.6.1 of the "standard" NetCDF.

Time to buckle down and try to figure out what's going on with this...

ETA: Followup email I just posted to the list. netCDF's complexity might be beating me...

@edhartnett
Copy link
Contributor

Try building with --disable-dap.

@edhartnett
Copy link
Contributor

Also @mathomp4 is there a github issue for this ocprint error? If not, there should be.

@mathomp4
Copy link
Author

mathomp4 commented Jul 31, 2018

@edhartnett Well, technically I build with dap on because the ultimate use for this netCDF (the GEOS model and its users) has a need for OpenDAP-enabled netcdf. But for testing...

And I did have a Github issue but as I read the README.md, my reading seemed to me to say that it was more of a mailing list issue. But I can add one.

ETA: Or, should I re-open the issue I had before? It's about a month old now, so maybe a new issue is better to get it fresh?

@mathomp4
Copy link
Author

Reported as #1095.

@DennisHeimbigner
Copy link
Collaborator

As an aside, it seems to me that for Greg's example of a single flat group with
500000 variables, the proposed lazy variable mechanism will not help at all.
The moment that one variable is accessed, then all 500000 will be brought in as well.
For his case, we are going to need a true lazy system and that means we need to
be prepared for the effect on variable ids.

@DennisHeimbigner
Copy link
Collaborator

instead of creating a hash table for atts for every var, wait to see if the atts are really going to be read for that var. B
I would be surprised if the cost of creating the empty hash table was a significant
factor. It is the insertions that matter, I think.
I have a local branch that is setup to experiment with various name-based
lookup mechanisms (one of which is hashing) and also includes some alternate
hashkey functions. When you are finished, I will attempt to do some perf
measurements to see if any of my alternatives makes much difference.

@DennisHeimbigner
Copy link
Collaborator

I think the gprof input I would like to see is the profile for a single read
of a metadata file with a relatively large number of variables. This should give
results that do not have so many zeros in them but would more accurately
show what routines are eating the time.

@edhartnett
Copy link
Contributor

@DennisHeimbigner I didn't fully explain the lazy vars. Initially I thought that I would have to learn all about all the vars in a group if the user asked for one.

But that is not how the implementatnion works out. Instead, netcdf-4 still iterates through all of the vars in the file on open, but it does not learn all about the vars. (For example, it does not learn their type, or fill value, and some other stuff).

This allows the file to open about twice as fast.

When you ask about a specific variable (for example, what is it's type?) then the library will learn all about that variable, but does not have to learn about all the other variables in the group at that time.

So Greg's file should open about twice as fast with lazy vars.

@DennisHeimbigner
Copy link
Collaborator

I see. Sounds good.

@edhartnett
Copy link
Contributor

@mathomp4 can you try your original file tests with the latest master? Fast global att reads and lazy att reads are working in master, so that might address many of your performance issues.

I am still working on lazy var reads, but they will not be in the upcoming 4.6.2 release.

@mathomp4
Copy link
Author

mathomp4 commented Nov 9, 2018

@edhartnett I'll give it a try as soon as I can. I'll be at SC18 the next week, so it might be a while (unless the tutorial/workshop/BoF gets really boring!).

@edhartnett
Copy link
Contributor

I think with PR #1262 I have reached the end of the effort to speed up file opens. I can't really think of anything else to do to make them faster. ;-)

I will be doing some graphs of the final results, I will post them here before closing this issue.

Those who have been suffering from slow opens are urged to try master after #1262 is merged.

@edhartnett
Copy link
Contributor

Here's another chart showing the improvement in file open times.

The file used is a real-world file from the WRF chemistry module. It contains over 360 variables.

The chart shows the time needed to open and close the file for different netCDF versions, and with current master + lazy var reads, and current master with both lazy var reads and use of hidden COORDINATES att.

As we see, there is not much difference between versions until 4.6.2. That version included lazy attribute reading, and fast global att reads, and is faster.

Doing lazy read of some var data (PR #1260) improves the open time. Adding use of the hidden coordinates atts (PR #1262) results in a large speedup. The end result is that the time to open this real-world file is reduced to less than 1/3rd of what it was in 4.6.0/4.6.1.

netcdf-4 open 2fclose times for wrf chem file

@mathomp4
Copy link
Author

@edhartnett Is this the "Coordinates" attribute listed in the netCDF docs here: https://www.unidata.ucar.edu/software/netcdf/docs/attribute_conventions.html

I looked at #1262 and I wasn't too sure how I'd enable this in my code. Or is there an example writing a file with COORDINATES in the netCDF source code?

@edhartnett
Copy link
Contributor

No, the coordinates attribute is a hidden attribute (called "_Netcdf4Coordinates"). You can see it with h5dump, but not with ncdump, because netCDF hides it. It is not the same as the "Coordinates" attribute you mention.

Once #1262 is merged, there is no need to turn this speedup on, it will be available for all newly created netCDF-4 files. You don't have to write the _NetCDF4Coordinates attribute, the library will do it automatically.

The presence of _NetCDF4Coordinates will also be detected automatically when reading a file, and it will be used to speed file opens. If it is not present, then files are opened in the usual (slower) way.

@mathomp4
Copy link
Author

👍 I like speedups/fixes that I don't have to do anything to get! Nice!

@edhartnett
Copy link
Contributor

BTW where does the merr sample file come from? That is, what real-world model or application?

@mathomp4
Copy link
Author

@edhartnett Looking back in time, it seems to be output from MERRA-2, a reanalysis generated by GMAO where I work. Looking at the metadata, it's the tavg1_2d_lnd_Nx aka M2T1NXLND collection on the date 2010-01-05:

https://disc.gsfc.nasa.gov/datasets/M2T1NXLND_V5.12.4/summary

It's 2D land-surface time-averaged output. It's one of the outputs with the larger number of variables:

https://gmao.gsfc.nasa.gov/pubs/docs/Bosilovich785.pdf

which is probably why I looked at it. Though it might have been that I had it at hand when this question/issue came up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/netcdf4 type/performance performance related issues
Projects
None yet
Development

No branches or pull requests

7 participants