Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster and more flexible NetCDF IO. #145

Closed
4 tasks
ali-ramadhan opened this issue Mar 21, 2019 · 20 comments
Closed
4 tasks

Faster and more flexible NetCDF IO. #145

ali-ramadhan opened this issue Mar 21, 2019 · 20 comments
Labels
abstractions 🎨 Whatever that means help wanted 🦮 plz halp (guide dog provided) output 💾 performance 🏍️ So we can get the wrong answer even faster
Milestone

Comments

@ali-ramadhan
Copy link
Member

NetCDF.jl seems to be missing some features and isn't really being maintained (See JuliaGeo/NetCDF.jl#62 about saving time values and JuliaGeo/NetCDF.jl#39). Maybe it's worth switching to NCDatasets.jl which is actively maintained and grew out of bugs that weren't being fixed in NetCDF.jl. Unfortunately we're choosing between two relatively young packages. An alternative would be to use the much more mature netcdf4-python but I'd rather not have to use PyCall...

Originally posted by @ali-ramadhan in #31 (comment)

Some more comments:

  • Right now NetCDF output is much slower than expected. We have asynchronous output writing Asynchronous NetCDF output. #137 but spending 2-3 minutes to write out the fields of a 256³ grid is ridiculous.
  • Compression doesn't seem to work in NetCDF.jl: See Document chunking in nccreate JuliaGeo/NetCDF.jl#87
  • We might want to share this output writing feature with CliMA.jl. See Data file formats ClimateMachine.jl#114
  • @simonbyrne and @charleskawczynski suggested looking at HDF5.jl. As NetCDF4 is built on HDF5, we should be able to generate valid NetCDF files with it.
  • Whatever we end up doing, @christophernhill makes a good point that we should produce portable NetCDF files as this is what all our potential users would expect and want.
  • Thinking more long-term @lcw says that IO performance is a hard problem and thinks we are going to want something that plays well on clusters (e.g., something MPI IO based).

Some extra features we might want in the short-term:

  • Option to only write out a specific subset of the model state, e.g. surface velocities only, or a vertical temperature slice. This would really speed up IO if you don't need full 3D fields.
  • Select which fields to write out to NetCDF.
  • Include diagnostics in NetCDF output. I believe this will be possible if we resolve the item above as output writing occurs right after diagnostics are run, so diagnostic fields can just be included as one of the fields to write out.
  • Option to create one NetCDF file for each iteration, or to combine all output into one NeCDF file.

I think the main point of this issue is that we should decide how to output NetCDF. Packages we could choose from are NCDatasets.jl and HDF5.jl. Or we could go with something else.

@ali-ramadhan ali-ramadhan added help wanted 🦮 plz halp (guide dog provided) performance 🏍️ So we can get the wrong answer even faster abstractions 🎨 Whatever that means labels Mar 21, 2019
@ali-ramadhan ali-ramadhan changed the title Faster and more flexible NetCDF output writer. Faster and more flexible NetCDF IO. Mar 21, 2019
@glwagner
Copy link
Member

Some comments:

  1. We may want to allow users to output in JLD2 in addition to NetCDF.

  2. Check out what FourierFlows.jl does:

https://github.com/FourierFlows/FourierFlows.jl/blob/f7a87d4090123fc3c241bed621b64660a6f3596f/src/output.jl#L1

Output is specified by the user as a function that extracts the output from Model. This permits completely arbitrary output, including on-line calculations. An added benefit is that it is simple to implement.

@ali-ramadhan
Copy link
Member Author

  1. That would definitely be nice and fast.

  2. We already support this. The user would just have to define their own struct CustomOutputWriter <: OutputWriter, write a write_output(::Model, ::CustomOutputWriter) function, and add this custom writer to model.output_writers. The process does needs to be documented though.

@glwagner
Copy link
Member

Huh. I see what you're saying about defining a custom output writer. I guess a user can use JLD2 that way too?

@ali-ramadhan
Copy link
Member Author

Yup. It would be nice to have some good built in writers, e.g. a solid netcdf writer, but otherwise they can always write a custom writer.

@christophernhill
Copy link
Member

christophernhill commented Mar 22, 2019 via email

@christophernhill
Copy link
Member

christophernhill commented Mar 22, 2019 via email

@ali-ramadhan
Copy link
Member Author

Zarr looks interesting! Sounds like we'd need a good internet connection (e.g. on Google Cloud) but might be cool for accessing datasets over OpeNDAP (can it do that?) or saving tons of data to a cloud storage bucket. We could easily play around with a Zarr output writer on a branch (could try ZarrNative.jl or the python implementation with PyCall).

And no, we haven't talked with JuliaGeo. Seems like it's a more GIS oriented organization, but could be helpful to get in touch with @yeesian.

@christophernhill
Copy link
Member

christophernhill commented Mar 22, 2019 via email

@yeesian
Copy link

yeesian commented Mar 22, 2019

This looks super cool: I'll be happy to meet for coffee and chat in person (assuming you're both at MIT) but I am not very familiar with tools in climatology* and think @meggart @visr @evetion @mkborregaard and @juliohm will be in a better position to comment about it.

*The packages in JuliaGeo have been mostly focused on IO and has not had the bandwidth to think about how it might interface with packages for (climate/ocean/etc) models.

@meggart
Copy link

meggart commented Mar 22, 2019

Hey, original author of both https://github.com/JuliaGeo/NetCDF.jl and https://github.com/meggart/ZarrNative.jl here. Regarding the state of NetCDF.jl , yes I would say I mostly stopped developing the package due to time constraints and currently shift my focus towards Zarr since this is what we are using in our current project.

My last attempt at improving the NetCDF solved many of the issues with the package JuliaGeo/NetCDF.jl#61 but was not merged because of conflicts with other bugfix PRs. However, might be source of inspiration if someone wants to do a rewrite.

Regarding write performance, I would be very interested to see examples where NetCDF.jl performs worse than e.g. python-netcdf4, since most of the time should be spent in the same NetCDF C library. I have been using the package extensively and did not experience it to be slower than comparable packages.

I you are worried about the robustness of NetCDF.jl, you should not even look at ZarrNative.jl, since it is still very young and rather a prototype.

I would be very happy to discuss the issues further, maybe in a call? Would also be interested to learn about your project which seems to be very cool.

@christophernhill
Copy link
Member

@simonbyrne are there Caltech and/or NPS people who might want to get a little involved in #145 preliminary thinking - maybe @lcw? I am unsure if we can have a single I/O framework that spans all of raw DG to observations (or what that really means!), so it might be too much. Definitely be interested in making sure we are all aware of landscape though.

@tapios
Copy link

tapios commented Mar 22, 2019

I think we should distinguish between output we need when we run the model, and output we may eventually provide to the broader community. Ideally, formats for both would be the same, but this may not be the best solution. E.g., netCDF has obvious disadvantages but is still widely used. That does not mean it should be the output format we use by default (although we may want to provide model statistics in netCDF in the end, if a few years from now this is still what everyone uses). So: separate the discussion of what is best for us now from what we should provide (e.g., in any CMIPx archive down the line). Zarr and HDF both seem worth discussing.

@lcw
Copy link

lcw commented Mar 22, 2019

@simonbyrne are there Caltech and/or NPS people who might want to get a little involved in #145 preliminary thinking - maybe @lcw? I am unsure if we can have a single I/O framework that spans all of raw DG to observations (or what that really means!), so it might be too much. Definitely be interested in making sure we are all aware of landscape though.

I am happy to be involved with the discussion. My I/O experience is limited but happy to give a DG perspective.

@tapios
Copy link

tapios commented Mar 22, 2019

Also important to keep in mind in this discussion: Our workflow will be different from most standard models, which write out instantaneous output that then is post-processed to get statistics etc. We will have to accumulate statistics on the fly, and we can (and should) forgo most instantaneous output, at least for the atmosphere. The model will learn from the accumulated statistics. Otherwise, with instantaneous output, the data volume, especially with embedded LES, will create an I/O and data transfer bottleneck that will limit us, and, e.g., will limit our ability to use distributed computing platforms.

@simonbyrne
Copy link
Member

HDF5 does have some new features that might be useful for parallel IO, such as Virtual Datasets:
https://support.hdfgroup.org/HDF5/docNewFeatures/NewFeaturesVirtualDatasetDocs.html

@visr
Copy link

visr commented Mar 22, 2019

My feeling is that if you want to write NetCDF files through the HDF API that it will be more work, though I never tried.

Regarding NetCDF.jl & NCDatasets.jl, I feel that the statements in the OP that NetCDF.jl is not being maintained and that NCDatasets.jl grew out of bugs not being fixed is are a bit of a misrepresentation.

For installations and dependency reduction, hopefully the new HDF5 release, which will for the first time support cross compilation, will lead to HDF5.jl switching to BinaryBuilder, which will allow NetCDF.jl to do the same. Also with the Clang.jl improvements we can regenerate the bindings. I still hope that NetCDFand NCDatasets will be able to share more code in the future, and be mainly about exposing different user facing APIs.

@ali-ramadhan
Copy link
Member Author

ali-ramadhan commented Mar 25, 2019

This looks super cool: I'll be happy to meet for coffee and chat in person (assuming you're both at MIT) but I am not very familiar with tools in climatology* and think @meggart @visr @evetion @mkborregaard and @juliohm will be in a better position to comment about it.

*The packages in JuliaGeo have been mostly focused on IO and has not had the bandwidth to think about how it might interface with packages for (climate/ocean/etc) models.

That might be useful down the line! Yes we're both around MIT. I think we're still figuring how we want to do IO in the long-term but will definitely want some way to output NetCDF.

Hey, original author of both https://github.com/JuliaGeo/NetCDF.jl and https://github.com/meggart/ZarrNative.jl here. Regarding the state of NetCDF.jl , yes I would say I mostly stopped developing the package due to time constraints and currently shift my focus towards Zarr since this is what we are using in our current project.

My last attempt at improving the NetCDF solved many of the issues with the package JuliaGeo/NetCDF.jl#61 but was not merged because of conflicts with other bugfix PRs. However, might be source of inspiration if someone wants to do a rewrite.

Regarding write performance, I would be very interested to see examples where NetCDF.jl performs worse than e.g. python-netcdf4, since most of the time should be spent in the same NetCDF C library. I have been using the package extensively and did not experience it to be slower than comparable packages.

I you are worried about the robustness of NetCDF.jl, you should not even look at ZarrNative.jl, since it is still very young and rather a prototype.

I would be very happy to discuss the issues further, maybe in a call? Would also be interested to learn about your project which seems to be very cool.

Thanks so much for working on NetCDF.jl! I didn't mean to sound ungrateful about NetCDF.jl's performance. We were just debating which package to use. With JuliaGeo/NetCDF.jl#87 fixed, I think we'll be happy for a long time. The compress=9 bug explains why the IO was slow. @glwagner has suggested that for a project of our scale we'd want to help and contribute to the packages we use.

We definitely want to stick with NetCDF as it's the de facto standard in the climate, atmospheric, and ocean sciences. A discussion might be helpful down the line. With faster IO I think we're happy now and we're still figuring how to do IO long-term.

My feeling is that if you want to write NetCDF files through the HDF API that it will be more work, though I never tried.

Regarding NetCDF.jl & NCDatasets.jl, I feel that the statements in the OP that NetCDF.jl is not being maintained and that NCDatasets.jl grew out of bugs not being fixed is are a bit of a misrepresentation.

For installations and dependency reduction, hopefully the new HDF5 release, which will for the first time support cross compilation, will lead to HDF5.jl switching to BinaryBuilder, which will allow NetCDF.jl to do the same. Also with the Clang.jl improvements we can regenerate the bindings. I still hope that NetCDFand NCDatasets will be able to share more code in the future, and be mainly about exposing different user facing APIs.

Thanks for the feedback! My thinking was the same, why use HDF5.jl when NetCDF.jl and NCDatasets.jl exist since we want NetCDF output in the end. Sorry if I misrepresented the two packages, it was just what I gleaned by skimming a few issues and PRs.

Will definitely keep a look out for new HDF5 releases.

@ali-ramadhan
Copy link
Member Author

ali-ramadhan commented Mar 25, 2019

I think we should distinguish between output we need when we run the model, and output we may eventually provide to the broader community. Ideally, formats for both would be the same, but this may not be the best solution. E.g., netCDF has obvious disadvantages but is still widely used. That does not mean it should be the output format we use by default (although we may want to provide model statistics in netCDF in the end, if a few years from now this is still what everyone uses). So: separate the discussion of what is best for us now from what we should provide (e.g., in any CMIPx archive down the line). Zarr and HDF both seem worth discussing.

Also important to keep in mind in this discussion: Our workflow will be different from most standard models, which write out instantaneous output that then is post-processed to get statistics etc. We will have to accumulate statistics on the fly, and we can (and should) forgo most instantaneous output, at least for the atmosphere. The model will learn from the accumulated statistics. Otherwise, with instantaneous output, the data volume, especially with embedded LES, will create an I/O and data transfer bottleneck that will limit us, and, e.g., will limit our ability to use distributed computing platforms.

I agree. Ideally we'd support different formats (e.g. NetCDF, JLD2, HDF, Zarr, etc.) and have the option to use the best format for your application. We can already switch between output writers and choose which field(s)/diagnostics to output but we only do binary, NetCDF, and JLD for now. We were just focusing on NetCDF for our short-term needs, but this will definitely be a challenge for large problems.

@visr
Copy link

visr commented Apr 2, 2019

Actually just came across this; https://github.com/shoyer/h5netcdf. It is a Python library, but reads and writes NetCDF directly using the HDF5 library. Still don't think it would be the first way to go, but it can be done :)

@ali-ramadhan ali-ramadhan added this to the v1.0 milestone Apr 3, 2019
@ali-ramadhan
Copy link
Member Author

I think with PR #643 we finally have a flexible enough NetCDF output writer based on NCDatasets.jl and can finally close this issue: it's fast and allows for arbitrary output (fields, scalars, profiles, slices, etc.) so we can output all our diagnostics to NetCDF.

Thanks @suyashbire1 for working on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
abstractions 🎨 Whatever that means help wanted 🦮 plz halp (guide dog provided) output 💾 performance 🏍️ So we can get the wrong answer even faster
Projects
None yet
Development

No branches or pull requests

9 participants