-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster and more flexible NetCDF IO. #145
Comments
Some comments:
Output is specified by the user as a function that extracts the output from |
|
Huh. I see what you're saying about defining a custom output writer. I guess a user can use JLD2 that way too? |
Yup. It would be nice to have some good built in writers, e.g. a solid netcdf writer, but otherwise they can always write a custom writer. |
Hi Both
Are we talking with
https://github.com/JuliaGeo
I am tempted to think we should explore Zarr interfaces (
https://zarr.readthedocs.io/en/stable/index.html ) as well as netCDF.
yeesian@mit.edu who is part of https://github.com/JuliaGeo is at MIT
Chris
…On Fri, Mar 22, 2019 at 9:41 AM Ali Ramadhan ***@***.***> wrote:
Yup. It would be nice to have some good built in writers, e.g. a solid netcdf writer, but otherwise they can always write a custom writer.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
… On Fri, Mar 22, 2019 at 9:48 AM Chris Hill ***@***.***> wrote:
Hi Both
Are we talking with
https://github.com/JuliaGeo
I am tempted to think we should explore Zarr interfaces (
https://zarr.readthedocs.io/en/stable/index.html ) as well as netCDF.
***@***.*** who is part of https://github.com/JuliaGeo is at MIT
Chris
On Fri, Mar 22, 2019 at 9:41 AM Ali Ramadhan ***@***.***> wrote:
>
> Yup. It would be nice to have some good built in writers, e.g. a solid netcdf writer, but otherwise they can always write a custom writer.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub, or mute the thread.
|
Zarr looks interesting! Sounds like we'd need a good internet connection (e.g. on Google Cloud) but might be cool for accessing datasets over OpeNDAP (can it do that?) or saving tons of data to a cloud storage bucket. We could easily play around with a Zarr output writer on a branch (could try ZarrNative.jl or the python implementation with PyCall). And no, we haven't talked with JuliaGeo. Seems like it's a more GIS oriented organization, but could be helpful to get in touch with @yeesian. |
JuliaGeo also hosts
https://github.com/JuliaGeo/NetCDF.jl
( but not https://github.com/Alexander-Barth/NCDatasets.jl ).
…On Fri, Mar 22, 2019 at 11:07 AM Ali Ramadhan ***@***.***> wrote:
Zarr looks interesting! Sounds like we'd need a good internet connection
(e.g. on Google Cloud) but might be cool for accessing datasets over
OpeNDAP (can it do that?) or saving tons of data to a cloud storage bucket.
We could easily play around with a Zarr output writer on a branch (could
try ZarrNative.jl or the python implementation with PyCall).
And no, we haven't talked with JuliaGeo. Seems like it's a more GIS
oriented organization, but could be helpful to get in touch with @yeesian
<https://github.com/yeesian>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#145 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADXx4PcRnv_wD0UOkskOPbrlShGP5Npjks5vZPGugaJpZM4cBlZJ>
.
|
This looks super cool: I'll be happy to meet for coffee and chat in person (assuming you're both at MIT) but I am not very familiar with tools in climatology* and think @meggart @visr @evetion @mkborregaard and @juliohm will be in a better position to comment about it. *The packages in JuliaGeo have been mostly focused on IO and has not had the bandwidth to think about how it might interface with packages for (climate/ocean/etc) models. |
Hey, original author of both https://github.com/JuliaGeo/NetCDF.jl and https://github.com/meggart/ZarrNative.jl here. Regarding the state of NetCDF.jl , yes I would say I mostly stopped developing the package due to time constraints and currently shift my focus towards Zarr since this is what we are using in our current project. My last attempt at improving the NetCDF solved many of the issues with the package JuliaGeo/NetCDF.jl#61 but was not merged because of conflicts with other bugfix PRs. However, might be source of inspiration if someone wants to do a rewrite. Regarding write performance, I would be very interested to see examples where NetCDF.jl performs worse than e.g. python-netcdf4, since most of the time should be spent in the same NetCDF C library. I have been using the package extensively and did not experience it to be slower than comparable packages. I you are worried about the robustness of NetCDF.jl, you should not even look at ZarrNative.jl, since it is still very young and rather a prototype. I would be very happy to discuss the issues further, maybe in a call? Would also be interested to learn about your project which seems to be very cool. |
@simonbyrne are there Caltech and/or NPS people who might want to get a little involved in #145 preliminary thinking - maybe @lcw? I am unsure if we can have a single I/O framework that spans all of raw DG to observations (or what that really means!), so it might be too much. Definitely be interested in making sure we are all aware of landscape though. |
I think we should distinguish between output we need when we run the model, and output we may eventually provide to the broader community. Ideally, formats for both would be the same, but this may not be the best solution. E.g., netCDF has obvious disadvantages but is still widely used. That does not mean it should be the output format we use by default (although we may want to provide model statistics in netCDF in the end, if a few years from now this is still what everyone uses). So: separate the discussion of what is best for us now from what we should provide (e.g., in any CMIPx archive down the line). Zarr and HDF both seem worth discussing. |
I am happy to be involved with the discussion. My I/O experience is limited but happy to give a DG perspective. |
Also important to keep in mind in this discussion: Our workflow will be different from most standard models, which write out instantaneous output that then is post-processed to get statistics etc. We will have to accumulate statistics on the fly, and we can (and should) forgo most instantaneous output, at least for the atmosphere. The model will learn from the accumulated statistics. Otherwise, with instantaneous output, the data volume, especially with embedded LES, will create an I/O and data transfer bottleneck that will limit us, and, e.g., will limit our ability to use distributed computing platforms. |
HDF5 does have some new features that might be useful for parallel IO, such as Virtual Datasets: |
My feeling is that if you want to write NetCDF files through the HDF API that it will be more work, though I never tried. Regarding NetCDF.jl & NCDatasets.jl, I feel that the statements in the OP that NetCDF.jl is not being maintained and that NCDatasets.jl grew out of bugs not being fixed is are a bit of a misrepresentation. For installations and dependency reduction, hopefully the new HDF5 release, which will for the first time support cross compilation, will lead to HDF5.jl switching to BinaryBuilder, which will allow NetCDF.jl to do the same. Also with the Clang.jl improvements we can regenerate the bindings. I still hope that NetCDFand NCDatasets will be able to share more code in the future, and be mainly about exposing different user facing APIs. |
That might be useful down the line! Yes we're both around MIT. I think we're still figuring how we want to do IO in the long-term but will definitely want some way to output NetCDF.
Thanks so much for working on NetCDF.jl! I didn't mean to sound ungrateful about NetCDF.jl's performance. We were just debating which package to use. With JuliaGeo/NetCDF.jl#87 fixed, I think we'll be happy for a long time. The We definitely want to stick with NetCDF as it's the de facto standard in the climate, atmospheric, and ocean sciences. A discussion might be helpful down the line. With faster IO I think we're happy now and we're still figuring how to do IO long-term.
Thanks for the feedback! My thinking was the same, why use HDF5.jl when NetCDF.jl and NCDatasets.jl exist since we want NetCDF output in the end. Sorry if I misrepresented the two packages, it was just what I gleaned by skimming a few issues and PRs. Will definitely keep a look out for new HDF5 releases. |
I agree. Ideally we'd support different formats (e.g. NetCDF, JLD2, HDF, Zarr, etc.) and have the option to use the best format for your application. We can already switch between output writers and choose which field(s)/diagnostics to output but we only do binary, NetCDF, and JLD for now. We were just focusing on NetCDF for our short-term needs, but this will definitely be a challenge for large problems. |
Actually just came across this; https://github.com/shoyer/h5netcdf. It is a Python library, but reads and writes NetCDF directly using the HDF5 library. Still don't think it would be the first way to go, but it can be done :) |
I think with PR #643 we finally have a flexible enough NetCDF output writer based on NCDatasets.jl and can finally close this issue: it's fast and allows for arbitrary output (fields, scalars, profiles, slices, etc.) so we can output all our diagnostics to NetCDF. Thanks @suyashbire1 for working on this! |
Some more comments:
Some extra features we might want in the short-term:
I think the main point of this issue is that we should decide how to output NetCDF. Packages we could choose from are NCDatasets.jl and HDF5.jl. Or we could go with something else.
The text was updated successfully, but these errors were encountered: