-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Making NASA JPL PO.DAAC datasets available from ocean.pangeo.io? #686
Comments
A follow-up here is that I would like guidance on whether we should create the Zarr manifestations on the PO.DAAC-side and serve them from our data center or if there is a precedent/guidance on how this should be done. Any comments are really appreciated. Thank you |
Hi @lewismc - welcome to Pangeo! As a physical oceanographer myself, I'm thrilled to see someone from NASA getting involved in this issue. I personally would love to have the PO.DAAC datasets available in a cloud-optimized format. So THANK YOU for your participation! First let me state that all of this zarr / cloud stuff we are doing is pretty new and experimental. We are working towards defining best practices, but there are still a lot of open questions about how to do things. When trying to host-cloud optimized mirrors of existing data repositories like PO.DAAC, the main issues come to mind are:
I hope this list can help you figure out how to move forward. I invite you to attend our weekly Pangeo meetings if you want to discuss things face-to-face. |
Also, I'm curious if you have official support from PO.DAAC to be doing this? |
This is really refreshing to see @rabernat thank you for your detailed thoughts. There is a lot for me to respond to so let me gradually make my way through it
NASA Earth Science Data Information System (ESDIS) program's Earth Science Data System Working Group (ESDSWG) currently facilitates one WG focused on analysis ready data. I think presenting your work to them would be an excellent step in the right direcvtion. Maybe you already knew about them. Please let me know if you would like ot learn more.
By reading the documentation I understood this. It is very interesting and until I read the Pengeo documentation I was not actually aware of the Zarr format if I am honest.
For analysis purposes we've been going in this direction for about 5 years now as well. The way we described it was that the legacy formats e.g. netCDF are optimized for write performance with the timeseries being written in one pass. This is of course not optimal for analysis where data access patterns are significantly different in nature. In short, getting away from files has been something we've been doing with our R&TD projects for some time so we are on the same page here :)
Correct. In the Apache SDAP project, the analysis stack NEXUS uses various tiling schemes which are dataset specific. A suitable chunking or data structuring methodology requires inherent knowledge of both the dataset one is working with and what your access patterns are. This is non-trivial in nature and we only really find something that works well after several iterations and comparisons.
Maybe working with the PO.DAAC datasets is an opportunity to experiment here...?
I will certainly share my experiences on the referenced Github issue.
I have nothing ro share right now but we do have several datasets at PO.DAAC which would allow us to experiment in this area as well.
I think for the time being, accessing from ocean.pangeo.io is the best place. We will not be attempting to host a dedicated Pangeo stack at PO.DAAC (or in one of our AWS accounts) any time soon I wouldn't imagine. Maybe we can discuss further about hosting some candidate Zarrified PO.DAAC dataset(s) in Google Cloud US-Central region which would allow us to address a few of the experimentation areas stated above?
Yes I agree that the lightweight data catalog is a suiting concept. Once we've reviewed some of the candidates and we make some progress on the netCDF/HDF-->Zarr conversation pipeline then we can come back to this one.
Yes I will attend the next one which is tomorrow Wednesday, Aug 7th, 2019. If you could spare some time to include this item on the agenda it would be excellent.
We were discussing Pangeo only yesterday and although it is very early days (that is to say that we have known about the Pangeo community and stack for quite some time but that there are few examples which employ PO.DAAC data) I see no reason for us not to push on. There is no official support for us mirroring the entire PO.DAAC archive as Zarr... the justification to do this would need to come from our community and at this point in time there is just no way that we would get the support required. HOWEVER there is certainly worth in us prototyping a conversion pipeline, staging data for rapid infusion into ocean.pangeo.io and demonstrating the interactive analysis capability for one or more representative science analyses. I look forward to joining the call tomorrow. |
@rabernat just got conformation that Mike Gangl (my boss and lead of PO.DAAC Cloud System will join us as well) |
There was quite a bit of interest in the PO.DAAC datasets here at the Pangeo annual meeting in Seattle. When we last talked, our plan was to brainstorm some desired outcomes for this work. I'll kick that off here:
|
Following up on @rabernat I have been looking into converting the MUR SST dataset to Zarr using the PO.DAAC opendap server. This conversion is motivated by interest from @cgentemann in using this dataset in some cloud science tutorials. Here is a gist of the pangeo jupyter notebook that I am working on for converting this dataset: https://gist.github.com/abarciauskas-bgse/95259d8ccce60b452afe4415a476225f My next step is to look into running this on separate cluster (e.g. not on hub.pangeo) so I can have it run over a couple days. Right now reading each time step (a day) is taking at least a minute using the opendap server (@rabernat helped me figure this out), which means doing the whole MUR SST archive (16-17 years) will take 4-5 days by my estimation. If anyone has advice on things to look into to speed this up I would be open to them! |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date. |
Hi all. Has there been any progress on this lately? I'd also like to have access to JPL data via Pangeo. Specifically, sea ice would be excellent. |
Yes, NASA DAACs are moving their data to the cloud. You can find the datasets here: https://search.earthdata.nasa.gov/search?ff=Available%20from%20AWS%20Cloud. There is more information here: https://podaac.jpl.nasa.gov/cloud-datasets/about. There are some tutorials on access, but there isn't yet a clean way to easily use your earthdata login to access it. The tutorials provide some code, but it is a bit complicated. Hopefully a library will come out soon. -chelle |
Hi folks,
I'm currently evaluating Pangeo and came across the Google Cloud Storage Data Catalog documentation and specifically how one could prepare cloud optimized Zarr data.
PO.DAAC's data catalog currently offers >550 datasets covering a diverse and rich world of oceanography variables e.g. sea surface salinity, gravity, sea ice, sea surface temperature, etc. I think it would be excellent if these were available for processing within Pangeo through simple Python function calls.
How would I go about investigating the above? Thank you
The text was updated successfully, but these errors were encountered: