Skip to content
This repository has been archived by the owner on Sep 26, 2023. It is now read-only.

[WIP] Pipeline for SMAP soil moisture ingest #117

Closed
wants to merge 159 commits into from

Conversation

ashiklom
Copy link
Collaborator

This is a minimal, very much work-in-progress workflow for converting SMAP L3 soil moisture data into Zarr. This is nowhere close to being merged, and I'm totally fine with this being replaced with someone else's implementation of a "real" workflow for this same dataset.

That said, the Python scripts here do work. My question is: What to do next (in the specific context of EIS)? Do I need to set up a CDK (also, what is a CDK?)? Where do I put the resulting Zarr?

We will be repeating very similar workflows for a lot more datasets as EIS moves forward, so I wanted to use this as a pathfinder to understand how this process works.

jvntf and others added 23 commits March 3, 2022 13:50
* Update README.md

* Some docs for having different workflows for different datasets

* Update README.md

* Update README.md

Co-authored-by: Jeevan Farias <jeevanfarias@gmail.com>

Co-authored-by: Jeevan Farias <jeevanfarias@gmail.com>
* Add sql and instructions for OMSO2PCA collection and update item STAC metadata

* Rename some things

* Update assets in stac-gen for non-cmr data

* Remove 'name' key as invalid

* Add example file for OMSO2PCA-COG

* Update stack for OMSO2PCA metadata

* Gitignore cdk context

* add cdk deploy step

* Regex should permit characters after the year

* add OMI NO2 sql

* Working on metadata fixes

* Updated collection SQL for OMI_trno2-COG.sql
…ns-metadata

Add social vulnerability index collections metadata sql
This doesn't work (on DISCOVER) because of weird HDF5 reading errors.
@abarciauskas-bgse
Copy link
Collaborator

This is a great start to having a Zarr workflow, thanks @ashiklom

AWS CDK is cloud development kit and we are using it to have AWS cloud-hosted deployments for our workflow pipelines.

As discussed in #113 we are working on consolidating some of the existing dataset-workflows to be generalized to a single but configurable pipeline which discovers, processes and publishes data. The current use cases we have should make it reasonable to configure discovery from S3 directory or CMR and then configurably process the inputs to COGs (custom functions) and then generate STAC metadata to publish for those COGs.

Right now, this workflow falls into the "manual" or "local" category of a dataset workflow. Regardless of whether we get its functions into cloud-hosted services, we still want to record the code used to generate and publish the cloud-optimized format.

So the next steps (not necessarily in this order) for completing the "local" version of this workflow would be:

For creating a cloud deployment, I think the workflow could be broken up into
discover cmr-> (map) download files -> (wait for all map iterators to finish) generate zarr -> publish STAC

There are a couple options for moving forward with a cloud deployment of this workflow:

  • break up the steps and write CDK code to do the workflow above in a step function into a new workflow
  • integrate with the current simple/generalized ingest being developed in Add bmhd metadata #111 however it would require almost an entirely new (but configurable) set of functions for that workflow.
  • Should we write a pangeo-forge recipe for this dataset? If we're interested in exploring this option, I will ask @sharkinsspatial what the next steps should be. I believe we can write the recipe but I'm not sure if we can or should run it on the pangeo-forge infrastructure Example pipeline for SMAP Seasurface Salinity pangeo-forge/staged-recipes#30
    • If we move forward with using pangeo-forge, we will need to determine that the data to be written to the VEDA datastore and published to the VEDA API.

@ashiklom
Copy link
Collaborator Author

@abarciauskas-bgse Thanks for the detailed response!

I'm not sure I'll be of much use on most of the points above, so here's another idea: How about we create a separate public repository for manual workflows like this that we will be developing for EIS and related work? Then, you all can use that repository as a collection of use cases for creating automated ingest workflows, Pangeo Forge recipes, etc. I think an approach like that may be more efficient than me trying to learn all of these procedures.

Thoughts?

@abarciauskas-bgse
Copy link
Collaborator

I like that idea. We might want to iterate on it as we develop a longer term process for taking science to cloud workflows, but this sounds good for now. How about a repo "veda-data-scripts" or "veda-data-processing" ? I'm wondering if we need anything that is specific to each EIS application, but I believe many of these data sets will be shared so probably makes more sense to have 1 shared repository for all EIS and VEDA science applications.

@ashiklom
Copy link
Collaborator Author

ashiklom commented May 3, 2022

Following up on this. I just created https://github.com/ashiklom/veda-data-processing, which includes the SMAP workflow from this PR. All the code in there is under MIT License — you can try to do some fancy git subtree / git submodule magic to add these to this or a similar repo along with the full revision history...or just copy the code over (again...MIT License! I added my name as a contact so people know who to blame even in the absence of the git commit history).

@abarciauskas-bgse
Copy link
Collaborator

Awesome thanks @ashiklom should we close this PR then?

@ashiklom ashiklom closed this May 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants