[WIP] Pipeline for SMAP soil moisture ingest #117

ashiklom · 2022-04-13T20:35:49Z

This is a minimal, very much work-in-progress workflow for converting SMAP L3 soil moisture data into Zarr. This is nowhere close to being merged, and I'm totally fine with this being replaced with someone else's implementation of a "real" workflow for this same dataset.

That said, the Python scripts here do work. My question is: What to do next (in the specific context of EIS)? Do I need to set up a CDK (also, what is a CDK?)? Where do I put the resulting Zarr?

We will be repeating very similar workflows for a lot more datasets as EIS moves forward, so I wanted to use this as a pathfinder to understand how this process works.

…terraform-and-script AWS Batch terraform and run job script

Convert omno2d.003

…omno2-on-batch OMNO2 on batch

…is-aod-global

…extension Convert MCD19A2.006 HDF4 files to COG

* Update README.md * Some docs for having different workflows for different datasets * Update README.md * Update README.md Co-authored-by: Jeevan Farias <jeevanfarias@gmail.com> Co-authored-by: Jeevan Farias <jeevanfarias@gmail.com>

* Add sql and instructions for OMSO2PCA collection and update item STAC metadata * Rename some things * Update assets in stac-gen for non-cmr data * Remove 'name' key as invalid * Add example file for OMSO2PCA-COG * Update stack for OMSO2PCA metadata * Gitignore cdk context * add cdk deploy step * Regex should permit characters after the year * add OMI NO2 sql * Working on metadata fixes * Updated collection SQL for OMI_trno2-COG.sql

…lines Add/separated pipelines

…_assets Allow for multi assets.

…ns-metadata Add social vulnerability index collections metadata sql

This doesn't work (on DISCOVER) because of weird HDF5 reading errors.

abarciauskas-bgse · 2022-04-13T22:08:46Z

This is a great start to having a Zarr workflow, thanks @ashiklom

AWS CDK is cloud development kit and we are using it to have AWS cloud-hosted deployments for our workflow pipelines.

As discussed in #113 we are working on consolidating some of the existing dataset-workflows to be generalized to a single but configurable pipeline which discovers, processes and publishes data. The current use cases we have should make it reasonable to configure discovery from S3 directory or CMR and then configurably process the inputs to COGs (custom functions) and then generate STAC metadata to publish for those COGs.

Right now, this workflow falls into the "manual" or "local" category of a dataset workflow. Regardless of whether we get its functions into cloud-hosted services, we still want to record the code used to generate and publish the cloud-optimized format.

So the next steps (not necessarily in this order) for completing the "local" version of this workflow would be:

Write the Zarr store to the VEDA data store. I will create that bucket soon and follow up on access but there is some additional context in Design: VEDA datasets should be in an MCP bucket #118 which welcomes your feedback
Define what the STAC metadata should look like for Zarr. Just looking at https://planetarycomputer.microsoft.com/api/stac/v1/collections/daymet-monthly-pr it looks like they have defined the Zarr store as an asset of the collection, but there may be other options, like storing it as one item belonging to the collection
Create the STAC metadata module

For creating a cloud deployment, I think the workflow could be broken up into
discover cmr-> (map) download files -> (wait for all map iterators to finish) generate zarr -> publish STAC

There are a couple options for moving forward with a cloud deployment of this workflow:

break up the steps and write CDK code to do the workflow above in a step function into a new workflow
integrate with the current simple/generalized ingest being developed in Add bmhd metadata #111 however it would require almost an entirely new (but configurable) set of functions for that workflow.
Should we write a pangeo-forge recipe for this dataset? If we're interested in exploring this option, I will ask @sharkinsspatial what the next steps should be. I believe we can write the recipe but I'm not sure if we can or should run it on the pangeo-forge infrastructure Example pipeline for SMAP Seasurface Salinity pangeo-forge/staged-recipes#30
- If we move forward with using pangeo-forge, we will need to determine that the data to be written to the VEDA datastore and published to the VEDA API.

ashiklom · 2022-04-19T17:58:35Z

@abarciauskas-bgse Thanks for the detailed response!

I'm not sure I'll be of much use on most of the points above, so here's another idea: How about we create a separate public repository for manual workflows like this that we will be developing for EIS and related work? Then, you all can use that repository as a collection of use cases for creating automated ingest workflows, Pangeo Forge recipes, etc. I think an approach like that may be more efficient than me trying to learn all of these procedures.

Thoughts?

abarciauskas-bgse · 2022-04-21T02:25:08Z

I like that idea. We might want to iterate on it as we develop a longer term process for taking science to cloud workflows, but this sounds good for now. How about a repo "veda-data-scripts" or "veda-data-processing" ? I'm wondering if we need anything that is specific to each EIS application, but I believe many of these data sets will be shared so probably makes more sense to have 1 shared repository for all EIS and VEDA science applications.

ashiklom · 2022-05-03T16:27:22Z

Following up on this. I just created https://github.com/ashiklom/veda-data-processing, which includes the SMAP workflow from this PR. All the code in there is under MIT License — you can try to do some fancy git subtree / git submodule magic to add these to this or a similar repo along with the full revision history...or just copy the code over (again...MIT License! I added my name as a contact so people know who to blame even in the absence of the git commit history).

abarciauskas-bgse · 2022-05-09T12:24:16Z

Awesome thanks @ashiklom should we close this PR then?

abarciauskas-bgse and others added 30 commits May 18, 2020 10:56

Merge pull request NASA-IMPACT#14 from NASA-IMPACT/abarciauskas-bgse_…

40ec002

…terraform-and-script AWS Batch terraform and run job script

initial conversion script

fdd879d

removing duplicate file extension

9685954

final cleanups

17ad0e3

file rename

560d4df

Merge pull request NASA-IMPACT#17 from NASA-IMPACT/convert_OMNO2d.003

c573a25

Convert omno2d.003

Move some things around

99fb874

Update acl

6c2ca78

Update gitignore

32533d1

Fix filename extension

799957b

Add list-urls function

8da216f

Update documentation

0a36ce7

Removing obsolete xml ElementTree

389bdb3

Merge pull request NASA-IMPACT#18 from NASA-IMPACT/abarciauskas-bgse_…

77ba746

…omno2-on-batch OMNO2 on batch

Fix merge conflict in .gitignore

732ba9f

Merge branch 'abarciauskas-hdf4-extension' into abarciauskas-bgse_mod…

59be6e8

…is-aod-global

Merge branch 'master' into abarciauskas-hdf4-extension

cf5857e

Merge pull request NASA-IMPACT#21 from NASA-IMPACT/abarciauskas-hdf4-…

1ba3a7f

…extension Convert MCD19A2.006 HDF4 files to COG

Add orbit preference

5dc7145

Move up pip install in dockerfile

099d9fe

Refactoring for VI

2a90692

More refactor

835d21a

Modify comment

3ced27e

Add padding to ymin and ymax

9ff61e8

Modifications for running more VI

1ed06ab

Working on band descriptions

cc057a5

Fix merge conflicts

4e6823f

Modify list-urls for directories

7c455dc

Add example command

6572fa0

Add band names

574e4b8

jvntf and others added 23 commits March 3, 2022 13:50

Merge remote-tracking branch 'origin/main' into ab/docs-for-workflows

661235d

Add so2 no2 workflow

9cea4fa

Remove unnecsessary code from hls pipeline

af29414

Add cmr cog generation pipeline

c42e662

Fixes to make distinct pipelines deployable

73e4c84

Ab/docs for workflows (NASA-IMPACT#90)

3d74d6c

* Update README.md * Some docs for having different workflows for different datasets * Update README.md * Update README.md Co-authored-by: Jeevan Farias <jeevanfarias@gmail.com> Co-authored-by: Jeevan Farias <jeevanfarias@gmail.com>

Fixes

ad5f973

remove ndjson bucket from no2 so2 stack

0be9f42

Merge branch 'main' into add/separated-pipelines

2f9adaf

Updates

f2cd3d0

Merge pull request NASA-IMPACT#92 from NASA-IMPACT/add/separated-pipe…

e7b8dfd

…lines Add/separated pipelines

Alphabetize imports and add constants.

2adf689

Add parameters for multi assets support.

7714fa7

Pass asset name, roles, and media type for asset.

ab2042d

Get asset name, roles, and media type from passed event.

980b90e

Update add-a-new-data-layer-to-a-dashboard--high-level-steps-.md

6271a5c

Merge pull request NASA-IMPACT#107 from NASA-IMPACT/enhc-enable_multi…

f1e176f

…_assets Allow for multi assets.

Add svi collections metadata sql

3f71ef5

Merge pull request NASA-IMPACT#114 from NASA-IMPACT/add/svi-collectio…

bd12c54

…ns-metadata Add social vulnerability index collections metadata sql

Add download script

2e1afbf

First draft of SMAP processing script

a4a6881

This doesn't work (on DISCOVER) because of weird HDF5 reading errors.

Working SMAP Zarr generation pipeline

629cc0c

abarciauskas-bgse force-pushed the main branch from bd12c54 to 210620a Compare April 30, 2022 14:00

ashiklom closed this May 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Pipeline for SMAP soil moisture ingest #117

[WIP] Pipeline for SMAP soil moisture ingest #117

ashiklom commented Apr 13, 2022

abarciauskas-bgse commented Apr 13, 2022

ashiklom commented Apr 19, 2022

abarciauskas-bgse commented Apr 21, 2022

ashiklom commented May 3, 2022

abarciauskas-bgse commented May 9, 2022

[WIP] Pipeline for SMAP soil moisture ingest #117

[WIP] Pipeline for SMAP soil moisture ingest #117

Conversation

ashiklom commented Apr 13, 2022

abarciauskas-bgse commented Apr 13, 2022

ashiklom commented Apr 19, 2022

abarciauskas-bgse commented Apr 21, 2022

ashiklom commented May 3, 2022

abarciauskas-bgse commented May 9, 2022