You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am currently trying to use the concepts in pangeo-forge and apply it to OOI Data in https://github.com/ooi-data. Firstly, I think that this is a really great project evolution for pangeo and I am looking forward to see where it's headed. Currently I am working on trying to convert a lot of OOI data into zarr files. I have tried working with prefect and dask for this, and just running that using K8s cronjob, but stumbled into a lot of roadblocks in terms of getting status and history of the data pipeline, and seeing if things broke in the process.
After running across pangeo-forge a few months back, I really loved the idea of being able to create a data pipeline that combined github actions and prefect! However, I couldn't fully integrate the current pangeo-forge to my needs since this idea is limited to having the source dataset already set to be pulled in a server somewhere. OOI system works differently than other system where the user has to request the data/wait/then fetch the data. There wasn't a way to do that step in pangeo-forge that I can visibly see. One solution that I thought might work is have a fetch and wait task within the prefect flow, but that means a lot of sitting around for the kubernetes pod.
So because of those roadblocks from pangeo-forge, I decided to take the concepts and use https://github.com/pangeo-forge/terraclimate-feedstock as an example to create a pangeo-forge-esq POC system, where I have github actions to perform those request and wait step, and then a step for the actual processing.
I also added the idea of having a sort of history for request and processing that is tracked by git to provide a full provenance for the dataset. This is not fully baked but you can see a sort of history example for request and process.
The processing currently is not actually running anything but I am still working on the backend logistics within my K8s cluster. But you can see on screenshot below of the running pipeline.
I just thought I should share my experiences with combining the power of Github Actions + Prefect to create a data pipeline from the ideas of pangeo-forge. Thank you for creating this great project and laying out the roadmap. I hope my ideas and prototype would spark other ideas. I would love to end up porting all the stuff I have over to pangeo-forge and contribute to the project to any capacity that I can at some point 😄
The text was updated successfully, but these errors were encountered:
Hi @lsetiawan, just wanted to check in to see how you're doing with this. The architecture of Pangeo Forge has evolved quite a bit since you first opened this Issue, so I wouldn't be surprised if your early experiments have needed / will need to be updated. Since this is on some level a design question related to pangeo-forge-recipes, perhaps we should move the discussion to https://github.com/pangeo-forge/pangeo-forge-recipes/issues.
Hi all!
I am currently trying to use the concepts in pangeo-forge and apply it to OOI Data in https://github.com/ooi-data. Firstly, I think that this is a really great project evolution for pangeo and I am looking forward to see where it's headed. Currently I am working on trying to convert a lot of OOI data into zarr files. I have tried working with prefect and dask for this, and just running that using K8s cronjob, but stumbled into a lot of roadblocks in terms of getting status and history of the data pipeline, and seeing if things broke in the process.
After running across pangeo-forge a few months back, I really loved the idea of being able to create a data pipeline that combined github actions and prefect! However, I couldn't fully integrate the current pangeo-forge to my needs since this idea is limited to having the source dataset already set to be pulled in a server somewhere. OOI system works differently than other system where the user has to request the data/wait/then fetch the data. There wasn't a way to do that step in pangeo-forge that I can visibly see. One solution that I thought might work is have a fetch and wait task within the prefect flow, but that means a lot of sitting around for the kubernetes pod.
So because of those roadblocks from pangeo-forge, I decided to take the concepts and use https://github.com/pangeo-forge/terraclimate-feedstock as an example to create a pangeo-forge-esq POC system, where I have github actions to perform those request and wait step, and then a step for the actual processing.
I also added the idea of having a sort of history for request and processing that is tracked by git to provide a full provenance for the dataset. This is not fully baked but you can see a sort of history example for request and process.
Then there's another issue of being able to replicate this for all the datasets that OOI have. So I decided to utilize the github templates to have a nice template to copy from: https://github.com/ooi-data/stream_template. And found away to keep all the dataset repos in sync with the template by using https://github.com/koj-co/update-template.
The processing currently is not actually running anything but I am still working on the backend logistics within my K8s cluster. But you can see on screenshot below of the running pipeline.
I just thought I should share my experiences with combining the power of Github Actions + Prefect to create a data pipeline from the ideas of pangeo-forge. Thank you for creating this great project and laying out the roadmap. I hope my ideas and prototype would spark other ideas. I would love to end up porting all the stuff I have over to pangeo-forge and contribute to the project to any capacity that I can at some point 😄
The text was updated successfully, but these errors were encountered: