-
Notifications
You must be signed in to change notification settings - Fork 4
Updated example #3
Updated example #3
Conversation
@TomAugspurger @CiaranEvans Could you review this PR? I can't apparently assign reviewers in this repo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly small change suggestions @abarciauskas-bgse
I wonder if we need to look at further abstracting some stuff in here behind some more pangeo-forge
provided functionality, there's still a fair bit of knowledge needed for this kind of Python work..
README.md
Outdated
```bash | ||
conda env create -f=envirnoment.yml | ||
conda activate pangeo-pipeline | ||
python recipes/pipeline.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python recipes/pipeline.py | |
python recipe/pipeline.py |
README.md
Outdated
Configure S3 Access: | ||
|
||
```bash | ||
cp recipe/config.yml.example config.yml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't see recipe/config.yml.example
in this branch
|
||
## Github Workflows | ||
|
||
NOTE: The scripts and workflows in `.github/` are out of date with the current example code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting on my pragmatic hat, if they're out of date, why don't we just remove them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that but I also thought github workflows are something that should be a part of this repo (if we decide to keep this repo)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's okay, but we can put them in if we need them, the benefit of this statement is minimal I'd argue (and the inclusion of stuff that's not being used).
We shouldn't be afraid of deleting stuff, that's what git's for in the end of the day 🥳
recipe/pipeline.py
Outdated
import argparse | ||
from fsspec.implementations.local import LocalFileSystem | ||
import os | ||
import pandas as pd | ||
import pangeo_forge | ||
import pangeo_forge.utils | ||
from pangeo_forge.tasks.http import download | ||
from pangeo_forge.tasks.xarray import combine_and_write | ||
from pangeo_forge.tasks.zarr import consolidate_metadata | ||
from prefect import Flow, Parameter, task, unmapped | ||
|
||
# We use Prefect to manage pipelines. In this pipeline we'll see | ||
# * Tasks: https://docs.prefect.io/core/concepts/tasks.html | ||
# * Flows: https://docs.prefect.io/core/concepts/flows.html | ||
# * Parameters: https://docs.prefect.io/core/concepts/parameters.html | ||
|
||
# A Task is one step in your pipeline. The `source_url` takes a day | ||
# like '2020-01-01' and returns the URL of the raw data. | ||
|
||
|
||
@task | ||
from pangeo_forge.recipe import NetCDFtoZarrSequentialRecipe | ||
from pangeo_forge.storage import CacheFSSpecTarget, FSSpecTarget | ||
from pangeo_forge.executors import PythonPipelineExecutor, PrefectPipelineExecutor | ||
import prefect | ||
from prefect import task, Flow | ||
import shutil | ||
import s3fs | ||
import tempfile | ||
import yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are we going with on linting/codestyles? AFAIK this isn't PEP8 compliant (which is being really picky, but I think we should just use something opinionated like flake8
, black
and isort
)
Imports should be grouped in the following order:
Standard library imports.
Related third party imports.
Local application/library specific imports.
You should put a blank line between each group of imports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recipe is linted when it goes through staged-recipes at https://github.com/pangeo-forge/staged-recipes/blob/0d6fa51a1189a7bdb1927e7987c7a74216d13ab2/.pre-commit-config.yaml.
I'd recommend copying that (or a similar) pre-commit config over when creating a new repository.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd argue linting then is a bit too late, I'd rather we upheld formatting across all our repos as they are, rather than transforming them when merged into others
It just makes everything look uniform then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify: recipes go through staged-recipes first, and then the GitHub repository is created for that recipe. So the linting will happen there first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that's fine, but if we're using this as an example, we should probably still lint/format this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right:
I'd recommend copying that (or a similar) pre-commit config over when creating a new repository.
recipe/pipeline.py
Outdated
this_dir = os.path.dirname(os.path.abspath(__file__)) | ||
|
||
if args.storage == 's3': | ||
# Read the config file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually if there's a comment like this saying <Verb> <something>
I prefer to take out the following block into a function called that, but this is an open comment haha
recipe/pipeline.py
Outdated
|
||
if args.execution_env == 'prefect': | ||
executor = PrefectPipelineExecutor() | ||
print(executor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this print needed?
recipe/pipeline.py
Outdated
executor = PrefectPipelineExecutor() | ||
print(executor) | ||
plan = executor.pipelines_to_plan(pipeline) | ||
# The 'plan' is a prefect.Flow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it's helpful to know that while the plan
is a return from a pangeo_forge module, the return object is a prefect module (and so you can operate on it with prefect functions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wont most editors/IDEs give you those prompts though?
|
||
Pre-requisites: | ||
* [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do they need prefect
installed? I know that this section is for local running but this will require prefect
if the user passes that argument
recipe/pipeline.py
Outdated
parser.add_argument('--storage', help='Optional argument to store data remotely, such as on AWS') | ||
args = parser.parse_args() | ||
|
||
# NOAA SST Specific Functions for generating a list of URLs | ||
def source_url(day: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer this to be refactored to be named get_source_url
It's great to see an update to this! Thanks @abarciauskas-bgse! I really appreciate your efforts. 😄 IMO, it may be a tiny bit premature early to be updating the recipe. We need to decide what ingredients a recipe repo should contain (see pangeo-forge/pangeo-forge-recipes#71). In my [very possibly wrong] opinion, the recipe should just contain a bare minimum. It should not
My draft PR for a new recipe (see pangeo-forge/staged-recipes#20) tries to follow this bare-bones approach. In any case, let's use this as a discussion case at our meeting today. |
Co-authored-by: Ciaran Evans <9111975+ciaranevans@users.noreply.github.com>
Co-authored-by: Ciaran Evans <9111975+ciaranevans@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. A few comments.
import tempfile | ||
import yaml | ||
|
||
parser = argparse.ArgumentParser() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, this type of stuff will go in the pangeo-forge CLI. So something like pangeo-forge execute recipe --execution-env=... --storage=...
.
pangeo-forge/pangeo-forge-recipes#43 is the tracking issue for that I think.
recipe/pipeline.py
Outdated
) | ||
this_dir = os.path.dirname(os.path.abspath(__file__)) | ||
|
||
if args.storage == 's3': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My hope is that we can rely on fsspec to completely abstract away the file system: https://pangeo-forge.readthedocs.io/en/latest/recipes.html?highlight=fsspec#storage.
So this could be something like
# CLI
$ pangeo-forge execute recipe --storage=file:///my-path
$ pangeo-forge execute recipe --storage=s3://my-bucket/my-path
And then in the recipe we do
fs = fsspec.get_filesystem_class(args.storage)()
README.md
Outdated
## Run the example on Prefect Cloud | ||
|
||
Pre-requisites: | ||
* Create an account on cloud.prefect.io |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if a prefect cloud account is required for executing this locally with the prefect executor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so
Another open question: do we want to modify this repo in place? Or should we delete it and regenerate it with the tooling to generate repositories when merged into staged-recipes? (If we do delete it, I think we should continue reviewing here to figure out what we want the output repository look like) |
Thanks all for reviewing, changes were very helpful. I'm happy to discuss our approach to examples during the meeting today. |
@rabernat @TomAugspurger given the discussion yesterday and pangeo-forge/staged-recipes#20, do we still want an example pipeline or example recipe repository? |
IMO, the order should be:
I think that gets us closes to and end-to-end workflow from new recipe to something running on the bakeries. So, I'd say hold off on further effort here for now. |
closing as stale |
No description provided.