Argo is a great system for orchestrating data science workflows, but fidgeting around with yaml files to write a pipeline can be frustrating. Although there are libraries like [kfp] and various argo SDKs which let you specify workflows programmatically, there's still a lot of boilerplate. Running data science workloads in the cloud should be easier.
kung-fu-pipelines abstracts away the boilerplate so you can focus on the logic of your workflow steps.
This library is a work in progress. Feedback is welcome :)
This library contains two main classes. A Step
represents a single step in a
workflow. It contains information about the logic to be performed by that Step
along with metadata such as the parameters the Step
expects and a description.
A Step
can automatically generate the appropriate YAML for performing that
A Workflow
is a template for organizing Step
objects together in a certain
way. Many types of pipelines follow the same general structure, so you can
create a Workflow
which encapsulates this structure and then instantiate it
with the Step
s relevant to your specific pipeline and then generate the Argo
YAML automatically.
For example, a machine learning Workflow
might have the following steps:
- Construct reference dataset
- Train/test split
- Preprocess training data
- Train model
- Evaluate model on test set
You could create a Workflow
which has these steps, or whatever variation you
need, and then for each of your experiments just use that Workflow
while
swapping in the appropriate Step
objects.
pip install kung-fu-pipelines
Pull requests, issues, questions, and comments are welcome. You can also reach me directly at skhan8@mail.einstein.yu.edu.