-
Notifications
You must be signed in to change notification settings - Fork 27
What infrastructure is in place for handling dependencies between pipelines? #35
Comments
Hi Sean, It can be done in batches where you run/save data with one pipeline, and then pick it up with another since the I/O throughput with Google Storage is very fast. This is basically a variation of what WDL is performing via Cromwell with the JES (Pipeline API) as a backend. Like you, I come from the same background, and I suggested a possible approach a while ago here: But the purpose of Pipeline API in Google Genomics is different, and it's easier to just quote Matt:
There are other connected pipelines implementations one can use with Google like Dataflow, which come closer to what you looking for, but that requires that the data be loaded into the Google Genomics API first - below is a link to the repository with examples: https://github.com/googlegenomics/dataflow-java If you have not used the Dataflow API before, below is how to construct a Dataflow pipeline: https://cloud.google.com/dataflow/pipelines/constructing-your-pipeline I still think that connected pipelines is a critical feature of the Pipeline API. Hope it helps, |
In addition to Broad Institute's Cromwell runner for WDL, a new open source project called Funnel builds on top of the Pipelines API to support complex workflows defined using Common Workflow Language (CWL). Funnel is being developed by the folks who run the DREAM challenges. The authors presented progress at a workshop last week at Institute for Systems Biology. Another idea is to write a tiny python wrapper that makes each call to the Pipelines API blocking. Then you can call it multiple times in a row, with different bioinformatics tools, to build a simple pipeline. About Paul's suggestion of using Cloud Dataflow (and the Apache Beam python SDK), that's also a possibility. You can imagine using Pipelines API to run individual steps, and using Dataflow for orchestration. Features like caching of intermediate results, and retries, and preemptible VMs to reduce cost, are all things that can be added, and are definitely desirable. They're outside of the current scope of the Pipelines API and are probably best built on top, along the lines of Cromwell and Funnel. Cheers, |
Thanks, Jonathan, for the clarification of scope for the pipeline API. I agree that building on top of it, treating the pipeline API as a "raw executor", makes a lot of sense. There are a number of really good workflow engines out there already. Adapting them to the Google Genomics pipeline API is probably just a matter of time for at least some of them. |
Do you have any particular workflow engines in mind that you'd like to see On Tue, Aug 9, 2016 at 9:29 AM Sean Davis notifications@github.com wrote:
|
Toil and nextflow are both getting a fair amount of love-and-care and have at least some support for CWL, abstract executors, and cloud files. Snakemake also has a good following, but I see this in a slightly different space. I also like the approach that https://github.com/GoogleCloudPlatform/appengine-pipelines uses (returns promises). A pretty full list is available here: |
Thanks! We're definitely keen to help the Toil folks support Pipelines API On Tue, Aug 9, 2016 at 9:58 AM Sean Davis notifications@github.com wrote:
|
I assume that you mean toil is python? Nextflow is groovy-based, though it is really a mini language in a sense.
|
Yes, typo! I meant Toil is python. On Tue, Aug 9, 2016 at 10:08 AM Sean Davis notifications@github.com wrote:
|
Complex bioinformatics workflows usually have dependencies from one step in a workflow to the next. Is there functionality in the pipelines API to handle this? If not, any suggestions on implementation of such workflow dependencies, ideally with caching of previously computed results?
The text was updated successfully, but these errors were encountered: