Skip to content
This repository has been archived by the owner on Mar 30, 2020. It is now read-only.

What infrastructure is in place for handling dependencies between pipelines? #35

Open
seandavi opened this issue Aug 9, 2016 · 8 comments

Comments

@seandavi
Copy link

seandavi commented Aug 9, 2016

Complex bioinformatics workflows usually have dependencies from one step in a workflow to the next. Is there functionality in the pipelines API to handle this? If not, any suggestions on implementation of such workflow dependencies, ideally with caching of previously computed results?

@pgrosu
Copy link

pgrosu commented Aug 9, 2016

Hi Sean,

It can be done in batches where you run/save data with one pipeline, and then pick it up with another since the I/O throughput with Google Storage is very fast. This is basically a variation of what WDL is performing via Cromwell with the JES (Pipeline API) as a backend.

Like you, I come from the same background, and I suggested a possible approach a while ago here:

#13 (comment)

But the purpose of Pipeline API in Google Genomics is different, and it's easier to just quote Matt:

The idea of the pipelines API as it stands is to provide this very simple, but powerful, building block. It will enable many different pipeline runners (like cromwell) to add a "run in cloud" feature to their existing workflow definition files without the need to explicitly provision a fixed cluster (like Grid Engine or Mesos).

There are other connected pipelines implementations one can use with Google like Dataflow, which come closer to what you looking for, but that requires that the data be loaded into the Google Genomics API first - below is a link to the repository with examples:

https://github.com/googlegenomics/dataflow-java

If you have not used the Dataflow API before, below is how to construct a Dataflow pipeline:

https://cloud.google.com/dataflow/pipelines/constructing-your-pipeline

I still think that connected pipelines is a critical feature of the Pipeline API.

Hope it helps,
Paul

@jbingham
Copy link
Member

jbingham commented Aug 9, 2016

In addition to Broad Institute's Cromwell runner for WDL, a new open source project called Funnel builds on top of the Pipelines API to support complex workflows defined using Common Workflow Language (CWL). Funnel is being developed by the folks who run the DREAM challenges. The authors presented progress at a workshop last week at Institute for Systems Biology.

Another idea is to write a tiny python wrapper that makes each call to the Pipelines API blocking. Then you can call it multiple times in a row, with different bioinformatics tools, to build a simple pipeline.

About Paul's suggestion of using Cloud Dataflow (and the Apache Beam python SDK), that's also a possibility. You can imagine using Pipelines API to run individual steps, and using Dataflow for orchestration.

Features like caching of intermediate results, and retries, and preemptible VMs to reduce cost, are all things that can be added, and are definitely desirable. They're outside of the current scope of the Pipelines API and are probably best built on top, along the lines of Cromwell and Funnel.

Cheers,
Jonathan

@seandavi
Copy link
Author

seandavi commented Aug 9, 2016

Thanks, Jonathan, for the clarification of scope for the pipeline API. I agree that building on top of it, treating the pipeline API as a "raw executor", makes a lot of sense. There are a number of really good workflow engines out there already. Adapting them to the Google Genomics pipeline API is probably just a matter of time for at least some of them.

@jbingham
Copy link
Member

jbingham commented Aug 9, 2016

Do you have any particular workflow engines in mind that you'd like to see
adapted to support the Pipelines API? Just curious which you like best.

On Tue, Aug 9, 2016 at 9:29 AM Sean Davis notifications@github.com wrote:

Thanks, Jonathan, for the clarification of scope for the pipeline API. I
agree that building on top of it, treating the pipeline API as a "raw
executor", makes a lot of sense. There are a number of really good workflow
engines out there already. Adapting them to the Google Genomics pipeline
API is probably just a matter of time for at least some of them.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#35 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAiXqQcGCOPYuFNMxv74RFzqLSOnKIXxks5qeKrIgaJpZM4JgFjs
.

@seandavi
Copy link
Author

seandavi commented Aug 9, 2016

Toil and nextflow are both getting a fair amount of love-and-care and have at least some support for CWL, abstract executors, and cloud files. Snakemake also has a good following, but I see this in a slightly different space. I also like the approach that https://github.com/GoogleCloudPlatform/appengine-pipelines uses (returns promises). A pretty full list is available here:
https://github.com/pditommaso/awesome-pipeline

@jbingham
Copy link
Member

jbingham commented Aug 9, 2016

Thanks! We're definitely keen to help the Toil folks support Pipelines API
and Google Cloud Platform generally. What I really like about nextflow is
that it's python.

On Tue, Aug 9, 2016 at 9:58 AM Sean Davis notifications@github.com wrote:

Toil and nextflow are both getting a fair amount of love-and-care and have
at least some support for CWL, abstract executors, and cloud files.
Snakemake also has a good following, but I see this in a slightly different
space. I also like the approach that
https://github.com/GoogleCloudPlatform/appengine-pipelines uses (returns
promises). A pretty full list is available here:
https://github.com/pditommaso/awesome-pipeline


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#35 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAiXqe0TVYnTUJTby2sL00tvgZpfHTxjks5qeLGigaJpZM4JgFjs
.

@seandavi
Copy link
Author

seandavi commented Aug 9, 2016

I assume that you mean toil is python? Nextflow is groovy-based, though it is really a mini language in a sense.

On Aug 9, 2016, at 1:07 PM, jbingham notifications@github.com wrote:

Thanks! We're definitely keen to help the Toil folks support Pipelines API
and Google Cloud Platform generally. What I really like about nextflow is
that it's python.

On Tue, Aug 9, 2016 at 9:58 AM Sean Davis notifications@github.com wrote:

Toil and nextflow are both getting a fair amount of love-and-care and have
at least some support for CWL, abstract executors, and cloud files.
Snakemake also has a good following, but I see this in a slightly different
space. I also like the approach that
https://github.com/GoogleCloudPlatform/appengine-pipelines uses (returns
promises). A pretty full list is available here:
https://github.com/pditommaso/awesome-pipeline


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#35 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAiXqe0TVYnTUJTby2sL00tvgZpfHTxjks5qeLGigaJpZM4JgFjs
.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub #35 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/AAFpEwJgUInjf_fbUdUgbfe8fXtoL7usks5qeLO0gaJpZM4JgFjs.

@jbingham
Copy link
Member

jbingham commented Aug 9, 2016

Yes, typo! I meant Toil is python.

On Tue, Aug 9, 2016 at 10:08 AM Sean Davis notifications@github.com wrote:

I assume that you mean toil is python? Nextflow is groovy-based, though it
is really a mini language in a sense.

On Aug 9, 2016, at 1:07 PM, jbingham notifications@github.com wrote:

Thanks! We're definitely keen to help the Toil folks support Pipelines
API
and Google Cloud Platform generally. What I really like about nextflow is
that it's python.

On Tue, Aug 9, 2016 at 9:58 AM Sean Davis notifications@github.com
wrote:

Toil and nextflow are both getting a fair amount of love-and-care and
have
at least some support for CWL, abstract executors, and cloud files.
Snakemake also has a good following, but I see this in a slightly
different
space. I also like the approach that
https://github.com/GoogleCloudPlatform/appengine-pipelines uses
(returns
promises). A pretty full list is available here:
https://github.com/pditommaso/awesome-pipeline


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<
#35 (comment)
,
or mute the thread
<
https://github.com/notifications/unsubscribe-auth/AAiXqe0TVYnTUJTby2sL00tvgZpfHTxjks5qeLGigaJpZM4JgFjs

.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <
https://github.com/googlegenomics/pipelines-api-examples/issues/35#issuecomment-238620931>,
or mute the thread <
https://github.com/notifications/unsubscribe-auth/AAFpEwJgUInjf_fbUdUgbfe8fXtoL7usks5qeLO0gaJpZM4JgFjs
.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#35 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAiXqUhMWS7J4DOifi5NaSfpvD-oDZweks5qeLQHgaJpZM4JgFjs
.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants