"dry run" option? #844

jemunro · 2018-08-23T07:54:00Z

When using the -resume option, it would be helpful to see which processes are going to be run and which will be reused from the cache.
Additionally it would be useful to indicate which hashes have changed requiring a cached process to be run. I know this can be done manually with the -dump-hashes option but this is not very user friendly.

pditommaso · 2018-08-28T12:56:12Z

Yes, it could have sense to have a kind of dry-run showing the which cached tasks would be used.

fmorency · 2019-01-07T20:57:29Z

+1! Would be super useful!

blacky0x0 · 2019-02-20T20:26:33Z

Just thoughts.
Dry-run or plan should be used to produce the full graph structure or just a part of DAG in text or UI mode. It should use fail-fast behavior while constructing AST. The parts of AST can be validated by json schema validator as a workaround.

nextflow plan -target=process.1A_prepare_genome_samtools
nextflow plan -target=module.'rnaseq.nf'.fastqc

stevekm · 2020-01-14T14:56:59Z

was this ever implemented?

kopardev · 2020-02-21T15:18:40Z

Is dry-run or plan feature available in the latest nexflow version?

stale · 2020-04-27T02:34:42Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

jimhavrilla · 2020-06-25T21:16:23Z

Would still love this

stale · 2020-11-23T01:59:29Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

pditommaso · 2020-11-23T06:34:22Z

Nextflow now supports the ability to simulate the execution using command stubs. See #1774.

More here https://www.nextflow.io/docs/edge/process.html#stub

nick-youngblut · 2023-04-11T15:02:43Z

@pditommaso a stub run does not cover the following situation, correct? The situation:

The pipeline developer runs their Nextflow pipeline (e.g., large-scale requiring large amounts of compute resources & time)
The pipeline dies much of the way through the run (e.g., after days of running) due to an error in one step of the code (e.g., the code didn't scale to 10's of 1000's of files)
The pipeline developer modifies the code
The pipeline developer wants to know which parts of the pipeline will be re-run as a result of the code modification

I'm in that current situation, and I'm afraid that my modification will restart the pipeline from (near) the beginning, which means a loss of days and $$. I believe that a stub run must start from the beginning, and will thus not show me whether my modifications will result in a (near complete) restart of my pipeline run.

I migrated to Nextflow from Snakemake, and the --dryrun feature of Snakemake has helped me many times to avoid needless computation. It would be great to have the same functionality in Nextflow.

guarelin · 2023-08-22T17:24:33Z

I want to bump the feature-request for dry-runs as a previous Snakemake user.

In the meantime, if nextflow were to output the reason for reruns that would also be helpful. Snakemake has a reason for every rule.

bentsherman · 2023-08-24T15:42:57Z

I took a first pass at implementing the most recent suggestions in this thread. Namely, resume a pipeline and determine which tasks would be re-executed, without actually executing them.

Feel free to test the linked PR. Basically run a pipeline with -dry-run and it will resume, print every task that is re-executed, then exit.

I think it could be improved by showing also the reason why a task is executed -- no cache entry found, cache folder found but outputs are missing, etc. If someone could share an example Snakemake scheduling plan, it might be useful to compare.

The drawback of this approach is that it doesn't execute the entire pipeline, it only goes as far as the cache can take it. On the other hand, stub run can execute the entire pipeline but does not correspond to a real pipeline run.

bentsherman · 2023-08-24T15:49:36Z

See also my comments here: #1458 (comment)

KristinaGagalova · 2023-11-01T02:28:52Z

I am new in the Nextflow community but I would really love to have this option. I am using it all the time with Snakemake

subwaystation · 2024-04-23T08:07:50Z

+1

Gullumluvl · 2024-12-19T16:54:30Z

See also my comments here: #1458 (comment)

Very insightful explanation. I am very very interested in this issue as well, because as some people have said, it might turn me away from Nextflow, despite its dataflow model being very cool. It's simply not possible to have uncertainty about long-running tasks being restarted, for a typical research process which simultaneously develops the code while analyzing the data of interest, being constrained in time (is it a bad way to proceed? maybe, but there are many reasons why this is so in the current research system).

So your comment raises some questions:

1. about "dry resume"

dry resume: resume a pipeline and show which tasks will be re-executed without executing them (doesn't cover the entire pipeline)

Is this going to be ~~implemented~~ merged into the main branch? This is practically the requested feature in my opinion. Also, I don't understand

The drawback of this approach is that it doesn't execute the entire pipeline, it only goes as far as the cache can take it.

(from #844 (comment)).

I think it's fine since we precisely want to know which tasks will be resumed from the cache? Did you mean that you can't simultaneously process the workflow script to figure out the other tasks not in the cache?

2. about a true dry-run

I doubt we will ever implement a full "dry run" feature that lays out the entire schedule of tasks without executing them, because there is no way to do that under the dataflow programming model.

Here I lack in-depth knowledge of Nextflow to really judge, so I can only ask questions, please correct me.

Why? I get that in the "target" model of make/snakemake, we can easily statically know the execution graph, since all outputs are determined by the script. But what makes this impossible for the "dataflow" model? Is it because we might generate an unknown number of outputs from a given tasks, which is then piped to following tasks? Is it because of some operators like branch? or because of some parts of the execution graph originating from reading file contents (eg. splitCsv)?
Can we not mock up the outputs from these dynamically generated graph nodes? For example, if we don't know how many lines a splitted file has, can we generate a fixed or random number of lines, according to a pattern defined by the user? Or is it precisely what a 'stub-run' does and irreconcilable with predicting a regular run?
Are there any other unsolvable limitations? Such as how the task hashes are used?

Sorry if this is naive. If there are some good resources to understand the problem I would be interested.

bentsherman · 2024-12-19T17:59:42Z

The issue with a full dry run is that task outputs can be filtered / scattered / gathered between the processes, based on runtime conditions, which means the number of tasks cannot be known without running the actual pipeline. Certain pipelines might not have this problem, but in general it is an issue.

We could try to mock the inputs and simulate the actual behavior, but the dry run will diverge from the actual behavior as the pipeline goes further along, to the point of being useless or even misleading.

For this reason I think a more realistic goal is to support a "dry resume" whose purpose is simply to show how much of a run is cached. I agree that the drawback isn't a big deal, it's still useful. It would also be useful to explain immediately why a task wasn't cached, rather than having to do two extra runs and compare hashes as shown here.

I think these two things would alleviate most of the concerns with resume. The linked PR has the progress I've made so far, but there is still some work to do and I've just been focused on other things.

pditommaso added the kind/enhancement label Aug 28, 2018

pditommaso mentioned this issue Aug 28, 2018

Automatically delete files marked as temp as soon as not needed anymore #452

Open

blacky0x0 mentioned this issue Feb 18, 2019

Syntax enhancement aka DLS-2 #984

Closed

stevekm mentioned this issue Jan 14, 2020

Output Nextflow pipeline tasks in a serializable format for external execution #1458

Closed

stale bot added the wontfix label Apr 27, 2020

pditommaso added stale and removed wontfix labels Apr 27, 2020

stale bot removed the stale label Jun 25, 2020

stale bot added the stale label Nov 23, 2020

pditommaso closed this as completed Nov 23, 2020

bentsherman linked a pull request Aug 24, 2023 that will close this issue

Dry run #4214

Draft

bentsherman reopened this Aug 24, 2023

stale bot removed stale labels Aug 24, 2023

bentsherman mentioned this issue Oct 3, 2023

Improve cache debugging with -dump-hashes #4367

Closed

bentsherman removed the enhancement label Oct 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"dry run" option? #844

"dry run" option? #844

jemunro commented Aug 23, 2018

pditommaso commented Aug 28, 2018

fmorency commented Jan 7, 2019

blacky0x0 commented Feb 20, 2019 •

edited

Loading

stevekm commented Jan 14, 2020

kopardev commented Feb 21, 2020

stale bot commented Apr 27, 2020

jimhavrilla commented Jun 25, 2020

stale bot commented Nov 23, 2020

pditommaso commented Nov 23, 2020

nick-youngblut commented Apr 11, 2023

guarelin commented Aug 22, 2023

bentsherman commented Aug 24, 2023

bentsherman commented Aug 24, 2023

KristinaGagalova commented Nov 1, 2023

subwaystation commented Apr 23, 2024

Gullumluvl commented Dec 19, 2024 •

edited

Loading

bentsherman commented Dec 19, 2024

"dry run" option? #844

"dry run" option? #844

Comments

jemunro commented Aug 23, 2018

pditommaso commented Aug 28, 2018

fmorency commented Jan 7, 2019

blacky0x0 commented Feb 20, 2019 • edited Loading

stevekm commented Jan 14, 2020

kopardev commented Feb 21, 2020

stale bot commented Apr 27, 2020

jimhavrilla commented Jun 25, 2020

stale bot commented Nov 23, 2020

pditommaso commented Nov 23, 2020

nick-youngblut commented Apr 11, 2023

guarelin commented Aug 22, 2023

bentsherman commented Aug 24, 2023

bentsherman commented Aug 24, 2023

KristinaGagalova commented Nov 1, 2023

subwaystation commented Apr 23, 2024

Gullumluvl commented Dec 19, 2024 • edited Loading

1. about "dry resume"

2. about a true dry-run

bentsherman commented Dec 19, 2024

blacky0x0 commented Feb 20, 2019 •

edited

Loading

Gullumluvl commented Dec 19, 2024 •

edited

Loading