-
Notifications
You must be signed in to change notification settings - Fork 647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"dry run" option? #844
Comments
Yes, it could have sense to have a kind of dry-run showing the which cached tasks would be used. |
+1! Would be super useful! |
Just thoughts.
|
was this ever implemented? |
Is dry-run or plan feature available in the latest nexflow version? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Would still love this |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Nextflow now supports the ability to simulate the execution using command stubs. See #1774. More here https://www.nextflow.io/docs/edge/process.html#stub |
@pditommaso a stub run does not cover the following situation, correct? The situation:
I'm in that current situation, and I'm afraid that my modification will restart the pipeline from (near) the beginning, which means a loss of days and $$. I believe that a stub run must start from the beginning, and will thus not show me whether my modifications will result in a (near complete) restart of my pipeline run. I migrated to Nextflow from Snakemake, and the |
I want to bump the feature-request for dry-runs as a previous Snakemake user. In the meantime, if nextflow were to output the reason for reruns that would also be helpful. Snakemake has a reason for every rule. |
I took a first pass at implementing the most recent suggestions in this thread. Namely, resume a pipeline and determine which tasks would be re-executed, without actually executing them. Feel free to test the linked PR. Basically run a pipeline with I think it could be improved by showing also the reason why a task is executed -- no cache entry found, cache folder found but outputs are missing, etc. If someone could share an example Snakemake scheduling plan, it might be useful to compare. The drawback of this approach is that it doesn't execute the entire pipeline, it only goes as far as the cache can take it. On the other hand, stub run can execute the entire pipeline but does not correspond to a real pipeline run. |
See also my comments here: #1458 (comment) |
I am new in the Nextflow community but I would really love to have this option. I am using it all the time with Snakemake |
+1 |
Very insightful explanation. I am very very interested in this issue as well, because as some people have said, it might turn me away from Nextflow, despite its dataflow model being very cool. It's simply not possible to have uncertainty about long-running tasks being restarted, for a typical research process which simultaneously develops the code while analyzing the data of interest, being constrained in time (is it a bad way to proceed? maybe, but there are many reasons why this is so in the current research system). So your comment raises some questions: 1. about "dry resume"
Is this going to be
(from #844 (comment)). I think it's fine since we precisely want to know which tasks will be resumed from the cache? Did you mean that you can't simultaneously process the workflow script to figure out the other tasks not in the cache? 2. about a true dry-run
Here I lack in-depth knowledge of Nextflow to really judge, so I can only ask questions, please correct me.
Sorry if this is naive. If there are some good resources to understand the problem I would be interested. |
The issue with a full dry run is that task outputs can be filtered / scattered / gathered between the processes, based on runtime conditions, which means the number of tasks cannot be known without running the actual pipeline. Certain pipelines might not have this problem, but in general it is an issue. We could try to mock the inputs and simulate the actual behavior, but the dry run will diverge from the actual behavior as the pipeline goes further along, to the point of being useless or even misleading. For this reason I think a more realistic goal is to support a "dry resume" whose purpose is simply to show how much of a run is cached. I agree that the drawback isn't a big deal, it's still useful. It would also be useful to explain immediately why a task wasn't cached, rather than having to do two extra runs and compare hashes as shown here. I think these two things would alleviate most of the concerns with resume. The linked PR has the progress I've made so far, but there is still some work to do and I've just been focused on other things. |
When using the -resume option, it would be helpful to see which processes are going to be run and which will be reused from the cache.
Additionally it would be useful to indicate which hashes have changed requiring a cached process to be run. I know this can be done manually with the -dump-hashes option but this is not very user friendly.
The text was updated successfully, but these errors were encountered: