-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic task cleanup #3849
base: master
Are you sure you want to change the base?
Automatic task cleanup #3849
Conversation
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
…emote paths) Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
As I mentioned in the other PR, this eager cleanup currently won't work correctly with file publishing because the publishing is asynchronous. So for each task we need to wait for any files to be published first... |
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
If I understand correctly, your current method of implementing this is to wait for the output of each process to be output then perform cleanup. It may be better to wait until the end of all processes of a single run and then delete. This would probably make it easier to run since you do not need to worry about what files need to be stored. This would still offer a significant amount of space saving. Exmaple: |
The cleanup strategy needs to be reworked due to the upcoming workflow output DSL #4670. If the publish definition is moved to the workflow level, the task has no idea which of its outputs will be published, and the cleanup observer can't delete a task until it knows that all of the outputs that are going to be published have been published. A simple solution would be to mark an output for deletion when it is published (and downstream tasks are done with it, etc). The downside is that outputs that are not published are not deleted. Thinking further, the current POC of the output DSL just appends some "publish" operators to the DAG, so I might be able to trace each process output through the DAG to see if it's connected to a publish op. That way we know if an output can "not" be published and delete it sooner. It still misses files that "could" be published but aren't at runtime, e.g. because they get filtered out by a Finally, we can always fall back to the existing strategy of "delete whatever is left at the end". As long as the eager cleanup can delete enough files early on, it should be enough to move many pipelines from un-runnable to runnable. |
Many thanks for working on this, I know many people will already benefit from this even if resumeability doesn't work but I wanted to share our usecase for why resumeability is important since I haven't seen anyone mention it yet :) Our pipeline will be the backbone of a platform that will continously accept new samples and archive all the results. The first part of the pipeline is QC, Filtering, genome assembly, etc. And then comes various genome annotation methods. Since annotation methods tend to evolve over time, we want to be able to quickly re-run all samples we have already analyzed. That's why we want to keep the Hoping that this is a good usecase to have in mind while developing the resuming functionality. For now we will try the GEMmaker method. Thank you!! |
One way to handle that would be to publish files that you don't want to be lost. They will be deleted from the work directory, but when the automatic cleanup recovers a deleted task on resume, it will also verify that the published files are up to date (the file checksums will be stored in the |
Nice, true that would make sense that it can reference published files, is that logic already implemented in the nf-boost plugin? |
No, nf-boost doesn't do anything with resume |
// -- recursively check for cached outputs | ||
List<HashCode> queue = [ seedHash ] | ||
Map<HashCode,TaskHandler> handlers = [:] | ||
|
||
while( !queue.isEmpty() ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are recursively searching for downstream tasks that are still whole (or the end of the pipeline).
Note that we are skipping the dataflow logic between these processes. Essentially we are assuming that we only need to check the consumer tasks.
However, a user could add a new process between runs that depends on this deleted task. In this case we need to re-execute the deleted task. So this approach of only checking the consumer tasks will not work.
Instead, we can optimistically finalize the deleted task and emit "fake" output files (e.g. DeletedPath) that point to the deleted task. A deleted file can provide its hash and maybe some other metadata, but if downstream code tries to access the file contents, it causes the deleted task to be re-executed.
But how to manage multiple file objects pointing to the same deleted task? They will need to be synchronized so that the deleted task is re-executed only once (this is already done in FilePorter). Then the fake file object needs to be "re-hydrated" with the re-computed file and delegate to it.
I don't think the re-computed task will be able to simply emit its outputs. I think it will need to communicate the re-computed files to the fake files that are already downstream. So the deleted file might need to also know something about where it sits in the task outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above approach should also free us from tacking the downstream consumers of a task. A nice bonus, since tracking downstream consumers doesn't match up well with our plans for provenance tracking (based on tracking upstream producers).
Alternative to #3818
Instead of adding a
temporary
option to output paths, this PR facilitates the automatic cleanup through thecleanup
config option. By settingcleanup = 'eager'
, Nextflow will automatically delete task directories during the workflow run. Caveats are documented in the PR.TODO: