Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup of work directory without losing command logs #668

Closed
stevekm opened this issue Apr 26, 2018 · 4 comments
Closed

Cleanup of work directory without losing command logs #668

stevekm opened this issue Apr 26, 2018 · 4 comments

Comments

@stevekm
Copy link
Contributor

stevekm commented Apr 26, 2018

After completing a pipeline, the 'work' directory tends to consume a lot of storage space. But deleting the work directory is not ideal, because it also contains the files such as '.command.run', '.command.log', etc., along with the files produced by your processes, which are useful to preserve for debugging, troubleshooting, record keeping, etc. It would be very useful to have some sort of feature that allows you to clear out the storage used by the 'work' directory without losing these items.

I came up with a script here to attempt to implement this by:

  • removing 'work' subdirectories not included in the 'trace' file (i.e. the results of the most recent pipeline execution)

  • replacing all files that do not match the pattern '.command*' or '.exitcode' with "file stubs" (empty files of the same name)

However a native Nextflow implementation of something similar might be helpful.

Alternatively, these aspects could be instead integrated into the 'trace' output (stdout & stderr logs, directory structure & contents, etc.).

@pditommaso
Copy link
Member

pditommaso commented Apr 28, 2018

Your proposal is to delete all task files but the .command.* files ?

@stevekm
Copy link
Contributor Author

stevekm commented Apr 28, 2018

My proposal is to replace all task files with empty file stubs, to show that they were created.

@stevekm
Copy link
Contributor Author

stevekm commented Jun 7, 2018

A follow up on this, here is a Makefile entry I have been working on to accomplish this, including the following steps:

  • remove subdirs in the Nextflow 'work' directory that aren't part of the latest run as per the trace file

  • resolve any symlinks in the Nextflow 'publishDir'

  • write the contents of each remaining Nextflow 'work' subdir to a file (maybe not that important, or could be adjusted to record file md5 or line count)

  • create empty 'file stubs' for all workflow files in 'work' subdirs

# ~~~~~ FINALIZE ~~~~~ #
# steps for finalizing the Nextflow pipeline 'output' publishDir and 'work' directories
# configured for parallel processing with `make finalize -j8`

# Nextflow "publishDir" directory
publishDir:=output
# Nextflow "work" directory of items to be removed
workDir:=work
TRACEFILE:=trace.txt

finalize: finalize-work-rm finalize-output finalize-work-ls finalize-work-stubs

## ~~~ convert all symlinks to their linked items ~~~ ##
# symlinks in the publishDir to convert to files
publishDirLinks:=
FIND_publishDirLinks:=
ifneq ($(FIND_FILES),)
publishDirLinks:=$(shell find $(publishDir)/ -type l)
endif
finalize-output:
	echo ">>> Converting symlinks in output dir '$(publishDir)' to their targets..."
	$(MAKE) finalize-output-recurse FIND_publishDirLinks=1
finalize-output-recurse: $(publishDirLinks)
# convert all symlinks to their linked items
$(publishDirLinks):
	@ { \
	destination="$@"; \
	sourcepath="$$(python -c 'import os; print(os.path.realpath("$@"))')" ; \
	echo ">>> Resolving path: $${destination}" ; \
	if [ ! -e "$${sourcepath}" ]; then echo "ERROR: Source does not exist: $${sourcepath}"; \
	elif [ -f "$${sourcepath}" ]; then rsync -va "$$sourcepath" "$$destination" ; \
	elif [ -d "$${sourcepath}" ]; then { \
	timestamp="$$(date +%s)" ; \
	tmpdir="$${destination}.$${timestamp}" ; \
	rsync -va "$${sourcepath}/" "$${tmpdir}" && \
	rm -f "$${destination}" && \
	mv "$${tmpdir}" "$${destination}" ; } ; \
	fi ; }
.PHONY: $(publishDirLinks)


## ~~~ write list of files in each subdir to file '.ls.txt' ~~~ ##
# subdirs in the 'work' dir
NXFWORKSUBDIRS:=
FIND_NXFWORKSUBDIRS:=
ifneq ($(FIND_NXFWORKSUBDIRS),)
NXFWORKSUBDIRS:=$(shell find "$(workDir)/" -maxdepth 2 -mindepth 2)
endif
# file to write 'ls' contents of 'work' subdirs to
LSFILE:=.ls.txt
finalize-work-ls:
	echo ">>> Writing list of directory contents for each subdir in Nextflow work directory '$(workDir)'..."
	$(MAKE) finalize-work-ls-recurse FIND_NXFWORKSUBDIRS=1
finalize-work-ls-recurse: $(NXFWORKSUBDIRS)
# print the 'ls' contents of each subdir to a file, or delete the subdir
$(NXFWORKSUBDIRS):
	@ls_file="$@/$(LSFILE)" ; \
	echo ">>> Writing file list: $${ls_file}" ; \
	ls -1 "$@" > "$${ls_file}"
.PHONY: $(NXFWORKSUBDIRS)


## ~~~ replace all files in 'work' dirs with empty file stubs ~~~ ##
NXFWORKFILES:=
FIND_NXFWORKFILES:=
# files in work subdirs to keep
LSFILEREGEX:=\.ls\.txt
NXFWORKFILES:='.command.begin|.command.err|.command.log|.command.out|.command.run|.command.sh|.command.stub|.command.trace|.exitcode|$(LSFILE)'
NXFWORKFILESREGEX:='.*\.command\.begin\|.*\.command\.err\|.*\.command\.log\|.*\.command\.out\|.*\.command\.run\|.*\.command\.sh\|.*\.command\.stub\|.*\.command\.trace\|.*\.exitcode\|.*$(LSFILEREGEX)'
ifneq ($(FIND_NXFWORKFILES),)
NXFWORKFILES:=$(shell find -P "$(workDir)/" -type f ! -regex $(NXFWORKFILESREGEX))
endif
finalize-work-stubs:
	echo ">>> Creating file stubs for pipeline output in Nextflow work directory '$(workDir)'..."
	$(MAKE) finalize-work-stubs-recurse FIND_NXFWORKFILES=1
finalize-work-stubs-recurse: $(NXFWORKFILES)
$(NXFWORKFILES):
	@printf '>>> Creating file stub: $@\n' && rm -f "$@" && touch "$@"
.PHONY: $(NXFWORKFILES)


## ~~~ remove 'work' subdirs that are not in the latest trace file (e.g. most previous run) ~~~ ##
# subdirs in the 'work' dir
NXFWORKSUBDIRSRM:=
FIND_NXFWORKSUBDIRSRM:=
# regex from the hashes of tasks in the tracefile to match against work subdirs
HASHPATTERN:=
ifneq ($(FIND_NXFWORKSUBDIRSRM),)
NXFWORKSUBDIRSRM:=$(shell find "$(workDir)/" -maxdepth 2 -mindepth 2)
HASHPATTERN:=$(shell python -c 'import csv; reader = csv.DictReader(open("$(TRACEFILE)"), delimiter = "\t"); print("|".join([row["hash"] for row in reader]))')
endif
finalize-work-rm:
	echo ">>> Removing subdirs in Nextflow work directory '$(workDir)' which are not included in Nextflow trace file '$(TRACEFILE)'..."
	$(MAKE) finalize-work-rm-recurse FIND_NXFWORKSUBDIRSRM=1
finalize-work-rm-recurse: $(NXFWORKSUBDIRSRM)
# remove the subdir if its not listed in the trace hashes
$(NXFWORKSUBDIRSRM):
	@if [ ! "$$(echo '$@' | grep -q -E "$(HASHPATTERN)"; echo $$? )" -eq 0 ]; then \
	echo ">>> Removing subdir: $@" ; \
	rm -rf "$@" ; \
	fi
.PHONY: $(NXFWORKSUBDIRSRM)

I saw that Nextflow has a built-in 'clean' feature though I was not clear on what exactly it was cleaning. Some of the items listed here might be beyond the scope of Nextflow, though in general some kind of reclamation of the storage space used by unnecessary files in the 'work' subdirs without losing complete execution records is my end goal.

@pditommaso
Copy link
Member

Closing this in favour of #452.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants