-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dev branch] Checksumming/checkpointing needed to assure integrity of output files. #71
Comments
Opening this for discussion, but I was thinking of checkpointing during a run in terms of one Arguments for writing one file instead of multiple:
|
I do like the idea of a single file with this information in it. A few suggestions:
Reason I am so paranoid about this: I worked for one month once on fastq files that were corrupted while downloading from the sequencing center. It emerged that even though they were paired-end, they did not have the same number of lines and neither had numbers of lines divisible by four (which you would expect for fastq). In addition, the order was all messed up. You'd think that would be obvious right? Well, they were still readable, and it turns out assemblers sometimes do not do basic things like check that the name of two paired reads is the same!! Imagine troubleshooting a problem with a user for 1 month and finding it was something simple like corrupted files. |
We can store the calculated checksum in the
I'm not sure what the difference here would be or why this would matter if we are calculating checksums. If the checksums match, this should mean the file is unchanged regardless of the modified date. Or even it if were modified, It must be restored to the same state to have the same checksum. |
I think I'm going to scrap the decorator idea as I am not sure how we would implement this.
Do you mean checksum the
I think placing the datetime of the calculated checksum is a good idea. I'm not convinced this is necessary for ensuring file integrity. But its is a nice way to leave some bread crumbs for users when they are looking at past data analyses. I'm rethinking the format in terms of a table where columns would include the following:
|
My goal for implementing this is to add checkpoints at any stage in the pipeline where we would like to ensure the file integrity of the input and output files ( In order to provide checkpoints in a general, easy-to-mix-in manner we should not have to specify the path to the checkpoints information other than specifying pre-instantiated checkpointer (similar to logger obj handling)from autometa.common import checkpointing
# Not sure if `checkpointing.py` would be in `config` or `common`
from autometa.config import checkpointing
checkpointer = checkpointing.getCheckpointer(__name__)
def some_file_handling_function(infpath, outfpath, *args, **kwds):
checkpointer.verify_inputs(inputs=[infpath])
obj = stuff_with_infpath(infpath)
write_to_outfpath(outfpath)
checkpointer.verify_outputs(outputs=[outfpath])
return obj At instantiation the checkpoint file path (whether it exists or not) would be provided as well as all files that should be added, checked or updated. infpaths=[assembly.fna, orfs.fna, orfs.faa, taxonomy.tsv, blastp.tsv]
outfpaths = [blastp.tsv, orfs.faa, orfs.fna, taxonomy.tsv]
chkpt_fpath = '</path/to/checkpoints.tsv>'
checkpointer = checkpointing.basicConfig(fpath=chkpt_fpath,inputs=infpaths, outputs=outfpaths) This is going to require some knowledge of the structure of the logging module, but I think if we can get this to work like a CheckpointMixinfrom autometa.common.checkpointer import CheckpointMixin
# Not sure if `checkpointer.py` would be in `config` or `common`
from autometa.config.checkpointer import CheckpointMixin
class SomeClass(CheckpointMixin):
...
@checkpointer(inputs=['infpath'], outputs=['outfpath'])
def some_file_handling_function(self, infpath, outfpath, *args, **kwds):
obj = stuff_with_infpath(infpath)
write_to_outfpath(outfpath)
return obj |
Thinking this may actually be more related to creating an Taking some ideas from above, could create a similar I.e taxonomy workflow:workflow sequence
name=“taxonomy”
inputs = [assembly, orfs, blastp, lca, hits.pkl]
outputs = [orfs, blastp, lca, hits.pkl, taxonomy]
WorkFlow(inputs, outputs, name) Root workflowname = “Autometa”: Workflow sequence
Checkpointing/AutometaManagerManager(Filterer)
Checkpoint(Filterer)
Workflow(object)
# workflow is like a logger record
# stage is like a logger level
# Checkpointer is like a logger
# Manager is like a Handler (dispatch workflows to destinations)
checkpointing.basicConfig()
checkpointing.setStage()
user= AutometaUser(user.config)
# either
user.read_config(metagenome.config)
checkpoints = user.get_checkpoints()
# or
checkpoints = user.get_checkpoints(metagenome.config)
# then
manager = AutometaManager(checkpoints)
# Now we can control stages in pipeline with user and metagenome configuration
manager.start()
manager.restart()
manager.resume()
... |
I do not have time to work on this right now, however I am providing some DAGs for insight into the inputs and outputs that will need to be checked at any given stage. Autometa DAG structureHere is a simple DAG graph of the inputs and outputs from length-filtering the metagenome to recruiting unclustered contigs for both archaea and bacteria. (ORFs written at length-filtering) Autometa DAG (w/labels)This DAG is labeled with the console scripts and inputs/outputs (ORFs written at length-filtering). Autometa DAG (boxed/detailed)This DAG is labeled with the console scripts and inputs/outputs and boxed with the generated files (ORFs written at length-filtering). The cores/memory/footprints information may be ignored. Autometa DAG structureHere is a simple DAG graph of the inputs and outputs from length-filtering the metagenome to recruiting unclustered contigs for both archaea and bacteria. (ORFs written at Autometa DAG (w/labels)This DAG is labeled with the console scripts and inputs/outputs (ORFs written at Autometa DAG (boxed/detailed)This DAG is labeled with the console scripts and inputs/outputs and boxed with the generated files (ORFs written at |
Currently, I am thinking the easiest approach would be to parse the respective metagenome.config file within a project directory and determine where to resume the autometa run. With the DAG structures above, one should be able to determine the starting point given any set of files within the parsed metagenome.config Pseudocode# 1. Parse config file
mgargs = parse_config("metagenome.config")
# mgargs.files.<namespace for any of the inputs/outputs>
# 2. Check outputs (perhaps recursively), then inputs, given each respective workflow working from bottom-up
tasks = []
workflows = ['binning', 'coverages', 'kmers', 'taxonomy', 'markers', 'length_filter'']
# 3. While performing checks, mark workflows that are finished and others that need to be run.
for workflow in workflows:
task = check_workflow(workflow)
tasks.append(task)
# 4. Get/execute workflows that still need to be performed
for task in tasks:
task() The workflows could be constructed hierarchically to resemble the DAG structures above with an ordered "dot" syntax. i.e. without taxonomy:root_workflow = binning.markers.coverages.kmer_embedded.kmer_normalized.kmer_counts.orfs.length_filter.metagenome i.e. with taxonomy:root_workflow = binning.markers.coverages.kmer_embedded.kmer_normalized.kmer_counts.taxonomy.lca.blastp.orfs.length_filter.metagenome |
This is great work! How did you make the DAG diagrams? |
These can be found here You will need Luckily, conda to the rescue: # getting the `makeflow_viz` command
conda install -y -c conda-forge ndcctools
# getting the `dot` command
conda install -y -c anaconda graphviz Now with your makeflow script, you can generate your DAG # Writing the respective dot file
makeflow_viz --dot-no-labels -D dot autometa.mf > autometa.nolabels.dot
makeflow_viz -D dot autometa.mf > autometa.dot
makeflow_viz --dot-details -D dot autometa.mf > autometa.detailed.dot
# Now generate the DAG image using the dot file
dot -Tgif < autometa.nolabels.dot > autometa.nolabels.gif
dot -Tgif < autometa.dot > autometa.gif
dot -Tgif < autometa.detailed.dot > autometa.detailed.gif |
You may notice a few differences between the first set of 3 DAGs and the second set of 3. Namely, whether to call ORFs at |
Similarly, the submitted |
During the code review, it became clear that we need a general method of determining not just that an output file exists but that it is not corrupted in some way. This has been acknowledged by @WiscEvan in a few PR requests so far, but I thought I'd flag it as an issue here. This is important especially for time consuming and/or computationally intensive steps because there are a lot of reasons why, for example, DIAMOND could crash half-way through writing its output file. I propose the following:
Although some could argue that the above falls under the category of "nice to have, but not essential", I would argue that it would help diagnose or even reduce support requests.
The text was updated successfully, but these errors were encountered: