[dev branch] Checksumming/checkpointing needed to assure integrity of output files. #71

jason-c-kwan · 2020-05-07T12:21:38Z

During the code review, it became clear that we need a general method of determining not just that an output file exists but that it is not corrupted in some way. This has been acknowledged by @WiscEvan in a few PR requests so far, but I thought I'd flag it as an issue here. This is important especially for time consuming and/or computationally intensive steps because there are a lot of reasons why, for example, DIAMOND could crash half-way through writing its output file. I propose the following:

Whenever a file is written by Autometa, a checksum file is made. Because we don't necessarily have to worry that much about security, a fast hashing algorithm like SHA-1 could be used.
Whenever a file is required later, a simple check that the checksum file is there and non-zero would indicate that the step previously was completed.
Checking that the checksum in the checksum file matches to a re-calculated checksum of the output file will confirm that the file was not corrupted since writing. Secondarily, this confirms that the checksum file is not corrupted, and that the initial calculation of the checksum didn't fail in some way.

Although some could argue that the above falls under the category of "nice to have, but not essential", I would argue that it would help diagnose or even reduce support requests.

evanroyrees · 2020-05-15T20:16:01Z

Opening this for discussion, but I was thinking of checkpointing during a run in terms of one checkpoints.txt file that consists of checksums with the respective step.

Arguments for writing one file instead of multiple:

A user may move their file and not the respective md5, which would result in not finding the respective md5 and a performance hit due to having to falsely re-do the annotation.
Progress reporting: If multiple annotations are located in multiple directories, searching for the respective md5 files may become even more cumbersome.

One checkpoints.txt file would eliminate a user having to shuttle both the md5 and their file around.

Inspecting a checkpoints.txt/progress.txt/steps.txt/pipeline.txt file would be easier than globbing for multiple file paths that may be in different directories.

`checkpoints.txt` file structure:

Structure for writing/lookup in checkpoints.txt

<checkpoints.txt>

# Step successful
# <checksum><whitespace><filename>\n
# or
# <checksum><whitespace><function_called>\n

# Step not yet encountered or failed.
# <whitespace><whitespace><filename>\n

<checkpoints.txt> example

3fda9608a77b3809141f75a6069b4abd length_filter
852011c3716e19e39a5e30be948896f5 call_orfs
1ee28bbbbc7d5478f112dff5b340a025 assign_taxonomy
  get_markers
  get_binning

Where the step has not yet been performed, the respective checksum is an empty string. This allows us to only have to search one filepath and just within this one file for corresponding checksums to any of the other intermediate files produced during any given Autometa run.

Implementation idea

I was thinking of writing a general function as a decorator, so when we reach a step in the pipeline we would like Autometa to check, we can wrap the step with @checkpoint or some equivalent. Perhaps this would take as an argument (as well as the function) an output file path, such as @utilities.checkpoint(out=outfpath). We would similarly require the function have an output file path argument to avoid confusion.

Example:

@checkpoint
def some_func(infpath, out):
    """"This function takes as input an `infpath` and writes an `out`""""
    some_info = read(infpath)
    write(outfpath)
    return outfpath

# Since `@checkpoint` is specified
# `out` will be parsed from the function and checksummed
# then added to the checkpoints file for lookup later

I need to work on this implementation still, but for the future, just being able to place a @checkpoint decorator on a step in the pipeline that writes a file seems much simpler than writing out checkpoint functionality on a function-by-function basis.
This would also allow anyone to arbitrarily checkpoint any point in the pipeline that writes out a file.

jason-c-kwan · 2020-05-15T20:57:28Z

I do like the idea of a single file with this information in it. A few suggestions:

The file should have it's own checksum available (separate file?) so that we know it is to be trusted.
Perhaps in the file we can also include the date/time that the checksum was calculated. If it is too different to the modified date of the file we are checking (with some allowance for generating the checksum) then that would be a sign that something is wrong).

Reason I am so paranoid about this: I worked for one month once on fastq files that were corrupted while downloading from the sequencing center. It emerged that even though they were paired-end, they did not have the same number of lines and neither had numbers of lines divisible by four (which you would expect for fastq). In addition, the order was all messed up. You'd think that would be obvious right? Well, they were still readable, and it turns out assemblers sometimes do not do basic things like check that the name of two paired reads is the same!! Imagine troubleshooting a problem with a user for 1 month and finding it was something simple like corrupted files.

evanroyrees · 2020-05-19T16:23:27Z

The file should have it's own checksum available (separate file?) so that we know it is to be trusted.

Each file has its own checksum because we can calculate it with utilities.calc_checksum(fpath)
That being said for database files, we should host all of them with a separate checksum file to ensure database file integrity. However, this is not really possible for files we will be generating for the user because we will not have their data to compare checksums a priori.

We can store the calculated checksum in the checkpoints file and then any time we want to compare, we would calculate the checksum of the file then lookup the corresponding reference in checkpoints. Do you mean write checksums for each file generated, then checksum the checksum and write this to checkpoints?

Perhaps in the file we can also include the date/time that the checksum was calculated. If it is too different to the modified date of the file we are checking (with some allowance for generating the checksum) then that would be a sign that something is wrong).

I'm not sure what the difference here would be or why this would matter if we are calculating checksums. If the checksums match, this should mean the file is unchanged regardless of the modified date. Or even it if were modified, It must be restored to the same state to have the same checksum.

evanroyrees · 2020-07-13T19:46:40Z

I think I'm going to scrap the decorator idea as I am not sure how we would implement this.

I do like the idea of a single file with this information in it. A few suggestions:

The file should have it's own checksum available (separate file?) so that we know it is to be trusted.

Do you mean checksum the checkpoints.tsv file?

Perhaps in the file we can also include the date/time that the checksum was calculated. If it is too different to the modified date of the file we are checking (with some allowance for generating the checksum) then that would be a sign that something is wrong).

I think placing the datetime of the calculated checksum is a good idea. I'm not convinced this is necessary for ensuring file integrity. But its is a nice way to leave some bread crumbs for users when they are looking at past data analyses.

I'm rethinking the format in terms of a table where columns would include the following:

Checksum	Name	Description	LastUpdated
`<checksum><whitespace><filename>\n`	filename	Description of generated file/checkpoint	Time of entry
`3fda9608a77b3809141f75a6069b4abd metagenome.filtered.fna\n`	metagenome.filtered.fna	Metagenome contigs passing length filter cutoff	`year-month-day-hour-min-sec`

evanroyrees · 2020-07-15T16:11:09Z

My goal for implementing this is to add checkpoints at any stage in the pipeline where we would like to ensure the file integrity of the input and output files (inputs and outputs, respectively).

In order to provide checkpoints in a general, easy-to-mix-in manner we should not have to specify the path to the checkpoints information other than specifying inputs and outputs. I'm considering this either as a CheckpointMixin class or something similar to logging, where the pre-defined logger object is inherited if and only if a logger object has already been instantiated in another program (i.e. where __main__ is being called).

pre-instantiated checkpointer (similar to logger obj handling)

from autometa.common import checkpointing
# Not sure if `checkpointing.py` would be in `config` or `common`
from autometa.config import checkpointing

checkpointer = checkpointing.getCheckpointer(__name__)

def some_file_handling_function(infpath, outfpath, *args, **kwds):
    checkpointer.verify_inputs(inputs=[infpath])
    obj = stuff_with_infpath(infpath)
    write_to_outfpath(outfpath)
    checkpointer.verify_outputs(outputs=[outfpath])
    return obj

At instantiation the checkpoint file path (whether it exists or not) would be provided as well as all files that should be added, checked or updated.

infpaths=[assembly.fna, orfs.fna, orfs.faa, taxonomy.tsv, blastp.tsv]
outfpaths = [blastp.tsv, orfs.faa, orfs.fna, taxonomy.tsv]
chkpt_fpath = '</path/to/checkpoints.tsv>'
checkpointer = checkpointing.basicConfig(fpath=chkpt_fpath,inputs=infpaths, outputs=outfpaths)

This is going to require some knowledge of the structure of the logging module, but I think if we can get this to work like a logger object, this would be useful and easy to add into any stage in the pipeline without disrupting the flow of the code.

CheckpointMixin

from autometa.common.checkpointer import CheckpointMixin
# Not sure if `checkpointer.py` would be in `config` or `common`
from autometa.config.checkpointer import CheckpointMixin

class SomeClass(CheckpointMixin):
    ...
    @checkpointer(inputs=['infpath'], outputs=['outfpath'])
    def some_file_handling_function(self, infpath, outfpath, *args, **kwds):
        obj = stuff_with_infpath(infpath)
        write_to_outfpath(outfpath)
        return obj

evanroyrees · 2020-07-17T19:26:22Z

Thinking this may actually be more related to creating an AutometaManager class.

Taking some ideas from above, could create a similar checkpointer object similar to a logger object, but instead of (emitting messages) logging messages, it will be constructed to dispatch workflows. Similar to logging, instead of emitting messages at a logger level, you would dispatch workflows at a specified stage in the Autometa pipeline. Any respective stage (level) would have its own hierarchy of workflows that would be filtered according to stage, file existence/integrity. After the job is dispatched and successfully completed, a checkpoint can be placed by the manager in a stages section in the metagenome.config file written within the workspace metagenome directory.

I.e taxonomy workflow:

workflow sequence

filter-length
get_orfs
blastp
lca
majority vote

name=“taxonomy”
inputs = [assembly, orfs, blastp, lca, hits.pkl]
outputs = [orfs, blastp, lca, hits.pkl, taxonomy]

WorkFlow(inputs, outputs, name)

Root workflow

name = “Autometa”:

Workflow sequence

Get k-mer features
Get coverages
Get taxonomy
Get binning
Recruit unclustered

Checkpointing/AutometaManager

Manager(Filterer)
Checkpoint(Filterer)
Workflow(object)

# workflow is like a logger record 
# stage is like a logger level
# Checkpointer is like a logger
# Manager is like a Handler (dispatch workflows to destinations)

checkpointing.basicConfig()
checkpointing.setStage()

user= AutometaUser(user.config)
# either
user.read_config(metagenome.config)
checkpoints = user.get_checkpoints()
# or
checkpoints = user.get_checkpoints(metagenome.config)
# then
manager = AutometaManager(checkpoints)

# Now we can control stages in pipeline with user and metagenome configuration
manager.start()
manager.restart()
manager.resume()
...

evanroyrees · 2020-07-21T16:13:43Z

I do not have time to work on this right now, however I am providing some DAGs for insight into the inputs and outputs that will need to be checked at any given stage.

Autometa DAG structure

Here is a simple DAG graph of the inputs and outputs from length-filtering the metagenome to recruiting unclustered contigs for both archaea and bacteria. (ORFs written at length-filtering)

Autometa DAG (w/labels)

This DAG is labeled with the console scripts and inputs/outputs (ORFs written at length-filtering).

Autometa DAG (boxed/detailed)

This DAG is labeled with the console scripts and inputs/outputs and boxed with the generated files (ORFs written at length-filtering). The cores/memory/footprints information may be ignored.

Autometa DAG structure

Here is a simple DAG graph of the inputs and outputs from length-filtering the metagenome to recruiting unclustered contigs for both archaea and bacteria. (ORFs written at autometa-taxonomy).

Autometa DAG (w/labels)

This DAG is labeled with the console scripts and inputs/outputs (ORFs written at autometa-taxonomy).

Autometa DAG (boxed/detailed)

This DAG is labeled with the console scripts and inputs/outputs and boxed with the generated files (ORFs written at autometa-taxonomy). The cores/memory/footprints information may be ignored.

evanroyrees · 2020-07-21T16:43:42Z

Currently, I am thinking the easiest approach would be to parse the respective metagenome.config file within a project directory and determine where to resume the autometa run. With the DAG structures above, one should be able to determine the starting point given any set of files within the parsed metagenome.config

Pseudocode

# 1. Parse config file
mgargs = parse_config("metagenome.config")
# mgargs.files.<namespace for any of the inputs/outputs>

# 2. Check outputs (perhaps recursively), then inputs, given each respective workflow working from bottom-up
tasks = []
workflows = ['binning', 'coverages', 'kmers', 'taxonomy', 'markers', 'length_filter'']
# 3. While performing checks, mark workflows that are finished and others that need to be run.
for workflow in workflows:
    task = check_workflow(workflow)
    tasks.append(task)

# 4. Get/execute workflows that still need to be performed
for task in tasks:
    task()

The workflows could be constructed hierarchically to resemble the DAG structures above with an ordered "dot" syntax.

i.e. without taxonomy:

root_workflow = binning.markers.coverages.kmer_embedded.kmer_normalized.kmer_counts.orfs.length_filter.metagenome

i.e. with taxonomy:

root_workflow = binning.markers.coverages.kmer_embedded.kmer_normalized.kmer_counts.taxonomy.lca.blastp.orfs.length_filter.metagenome

jason-c-kwan · 2020-07-29T11:47:57Z

This is great work! How did you make the DAG diagrams?

evanroyrees · 2020-07-29T14:32:46Z

These can be found here

You will need ndcctools and dot

Luckily, conda to the rescue:

# getting the `makeflow_viz` command
conda install -y -c conda-forge ndcctools
# getting the `dot` command
conda install -y -c anaconda graphviz

Now with your makeflow script, you can generate your DAG

# Writing the respective dot file
makeflow_viz --dot-no-labels -D dot autometa.mf > autometa.nolabels.dot
makeflow_viz -D dot autometa.mf > autometa.dot
makeflow_viz --dot-details -D dot autometa.mf > autometa.detailed.dot

# Now generate the DAG image using the dot file
dot -Tgif < autometa.nolabels.dot > autometa.nolabels.gif
dot -Tgif < autometa.dot > autometa.gif
dot -Tgif < autometa.detailed.dot > autometa.detailed.gif

evanroyrees · 2020-07-29T14:34:46Z

You may notice a few differences between the first set of 3 DAGs and the second set of 3. Namely, whether to call ORFs at autometa-length-filter (and possibly rename the entrypoint?) or to call the orfs at autometa-taxonomy, or perhaps both? This raises some design questions regarding entrypoints and what their purpose should serve. This may be somewhat related to checkpoints as well.

evanroyrees · 2020-07-29T14:37:20Z

Similarly, the submitted autometa.mf script is an example (the simplest with taxonomy). If a user would need to generate coverages.tsv using reads or a BAM or SAM file, they would need to use one of the commented out autometa-coverage sections. I am not aware of specifying an optional command. So either the user would include taxonomy or remove this from the autometa.mf script depending on their desires.

jason-c-kwan added enhancement New feature or request code review labels May 7, 2020

jason-c-kwan added this to the Aim 0: Release 2.0 milestone May 7, 2020

evanroyrees mentioned this issue May 7, 2020

[dev branch] Issues with autometa/common/utilities.py #40

Closed

4 tasks

evanroyrees mentioned this issue May 14, 2020

databases and utilities #77

Merged

evanroyrees self-assigned this May 19, 2020

Sidduppal assigned Sidduppal and unassigned Sidduppal Jul 13, 2020

evanroyrees removed their assignment Jul 21, 2020

evanroyrees mentioned this issue Jul 29, 2020

UI/UX #32

Closed

39 tasks

evanroyrees added the stretch wonderful for the community, yet may not be top priority label Sep 13, 2020

evanroyrees linked a pull request Mar 3, 2021 that will close this issue

Nextflow implementation template #118

Merged

evanroyrees self-assigned this Mar 19, 2021

evanroyrees closed this as completed Apr 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dev branch] Checksumming/checkpointing needed to assure integrity of output files. #71

[dev branch] Checksumming/checkpointing needed to assure integrity of output files. #71

jason-c-kwan commented May 7, 2020

evanroyrees commented May 15, 2020 •

edited

Loading

jason-c-kwan commented May 15, 2020

evanroyrees commented May 19, 2020

evanroyrees commented Jul 13, 2020

evanroyrees commented Jul 15, 2020

evanroyrees commented Jul 17, 2020

evanroyrees commented Jul 21, 2020 •

edited

Loading

evanroyrees commented Jul 21, 2020 •

edited

Loading

jason-c-kwan commented Jul 29, 2020

evanroyrees commented Jul 29, 2020

evanroyrees commented Jul 29, 2020

evanroyrees commented Jul 29, 2020 •

edited

Loading

[dev branch] Checksumming/checkpointing needed to assure integrity of output files. #71

[dev branch] Checksumming/checkpointing needed to assure integrity of output files. #71

Comments

jason-c-kwan commented May 7, 2020

evanroyrees commented May 15, 2020 • edited Loading

Arguments for writing one file instead of multiple:

checkpoints.txt file structure:

<checkpoints.txt>

<checkpoints.txt> example

Implementation idea

jason-c-kwan commented May 15, 2020

evanroyrees commented May 19, 2020

evanroyrees commented Jul 13, 2020

evanroyrees commented Jul 15, 2020

pre-instantiated checkpointer (similar to logger obj handling)

CheckpointMixin

evanroyrees commented Jul 17, 2020

I.e taxonomy workflow:

workflow sequence

Root workflow

Workflow sequence

Checkpointing/AutometaManager

evanroyrees commented Jul 21, 2020 • edited Loading

Autometa DAG structure

Autometa DAG (w/labels)

Autometa DAG (boxed/detailed)

Autometa DAG structure

Autometa DAG (w/labels)

Autometa DAG (boxed/detailed)

evanroyrees commented Jul 21, 2020 • edited Loading

Pseudocode

i.e. without taxonomy:

i.e. with taxonomy:

jason-c-kwan commented Jul 29, 2020

evanroyrees commented Jul 29, 2020

evanroyrees commented Jul 29, 2020

evanroyrees commented Jul 29, 2020 • edited Loading

evanroyrees commented May 15, 2020 •

edited

Loading

`checkpoints.txt` file structure:

evanroyrees commented Jul 21, 2020 •

edited

Loading

evanroyrees commented Jul 21, 2020 •

edited

Loading

evanroyrees commented Jul 29, 2020 •

edited

Loading