Resolving issues #16, #17, #18, #21 and update to Autometa API and Logger #25

evanroyrees · 2020-03-11T23:00:31Z

Issue 17: `kmers.py`

1. Exception handling for load

Line 83 in 01a432d

def load(kmers_fpath):

Custom exception classes have been written in autometa/common/exceptions.py. Relevant exceptions for kmers.py are IncorrectKmerFormat and KmerEmbeddingError.

Exceptions References:

2. & 3. revcomp function in `init_kmers` and `init_kmers` documentation

#17 (comment) The case where -1 will execute should never occur. As init_kmers will alway generate the kmers being reverse complemented (all_kmers) See

Autometa/autometa/common/kmers.py

Lines 65 to 66 in 01a432d

    
           dna_letters = ['A', 'T', 'C', 'G'] 
        
           all_kmers = list(dna_letters)

using the standard ATCG and never encounter IUPAC degenerate nucleotide codes.

4. k-mer counting with multiprocessing

Question - has this been tested to see if it actually uses multiple processors efficiently? Or that it is faster than single processing?
Answer - This has been tested and it appears to run 2-4 times as fast as the sequential counter.

5,6,7. Case: k-mer length is longer than record length (handling for all respective counting functions)

Contigs where this case applies are returned with the k-mer counts filled as NA. After counting has finished counts of 0 are converted to Na. The contigs without any counts are now removed in the normalization step prior to the CLR transform.

7(continued) & 8. CLR calculation

This is performed in df.dropna(axis='index', how='all').

df.dropna(axis='index', how='all'); Drop contig if no counts were recovered (handling of edge case described in comment 7 above).
df.dropna(axis='columns', how='all') ; Drop k-mer column if no contigs contain this k-mer
how='all' ; only drop if all values being investigated are Na

Both in conjunction specify to only drop the row if all of the columns are Na. After dropping the appropriate columns, Na values are filled with 0 and the normalization procedure then takes place.

9 & 10. `embed` function

n_components has been changed to default of 2.
generic text has been removed from docstring.

11. TSNE

I think this may require running the python2.7 TSNE via a separate script specific for this version of TSNE.

Located here -> python 2.7 bhsne
This is already referenced in issue python2.7 BH-SNE #2

Issue 16: prodigal.py

1. Argparse

argparse does not support passing bool as a type arg. However, this is handled in how the booleans were defined. (Pass flags to operate as the switch rather than providing boolean values that need to be interpreted from a string from command line to a boolean)

2 gzip

Prodigal does not support gzipped files

3. prodigal command congruent for parallel and sequential orf calling

done

4. returncode checks

a warning is logged and the files are checked because GNU parallel may return other codes that are warnings but not errors. I.e. difficult to discern based on the return code whether the run was successful.

5. `get_orf_translations` function name

Two functions are written for conversion of ORF headers and contig headers:

func: contigs_from_headers - Retrieve a dict of {ORF_ID: contig_ID} to map ORFs to their respective contigs.
func: orf_records_from_contigs - Retrieve list of SeqIO.SeqRecord objects of ORF records corresponding to input contigs. The docstring clarifies that this function requires an already prodigal annotated file. Otherwise, I'm not sure what to name this to make it more clear.

…dmp and merged.dmp are out of sync with nr.gz

merge nodes when databases out of sync.

…2020.

…g. Renamed 'projects' to 'workspace' to avoid confusion with 'project'. test metagenome.config file has been updated with respective files & parameters. Reconfigured logger to stream info and write debug level to timestamped log file. Added exceptions. to be used across autometa pipeline.

…etrieves versions from each executable dependency in environ.py. This is used in prodigal to parse corresponding to the prodigal version. I.e. 2.5 differs from version >=2.6. Prodigal now will parse ORF headers and convert contigs to ORF headers according to version available. Default config now has versions section and generated config files now contain versions section related to executable dependencies. Renamed 'new_workspace' in user.py to 'new_project' as this is more appropriate.

…es.py for handling databases config. Default behavior is to download and format required databases. Changed flag to flag to be more clear. autometa will print an issue request to the user upon any exceptions being encountered (NOT KeyboardInterrupt.. Although this will also be logged). Logging behavior changed slightly, where user can specify level (default is INFO) and path to log file. binning call has been moved to user.py. autometa.config imports in user.py have been removed and general autometa.config module is imported via to perform respective func call.

…en checking dependencies. Executable versions are now logged in debug info. log is now only written when flag is supplied. Timestamped log has been commented out. In the future, this could be a flag to log each run via a timestamped log. in databases now only returns the config and the method of databases is used when checking dependencies.

…gram is provided as input. Updated respective files using this function. This should be clearer than returning a dict of the program passed in as key and removes redundant calls to pass in the program as input and then again as a key to retrieve the version value.

…ip performing check to place appropriate metagenome number and just return 1.

… a bug fix related to exception hierarchy. changed timeit logging message format. Respective exception handling updatedin metagenome.py

…Note: entire pipeline currently assumes orf calling was performed using prodigal. Update to template.py where ArgumentParser now has default description, where previously this was by default usage. (Which the usage by default should be the name of the script). Updates to respective files where ORF to contig translations are necessary.

autometa/binning/recursive_dbscan.py

autometa/common/coverage.py

jason-c-kwan · 2020-03-20T19:48:54Z

autometa/common/exceptions.py

+"""
+
+class AutometaException(Exception):
+    """docstring for AutometaException."""


The docstring could be more descriptive. For example, describing circumstances in which the error is raised.

This has not yet been implemented, but the idea here is to have an AutometaException wrapper to demarkate between exceptions of which we are aware and are handling/have handled and others that we have not yet encountered. In the latter case, we can emit a message to the user to request they submit an issue.

The AutometaException would be the base exception class for Autometa and other exceptions would be subclassed from this.

jason-c-kwan · 2020-03-20T19:50:59Z

autometa/common/exceptions.py

+    You may file an issue with us at https://github.com/KwanLab/Autometa/issues/new
+    '''
+    def __str__(self):
+        return f'{self.value}\n\n{issue_request}'


It seems that in Autometa.py the whole of main is in a try/except block, which also prints this message. So if this error is raised, will the user see the message twice?

jason-c-kwan · 2020-03-20T19:52:41Z

autometa/common/exceptions.py

+
+    def __str__(self):
+        return f'{self.fpath} does not contain a \"contig\" column. '\
+        'Ensure the k-mer matrix was properly generated.'


I'm not sure if this is that descriptive to a user. Would it be better to say "There is something wrong with the k-mer matrix file. Perhaps it is corrupted? Try deleting [full path] and trying again".

Yes, possibly. The user could have also provided the wrong file, in which case we would not like them to delete anything and simply pass in the appropriate file. Hard to say. I think this would be good to emit to the user though. Probably messages accounting for both situations would be best.

jason-c-kwan · 2020-03-20T19:53:34Z

autometa/common/exceptions.py

+        'Ensure the k-mer matrix was properly generated.'
+
+class KmerEmbeddingError(AutometaException):
+    """KmerEmbeddingError exception class."""


It would be better to have a more descriptive docstring here and an error message for the user. Under what circumstances is this thrown?

jason-c-kwan · 2020-03-20T19:53:58Z

autometa/common/exceptions.py

+        return self.value
+
+class RecursiveDBSCANError(AutometaException):
+    """RecursiveDBSCANError exception class."""


Same comment here. From the code I can't tell why this exists and there is no message for the user.

The RecursiveDBSCANError is passed a message and is described here:

https://github.com/WiscEvan/Autometa/blob/5c735796764df761558d5035bfd7248409a5fe1a/autometa/binning/recursive_dbscan.py#L340-L343

autometa/common/external/bedtools.py

jason-c-kwan · 2020-03-20T20:28:40Z

You said that Prodigal doesn't support .gz files, but could the file be piped into Prodigal with the unix command zcat? I'm thinking of something like: zcat asm.fasta.gz > prodigal ....

jason-c-kwan · 2020-03-20T20:35:31Z

autometa/common/external/prodigal.py

    else:
        with open(os.devnull, 'w') as stdout, open(os.devnull, 'w') as stderr:
            proc = subprocess.run(cmd, stdout=stdout, stderr=stderr)
            returncode = proc.returncode
    if returncode:
        logger.warning(f'Args:{cmd} ReturnCode:{returncode}')
+        # COMBAK: Check all possible return codes for GNU parallel
    for fp in [nucls_out, prots_out]:
        if not os.path.exists(fp):


See issue #16, point 4. Perhaps BioPython could be used to check that nucls_out and prots_out are valid fasta files?

I don't think there is a validFasta or some equivalent function in BioPython, but we can retrieve the records from the files with Biopython and if we do not encounter an error, assume they are valid.

Yes, that is probably the next best thing, but note that it might not give an error with a zero size file (I guess that would technically be a valid fasta file with no contents). So perhaps check that size != 0 and try loading with BioPython?

jason-c-kwan · 2020-03-20T21:41:06Z

autometa/common/kmers.py

+    -------
+    str or int(-1)
+        reverse complemented string.
+        Note: If any weird letters are encountered, int value of -1 is returned.


Nitpick, but in the comment, I would include that the reason for this is so that we don't count k-mers with weird characters. Otherwise someone in the future might decide that this needs "fixing".

This exception will never be encountered. revcomp() is only used once and this is used on the all_kmers variable which is initialized in the script. Therefore, perhaps for more clarity, this could be removed entirely.

Maybe. I just thought it would be worth including a comment about our reasoning. Otherwise it is a bit difficult to grasp from code review alone.

I've simplified the function to make it intuitive to the reader.

I.e. complement the character for every character in the string in reverse order

I've also removed the unnecessary exception handling and denoted this a function used only as an internal by prepending the underscore (conventions to the programming community).

autometa/common/kmers.py

evanroyrees · 2020-03-23T16:54:35Z

autometa/common/coverage.py

@@ -171,15 +224,22 @@ def main(args):
        format='%(asctime)s : %(name)s : %(levelname)s : %(message)s',
        datefmt='%m/%d/%Y %I:%M:%S %p',
        level=logger.DEBUG)
-    parser = argparse.ArgumentParser('Autometa Coverage')
+    parser = argparse.ArgumentParser(description='Constuct contig coverage table given an input assembly and reads.')


I interpreted issue-#21 item 1 as the main script description here.

Yes, the same description is also at the top of the script in a comment.

… 1.0. in kmers and recursive_dbscan. Updated main function for recursive dbscan with required coverage table input and subsetting taxonomy by the provided domain. Datatype conversion in pandas dataframes are now performed to optimize space in mag.py and recursive_dbscan.py. Added script description to coverage.py and removed unused exception handling in docstring. Renamed bedtools column 'breadth' to 'depth_fraction' and 'total_breadth' to 'depth_product'. Added KmerFormatError in docstring in kmers.load() func. Updated docstring in autometa.config.environ.find_executables()

evanroyrees · 2020-03-23T18:16:34Z

You said that Prodigal doesn't support .gz files, but could the file be piped into Prodigal with the unix command zcat? I'm thinking of something like: zcat asm.fasta.gz > prodigal ....

This is a feature request for the next version of prodigal, however, it does not appear this is currently available. See discussion here

Similarly you can try:

gzip tests/data/metagenome.fna && gzip -c tests/data/metagenome.fna.gz | prodigal -m -o /dev/null -p meta -a test.faa -d test.fna

This results in genes found in sequences with runs of N's and the resulting files are binary encoded.

A few example output lines from test.fna below:

>^W<82>^Vˈ<91>.<FD><92>0^\<97><9F>~<EA>Ô<C4><F9>^N^YZ<F4>^^}<E0>_1 # 1 # 5538 # 1 # ID=2_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.930
NNNGTNNNNNNNNCNNNGNNATNGNGNTNNNNNATNANNNNGNNNNNNNNNNNNNNANNNNNGNNNNNNN
NNNNNNCNNNNNNNNNNNNNNNNNNTNNNANNNNNNNNNNANNCNNGNGNNNNNNNNNNNNNNNNNCNNN
NNNNNNNNNNANNNNNNTNNNNGNNNNNNNNNNNNNNNCNNNNNNNNNCNNNNNNNNNTGNNNNNNNNNN
TNNNNNNNNNNGNNNNNNANNNNNNNNNNNTNNNNGTNNNNNNNNNNNNNNNNNNNNNCNNNNNNNNNNN
NNNANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCNNNNGNNNNNNNNNNNNNNNNNNNNNNNNNN
NTNGNNNNNNNCNNNTNCNNNNNNTNNNNNNNGNANANNNANNTTCNNNNNGNNNNGNNNNNNNNANNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNGNNNNANNNACCNNNNNNNNNNNNCNNNNNNNNNCNNNNNNNNN

autometa/common/kmers.py

… list handling for multiple reads files in input. Added fasta format check and simple fasta parser from Biopython for performance and Exception handling. Docstrings noting where discussions should be placed on readthedocs relating to specific autometa functionality.

jason-c-kwan

Yes, I think this is ready to merge. Note: I purposely didn't check solving of issue #20 because @kaw97 said he wanted to tackle that issue.

evanroyrees and others added 9 commits February 19, 2020 17:54

updated add_contig_taxonomy.py to merge any nodes if databases nodes.…

e16b4b8

…dmp and merged.dmp are out of sync with nr.gz

added filepath handling to merged.dmp

2889600

resolved merged_fpath variable

6505fda

added extraction of merged.dmp from taxdump.tar.gz

7d16d90

Merge pull request KwanLab#9 from WiscEvan/master

faf8243

merge nodes when databases out of sync.

resolved KwanLab#10 Contributors added and copyright year updated to …

3c2ef35

…2020.

Merge branch 'master' of https://github.com/KwanLab/Autometa into dev

1d8982c

fixed merge conflicts

31e0e2f

evanroyrees added the code review label Mar 11, 2020

evanroyrees added this to the Aim 0: Release 2.0 milestone Mar 11, 2020

evanroyrees requested a review from jason-c-kwan March 11, 2020 23:00

This was linked to issues Mar 11, 2020

[Dev branch] Issues with autometa/common/kmers.py #17

Closed

[Dev branch] Issues with autometa/common/external/prodigal.py #16

Closed

evanroyrees assigned jason-c-kwan Mar 13, 2020

evanroyrees added 5 commits March 13, 2020 15:29

hotfix to case where new project does not contain any metagenomes. sk…

ea282d1

…ip performing check to place appropriate metagenome number and just return 1.

Changed OSError to subclass ChildProcessError in prodigal.py. This is…

ee1fbec

… a bug fix related to exception hierarchy. changed timeit logging message format. Respective exception handling updatedin metagenome.py

evanroyrees added the enhancement New feature or request label Mar 16, 2020

mostly resolves KwanLab#21 and resolves KwanLab#18.

bbfaecd

evanroyrees linked an issue Mar 17, 2020 that may be closed by this pull request

[Dev branch] Issues with autometa/common/coverage.py #21

Closed

evanroyrees changed the title ~~Resolving issues #17 & #16 and update to Autometa API and Logger~~ Resolving issues #16, #17, #18, #21 and update to Autometa API and Logger Mar 17, 2020

evanroyrees linked an issue Mar 17, 2020 that may be closed by this pull request

[Dev branch] Issues with autometa/common/external/bedtools.py #18

Closed