Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolving issues #16, #17, #18, #21 and update to Autometa API and Logger #25

Merged
merged 20 commits into from
Mar 26, 2020

Conversation

evanroyrees
Copy link
Collaborator

Issue 17: kmers.py

1. Exception handling for load

def load(kmers_fpath):

Custom exception classes have been written in autometa/common/exceptions.py. Relevant exceptions for kmers.py are IncorrectKmerFormat and KmerEmbeddingError.

Exceptions References:

2. & 3. revcomp function in init_kmers and init_kmers documentation

#17 (comment) The case where -1 will execute should never occur. As init_kmers will alway generate the kmers being reverse complemented (all_kmers) See

dna_letters = ['A', 'T', 'C', 'G']
all_kmers = list(dna_letters)
using the standard ATCG and never encounter IUPAC degenerate nucleotide codes.

4. k-mer counting with multiprocessing

Question - has this been tested to see if it actually uses multiple processors efficiently? Or that it is faster than single processing?
Answer - This has been tested and it appears to run 2-4 times as fast as the sequential counter.

5,6,7. Case: k-mer length is longer than record length (handling for all respective counting functions)

Contigs where this case applies are returned with the k-mer counts filled as NA. After counting has finished counts of 0 are converted to Na. The contigs without any counts are now removed in the normalization step prior to the CLR transform.

7(continued) & 8. CLR calculation

This is performed in df.dropna(axis='index', how='all').

  • df.dropna(axis='index', how='all'); Drop contig if no counts were recovered (handling of edge case described in comment 7 above).
  • df.dropna(axis='columns', how='all') ; Drop k-mer column if no contigs contain this k-mer
  • how='all' ; only drop if all values being investigated are Na

Both in conjunction specify to only drop the row if all of the columns are Na. After dropping the appropriate columns, Na values are filled with 0 and the normalization procedure then takes place.

9 & 10. embed function

  • n_components has been changed to default of 2.
  • generic text has been removed from docstring.

11. TSNE

I think this may require running the python2.7 TSNE via a separate script specific for this version of TSNE.

Issue 16: prodigal.py

1. Argparse

argparse does not support passing bool as a type arg. However, this is handled in how the booleans were defined. (Pass flags to operate as the switch rather than providing boolean values that need to be interpreted from a string from command line to a boolean)

2 gzip

  • Prodigal does not support gzipped files

3. prodigal command congruent for parallel and sequential orf calling

  • done

4. returncode checks

a warning is logged and the files are checked because GNU parallel may return other codes that are warnings but not errors. I.e. difficult to discern based on the return code whether the run was successful.

5. get_orf_translations function name

Two functions are written for conversion of ORF headers and contig headers:

  1. func: contigs_from_headers - Retrieve a dict of {ORF_ID: contig_ID} to map ORFs to their respective contigs.
  2. func: orf_records_from_contigs - Retrieve list of SeqIO.SeqRecord objects of ORF records corresponding to input contigs. The docstring clarifies that this function requires an already prodigal annotated file. Otherwise, I'm not sure what to name this to make it more clear.

evanroyrees and others added 9 commits February 19, 2020 17:54
…dmp and merged.dmp are out of sync with nr.gz
merge nodes when databases out of sync.
…g. Renamed 'projects' to 'workspace' to avoid confusion with 'project'. test metagenome.config file has been updated with respective files & parameters. Reconfigured logger to stream info and write debug level to timestamped log file. Added exceptions. to be used across autometa pipeline.
…etrieves versions from each executable dependency in environ.py. This is used in prodigal to parse corresponding to the prodigal version. I.e. 2.5 differs from version >=2.6. Prodigal now will parse ORF headers and convert contigs to ORF headers according to version available. Default config now has versions section and generated config files now contain versions section related to executable dependencies. Renamed 'new_workspace' in user.py to 'new_project' as this is more appropriate.
…es.py for handling databases config. Default behavior is to download and format required databases. Changed flag to flag to be more clear. autometa will print an issue request to the user upon any exceptions being encountered (NOT KeyboardInterrupt.. Although this will also be logged). Logging behavior changed slightly, where user can specify level (default is INFO) and path to log file. binning call has been moved to user.py. autometa.config imports in user.py have been removed and general autometa.config module is imported via to perform respective func call.
…en checking dependencies. Executable versions are now logged in debug info. log is now only written when flag is supplied. Timestamped log has been commented out. In the future, this could be a flag to log each run via a timestamped log. in databases now only returns the config and the method of databases is used when checking dependencies.
…gram is provided as input. Updated respective files using this function. This should be clearer than returning a dict of the program passed in as key and removes redundant calls to pass in the program as input and then again as a key to retrieve the version value.
…ip performing check to place appropriate metagenome number and just return 1.
… a bug fix related to exception hierarchy. changed timeit logging message format. Respective exception handling updatedin metagenome.py
@evanroyrees evanroyrees added the enhancement New feature or request label Mar 16, 2020
@evanroyrees evanroyrees linked an issue Mar 17, 2020 that may be closed by this pull request
@evanroyrees evanroyrees changed the title Resolving issues #17 & #16 and update to Autometa API and Logger Resolving issues #16, #17, #18, #21 and update to Autometa API and Logger Mar 17, 2020
@evanroyrees evanroyrees linked an issue Mar 17, 2020 that may be closed by this pull request
…Note: entire pipeline currently assumes orf calling was performed using prodigal. Update to template.py where ArgumentParser now has default description, where previously this was by default usage. (Which the usage by default should be the name of the script). Updates to respective files where ORF to contig translations are necessary.
"""

class AutometaException(Exception):
"""docstring for AutometaException."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring could be more descriptive. For example, describing circumstances in which the error is raised.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has not yet been implemented, but the idea here is to have an AutometaException wrapper to demarkate between exceptions of which we are aware and are handling/have handled and others that we have not yet encountered. In the latter case, we can emit a message to the user to request they submit an issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AutometaException would be the base exception class for Autometa and other exceptions would be subclassed from this.

You may file an issue with us at https://github.com/KwanLab/Autometa/issues/new
'''
def __str__(self):
return f'{self.value}\n\n{issue_request}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that in Autometa.py the whole of main is in a try/except block, which also prints this message. So if this error is raised, will the user see the message twice?


def __str__(self):
return f'{self.fpath} does not contain a \"contig\" column. '\
'Ensure the k-mer matrix was properly generated.'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is that descriptive to a user. Would it be better to say "There is something wrong with the k-mer matrix file. Perhaps it is corrupted? Try deleting [full path] and trying again".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, possibly. The user could have also provided the wrong file, in which case we would not like them to delete anything and simply pass in the appropriate file. Hard to say. I think this would be good to emit to the user though. Probably messages accounting for both situations would be best.

'Ensure the k-mer matrix was properly generated.'

class KmerEmbeddingError(AutometaException):
"""KmerEmbeddingError exception class."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to have a more descriptive docstring here and an error message for the user. Under what circumstances is this thrown?

return self.value

class RecursiveDBSCANError(AutometaException):
"""RecursiveDBSCANError exception class."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here. From the code I can't tell why this exists and there is no message for the user.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jason-c-kwan
Copy link
Collaborator

You said that Prodigal doesn't support .gz files, but could the file be piped into Prodigal with the unix command zcat? I'm thinking of something like: zcat asm.fasta.gz > prodigal ....

else:
with open(os.devnull, 'w') as stdout, open(os.devnull, 'w') as stderr:
proc = subprocess.run(cmd, stdout=stdout, stderr=stderr)
returncode = proc.returncode
if returncode:
logger.warning(f'Args:{cmd} ReturnCode:{returncode}')
# COMBAK: Check all possible return codes for GNU parallel
for fp in [nucls_out, prots_out]:
if not os.path.exists(fp):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See issue #16, point 4. Perhaps BioPython could be used to check that nucls_out and prots_out are valid fasta files?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a validFasta or some equivalent function in BioPython, but we can retrieve the records from the files with Biopython and if we do not encounter an error, assume they are valid.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is probably the next best thing, but note that it might not give an error with a zero size file (I guess that would technically be a valid fasta file with no contents). So perhaps check that size != 0 and try loading with BioPython?

-------
str or int(-1)
reverse complemented string.
Note: If any weird letters are encountered, int value of -1 is returned.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick, but in the comment, I would include that the reason for this is so that we don't count k-mers with weird characters. Otherwise someone in the future might decide that this needs "fixing".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exception will never be encountered. revcomp() is only used once and this is used on the all_kmers variable which is initialized in the script. Therefore, perhaps for more clarity, this could be removed entirely.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. I just thought it would be worth including a comment about our reasoning. Otherwise it is a bit difficult to grasp from code review alone.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've simplified the function to make it intuitive to the reader.

I.e. complement the character for every character in the string in reverse order

I've also removed the unnecessary exception handling and denoted this a function used only as an internal by prepending the underscore (conventions to the programming community).

@@ -171,15 +224,22 @@ def main(args):
format='%(asctime)s : %(name)s : %(levelname)s : %(message)s',
datefmt='%m/%d/%Y %I:%M:%S %p',
level=logger.DEBUG)
parser = argparse.ArgumentParser('Autometa Coverage')
parser = argparse.ArgumentParser(description='Constuct contig coverage table given an input assembly and reads.')
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I interpreted issue-#21 item 1 as the main script description here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the same description is also at the top of the script in a comment.

… 1.0. in kmers and recursive_dbscan. Updated main function for recursive dbscan with required coverage table input and subsetting taxonomy by the provided domain. Datatype conversion in pandas dataframes are now performed to optimize space in mag.py and recursive_dbscan.py. Added script description to coverage.py and removed unused exception handling in docstring. Renamed bedtools column 'breadth' to 'depth_fraction' and 'total_breadth' to 'depth_product'. Added KmerFormatError in docstring in kmers.load() func. Updated docstring in autometa.config.environ.find_executables()
@evanroyrees
Copy link
Collaborator Author

You said that Prodigal doesn't support .gz files, but could the file be piped into Prodigal with the unix command zcat? I'm thinking of something like: zcat asm.fasta.gz > prodigal ....

This is a feature request for the next version of prodigal, however, it does not appear this is currently available. See discussion here

Similarly you can try:

gzip tests/data/metagenome.fna && gzip -c tests/data/metagenome.fna.gz | prodigal -m -o /dev/null -p meta -a test.faa -d test.fna

This results in genes found in sequences with runs of N's and the resulting files are binary encoded.

A few example output lines from test.fna below:

>^W<82>^Vˈ<91>.<FD><92>0^\<97><9F>~<EA>Ô<C4><F9>^N^YZ<F4>^^}<E0>_1 # 1 # 5538 # 1 # ID=2_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.930
NNNGTNNNNNNNNCNNNGNNATNGNGNTNNNNNATNANNNNGNNNNNNNNNNNNNNANNNNNGNNNNNNN
NNNNNNCNNNNNNNNNNNNNNNNNNTNNNANNNNNNNNNNANNCNNGNGNNNNNNNNNNNNNNNNNCNNN
NNNNNNNNNNANNNNNNTNNNNGNNNNNNNNNNNNNNNCNNNNNNNNNCNNNNNNNNNTGNNNNNNNNNN
TNNNNNNNNNNGNNNNNNANNNNNNNNNNNTNNNNGTNNNNNNNNNNNNNNNNNNNNNCNNNNNNNNNNN
NNNANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCNNNNGNNNNNNNNNNNNNNNNNNNNNNNNNN
NTNGNNNNNNNCNNNTNCNNNNNNTNNNNNNNGNANANNNANNTTCNNNNNGNNNNGNNNNNNNNANNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNGNNNNANNNACCNNNNNNNNNNNNCNNNNNNNNNCNNNNNNNNN

… list handling for multiple reads files in input. Added fasta format check and simple fasta parser from Biopython for performance and Exception handling. Docstrings noting where discussions should be placed on readthedocs relating to specific autometa functionality.
Copy link
Collaborator

@jason-c-kwan jason-c-kwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this is ready to merge. Note: I purposely didn't check solving of issue #20 because @kaw97 said he wanted to tackle that issue.

@jason-c-kwan jason-c-kwan merged commit 9a81d52 into KwanLab:dev Mar 26, 2020
@evanroyrees evanroyrees linked an issue Mar 26, 2020 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment