-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolving issues #16, #17, #18, #21 and update to Autometa API and Logger #25
Conversation
…dmp and merged.dmp are out of sync with nr.gz
merge nodes when databases out of sync.
…g. Renamed 'projects' to 'workspace' to avoid confusion with 'project'. test metagenome.config file has been updated with respective files & parameters. Reconfigured logger to stream info and write debug level to timestamped log file. Added exceptions. to be used across autometa pipeline.
…etrieves versions from each executable dependency in environ.py. This is used in prodigal to parse corresponding to the prodigal version. I.e. 2.5 differs from version >=2.6. Prodigal now will parse ORF headers and convert contigs to ORF headers according to version available. Default config now has versions section and generated config files now contain versions section related to executable dependencies. Renamed 'new_workspace' in user.py to 'new_project' as this is more appropriate.
…es.py for handling databases config. Default behavior is to download and format required databases. Changed flag to flag to be more clear. autometa will print an issue request to the user upon any exceptions being encountered (NOT KeyboardInterrupt.. Although this will also be logged). Logging behavior changed slightly, where user can specify level (default is INFO) and path to log file. binning call has been moved to user.py. autometa.config imports in user.py have been removed and general autometa.config module is imported via to perform respective func call.
…en checking dependencies. Executable versions are now logged in debug info. log is now only written when flag is supplied. Timestamped log has been commented out. In the future, this could be a flag to log each run via a timestamped log. in databases now only returns the config and the method of databases is used when checking dependencies.
…gram is provided as input. Updated respective files using this function. This should be clearer than returning a dict of the program passed in as key and removes redundant calls to pass in the program as input and then again as a key to retrieve the version value.
…ip performing check to place appropriate metagenome number and just return 1.
… a bug fix related to exception hierarchy. changed timeit logging message format. Respective exception handling updatedin metagenome.py
…Note: entire pipeline currently assumes orf calling was performed using prodigal. Update to template.py where ArgumentParser now has default description, where previously this was by default usage. (Which the usage by default should be the name of the script). Updates to respective files where ORF to contig translations are necessary.
""" | ||
|
||
class AutometaException(Exception): | ||
"""docstring for AutometaException.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring could be more descriptive. For example, describing circumstances in which the error is raised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has not yet been implemented, but the idea here is to have an AutometaException wrapper to demarkate between exceptions of which we are aware and are handling/have handled and others that we have not yet encountered. In the latter case, we can emit a message to the user to request they submit an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AutometaException
would be the base exception class for Autometa and other exceptions would be subclassed from this.
You may file an issue with us at https://github.com/KwanLab/Autometa/issues/new | ||
''' | ||
def __str__(self): | ||
return f'{self.value}\n\n{issue_request}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that in Autometa.py the whole of main
is in a try/except block, which also prints this message. So if this error is raised, will the user see the message twice?
|
||
def __str__(self): | ||
return f'{self.fpath} does not contain a \"contig\" column. '\ | ||
'Ensure the k-mer matrix was properly generated.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is that descriptive to a user. Would it be better to say "There is something wrong with the k-mer matrix file. Perhaps it is corrupted? Try deleting [full path] and trying again".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, possibly. The user could have also provided the wrong file, in which case we would not like them to delete anything and simply pass in the appropriate file. Hard to say. I think this would be good to emit to the user though. Probably messages accounting for both situations would be best.
'Ensure the k-mer matrix was properly generated.' | ||
|
||
class KmerEmbeddingError(AutometaException): | ||
"""KmerEmbeddingError exception class.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to have a more descriptive docstring here and an error message for the user. Under what circumstances is this thrown?
return self.value | ||
|
||
class RecursiveDBSCANError(AutometaException): | ||
"""RecursiveDBSCANError exception class.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here. From the code I can't tell why this exists and there is no message for the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RecursiveDBSCANError
is passed a message and is described here:
You said that Prodigal doesn't support |
autometa/common/external/prodigal.py
Outdated
else: | ||
with open(os.devnull, 'w') as stdout, open(os.devnull, 'w') as stderr: | ||
proc = subprocess.run(cmd, stdout=stdout, stderr=stderr) | ||
returncode = proc.returncode | ||
if returncode: | ||
logger.warning(f'Args:{cmd} ReturnCode:{returncode}') | ||
# COMBAK: Check all possible return codes for GNU parallel | ||
for fp in [nucls_out, prots_out]: | ||
if not os.path.exists(fp): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See issue #16, point 4. Perhaps BioPython could be used to check that nucls_out
and prots_out
are valid fasta files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is a validFasta
or some equivalent function in BioPython, but we can retrieve the records from the files with Biopython and if we do not encounter an error, assume they are valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is probably the next best thing, but note that it might not give an error with a zero size file (I guess that would technically be a valid fasta file with no contents). So perhaps check that size != 0
and try loading with BioPython?
autometa/common/kmers.py
Outdated
------- | ||
str or int(-1) | ||
reverse complemented string. | ||
Note: If any weird letters are encountered, int value of -1 is returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick, but in the comment, I would include that the reason for this is so that we don't count k-mers with weird characters. Otherwise someone in the future might decide that this needs "fixing".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exception will never be encountered. revcomp()
is only used once and this is used on the all_kmers
variable which is initialized in the script. Therefore, perhaps for more clarity, this could be removed entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe. I just thought it would be worth including a comment about our reasoning. Otherwise it is a bit difficult to grasp from code review alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've simplified the function to make it intuitive to the reader.
I.e. complement the character for every character in the string in reverse order
I've also removed the unnecessary exception handling and denoted this a function used only as an internal by prepending the underscore (conventions to the programming community).
autometa/common/coverage.py
Outdated
@@ -171,15 +224,22 @@ def main(args): | |||
format='%(asctime)s : %(name)s : %(levelname)s : %(message)s', | |||
datefmt='%m/%d/%Y %I:%M:%S %p', | |||
level=logger.DEBUG) | |||
parser = argparse.ArgumentParser('Autometa Coverage') | |||
parser = argparse.ArgumentParser(description='Constuct contig coverage table given an input assembly and reads.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I interpreted issue-#21 item 1 as the main script description here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the same description is also at the top of the script in a comment.
… 1.0. in kmers and recursive_dbscan. Updated main function for recursive dbscan with required coverage table input and subsetting taxonomy by the provided domain. Datatype conversion in pandas dataframes are now performed to optimize space in mag.py and recursive_dbscan.py. Added script description to coverage.py and removed unused exception handling in docstring. Renamed bedtools column 'breadth' to 'depth_fraction' and 'total_breadth' to 'depth_product'. Added KmerFormatError in docstring in kmers.load() func. Updated docstring in autometa.config.environ.find_executables()
This is a feature request for the next version of prodigal, however, it does not appear this is currently available. See discussion here Similarly you can try:
This results in genes found in sequences with runs of N's and the resulting files are binary encoded. A few example output lines from test.fna below: >^W<82>^Vˈ<91>.<FD><92>0^\<97><9F>~<EA>Ô<C4><F9>^N^YZ<F4>^^}<E0>_1 # 1 # 5538 # 1 # ID=2_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.930
NNNGTNNNNNNNNCNNNGNNATNGNGNTNNNNNATNANNNNGNNNNNNNNNNNNNNANNNNNGNNNNNNN
NNNNNNCNNNNNNNNNNNNNNNNNNTNNNANNNNNNNNNNANNCNNGNGNNNNNNNNNNNNNNNNNCNNN
NNNNNNNNNNANNNNNNTNNNNGNNNNNNNNNNNNNNNCNNNNNNNNNCNNNNNNNNNTGNNNNNNNNNN
TNNNNNNNNNNGNNNNNNANNNNNNNNNNNTNNNNGTNNNNNNNNNNNNNNNNNNNNNCNNNNNNNNNNN
NNNANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCNNNNGNNNNNNNNNNNNNNNNNNNNNNNNNN
NTNGNNNNNNNCNNNTNCNNNNNNTNNNNNNNGNANANNNANNTTCNNNNNGNNNNGNNNNNNNNANNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNGNNNNANNNACCNNNNNNNNNNNNCNNNNNNNNNCNNNNNNNNN |
… list handling for multiple reads files in input. Added fasta format check and simple fasta parser from Biopython for performance and Exception handling. Docstrings noting where discussions should be placed on readthedocs relating to specific autometa functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue 17:
kmers.py
1. Exception handling for load
Autometa/autometa/common/kmers.py
Line 83 in 01a432d
Custom exception classes have been written in autometa/common/exceptions.py. Relevant exceptions for
kmers.py
areIncorrectKmerFormat
andKmerEmbeddingError
.Exceptions References:
2. & 3. revcomp function in
init_kmers
andinit_kmers
documentation#17 (comment) The case where -1 will execute should never occur. As
init_kmers
will alway generate the kmers being reverse complemented (all_kmers
) SeeAutometa/autometa/common/kmers.py
Lines 65 to 66 in 01a432d
4. k-mer counting with multiprocessing
Question - has this been tested to see if it actually uses multiple processors efficiently? Or that it is faster than single processing?
Answer - This has been tested and it appears to run 2-4 times as fast as the sequential counter.
5,6,7. Case: k-mer length is longer than record length (handling for all respective counting functions)
Contigs where this case applies are returned with the k-mer counts filled as NA. After counting has finished counts of 0 are converted to Na. The contigs without any counts are now removed in the normalization step prior to the CLR transform.
7(continued) & 8. CLR calculation
This is performed in
df.dropna(axis='index', how='all')
.df.dropna(axis='index', how='all')
; Drop contig if no counts were recovered (handling of edge case described in comment 7 above).df.dropna(axis='columns', how='all')
; Drop k-mer column if no contigs contain this k-merhow='all'
; only drop if all values being investigated are NaBoth in conjunction specify to only drop the row if all of the columns are Na. After dropping the appropriate columns, Na values are filled with 0 and the normalization procedure then takes place.
9 & 10.
embed
function11. TSNE
I think this may require running the python2.7 TSNE via a separate script specific for this version of TSNE.
Issue 16: prodigal.py
1. Argparse
argparse does not support passing bool as a type arg. However, this is handled in how the booleans were defined. (Pass flags to operate as the switch rather than providing boolean values that need to be interpreted from a string from command line to a boolean)
2 gzip
3. prodigal command congruent for parallel and sequential orf calling
4. returncode checks
a warning is logged and the files are checked because GNU parallel may return other codes that are warnings but not errors. I.e. difficult to discern based on the return code whether the run was successful.
5.
get_orf_translations
function nameTwo functions are written for conversion of ORF headers and contig headers:
contigs_from_headers
- Retrieve a dict of {ORF_ID: contig_ID} to map ORFs to their respective contigs.orf_records_from_contigs
- Retrieve list of SeqIO.SeqRecord objects of ORF records corresponding to input contigs. The docstring clarifies that this function requires an already prodigal annotated file. Otherwise, I'm not sure what to name this to make it more clear.