-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add metagenomics.py::filter_bam_to_taxa #883
Conversation
add the CLI function filter_bam_to_taxa to metagenomics.py (and two basic unit tests). This functino filters an input bam file to include only reads that have been mapped to specified taxonomic IDs or scientific names. This requires a classification TSV file, as produced by tools such as Kraken, as well as the NCBI taxonomy database. The column numbers of the tax ID and read ID can be specified, allowing use beyond kraken-format read classification files, however the relationship is assumed to be bijective. Closes #875
metagenomics.py
Outdated
if type(names) == list: | ||
# if taxID->list (of names) | ||
for name in names: | ||
if name_pattern.match(name.lower()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps use re.I
for ignorecase instead of comparing both with lower()
metagenomics.py
Outdated
for heading in tax_names: | ||
# map heading to taxID | ||
name_pattern = re.compile(heading.lower()) | ||
for row_tax_id, names in db.names.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be cleaner in a function but might be marginally slower
metagenomics.py
Outdated
break | ||
if found_heading: | ||
break | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just do names = list(names)
to guarantee it will be a list and just use the top logic?
metagenomics.py
Outdated
|
||
tax_ids |= tax_ids_from_headings | ||
|
||
log.debug("tax_ids %s" % tax_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use , tax_ids)
instead of % tax_ids
formatting in a log function call
read_tax_id = int(row[tax_id_col]) | ||
|
||
# transform read ID to take read pairs into account | ||
read_id_match = re.match(paired_read_base_pattern,read_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only support of read names with format ending /1
or /2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it should be more flexible now.
tax_ids_to_include |= set(child_ids) | ||
|
||
tax_ids_to_include = frozenset(tax_ids_to_include) # frozenset membership check slightly faster | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.debug
here for tax_ids_to_include
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed such a line since the set can be huge depending on the number of children (ex. --taxNames="Viruses"
), and since we usually set --loglevel=DEBUG
on DNAnexus it's nice to avoid filling the logfile with tax IDs.
metagenomics.py
Outdated
parser.add_argument('in_taxdb_names_path', help='names.dmp file from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/') | ||
parser.add_argument('--taxNames', nargs="+", dest="tax_names", help='The taxonomic names to include. More than one can be specified. Mapped to Tax IDs by lowercase exact match only. Ex. "Viruses" This is in addition to any taxonomic IDs provided.') | ||
parser.add_argument('--taxIDs', nargs="+", type=int, dest="tax_ids", help='The NCBI taxonomy IDs to include. More than one can be specified. This is in addition to any taxonomic names provided.') | ||
parser.add_argument('--omit_children', action='store_true', dest="omit_children", help='Omit reads classified more specifically than each taxon specified (without this a taxon and its children are included).') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps --without-children
is a little more conventional argument name
metagenomics.py
Outdated
parser.add_argument('in_bam', help='Input bam file.') | ||
parser.add_argument('in_reads_to_tax_ID_map_file', help='TSV file mapping read IDs to taxIDs, Kraken-format by default. Assumes bijective mapping of read ID to tax ID.') | ||
parser.add_argument('out_bam', help='Output bam file, filtered to the taxa specified') | ||
parser.add_argument('in_taxdb_nodes_path', help='nodes.dmp file from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps just nodes_dmp
metagenomics.py
Outdated
@@ -1145,6 +1150,122 @@ def kraken_library_ids(library): | |||
class KrakenBuildError(Exception): | |||
'''Error while building kraken database.''' | |||
|
|||
def parser_filter_bam_to_taxa(parser=argparse.ArgumentParser()): | |||
parser.add_argument('in_bam', help='Input bam file.') | |||
parser.add_argument('in_reads_to_tax_ID_map_file', help='TSV file mapping read IDs to taxIDs, Kraken-format by default. Assumes bijective mapping of read ID to tax ID.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A shorter name perhaps. The help is very descriptive.
$TAX_IDs \ | ||
--loglevel=DEBUG | ||
|
||
samtools view -c ${classified_bam} | tee classified_taxonomic_filter_read_count_pre |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why tee ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use read_int()
downstream to ingest the value as a WDL output, and with tee with get the value in the log as well.
This adds a CLI function,
filter_bam_to_taxa
tometagenomics.py
(and two basic unit tests). This function filters an input bam file to include only reads that have been mapped to specified taxonomic IDs or scientific names. This requires a classification TSV file, as produced by tools such as Kraken, as well as the NCBI taxonomy database. The column numbers of the tax ID and read ID can be specified, allowing use beyond kraken-format read classification files, however the relationship is assumed to be bijective. Closes #875