Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add metagenomics.py::filter_bam_to_taxa #883

Merged
merged 6 commits into from
Aug 28, 2018
Merged

add metagenomics.py::filter_bam_to_taxa #883

merged 6 commits into from
Aug 28, 2018

Conversation

tomkinsc
Copy link
Member

@tomkinsc tomkinsc commented Aug 24, 2018

This adds a CLI function, filter_bam_to_taxa to metagenomics.py (and two basic unit tests). This function filters an input bam file to include only reads that have been mapped to specified taxonomic IDs or scientific names. This requires a classification TSV file, as produced by tools such as Kraken, as well as the NCBI taxonomy database. The column numbers of the tax ID and read ID can be specified, allowing use beyond kraken-format read classification files, however the relationship is assumed to be bijective. Closes #875

add the CLI function filter_bam_to_taxa to metagenomics.py (and two basic unit tests). This functino filters an input bam file to include only reads that have been mapped to specified taxonomic IDs or scientific names. This requires a classification TSV file, as produced by tools such as Kraken, as well as the NCBI taxonomy database. The column numbers of the tax ID and read ID can be specified, allowing use beyond kraken-format read classification files, however the relationship is assumed to be bijective. Closes #875
metagenomics.py Outdated
if type(names) == list:
# if taxID->list (of names)
for name in names:
if name_pattern.match(name.lower()):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps use re.I for ignorecase instead of comparing both with lower()

metagenomics.py Outdated
for heading in tax_names:
# map heading to taxID
name_pattern = re.compile(heading.lower())
for row_tax_id, names in db.names.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be cleaner in a function but might be marginally slower

metagenomics.py Outdated
break
if found_heading:
break
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just do names = list(names) to guarantee it will be a list and just use the top logic?

metagenomics.py Outdated

tax_ids |= tax_ids_from_headings

log.debug("tax_ids %s" % tax_ids)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use , tax_ids) instead of % tax_ids formatting in a log function call

read_tax_id = int(row[tax_id_col])

# transform read ID to take read pairs into account
read_id_match = re.match(paired_read_base_pattern,read_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only support of read names with format ending /1 or /2 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it should be more flexible now.

tax_ids_to_include |= set(child_ids)

tax_ids_to_include = frozenset(tax_ids_to_include) # frozenset membership check slightly faster

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log.debug here for tax_ids_to_include ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed such a line since the set can be huge depending on the number of children (ex. --taxNames="Viruses"), and since we usually set --loglevel=DEBUG on DNAnexus it's nice to avoid filling the logfile with tax IDs.

metagenomics.py Outdated
parser.add_argument('in_taxdb_names_path', help='names.dmp file from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/')
parser.add_argument('--taxNames', nargs="+", dest="tax_names", help='The taxonomic names to include. More than one can be specified. Mapped to Tax IDs by lowercase exact match only. Ex. "Viruses" This is in addition to any taxonomic IDs provided.')
parser.add_argument('--taxIDs', nargs="+", type=int, dest="tax_ids", help='The NCBI taxonomy IDs to include. More than one can be specified. This is in addition to any taxonomic names provided.')
parser.add_argument('--omit_children', action='store_true', dest="omit_children", help='Omit reads classified more specifically than each taxon specified (without this a taxon and its children are included).')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps --without-children is a little more conventional argument name

metagenomics.py Outdated
parser.add_argument('in_bam', help='Input bam file.')
parser.add_argument('in_reads_to_tax_ID_map_file', help='TSV file mapping read IDs to taxIDs, Kraken-format by default. Assumes bijective mapping of read ID to tax ID.')
parser.add_argument('out_bam', help='Output bam file, filtered to the taxa specified')
parser.add_argument('in_taxdb_nodes_path', help='nodes.dmp file from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps just nodes_dmp

metagenomics.py Outdated
@@ -1145,6 +1150,122 @@ def kraken_library_ids(library):
class KrakenBuildError(Exception):
'''Error while building kraken database.'''

def parser_filter_bam_to_taxa(parser=argparse.ArgumentParser()):
parser.add_argument('in_bam', help='Input bam file.')
parser.add_argument('in_reads_to_tax_ID_map_file', help='TSV file mapping read IDs to taxIDs, Kraken-format by default. Assumes bijective mapping of read ID to tax ID.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A shorter name perhaps. The help is very descriptive.

$TAX_IDs \
--loglevel=DEBUG

samtools view -c ${classified_bam} | tee classified_taxonomic_filter_read_count_pre
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why tee ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use read_int() downstream to ingest the value as a WDL output, and with tee with get the value in the log as well.

@tomkinsc tomkinsc merged commit 2faf679 into master Aug 28, 2018
@tomkinsc tomkinsc deleted the ct-filter-bam branch August 28, 2018 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants