add metagenomics.py::filter_bam_to_taxa #883

tomkinsc · 2018-08-24T22:45:22Z

This adds a CLI function, filter_bam_to_taxa to metagenomics.py (and two basic unit tests). This function filters an input bam file to include only reads that have been mapped to specified taxonomic IDs or scientific names. This requires a classification TSV file, as produced by tools such as Kraken, as well as the NCBI taxonomy database. The column numbers of the tax ID and read ID can be specified, allowing use beyond kraken-format read classification files, however the relationship is assumed to be bijective. Closes #875

add the CLI function filter_bam_to_taxa to metagenomics.py (and two basic unit tests). This functino filters an input bam file to include only reads that have been mapped to specified taxonomic IDs or scientific names. This requires a classification TSV file, as produced by tools such as Kraken, as well as the NCBI taxonomy database. The column numbers of the tax ID and read ID can be specified, allowing use beyond kraken-format read classification files, however the relationship is assumed to be bijective. Closes #875

yesimon · 2018-08-27T17:37:48Z

metagenomics.py

+ if type(names) == list:
+ # if taxID->list (of names)
+ for name in names:
+ if name_pattern.match(name.lower()):


Perhaps use re.I for ignorecase instead of comparing both with lower()

yesimon · 2018-08-27T17:40:44Z

metagenomics.py

+ for heading in tax_names:
+ # map heading to taxID
+ name_pattern = re.compile(heading.lower())
+ for row_tax_id, names in db.names.items():


Would be cleaner in a function but might be marginally slower

yesimon · 2018-08-27T17:41:21Z

metagenomics.py

+ break
+ if found_heading:
+ break
+ else:


Why not just do names = list(names) to guarantee it will be a list and just use the top logic?

yesimon · 2018-08-27T17:42:25Z

metagenomics.py

+
+ tax_ids |= tax_ids_from_headings
+
+ log.debug("tax_ids %s" % tax_ids)


use , tax_ids) instead of % tax_ids formatting in a log function call

yesimon · 2018-08-27T17:50:38Z

metagenomics.py

+ read_tax_id = int(row[tax_id_col])
+
+ # transform read ID to take read pairs into account
+ read_id_match = re.match(paired_read_base_pattern,read_id)


This only support of read names with format ending /1 or /2 ?

Thanks, it should be more flexible now.

yesimon · 2018-08-27T17:51:54Z

metagenomics.py

+ tax_ids_to_include |= set(child_ids)
+
+ tax_ids_to_include = frozenset(tax_ids_to_include) # frozenset membership check slightly faster
+


log.debug here for tax_ids_to_include ?

I removed such a line since the set can be huge depending on the number of children (ex. --taxNames="Viruses"), and since we usually set --loglevel=DEBUG on DNAnexus it's nice to avoid filling the logfile with tax IDs.

yesimon · 2018-08-27T17:56:09Z

metagenomics.py

+ parser.add_argument('in_taxdb_names_path', help='names.dmp file from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/')
+ parser.add_argument('--taxNames', nargs="+", dest="tax_names", help='The taxonomic names to include. More than one can be specified. Mapped to Tax IDs by lowercase exact match only. Ex. "Viruses" This is in addition to any taxonomic IDs provided.')
+ parser.add_argument('--taxIDs', nargs="+", type=int, dest="tax_ids", help='The NCBI taxonomy IDs to include. More than one can be specified. This is in addition to any taxonomic names provided.')
+ parser.add_argument('--omit_children', action='store_true', dest="omit_children", help='Omit reads classified more specifically than each taxon specified (without this a taxon and its children are included).')


Perhaps --without-children is a little more conventional argument name

yesimon · 2018-08-27T17:57:19Z

metagenomics.py

+ parser.add_argument('in_bam', help='Input bam file.')
+ parser.add_argument('in_reads_to_tax_ID_map_file', help='TSV file mapping read IDs to taxIDs, Kraken-format by default. Assumes bijective mapping of read ID to tax ID.') 
+ parser.add_argument('out_bam', help='Output bam file, filtered to the taxa specified')
+ parser.add_argument('in_taxdb_nodes_path', help='nodes.dmp file from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/')


Perhaps just nodes_dmp

yesimon · 2018-08-27T17:57:46Z

metagenomics.py

@@ -1145,6 +1150,122 @@ def kraken_library_ids(library):
 class KrakenBuildError(Exception):
 '''Error while building kraken database.'''

+def parser_filter_bam_to_taxa(parser=argparse.ArgumentParser()):
+ parser.add_argument('in_bam', help='Input bam file.')
+ parser.add_argument('in_reads_to_tax_ID_map_file', help='TSV file mapping read IDs to taxIDs, Kraken-format by default. Assumes bijective mapping of read ID to tax ID.') 


A shorter name perhaps. The help is very descriptive.

yesimon · 2018-08-27T18:00:00Z

pipes/WDL/workflows/tasks/tasks_metagenomics.wdl

+ $TAX_IDs \
+ --loglevel=DEBUG
+
+ samtools view -c ${classified_bam} | tee classified_taxonomic_filter_read_count_pre


We use read_int() downstream to ingest the value as a WDL output, and with tee with get the value in the log as well.

tomkinsc added 4 commits August 24, 2018 18:45

add WDL for filter_bam_to_taxa

cd8589b

add pre/post read counts as WDL output in filter_bam_to_taxa task

cdf80be

correction for extract_tarball command

24dbf1a

yesimon reviewed Aug 27, 2018

View reviewed changes

tomkinsc added 2 commits August 27, 2018 15:50

revisions following code review by @yesimon

d759e9c

allow WDL task filter_bam_to_taxa to pass --without-children boolean

b0fe08e

tomkinsc merged commit 2faf679 into master Aug 28, 2018

tomkinsc deleted the ct-filter-bam branch August 28, 2018 14:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add metagenomics.py::filter_bam_to_taxa #883

add metagenomics.py::filter_bam_to_taxa #883

tomkinsc commented Aug 24, 2018 •

edited

Loading

yesimon Aug 27, 2018

yesimon Aug 27, 2018

yesimon Aug 27, 2018

yesimon Aug 27, 2018

yesimon Aug 27, 2018

tomkinsc Aug 27, 2018

yesimon Aug 27, 2018

tomkinsc Aug 27, 2018

yesimon Aug 27, 2018

yesimon Aug 27, 2018

yesimon Aug 27, 2018

yesimon Aug 27, 2018

tomkinsc Aug 27, 2018


		tax_ids \|= tax_ids_from_headings

		log.debug("tax_ids %s" % tax_ids)

		tax_ids_to_include \|= set(child_ids)

		tax_ids_to_include = frozenset(tax_ids_to_include) # frozenset membership check slightly faster

add metagenomics.py::filter_bam_to_taxa #883

add metagenomics.py::filter_bam_to_taxa #883

Conversation

tomkinsc commented Aug 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomkinsc commented Aug 24, 2018 •

edited

Loading