Coverage Calculation #135

dhakim87 · 2021-08-13T22:29:26Z

Added zebra filter's SortedRangeList for computing coverage.
Implementations of mapper (plain and ordinal) now return dictionaries keyed by subject rather than sets of subject.
Dictionary currently maps to list of (start,end) tuples defining read coverage (only implemented in plain mapper).
Read coverage is demultiplexed, used to compute coverage (if a coverage_map object is sent), then stripped out.

…tations of mapper (plain and ordinal) now return dictionaries keyed by subject rather than sets of subject. Dictionary currently maps to list of (start,end) tuples defining read coverage (only implemented in plain mapper). Read coverage is demultiplexed, used to compute coverage (if a coverage_map object is sent), then stripped out

qiyunzhu · 2021-08-14T00:30:32Z

@dhakim87 Thank you for contributing! I will test it and get back to you!

This line is too long. Modified to pass code style check.

qiyunzhu · 2021-08-14T01:11:16Z

@dhakim87 The coverage_map parameter is currently not connected to any CLI argument. One can test the program in Python but cannot do it through CLI. It will be great if the coverage map can direct to an output file. At this moment, I will be most interested in knowing how the output file looks like!

qiyunzhu · 2021-08-28T23:28:21Z

@dhakim87 I have carefully read your code, added a CLI entry --outcov to enable writing coverage maps to the disk, and did a few tweaks. I am thinking of other tweaks.

qiyunzhu · 2021-08-28T23:37:38Z

Test of performance on a small dataset suggested:

Old code: runtime = 0:21.60, maxres = 102892.
New code, without coverage calculation: runtime = 0:28.18, maxres = 104348.
New code, with coverage calculation: runtime = 1:04.18, maxres = 621548.

Therefore, performance loss is a problem to be resolved.

qiyunzhu · 2021-08-29T17:27:46Z

Code has been updated in my PR to @dhakim87 's repo.

qiyunzhu · 2021-08-29T17:27:52Z

Also for reference: the standard code for calculating per-site depth is here:

https://github.com/samtools/samtools/blob/develop/bam2depth.c

Not sure if a Python implementation that works for the entire reference genome set will have reasonable performance. Likely not...

ElDeveloper · 2021-08-30T16:52:27Z

That's very cool. Do you think it would be better to see if we can modify that program instead? I think it would be hard to achieve comparable performance otherwise. But if that's not straightforward then I think this is still a great option.

…

On Aug 29, 2021, at 10:28 AM, Qiyun Zhu ***@***.***> wrote: Also for reference: the standard code for calculating per-site depth is here: • https://github.com/samtools/samtools/blob/develop/bam2depth.c Not sure if a Python implementation that works for the entire reference genome set will have reasonable performance. Likely not... — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

dhakim87 · 2021-08-30T18:48:16Z

I do think we should discuss the high level approach- it's not clear to me whether coverage calculations should be a separate preprocessor on sam files prior to woltka. Modifying that C code would be reasonably straightforward in my opinion. A quick skim says they are allocating a count array of the same size as the genome reference. My approach is algorithmically faster, but only produces boolean coverage, rather than counts. It would be trivial to port my coverage bit set implementation to C++ for a substantial speed boost, but figuring out how that integrates into the bigger picture is less clear to me.

qiyunzhu · 2021-08-31T14:41:08Z

@ElDeveloper @dhakim87 Directly importing / translating the SAMtools code might have the following hurdles:

SAMtools was originally designed for host RNAseq instead of any meta- stuff. The design goal was to find how well the host chromosome or genomic region is covered by reads. That is, it is a much lower throughput use case than typical shotgun metagenomics which involves many reference genomes. So scalability needs to be reconsidered.
To my understanding, samtools depth requires that the SAM file is already sorted by the coordinates of alignments on the reference genomes. This sorting operation is expensive, especially when the SAM file is very large. Therefore, Woltka is designed such that it processes the SAM file chunk by chunk, without the need for knowing the overall image of the file, nor sorting. (Note: I haven't tried samtools depth on unsorted SAM files, so I don't know about the outcome.)
samtools depth and other SAMtools operations require a header section in the SAM file. This header section defines the metadata of reference genomes. When the database is large and the biodiversity of the sample is high, this header can be very large (one line per genome). The SHOGUN protocol omits the header, therefore SAMtools cannot process its output files. To run SAMtools, we need to use vanilla Bowtie2 instead of SHOGUN.

Appendix: According to my note, the typical usage of samtools depth is like this:

samtools view -bS input.sam > input.bam
samtools sort input.bam input.sorted
samtools index input.sorted.bam
samtools view -bh input.sorted.bam chromosome1 > chr1.bam
samtools depth chr1.bam > chr1.depth.txt

To my knowledge, k-mer-based approaches provide decent scalability for typical metagenomics use cases, and have been shown effective. However, if we are able to implement an efficient alignment-based approach, theoretically its accuracy should be superior to k-mer methods.

ElDeveloper · 2021-08-31T21:10:28Z

Thanks so much for the explanation Qiyun. That's very helpful. In that case this implementation should be good as a starting point and further performance improvements can be scoped via Cython or a dedicated C extension if that's needed. @qiyunzhu, one more question based on the performance stats you posted before: what's leading to the 7 minute runtime increase in the execution without coverage computation?

…

On Aug 31, 2021, at 7:41 AM, Qiyun Zhu ***@***.***> wrote: @ElDeveloper @dhakim87 Directly importing / translating the SAMtools code might have the following hurdles: • SAMtools was originally designed for host RNAseq instead of any meta- stuff. The design goal was to find how well the host chromosome or genomic region is covered by reads. That is, it is a much lower throughput use case than typical shotgun metagenomics which involves many reference genomes. So scalability needs to be reconsidered. • To my understanding, samtools depth requires that the SAM file is already sorted by the coordinates of alignments on the reference genomes. This sorting operation is expensive, especially when the SAM file is very large. Therefore, Woltka is designed such that it processes the SAM file chunk by chunk, without the need for knowing the overall image of the file, nor sorting. (Note: I haven't tried samtools depth on unsorted SAM files, so I don't know about the outcome.) • samtools depth and other SAMtools operations require a header section in the SAM file. This header section defines the metadata of reference genomes. When the database is large and the biodiversity of the sample is high, this header can be very large (one line per genome). The SHOGUN protocol omits the header, therefore SAMtools cannot process its output files. To run SAMtools, we need to use vanilla Bowtie2 instead of SHOGUN. Appendix: According to my note, the typical usage of samtools depth is like this: samtools view -bS input.sam > input.bam samtools sort input.bam input.sorted samtools index input.sorted.bam samtools view -bh input.sorted.bam chromosome1 > chr1.bam samtools depth chr1.bam > chr1.depth.txt To my knowledge, k-mer-based approaches provide decent scalability for typical metagenomics use cases, and have been shown effective. However, if we are able to implement an efficient alignment-based approach, theoretically its accuracy should be superior to k-mer methods. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Restructured coverage functions

qiyunzhu · 2021-09-02T17:59:15Z

@ElDeveloper You are welcome!

That increase of runtime was due to the modified mechanism of parsing alignments. Previously it returns subjects only, in Dan's code it returns coordinates in addition to subjects. I have modified Dan's code so that coordinates won't be included if the user does not choose to report coverage. Therefore the performance is back to its original.

@dhakim87 As we discussed, I submitted another PR (#2) to your repo with the flattened interleaved start / end optimization. The runtime further reduced from 49 sec to 45 sec, and the memory usage almost halved. Will appreciate your review.

Flattened list of ranges

qiyunzhu · 2021-09-09T06:56:00Z

I think we can merge now. @dhakim87 @ElDeveloper

ElDeveloper · 2021-09-09T15:30:20Z

Very exciting!

…

On Sep 8, 2021, at 11:56 PM, Qiyun Zhu ***@***.***> wrote: I think we can merge now. @dhakim87 @ElDeveloper — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Update test_workflow.py

4ea953e

This line is too long. Modified to pass code style check.

added command-line entry

b7c6c74

qiyunzhu added 7 commits August 28, 2021 18:14

minor tweaks

0631c4f

fixed code style

81f173f

fixed code style

4af8dfd

modularized coverage functions

3f5b5d1

reverted changed files

a8992d1

Merge branch 'master' of github.com:qiyunzhu/woltka into coverage

07f576e

restructured coverage module

319c48a

qiyunzhu added 4 commits August 30, 2021 17:45

interleaved covers structure

9c45d59

added credit for dhakim87

bb20624

flattened ranges

1aed4be

implemented flatten solution

14e2893

dhakim87 and others added 2 commits August 31, 2021 18:52

Merge pull request #1 from qiyunzhu/coverage

c0a6861

Restructured coverage functions

tweaked flatten

e5b0a6c

Merge pull request #2 from qiyunzhu/coverage

0fb0f01

Flattened list of ranges

qiyunzhu merged commit 385243c into qiyunzhu:master Sep 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coverage Calculation #135

Coverage Calculation #135

dhakim87 commented Aug 13, 2021

qiyunzhu commented Aug 14, 2021

qiyunzhu commented Aug 14, 2021

qiyunzhu commented Aug 28, 2021

qiyunzhu commented Aug 28, 2021

qiyunzhu commented Aug 29, 2021

qiyunzhu commented Aug 29, 2021

ElDeveloper commented Aug 30, 2021 via email

dhakim87 commented Aug 30, 2021

qiyunzhu commented Aug 31, 2021

ElDeveloper commented Aug 31, 2021 via email

qiyunzhu commented Sep 2, 2021

qiyunzhu commented Sep 9, 2021

ElDeveloper commented Sep 9, 2021 via email

Coverage Calculation #135

Coverage Calculation #135

Conversation

dhakim87 commented Aug 13, 2021

qiyunzhu commented Aug 14, 2021

qiyunzhu commented Aug 14, 2021

qiyunzhu commented Aug 28, 2021

qiyunzhu commented Aug 28, 2021

qiyunzhu commented Aug 29, 2021

qiyunzhu commented Aug 29, 2021

ElDeveloper commented Aug 30, 2021 via email

dhakim87 commented Aug 30, 2021

qiyunzhu commented Aug 31, 2021

ElDeveloper commented Aug 31, 2021 via email

qiyunzhu commented Sep 2, 2021

qiyunzhu commented Sep 9, 2021

ElDeveloper commented Sep 9, 2021 via email