-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coverage Calculation #135
Coverage Calculation #135
Conversation
…tations of mapper (plain and ordinal) now return dictionaries keyed by subject rather than sets of subject. Dictionary currently maps to list of (start,end) tuples defining read coverage (only implemented in plain mapper). Read coverage is demultiplexed, used to compute coverage (if a coverage_map object is sent), then stripped out
@dhakim87 Thank you for contributing! I will test it and get back to you! |
This line is too long. Modified to pass code style check.
@dhakim87 The |
@dhakim87 I have carefully read your code, added a CLI entry |
Test of performance on a small dataset suggested:
Therefore, performance loss is a problem to be resolved. |
Also for reference: the standard code for calculating per-site depth is here: Not sure if a Python implementation that works for the entire reference genome set will have reasonable performance. Likely not... |
That's very cool. Do you think it would be better to see if we can modify that program instead? I think it would be hard to achieve comparable performance otherwise. But if that's not straightforward then I think this is still a great option.
… On Aug 29, 2021, at 10:28 AM, Qiyun Zhu ***@***.***> wrote:
Also for reference: the standard code for calculating per-site depth is here:
• https://github.com/samtools/samtools/blob/develop/bam2depth.c
Not sure if a Python implementation that works for the entire reference genome set will have reasonable performance. Likely not...
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I do think we should discuss the high level approach- it's not clear to me whether coverage calculations should be a separate preprocessor on sam files prior to woltka. Modifying that C code would be reasonably straightforward in my opinion. A quick skim says they are allocating a count array of the same size as the genome reference. My approach is algorithmically faster, but only produces boolean coverage, rather than counts. It would be trivial to port my coverage bit set implementation to C++ for a substantial speed boost, but figuring out how that integrates into the bigger picture is less clear to me. |
@ElDeveloper @dhakim87 Directly importing / translating the SAMtools code might have the following hurdles:
Appendix: According to my note, the typical usage of samtools view -bS input.sam > input.bam
samtools sort input.bam input.sorted
samtools index input.sorted.bam
samtools view -bh input.sorted.bam chromosome1 > chr1.bam
samtools depth chr1.bam > chr1.depth.txt To my knowledge, k-mer-based approaches provide decent scalability for typical metagenomics use cases, and have been shown effective. However, if we are able to implement an efficient alignment-based approach, theoretically its accuracy should be superior to k-mer methods. |
Thanks so much for the explanation Qiyun. That's very helpful. In that case this implementation should be good as a starting point and further performance improvements can be scoped via Cython or a dedicated C extension if that's needed.
@qiyunzhu, one more question based on the performance stats you posted before: what's leading to the 7 minute runtime increase in the execution without coverage computation?
… On Aug 31, 2021, at 7:41 AM, Qiyun Zhu ***@***.***> wrote:
@ElDeveloper @dhakim87 Directly importing / translating the SAMtools code might have the following hurdles:
• SAMtools was originally designed for host RNAseq instead of any meta- stuff. The design goal was to find how well the host chromosome or genomic region is covered by reads. That is, it is a much lower throughput use case than typical shotgun metagenomics which involves many reference genomes. So scalability needs to be reconsidered.
• To my understanding, samtools depth requires that the SAM file is already sorted by the coordinates of alignments on the reference genomes. This sorting operation is expensive, especially when the SAM file is very large. Therefore, Woltka is designed such that it processes the SAM file chunk by chunk, without the need for knowing the overall image of the file, nor sorting. (Note: I haven't tried samtools depth on unsorted SAM files, so I don't know about the outcome.)
• samtools depth and other SAMtools operations require a header section in the SAM file. This header section defines the metadata of reference genomes. When the database is large and the biodiversity of the sample is high, this header can be very large (one line per genome). The SHOGUN protocol omits the header, therefore SAMtools cannot process its output files. To run SAMtools, we need to use vanilla Bowtie2 instead of SHOGUN.
Appendix: According to my note, the typical usage of samtools depth is like this:
samtools view -bS input.sam >
input.bam
samtools sort input.bam input.sorted
samtools index input.sorted.bam
samtools view -bh input.sorted.bam chromosome1
>
chr1.bam
samtools depth chr1.bam
> chr1.depth.txt
To my knowledge, k-mer-based approaches provide decent scalability for typical metagenomics use cases, and have been shown effective. However, if we are able to implement an efficient alignment-based approach, theoretically its accuracy should be superior to k-mer methods.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Restructured coverage functions
@ElDeveloper You are welcome! That increase of runtime was due to the modified mechanism of parsing alignments. Previously it returns subjects only, in Dan's code it returns coordinates in addition to subjects. I have modified Dan's code so that coordinates won't be included if the user does not choose to report coverage. Therefore the performance is back to its original. @dhakim87 As we discussed, I submitted another PR (#2) to your repo with the flattened interleaved start / end optimization. The runtime further reduced from 49 sec to 45 sec, and the memory usage almost halved. Will appreciate your review. |
Flattened list of ranges
I think we can merge now. @dhakim87 @ElDeveloper |
Very exciting!
… On Sep 8, 2021, at 11:56 PM, Qiyun Zhu ***@***.***> wrote:
I think we can merge now. @dhakim87 @ElDeveloper
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Added zebra filter's SortedRangeList for computing coverage.
Implementations of mapper (plain and ordinal) now return dictionaries keyed by subject rather than sets of subject.
Dictionary currently maps to list of (start,end) tuples defining read coverage (only implemented in plain mapper).
Read coverage is demultiplexed, used to compute coverage (if a coverage_map object is sent), then stripped out.