Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coverage Calculation #135

Merged
merged 17 commits into from
Sep 9, 2021
Merged

Coverage Calculation #135

merged 17 commits into from
Sep 9, 2021

Conversation

dhakim87
Copy link
Contributor

Added zebra filter's SortedRangeList for computing coverage.
Implementations of mapper (plain and ordinal) now return dictionaries keyed by subject rather than sets of subject.
Dictionary currently maps to list of (start,end) tuples defining read coverage (only implemented in plain mapper).
Read coverage is demultiplexed, used to compute coverage (if a coverage_map object is sent), then stripped out.

…tations of mapper (plain and ordinal) now return dictionaries keyed by subject rather than sets of subject. Dictionary currently maps to list of (start,end) tuples defining read coverage (only implemented in plain mapper). Read coverage is demultiplexed, used to compute coverage (if a coverage_map object is sent), then stripped out
@qiyunzhu
Copy link
Owner

@dhakim87 Thank you for contributing! I will test it and get back to you!

This line is too long. Modified to pass code style check.
@qiyunzhu
Copy link
Owner

@dhakim87 The coverage_map parameter is currently not connected to any CLI argument. One can test the program in Python but cannot do it through CLI. It will be great if the coverage map can direct to an output file. At this moment, I will be most interested in knowing how the output file looks like!

@qiyunzhu
Copy link
Owner

@dhakim87 I have carefully read your code, added a CLI entry --outcov to enable writing coverage maps to the disk, and did a few tweaks. I am thinking of other tweaks.

@qiyunzhu
Copy link
Owner

Test of performance on a small dataset suggested:

  • Old code: runtime = 0:21.60, maxres = 102892.
  • New code, without coverage calculation: runtime = 0:28.18, maxres = 104348.
  • New code, with coverage calculation: runtime = 1:04.18, maxres = 621548.

Therefore, performance loss is a problem to be resolved.

@qiyunzhu
Copy link
Owner

Code has been updated in my PR to @dhakim87 's repo.

@qiyunzhu
Copy link
Owner

Also for reference: the standard code for calculating per-site depth is here:

Not sure if a Python implementation that works for the entire reference genome set will have reasonable performance. Likely not...

@ElDeveloper
Copy link
Contributor

ElDeveloper commented Aug 30, 2021 via email

@dhakim87
Copy link
Contributor Author

I do think we should discuss the high level approach- it's not clear to me whether coverage calculations should be a separate preprocessor on sam files prior to woltka. Modifying that C code would be reasonably straightforward in my opinion. A quick skim says they are allocating a count array of the same size as the genome reference. My approach is algorithmically faster, but only produces boolean coverage, rather than counts. It would be trivial to port my coverage bit set implementation to C++ for a substantial speed boost, but figuring out how that integrates into the bigger picture is less clear to me.

@qiyunzhu
Copy link
Owner

@ElDeveloper @dhakim87 Directly importing / translating the SAMtools code might have the following hurdles:

  1. SAMtools was originally designed for host RNAseq instead of any meta- stuff. The design goal was to find how well the host chromosome or genomic region is covered by reads. That is, it is a much lower throughput use case than typical shotgun metagenomics which involves many reference genomes. So scalability needs to be reconsidered.

  2. To my understanding, samtools depth requires that the SAM file is already sorted by the coordinates of alignments on the reference genomes. This sorting operation is expensive, especially when the SAM file is very large. Therefore, Woltka is designed such that it processes the SAM file chunk by chunk, without the need for knowing the overall image of the file, nor sorting. (Note: I haven't tried samtools depth on unsorted SAM files, so I don't know about the outcome.)

  3. samtools depth and other SAMtools operations require a header section in the SAM file. This header section defines the metadata of reference genomes. When the database is large and the biodiversity of the sample is high, this header can be very large (one line per genome). The SHOGUN protocol omits the header, therefore SAMtools cannot process its output files. To run SAMtools, we need to use vanilla Bowtie2 instead of SHOGUN.

Appendix: According to my note, the typical usage of samtools depth is like this:

samtools view -bS input.sam > input.bam
samtools sort input.bam input.sorted
samtools index input.sorted.bam
samtools view -bh input.sorted.bam chromosome1 > chr1.bam
samtools depth chr1.bam > chr1.depth.txt

To my knowledge, k-mer-based approaches provide decent scalability for typical metagenomics use cases, and have been shown effective. However, if we are able to implement an efficient alignment-based approach, theoretically its accuracy should be superior to k-mer methods.

@ElDeveloper
Copy link
Contributor

ElDeveloper commented Aug 31, 2021 via email

dhakim87 and others added 2 commits August 31, 2021 18:52
@qiyunzhu
Copy link
Owner

qiyunzhu commented Sep 2, 2021

@ElDeveloper You are welcome!

That increase of runtime was due to the modified mechanism of parsing alignments. Previously it returns subjects only, in Dan's code it returns coordinates in addition to subjects. I have modified Dan's code so that coordinates won't be included if the user does not choose to report coverage. Therefore the performance is back to its original.

@dhakim87 As we discussed, I submitted another PR (#2) to your repo with the flattened interleaved start / end optimization. The runtime further reduced from 49 sec to 45 sec, and the memory usage almost halved. Will appreciate your review.

@qiyunzhu
Copy link
Owner

qiyunzhu commented Sep 9, 2021

I think we can merge now. @dhakim87 @ElDeveloper

@qiyunzhu qiyunzhu merged commit 385243c into qiyunzhu:master Sep 9, 2021
@ElDeveloper
Copy link
Contributor

ElDeveloper commented Sep 9, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants