bamCoverage gives wrong results #438

cgirardot · 2016-10-26T09:52:45Z

I'd first like to thank you for your great suite of tools. We are using it very frequently.

I think I have found a bug in bamCoverage when using it in a somehow unusual fashion. I have done the following with both v2.3.6 and v1.5.11 :

I have a bam file that contains 1-bp features (these are peak summits from different datasets actually) and the goal is to build a signal track using different smooth length in order to extract "clusters of summits". The idea is to produce a bed graph file and extract stretches with non-0 scores.

I have a sorted bam file containing these 1-bp features and I ran (for different smooth length L) :

v1.5.11

bamCoverage --numberOfProcessors 5 --bam all_summits_sorted.bam --outFileName all_summits_smooth$L.bdg --outFileFormat bedgraph --fragmentLength 1 --binSize 1 --smoothLength $L --minMappingQuality 1 --missingDataAsZero yes

v2.3.6

bamCoverage --numberOfProcessors 5 --bam all_summits_sorted.bam --outFileName all_summits_smooth$L.bdg --outFileFormat bedgraph --binSize 1 --smoothLength $L --minMappingQuality 1

These give me about the same results but for few extra lines in the v2.3.6 that all revealed to be missing stretches of 0 counts ; so all fine.

When using L > 80 ; I noticed few weird situations like depicted on the attached pictures. The picture show the bam file, the bed graph files produced by bamCoverage (v2.3.6) for smooth length 200 and 300 and the stretches of non-0 scores extracted from the bed graph (named SMOOTH XX).

Below is a screenshot of the 300 bp smoothed bed graph (sorted using sort -k1,1V -k2,2n) showing that this is not a display issue :

I attached summits for chr2R only in bed format and dm3.genome ; you can then reproduce my work using :

bedtools bedtobam -g dm3.genome -i chr2R_summits.bed > all_summits.bam
samtools sort -o all_summits_sorted.bam all_summits.bam
samtools index all_summits_sorted.bam

dm3.genome.txt
chr2R_summits.bed.gz

Charles
PS: it would also be nice to output sorted bdg files

dpryan79 · 2016-10-26T11:24:14Z

Thanks for reporting this, I'll have a look.

Regarding the bedGraph file being sorted, I agree completely and have created a stand alone issue for this (#439).

cgirardot · 2016-10-26T11:33:38Z

Maybe to elaborate a bit more. The problem appears with smooth 300, not before (for tested lengths) i.e. if I check how many of my initial summits are not falling within a non-zero signal stretch

bedtools intersect -a all_summits_sorted.bed -b all_summits_smooth300_clusters.bed -c | awk '$6!=1 {print $0}' | wc -l
13304
this is about 3% of the initial summits, so this is quite a systematic error
bedtools intersect -a all_summits_sorted.bed -b all_summits_smooth200_clusters.bed -c | awk '$6!=1 {print $0}' | wc -l
0

dpryan79 · 2016-10-26T21:45:55Z

I've fixed the bedGraph sorting issue for the next release. Regarding the issue that you're seeing, this is just because the bedGraph files have values with at most 2 digits after the decimal point. That suffices for our purposes, but doesn't really in your unusual use-case. The easiest solution is to just change the following line in writeBedGraph.py:

line_string = "{}\t{}\t{}\t{:.2f}\n"

to:

line_string = "{}\t{}\t{}\t{:f}\n"

In 99.99% of cases this just leads to much larger output files, though in yours it'd be useful. I don't expect that we'll change this, but we'll talk about it internally tomorrow.

cgirardot · 2016-10-27T12:23:56Z

Thanks for looking into this. I have 2 comments.

May I first suggest slightly different solutions? One could provide an option to control the precision, this is the most flexible and potentially address similar issues that people could have when using very large smoothing window on superficially sequenced datasets. Alternatively or additionally, one could make sure a signal stretch never ends up as 0 due to precision i.e. in the current situation values lower than 0.01 would end up as 0.01.

My second comment is that I think you still have another issue. Please check this screenshot.

This is for smooth len of 100.
As you can see, the region in the middle is derived from a single peak so it should be exactly 100 bp and have a flat score (of 0.01). Instead of this the bdg file truncated the signal at 5000000 (as visible on picture). Here are the lines from bdg file.

chr3L 4999745 4999905 0
chr3L 4999905 4999951 0.01
...
chr3L 4999983 4999984 0.01
chr3L 4999984 4999985 0.02
...
chr3L 4999998 4999999 0.02
chr3L 4999999 5000000 0
chr3L 5000000 5000103 0
chr3L 5000103 5000106 0.01
...

This is the only place with such an issue and the truncation at exactly 5000000 can't be a coincidence. I suppose the wrong signal value comes from the truncation in some way.

Best

Charles

dpryan79 · 2016-10-27T14:24:55Z

Yeah, there will be an edge effect every 5Mbases. This is because that's the size of the chunk processed at a time. The smoothing happens within each chunk, so it'll be off right at the borders.

fidelram · 2016-10-27T19:21:29Z

Although the solution seems trivial -add a new parameter to specify the precision-, we have to consider the overhead of updating manuals and galaxy wrappers and also the unfriendliness that a long list of options is. Furthermore, it is not so obvious for the user when to increase precision.

One idea is to see if it is possible to adjust the precision depending on the values range. For example, If values are ranging from 0-1 the precision is higher than if the values are ranging from 0 to 1000. I think that this could be achieved using the 'general format' from python (see https://docs.python.org/2/library/string.html).

dpryan79 · 2016-10-27T20:13:46Z

I just created a branch testing the general format. I updated all of the tests that I found broken locally (40, but not all would get tested on my laptop).

dpryan79 · 2016-10-27T20:37:00Z

The cli Travis tests are passing. I didn't go through the Galaxy ones, I'll do that tomorrow. I'll also run through a couple full-scale tests and double check that a few files (A) work and (B) aren't like 10x the size.

cgirardot · 2016-10-28T06:44:29Z

The general format solution seems indeed sensible. Thanks again for looking into this.

dpryan79 · 2016-10-28T09:42:11Z

All of the tests are now passing and the file size isn't immensely different (the decrease in digits at integers largely offsets the increase elsewhere). This is now merged in for the 2.4 release on Tuesday/Wednesday.

cgirardot · 2016-10-28T10:30:48Z

awesome! Thanks for the prompt reply and quick fixes. Do you mean Tuesday/Wednesday next week ?

dpryan79 · 2016-10-28T10:37:27Z

Yes, the 2.4.0 release was scheduled for the first (though it'll probably come out on the 2nd, since the 1st is a holiday here).

dpryan79 mentioned this issue Oct 26, 2016

bamCoverage bedGraph files should be sorted #439

Closed

dpryan79 added a commit that referenced this issue Oct 28, 2016

Move py2bit to a requirement and indicate changing #438

be014cf

dpryan79 closed this as completed Oct 28, 2016

dpryan79 mentioned this issue Oct 31, 2016

bamCoverage outputs empty bedgraph file #440

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bamCoverage gives wrong results #438

bamCoverage gives wrong results #438

cgirardot commented Oct 26, 2016

dpryan79 commented Oct 26, 2016

cgirardot commented Oct 26, 2016

dpryan79 commented Oct 26, 2016

cgirardot commented Oct 27, 2016

dpryan79 commented Oct 27, 2016 •

edited

Loading

fidelram commented Oct 27, 2016

dpryan79 commented Oct 27, 2016

dpryan79 commented Oct 27, 2016 •

edited

Loading

cgirardot commented Oct 28, 2016

dpryan79 commented Oct 28, 2016

cgirardot commented Oct 28, 2016

dpryan79 commented Oct 28, 2016

bamCoverage gives wrong results #438

bamCoverage gives wrong results #438

Comments

cgirardot commented Oct 26, 2016

dpryan79 commented Oct 26, 2016

cgirardot commented Oct 26, 2016

dpryan79 commented Oct 26, 2016

cgirardot commented Oct 27, 2016

dpryan79 commented Oct 27, 2016 • edited Loading

fidelram commented Oct 27, 2016

dpryan79 commented Oct 27, 2016

dpryan79 commented Oct 27, 2016 • edited Loading

cgirardot commented Oct 28, 2016

dpryan79 commented Oct 28, 2016

cgirardot commented Oct 28, 2016

dpryan79 commented Oct 28, 2016

dpryan79 commented Oct 27, 2016 •

edited

Loading

dpryan79 commented Oct 27, 2016 •

edited

Loading