Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bamCoverage gives wrong results #438

Closed
cgirardot opened this issue Oct 26, 2016 · 12 comments
Closed

bamCoverage gives wrong results #438

cgirardot opened this issue Oct 26, 2016 · 12 comments

Comments

@cgirardot
Copy link
Contributor

I'd first like to thank you for your great suite of tools. We are using it very frequently.

I think I have found a bug in bamCoverage when using it in a somehow unusual fashion. I have done the following with both v2.3.6 and v1.5.11 :

I have a bam file that contains 1-bp features (these are peak summits from different datasets actually) and the goal is to build a signal track using different smooth length in order to extract "clusters of summits". The idea is to produce a bed graph file and extract stretches with non-0 scores.

I have a sorted bam file containing these 1-bp features and I ran (for different smooth length L) :

v1.5.11

bamCoverage --numberOfProcessors 5 --bam all_summits_sorted.bam --outFileName all_summits_smooth$L.bdg --outFileFormat bedgraph --fragmentLength 1 --binSize 1 --smoothLength $L --minMappingQuality 1 --missingDataAsZero yes

v2.3.6

bamCoverage --numberOfProcessors 5 --bam all_summits_sorted.bam --outFileName all_summits_smooth$L.bdg --outFileFormat bedgraph --binSize 1 --smoothLength $L --minMappingQuality 1

These give me about the same results but for few extra lines in the v2.3.6 that all revealed to be missing stretches of 0 counts ; so all fine.

When using L > 80 ; I noticed few weird situations like depicted on the attached pictures. The picture show the bam file, the bed graph files produced by bamCoverage (v2.3.6) for smooth length 200 and 300 and the stretches of non-0 scores extracted from the bed graph (named SMOOTH XX).

bamcoveragebug2
bamcoveragebug1

Below is a screenshot of the 300 bp smoothed bed graph (sorted using sort -k1,1V -k2,2n) showing that this is not a display issue :
bdg_extract

I attached summits for chr2R only in bed format and dm3.genome ; you can then reproduce my work using :

bedtools bedtobam -g dm3.genome -i chr2R_summits.bed > all_summits.bam
samtools sort -o all_summits_sorted.bam all_summits.bam
samtools index all_summits_sorted.bam

dm3.genome.txt
chr2R_summits.bed.gz

Charles
PS: it would also be nice to output sorted bdg files

@dpryan79
Copy link
Collaborator

Thanks for reporting this, I'll have a look.

Regarding the bedGraph file being sorted, I agree completely and have created a stand alone issue for this (#439).

@cgirardot
Copy link
Contributor Author

Maybe to elaborate a bit more. The problem appears with smooth 300, not before (for tested lengths) i.e. if I check how many of my initial summits are not falling within a non-zero signal stretch

bedtools intersect -a all_summits_sorted.bed -b all_summits_smooth300_clusters.bed -c | awk '$6!=1 {print $0}' | wc -l
13304
this is about 3% of the initial summits, so this is quite a systematic error
bedtools intersect -a all_summits_sorted.bed -b all_summits_smooth200_clusters.bed -c | awk '$6!=1 {print $0}' | wc -l
0

@dpryan79
Copy link
Collaborator

I've fixed the bedGraph sorting issue for the next release. Regarding the issue that you're seeing, this is just because the bedGraph files have values with at most 2 digits after the decimal point. That suffices for our purposes, but doesn't really in your unusual use-case. The easiest solution is to just change the following line in writeBedGraph.py:

line_string = "{}\t{}\t{}\t{:.2f}\n"

to:

line_string = "{}\t{}\t{}\t{:f}\n"

In 99.99% of cases this just leads to much larger output files, though in yours it'd be useful. I don't expect that we'll change this, but we'll talk about it internally tomorrow.

@cgirardot
Copy link
Contributor Author

Thanks for looking into this. I have 2 comments.

May I first suggest slightly different solutions? One could provide an option to control the precision, this is the most flexible and potentially address similar issues that people could have when using very large smoothing window on superficially sequenced datasets. Alternatively or additionally, one could make sure a signal stretch never ends up as 0 due to precision i.e. in the current situation values lower than 0.01 would end up as 0.01.

My second comment is that I think you still have another issue. Please check this screenshot.
igv_snapshot
This is for smooth len of 100.
As you can see, the region in the middle is derived from a single peak so it should be exactly 100 bp and have a flat score (of 0.01). Instead of this the bdg file truncated the signal at 5000000 (as visible on picture). Here are the lines from bdg file.

chr3L 4999745 4999905 0
chr3L 4999905 4999951 0.01
...
chr3L 4999983 4999984 0.01
chr3L 4999984 4999985 0.02
...
chr3L 4999998 4999999 0.02
chr3L 4999999 5000000 0
chr3L 5000000 5000103 0
chr3L 5000103 5000106 0.01
...

This is the only place with such an issue and the truncation at exactly 5000000 can't be a coincidence. I suppose the wrong signal value comes from the truncation in some way.

Best

Charles

@dpryan79
Copy link
Collaborator

dpryan79 commented Oct 27, 2016

Yeah, there will be an edge effect every 5Mbases. This is because that's the size of the chunk processed at a time. The smoothing happens within each chunk, so it'll be off right at the borders.

@fidelram
Copy link
Collaborator

Although the solution seems trivial -add a new parameter to specify the precision-, we have to consider the overhead of updating manuals and galaxy wrappers and also the unfriendliness that a long list of options is. Furthermore, it is not so obvious for the user when to increase precision.

One idea is to see if it is possible to adjust the precision depending on the values range. For example, If values are ranging from 0-1 the precision is higher than if the values are ranging from 0 to 1000. I think that this could be achieved using the 'general format' from python (see https://docs.python.org/2/library/string.html).

@dpryan79
Copy link
Collaborator

I just created a branch testing the general format. I updated all of the tests that I found broken locally (40, but not all would get tested on my laptop).

@dpryan79
Copy link
Collaborator

dpryan79 commented Oct 27, 2016

The cli Travis tests are passing. I didn't go through the Galaxy ones, I'll do that tomorrow. I'll also run through a couple full-scale tests and double check that a few files (A) work and (B) aren't like 10x the size.

@cgirardot
Copy link
Contributor Author

The general format solution seems indeed sensible. Thanks again for looking into this.

@dpryan79
Copy link
Collaborator

All of the tests are now passing and the file size isn't immensely different (the decrease in digits at integers largely offsets the increase elsewhere). This is now merged in for the 2.4 release on Tuesday/Wednesday.

@cgirardot
Copy link
Contributor Author

awesome! Thanks for the prompt reply and quick fixes. Do you mean Tuesday/Wednesday next week ?

@dpryan79
Copy link
Collaborator

Yes, the 2.4.0 release was scheduled for the first (though it'll probably come out on the 2nd, since the 1st is a holiday here).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants