-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bamCoverage gives wrong results #438
Comments
Thanks for reporting this, I'll have a look. Regarding the bedGraph file being sorted, I agree completely and have created a stand alone issue for this (#439). |
Maybe to elaborate a bit more. The problem appears with smooth 300, not before (for tested lengths) i.e. if I check how many of my initial summits are not falling within a non-zero signal stretch
|
I've fixed the bedGraph sorting issue for the next release. Regarding the issue that you're seeing, this is just because the bedGraph files have values with at most 2 digits after the decimal point. That suffices for our purposes, but doesn't really in your unusual use-case. The easiest solution is to just change the following line in
to:
In 99.99% of cases this just leads to much larger output files, though in yours it'd be useful. I don't expect that we'll change this, but we'll talk about it internally tomorrow. |
Yeah, there will be an edge effect every 5Mbases. This is because that's the size of the chunk processed at a time. The smoothing happens within each chunk, so it'll be off right at the borders. |
Although the solution seems trivial -add a new parameter to specify the precision-, we have to consider the overhead of updating manuals and galaxy wrappers and also the unfriendliness that a long list of options is. Furthermore, it is not so obvious for the user when to increase precision. One idea is to see if it is possible to adjust the precision depending on the values range. For example, If values are ranging from 0-1 the precision is higher than if the values are ranging from 0 to 1000. I think that this could be achieved using the 'general format' from python (see https://docs.python.org/2/library/string.html). |
I just created a branch testing the general format. I updated all of the tests that I found broken locally (40, but not all would get tested on my laptop). |
The cli Travis tests are passing. I didn't go through the Galaxy ones, I'll do that tomorrow. I'll also run through a couple full-scale tests and double check that a few files (A) work and (B) aren't like 10x the size. |
The general format solution seems indeed sensible. Thanks again for looking into this. |
All of the tests are now passing and the file size isn't immensely different (the decrease in digits at integers largely offsets the increase elsewhere). This is now merged in for the 2.4 release on Tuesday/Wednesday. |
awesome! Thanks for the prompt reply and quick fixes. Do you mean Tuesday/Wednesday next week ? |
Yes, the 2.4.0 release was scheduled for the first (though it'll probably come out on the 2nd, since the 1st is a holiday here). |
I'd first like to thank you for your great suite of tools. We are using it very frequently.
I think I have found a bug in bamCoverage when using it in a somehow unusual fashion. I have done the following with both v2.3.6 and v1.5.11 :
I have a bam file that contains 1-bp features (these are peak summits from different datasets actually) and the goal is to build a signal track using different smooth length in order to extract "clusters of summits". The idea is to produce a bed graph file and extract stretches with non-0 scores.
I have a sorted bam file containing these 1-bp features and I ran (for different smooth length L) :
v1.5.11
v2.3.6
These give me about the same results but for few extra lines in the v2.3.6 that all revealed to be missing stretches of 0 counts ; so all fine.
When using L > 80 ; I noticed few weird situations like depicted on the attached pictures. The picture show the bam file, the bed graph files produced by bamCoverage (v2.3.6) for smooth length 200 and 300 and the stretches of non-0 scores extracted from the bed graph (named SMOOTH XX).
Below is a screenshot of the 300 bp smoothed bed graph (sorted using sort -k1,1V -k2,2n) showing that this is not a display issue :
I attached summits for chr2R only in bed format and dm3.genome ; you can then reproduce my work using :
dm3.genome.txt
chr2R_summits.bed.gz
Charles
PS: it would also be nice to output sorted bdg files
The text was updated successfully, but these errors were encountered: