-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
blacklist regions for bamCoverage/ bamCompare/ ... #101
Comments
Why would you want to keep the reads in the BAM file if you're going to ignore them downstream? |
Many of the individual deeptools like bamCoverage/ bamCompare/ bamFingerprint have gained recently the options to filter the bam according to sam flags (include/exclude) or mapping quality. This is all quite nice as this makes storing separate bam files for specific filters superfluous. We often use the bam file as the archive of all the reads from a seq run and filtering a bam again and again is tedious and time consuming. The blacklist filtering we use for example to filter out artefact region that, for example, are usually enriched in input. For mouse and human, there are blacklists from the Encode project. This allows for a more accurate coverage estimation. Recently we were also looking into some normalization issues and a properly filtered coverage track is essential for this as well. The (currently last) missing piece in deeptools is some sort of blacklist regions. Of course, one can imagine many more filters, but I would treat a blacklist filter still as one of general interest/purpose. |
so, the resulting file should also lack the reads overlapping with the blacklisted regions? (originally, I thought you only wanted to exclude the reads for the calculation of scaling factors) |
I'm working on some code for this now. A question arises about exactly how to deal with this. The simplest way to do this is (1) to just ignore any genomic chunk that happens to overlap a blacklisted region. Alternatively, (2) one could attempt to trim the bounds of a genomic chunk such that it no longer overlaps. That's possible, though explaining how that works and handling all of the edge cases might prove a bit annoying. I would personally prefer (1), simply because it's easy to implement and explain to users. In general I would then add a black list option for all tools that don't accept a BED file of regions (alternatively, I could just make the two options mutually exclusive). |
I think filtering out any fragments (not only reads) overlapping with blacklisted regions is fine. Adding a blacklist option that accepts a BED file seems totally sufficient to me. |
Looking at fragments would seriously decrease performance. I suppose that this could be done, but I'll start with filtering chunks and we'll see if that suffices and only try to ignore entire fragments if absolutely needed. |
Ok. I prefer filtering for fragments when writing BAM files as otherwise SAM flags have to be fixed for singletons (read paired has to be unset). This however requires resource-intensive sorting be read name. |
Filtering fragments was also my initial thought, but if it is much easier/faster filtering out the genomic chunks I think it is ok this way. We can see how it "behaves" and then improve later on... One thing: should we also adjust some scaling factors like the effective genome size? I mean a 1kb region doesn't matter, but if the blacklist appears to be some Mb then it might be substantial.... Thanks a lot btw :-) |
If the blacklisted region is really large then the user will need to adjust the effective genome size for the tools that require it (this is the case for anything in deepTools that uses an effective genome size and also allows restricting of the computation to a particular region). At the moment, the black list wouldn't normally affect things like normalization unless you use a method that actually samples the genome. |
computeGCBias already has a |
I'd also prefer to have everywhere the same option... :-) ...but it's not so important... |
Maybe |
great |
So @dpryan79 implemented a blacklist filtering and here is an example how the results would look like The top track is the original track from a bam file. The light green track is from bamCoverage with the blacklistFileName option. The red track is how a track would look like after filtering with bamutils. In blue is the blacklisted region, which was defined as >10 enrichment in the orinal track. So this seems to work nice. Please note the border effects of blacklisted regions. This comes from reads starting/ending outside the blacklist region, as it would be too slow to filter out all reads that overlap the blacklist region. By extending the blacklist regions by about 1x insertsize at each end these shall go away. But this is totally fine to give this to the user. In addition, there is also an estimation of how many reads fall into the blacklisted regions. This is used to adjust the scaling factors for e.g 1xcoverage. @dpryan79: can comment on this how it is finally done!? Thanks! Bamutils does a good job in also removing all overlapping reads (red track)! But it took about ~50 minutes to filter only the bam! The impact on deeptools runtime is minor and comes only from the additional read number estimation for blacklisted regions... Any suggestions/comments?! |
@steffenheyne Do you know how bamutils works? why we don't see a 'hole' as in the bamCoverage case? |
Thanks for testing this! The results will generally be a bit better if the blacklisted regions are widened slightly (one could do this automatically, but I'd prefer to avoid that). This generally works by taking each genomic chunk that would have been queried and subtracting out the blacklisted regions. For scaling factors, the normal read count from the BAM index is taken and then the number of reads in the blacklisted regions (from the I hope this suffices for people, performance would degrade to that of bamutils if we really go whole-hog on this. |
@fidelram Ah yes, I didn't use the option "--skipNonCoveredRegions" for any of the tracks, so then probably this is how @dpryan79 implemented the blacklisted region encoding...and this is probably always "nan" no matter if "--skipNonCoveredRegions" was used or not... Should this be changed? Mhmmm...maybe blacklist regions better also have 0 and "nan" only in case "--skipNonCoveredRegions" was used in addition |
Yes, you'll get NAs for blacklisted regions, which is the only sane way to do it. I won't implement different method than that. It's likely that there's no gap in the bamUtils preprocessed data due to read extension. Read extension would still occur with blacklists, but extensions into a blacklisted region would be ignored (this is a good thing). |
yes, all right, but couldn't we simply change the NAs to 0 when writing the wig/bigwig!? |
No |
NAs aren't written, nothing is written, which is the correct way to do it. |
If there's no other feedback on this by Wednesday I'll merge that branch into |
This is now merged in for the 2.2 release. |
Hi,
currently there is only a limited possibillity to ignore reads by region in the bam file. One can only ignore reads on a whole chromosome via:
--ignoreForNormalization IGNOREFORNORMALIZATION
It would be helpful if one can also provide a bed-file with blacklisted regions, and all reads falling into this set should be ignored for counting coverage etc. Also these reads should be excluded when scaling to 1x coverage or calculating other normalization factors.
In case of paired-end data, one should have the possibility to exclude the whole read-pair if any of the reads/mates overlaps the blacklisted region.
What do you think?
Thanks!
The text was updated successfully, but these errors were encountered: