blacklist regions for bamCoverage/ bamCompare/ ... #101

steffenheyne · 2015-09-21T14:34:21Z

Hi,

currently there is only a limited possibillity to ignore reads by region in the bam file. One can only ignore reads on a whole chromosome via:
--ignoreForNormalization IGNOREFORNORMALIZATION

It would be helpful if one can also provide a bed-file with blacklisted regions, and all reads falling into this set should be ignored for counting coverage etc. Also these reads should be excluded when scaling to 1x coverage or calculating other normalization factors.

In case of paired-end data, one should have the possibility to exclude the whole read-pair if any of the reads/mates overlaps the blacklisted region.

What do you think?
Thanks!

friedue · 2015-09-21T22:50:53Z

Why would you want to keep the reads in the BAM file if you're going to ignore them downstream?

steffenheyne · 2015-09-22T07:53:17Z

Many of the individual deeptools like bamCoverage/ bamCompare/ bamFingerprint have gained recently the options to filter the bam according to sam flags (include/exclude) or mapping quality.

This is all quite nice as this makes storing separate bam files for specific filters superfluous. We often use the bam file as the archive of all the reads from a seq run and filtering a bam again and again is tedious and time consuming.

The blacklist filtering we use for example to filter out artefact region that, for example, are usually enriched in input. For mouse and human, there are blacklists from the Encode project. This allows for a more accurate coverage estimation. Recently we were also looking into some normalization issues and a properly filtered coverage track is essential for this as well.

The (currently last) missing piece in deeptools is some sort of blacklist regions. Of course, one can imagine many more filters, but I would treat a blacklist filter still as one of general interest/purpose.

friedue · 2015-09-23T21:47:32Z

so, the resulting file should also lack the reads overlapping with the blacklisted regions? (originally, I thought you only wanted to exclude the reads for the calculation of scaling factors)

dpryan79 · 2016-02-09T10:40:05Z

I'm working on some code for this now. A question arises about exactly how to deal with this. The simplest way to do this is (1) to just ignore any genomic chunk that happens to overlap a blacklisted region. Alternatively, (2) one could attempt to trim the bounds of a genomic chunk such that it no longer overlaps. That's possible, though explaining how that works and handling all of the edge cases might prove a bit annoying. I would personally prefer (1), simply because it's easy to implement and explain to users. In general I would then add a black list option for all tools that don't accept a BED file of regions (alternatively, I could just make the two options mutually exclusive).

asrichter · 2016-02-09T10:56:58Z

I think filtering out any fragments (not only reads) overlapping with blacklisted regions is fine.
Trimming could cause some trouble in downstream analyses.

Adding a blacklist option that accepts a BED file seems totally sufficient to me.

dpryan79 · 2016-02-09T11:03:32Z

Looking at fragments would seriously decrease performance. I suppose that this could be done, but I'll start with filtering chunks and we'll see if that suffices and only try to ignore entire fragments if absolutely needed.

asrichter · 2016-02-09T11:07:04Z

Ok. I prefer filtering for fragments when writing BAM files as otherwise SAM flags have to be fixed for singletons (read paired has to be unset). This however requires resource-intensive sorting be read name.
For deepTools, read filtering should be fine unless one also filters for SAM flags at the same time.

steffenheyne · 2016-02-09T12:01:05Z

Filtering fragments was also my initial thought, but if it is much easier/faster filtering out the genomic chunks I think it is ok this way. We can see how it "behaves" and then improve later on...

One thing: should we also adjust some scaling factors like the effective genome size? I mean a 1kb region doesn't matter, but if the blacklist appears to be some Mb then it might be substantial....

Thanks a lot btw :-)

dpryan79 · 2016-02-09T12:05:56Z

If the blacklisted region is really large then the user will need to adjust the effective genome size for the tools that require it (this is the case for anything in deepTools that uses an effective genome size and also allows restricting of the computation to a particular region). At the moment, the black list wouldn't normally affect things like normalization unless you use a method that actually samples the genome.

dpryan79 · 2016-02-10T13:07:32Z

computeGCBias already has a --filterOut option that does that. Would people prefer that that be renamed to match every other tool (I'm currently calling the option --blackListFile), to rename the option in all of the other tools to match, or not worry about the difference? I'd generally prefer that the commands all have identical options when they're doing essentially the same thing.

steffenheyne · 2016-02-10T13:34:12Z

I'd also prefer to have everywhere the same option...
maybe even more inline with some other existing options would be
--blacklistRegionsFileName or
--blacklistRegionsFile or
--blacklistFileName

:-) ...but it's not so important...

dpryan79 · 2016-02-10T13:41:02Z

Maybe --blacklistFileName then for everything.

steffenheyne · 2016-02-10T15:11:02Z

great

steffenheyne · 2016-02-15T12:45:26Z

So @dpryan79 implemented a blacklist filtering and here is an example how the results would look like

The top track is the original track from a bam file. The light green track is from bamCoverage with the blacklistFileName option. The red track is how a track would look like after filtering with bamutils. In blue is the blacklisted region, which was defined as >10 enrichment in the orinal track.

So this seems to work nice. Please note the border effects of blacklisted regions. This comes from reads starting/ending outside the blacklist region, as it would be too slow to filter out all reads that overlap the blacklist region. By extending the blacklist regions by about 1x insertsize at each end these shall go away. But this is totally fine to give this to the user.

In addition, there is also an estimation of how many reads fall into the blacklisted regions. This is used to adjust the scaling factors for e.g 1xcoverage.

@dpryan79: can comment on this how it is finally done!? Thanks!

Bamutils does a good job in also removing all overlapping reads (red track)! But it took about ~50 minutes to filter only the bam! The impact on deeptools runtime is minor and comes only from the additional read number estimation for blacklisted regions...

Any suggestions/comments?!

fidelram · 2016-02-15T12:59:45Z

@steffenheyne Do you know how bamutils works? why we don't see a 'hole' as in the bamCoverage case?

dpryan79 · 2016-02-15T13:04:11Z

Thanks for testing this! The results will generally be a bit better if the blacklisted regions are widened slightly (one could do this automatically, but I'd prefer to avoid that).

This generally works by taking each genomic chunk that would have been queried and subtracting out the blacklisted regions. For scaling factors, the normal read count from the BAM index is taken and then the number of reads in the blacklisted regions (from the count() command in pysam) subtracted. This, then, has the benefit of getting a better scaling.

I hope this suffices for people, performance would degrade to that of bamutils if we really go whole-hog on this.

steffenheyne · 2016-02-15T13:06:32Z

@fidelram Ah yes, I didn't use the option "--skipNonCoveredRegions" for any of the tracks, so then probably this is how @dpryan79 implemented the blacklisted region encoding...and this is probably always "nan" no matter if "--skipNonCoveredRegions" was used or not...

Should this be changed? Mhmmm...maybe blacklist regions better also have 0 and "nan" only in case "--skipNonCoveredRegions" was used in addition

dpryan79 · 2016-02-15T13:14:57Z

Yes, you'll get NAs for blacklisted regions, which is the only sane way to do it. I won't implement different method than that. It's likely that there's no gap in the bamUtils preprocessed data due to read extension. Read extension would still occur with blacklists, but extensions into a blacklisted region would be ignored (this is a good thing).

steffenheyne · 2016-02-15T13:18:11Z

yes, all right, but couldn't we simply change the NAs to 0 when writing the wig/bigwig!?

dpryan79 · 2016-02-15T13:19:32Z

No

dpryan79 · 2016-02-15T13:19:58Z

NAs aren't written, nothing is written, which is the correct way to do it.

dpryan79 · 2016-02-19T14:25:33Z

If there's no other feedback on this by Wednesday I'll merge that branch into develop and close this.

dpryan79 · 2016-02-24T08:31:46Z

This is now merged in for the 2.2 release.

fidelram self-assigned this Sep 22, 2015

fidelram added the enhancement label Sep 23, 2015

dpryan79 added this to the 2.1 milestone Jan 17, 2016

dpryan79 mentioned this issue Feb 24, 2016

Fix101 #302

Merged

dpryan79 closed this as completed Feb 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blacklist regions for bamCoverage/ bamCompare/ ... #101

blacklist regions for bamCoverage/ bamCompare/ ... #101

steffenheyne commented Sep 21, 2015

friedue commented Sep 21, 2015

steffenheyne commented Sep 22, 2015

friedue commented Sep 23, 2015

dpryan79 commented Feb 9, 2016

asrichter commented Feb 9, 2016

dpryan79 commented Feb 9, 2016

asrichter commented Feb 9, 2016

steffenheyne commented Feb 9, 2016

dpryan79 commented Feb 9, 2016

dpryan79 commented Feb 10, 2016

steffenheyne commented Feb 10, 2016

dpryan79 commented Feb 10, 2016

steffenheyne commented Feb 10, 2016

steffenheyne commented Feb 15, 2016

fidelram commented Feb 15, 2016

dpryan79 commented Feb 15, 2016

steffenheyne commented Feb 15, 2016

dpryan79 commented Feb 15, 2016

steffenheyne commented Feb 15, 2016

dpryan79 commented Feb 15, 2016

dpryan79 commented Feb 15, 2016

dpryan79 commented Feb 19, 2016

dpryan79 commented Feb 24, 2016

blacklist regions for bamCoverage/ bamCompare/ ... #101

blacklist regions for bamCoverage/ bamCompare/ ... #101

Comments

steffenheyne commented Sep 21, 2015

friedue commented Sep 21, 2015

steffenheyne commented Sep 22, 2015

friedue commented Sep 23, 2015

dpryan79 commented Feb 9, 2016

asrichter commented Feb 9, 2016

dpryan79 commented Feb 9, 2016

asrichter commented Feb 9, 2016

steffenheyne commented Feb 9, 2016

dpryan79 commented Feb 9, 2016

dpryan79 commented Feb 10, 2016

steffenheyne commented Feb 10, 2016

dpryan79 commented Feb 10, 2016

steffenheyne commented Feb 10, 2016

steffenheyne commented Feb 15, 2016

fidelram commented Feb 15, 2016

dpryan79 commented Feb 15, 2016

steffenheyne commented Feb 15, 2016

dpryan79 commented Feb 15, 2016

steffenheyne commented Feb 15, 2016

dpryan79 commented Feb 15, 2016

dpryan79 commented Feb 15, 2016

dpryan79 commented Feb 19, 2016

dpryan79 commented Feb 24, 2016