Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blacklist regions for bamCoverage/ bamCompare/ ... #101

Closed
steffenheyne opened this issue Sep 21, 2015 · 23 comments
Closed

blacklist regions for bamCoverage/ bamCompare/ ... #101

steffenheyne opened this issue Sep 21, 2015 · 23 comments
Assignees
Milestone

Comments

@steffenheyne
Copy link
Collaborator

Hi,

currently there is only a limited possibillity to ignore reads by region in the bam file. One can only ignore reads on a whole chromosome via:
--ignoreForNormalization IGNOREFORNORMALIZATION

It would be helpful if one can also provide a bed-file with blacklisted regions, and all reads falling into this set should be ignored for counting coverage etc. Also these reads should be excluded when scaling to 1x coverage or calculating other normalization factors.

In case of paired-end data, one should have the possibility to exclude the whole read-pair if any of the reads/mates overlaps the blacklisted region.

What do you think?
Thanks!

@friedue
Copy link
Collaborator

friedue commented Sep 21, 2015

Why would you want to keep the reads in the BAM file if you're going to ignore them downstream?

@steffenheyne
Copy link
Collaborator Author

Many of the individual deeptools like bamCoverage/ bamCompare/ bamFingerprint have gained recently the options to filter the bam according to sam flags (include/exclude) or mapping quality.

This is all quite nice as this makes storing separate bam files for specific filters superfluous. We often use the bam file as the archive of all the reads from a seq run and filtering a bam again and again is tedious and time consuming.

The blacklist filtering we use for example to filter out artefact region that, for example, are usually enriched in input. For mouse and human, there are blacklists from the Encode project. This allows for a more accurate coverage estimation. Recently we were also looking into some normalization issues and a properly filtered coverage track is essential for this as well.

The (currently last) missing piece in deeptools is some sort of blacklist regions. Of course, one can imagine many more filters, but I would treat a blacklist filter still as one of general interest/purpose.

@fidelram fidelram self-assigned this Sep 22, 2015
@friedue
Copy link
Collaborator

friedue commented Sep 23, 2015

so, the resulting file should also lack the reads overlapping with the blacklisted regions? (originally, I thought you only wanted to exclude the reads for the calculation of scaling factors)

@dpryan79 dpryan79 added this to the 2.1 milestone Jan 17, 2016
@dpryan79
Copy link
Collaborator

dpryan79 commented Feb 9, 2016

I'm working on some code for this now. A question arises about exactly how to deal with this. The simplest way to do this is (1) to just ignore any genomic chunk that happens to overlap a blacklisted region. Alternatively, (2) one could attempt to trim the bounds of a genomic chunk such that it no longer overlaps. That's possible, though explaining how that works and handling all of the edge cases might prove a bit annoying. I would personally prefer (1), simply because it's easy to implement and explain to users. In general I would then add a black list option for all tools that don't accept a BED file of regions (alternatively, I could just make the two options mutually exclusive).

@asrichter
Copy link
Collaborator

I think filtering out any fragments (not only reads) overlapping with blacklisted regions is fine.
Trimming could cause some trouble in downstream analyses.

Adding a blacklist option that accepts a BED file seems totally sufficient to me.

@dpryan79
Copy link
Collaborator

dpryan79 commented Feb 9, 2016

Looking at fragments would seriously decrease performance. I suppose that this could be done, but I'll start with filtering chunks and we'll see if that suffices and only try to ignore entire fragments if absolutely needed.

@asrichter
Copy link
Collaborator

Ok. I prefer filtering for fragments when writing BAM files as otherwise SAM flags have to be fixed for singletons (read paired has to be unset). This however requires resource-intensive sorting be read name.
For deepTools, read filtering should be fine unless one also filters for SAM flags at the same time.

@steffenheyne
Copy link
Collaborator Author

Filtering fragments was also my initial thought, but if it is much easier/faster filtering out the genomic chunks I think it is ok this way. We can see how it "behaves" and then improve later on...

One thing: should we also adjust some scaling factors like the effective genome size? I mean a 1kb region doesn't matter, but if the blacklist appears to be some Mb then it might be substantial....

Thanks a lot btw :-)

@dpryan79
Copy link
Collaborator

dpryan79 commented Feb 9, 2016

If the blacklisted region is really large then the user will need to adjust the effective genome size for the tools that require it (this is the case for anything in deepTools that uses an effective genome size and also allows restricting of the computation to a particular region). At the moment, the black list wouldn't normally affect things like normalization unless you use a method that actually samples the genome.

@dpryan79
Copy link
Collaborator

computeGCBias already has a --filterOut option that does that. Would people prefer that that be renamed to match every other tool (I'm currently calling the option --blackListFile), to rename the option in all of the other tools to match, or not worry about the difference? I'd generally prefer that the commands all have identical options when they're doing essentially the same thing.

@steffenheyne
Copy link
Collaborator Author

I'd also prefer to have everywhere the same option...
maybe even more inline with some other existing options would be
--blacklistRegionsFileName or
--blacklistRegionsFile or
--blacklistFileName

:-) ...but it's not so important...

@dpryan79
Copy link
Collaborator

Maybe --blacklistFileName then for everything.

@steffenheyne
Copy link
Collaborator Author

great

@steffenheyne
Copy link
Collaborator Author

So @dpryan79 implemented a blacklist filtering and here is an example how the results would look like

igv_snapshot

The top track is the original track from a bam file. The light green track is from bamCoverage with the blacklistFileName option. The red track is how a track would look like after filtering with bamutils. In blue is the blacklisted region, which was defined as >10 enrichment in the orinal track.

So this seems to work nice. Please note the border effects of blacklisted regions. This comes from reads starting/ending outside the blacklist region, as it would be too slow to filter out all reads that overlap the blacklist region. By extending the blacklist regions by about 1x insertsize at each end these shall go away. But this is totally fine to give this to the user.

In addition, there is also an estimation of how many reads fall into the blacklisted regions. This is used to adjust the scaling factors for e.g 1xcoverage.

@dpryan79: can comment on this how it is finally done!? Thanks!

Bamutils does a good job in also removing all overlapping reads (red track)! But it took about ~50 minutes to filter only the bam! The impact on deeptools runtime is minor and comes only from the additional read number estimation for blacklisted regions...

Any suggestions/comments?!

@fidelram
Copy link
Collaborator

@steffenheyne Do you know how bamutils works? why we don't see a 'hole' as in the bamCoverage case?

@dpryan79
Copy link
Collaborator

Thanks for testing this! The results will generally be a bit better if the blacklisted regions are widened slightly (one could do this automatically, but I'd prefer to avoid that).

This generally works by taking each genomic chunk that would have been queried and subtracting out the blacklisted regions. For scaling factors, the normal read count from the BAM index is taken and then the number of reads in the blacklisted regions (from the count() command in pysam) subtracted. This, then, has the benefit of getting a better scaling.

I hope this suffices for people, performance would degrade to that of bamutils if we really go whole-hog on this.

@steffenheyne
Copy link
Collaborator Author

@fidelram Ah yes, I didn't use the option "--skipNonCoveredRegions" for any of the tracks, so then probably this is how @dpryan79 implemented the blacklisted region encoding...and this is probably always "nan" no matter if "--skipNonCoveredRegions" was used or not...

Should this be changed? Mhmmm...maybe blacklist regions better also have 0 and "nan" only in case "--skipNonCoveredRegions" was used in addition

@dpryan79
Copy link
Collaborator

Yes, you'll get NAs for blacklisted regions, which is the only sane way to do it. I won't implement different method than that. It's likely that there's no gap in the bamUtils preprocessed data due to read extension. Read extension would still occur with blacklists, but extensions into a blacklisted region would be ignored (this is a good thing).

@steffenheyne
Copy link
Collaborator Author

yes, all right, but couldn't we simply change the NAs to 0 when writing the wig/bigwig!?

@dpryan79
Copy link
Collaborator

No

@dpryan79
Copy link
Collaborator

NAs aren't written, nothing is written, which is the correct way to do it.

@dpryan79
Copy link
Collaborator

If there's no other feedback on this by Wednesday I'll merge that branch into develop and close this.

@dpryan79 dpryan79 mentioned this issue Feb 24, 2016
@dpryan79
Copy link
Collaborator

This is now merged in for the 2.2 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants