-
Notifications
You must be signed in to change notification settings - Fork 288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inaccurate counts with bedtools intersect -split and -f #750
Comments
Thanks for reporting this. I am away on vacation but will dig into this when I return. |
I see the problem and will start on a solution. Thanks very much for reporting this. |
… -a record. Not for each sub-aligment.
This has been fixed in master. Thanks so much for reporting! |
Thank you Aaron! Can I assume that it use the call as first tried? |
Yep, and you can also use |
The visualizations and numbers look good. Thanks! |
I'm trying to do something that I thought was straightforward but is not giving me the expected answer using
bedtools intersect
. I had written this to the Google forum (yet to be approved by the moderator) and therefore I kept working on it. As such, I've found a workaround solution which I share at the bottom.The task I'm attempting is "How many reads from my RNA-seq sample overlap with any exon from my GTF file, given that overlaps are at least 50% of the read's length?" The majority of our read lengths are between 73 and 76 (one read of a paired-end sequencing is trimmed) so I expected overlaps of at least 37 bp to be counted.
From bedtools documentation, one way I believed to perform the the call was the following:
$BEDTOOLS intersect -abam ${STAR_BAM_PA} -b ${GTF_REF_EXON} -bed -u -f 0.5 -split | wc -l
The bedtools version I'm using is 2.26 (but I get the same results with 2.27).
STAR_BAM_PA
is my primary alignments BAM file using STAR then samtools.GTF_REF_EXON
refers to the genomic coordinates file from NCBI RefSeq with only the lines containing "exon".Not surprisingly, removing the
-f
parameter (so that the default is min. 1 bp overlap) increased this number which was expected. In order to visualize what was actually going on, I removed thewc -l
and focused on one transcript and the reads that would align there. To my surprise, I found that when using the-f
parameter that all of the reads were within an exon and ignored those that spanned exon-exon boundaries.As you can see, the reads without the
-f
parameter showed reads spanning exon-exon junctions, several of which one end was spanning an exon with I thought using-split
would take care of this as it is from RNA-seq. The image is the same as above but zooming in between exon 1 and the start of exon 2.I tried the following:
–split
option (results were the same)-bed -f 0.5 -u -split
)-u
These all did not yield the desired results.
My interpretation of this is that the
-f
parameter isn't working on the length of the-abam
input file has the documentation states (since theSTAR_BAM_PA
file appears to show the original read lengths), but rather on the resulting alignment which is a partial output of thebedtools intersect
call. This misses many reads that span exon-exon junctions. This appears to be a bug to me.My workaround is the following:
$BEDTOOLS intersect -abam STAR_BAM_PA -b GTF_REF_EXON -bed -split \
| cut -f 1,2,3,4,5 \
| awk '$3-$2 > 36.5’ \
| awk ' { print $2, $3, $3-$2, $4}’ \
| cut -d ' ' -f 4 \
| uniq \
| wc -l
Line 1: The output of the
bedtools intersect
call with-split
shows all overlaps to an exon.Line 2 and 3: I filter the results to have only overlaps greater than 36.5 (since most read lengths are at least 73).
Lines 4-6: I count only the reads once if they have this overlap.
Of course it'd be nice if something can be recommended (or fixed) to use in bedtools where I don't have to do this. (It's more accurate than the bedtools only call but it requires hard-coding of the read length filter and other testing shows it stops with very large BAM files.)
I look forward to feedback.
The text was updated successfully, but these errors were encountered: