-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
before deduplication BAM file size:102M and after deduplication it is severly reduced to 378K #518
Comments
Thats odd. The log suggests that UMI-tools is only reading in 5951 reads, which is very different from your 2760592. The only way I can see this happening is if you have a very large number of read2 alignments, but a far smaller number of read1 alignments. What sort of sequencing is this? |
Hello IanSudbery, Thank you so much for you timely reply and for the future replies too |
Okay, so, as this is single cell, you need to do things a bit differently. Firstly, as far as I can tell, read1 contains little or no usable information. Once you have processed the cell barcodes and UMIs from read1, you will want to discard read1 and not use it further (i.e. map only read2, not read1). This means that your data is not paired end, and you should drop the Secondly, you need to tell UMI tools that it is dealing with data that has separate cells that need to be dealt with separately. You can do this by adding the Finally, I believe that in CEL-seq, that PCR happens before fragmentation, and thus, mapping position is not informative of duplication status. Thus, you will need to assign reads to genes, and then deduplicate on the basis of assigned gene, rather than mapping position. The whole process will be very simlar to that for 10X outlined in this tutorial: |
Greetings IanSudbery, ##logfiles of Deduplication & Count** ##Header of sorted BAM after assigning genes (assigned_soted.bam)which is used above** thanks in advance |
The logs suggest there has been an improvement of ~5,000 reads surviving deduplication to ~50,000 reads, which is a ten fold improvement. Note however, that 3M reads are being skipped because they are not assigned to any gene. |
yeah, I agree IanSudbery comparing to previous deduplication,its increasing to ten fold now.My concern is those unassigned huge reads only because people whoever processed this same dataset had 1,12,819 umi reads after deduplication. |
What isn't mentioned in your description is what they do with reads that multimap, In the sample you provided, all the unassigned genes were unassigned because they were multimapped. There is no reason you couldn't follow their proceedure, but using UMI-tools to do the collapsing. If you have mapped to the transcriptome, you can use the One alternative would be to use the switches I'll just finally note that if the original study didn't use UMI-tools, then it is likely that you will find fewer reads post deduplication, as we implement error correction that most other tools don't, and this leads to more reads being collapsed. Although I wouldn't expect that to be 1M -> 50k. |
Regarding multimap they didnt mentioned any details about it. ##STAR Mapping ** I got this Alligned file to transcriptome "Aligned.toTranscriptome.out.bam" ##Transcript-to-gene map generation ##UMI_Deduplication/Count I got the following output I am here attach my header of BAM file from STAR and the Transcript-gene-map.txt file |
In the tsx2gene map, you want genes in column 1 and transcripts in column 2, not vice-versa. |
Thanks a lot IanSudbery for your response, Again Ian in the above Trancriptome mapping method with STAR IMPORNTANT INFORMATION:Reads mapped within 1kb of the 3′ end of a gene and was in the same orientation as the gene, they assessed it as a transcript-mapping read for that gene. |
Again if i change --outFilterMultimapNmax 50 i got 100684 reads but can only get 58.7% genes covered from the original data |
I'm closing due to inactivity |
Greetings,
My code for deduplicting my paired end data:
##Step-1:(barcode/umi extraction)**
##Step-2:(Mapping the processed read with STAR)**
##Step-3(Deduping)
My terminal output for above deduping:
BAM content before:2760592 lines
BAM content after deduping:8454 lines
The text was updated successfully, but these errors were encountered: