-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a usegroup? #19
Comments
@tdlong Hi I would love to create a user group. Thank you for the suggestion!
|
My bam file is 123Gb! I have a few flow cells of RNA seq data from several different tissue I wish to use to annotate a de novo genome assembly. I am now filtering the bam file to only include HQ mapped pairs. This should reduce the size (I will keep track of this!).
It sounds like the best thing for me to do is to re-run the filtered bam, and memory profile while it is running. I will report back with the profile w.r.t. time.
Thanks for writing the software! I think it an important contribution as we see more and more de novo assemblies, and it is fairly cost effective to get pretty good RNAseq from several dozen tissues.
I guess one strategy would be to run the program on a small subset of the data and somehow subtract {housekeeping} genes from the bam file that are already annotated. With these RNAseq dataset the top 100 genes can suck up >25% of the reads.
T.
… On Jun 28, 2018, at 8:29 PM, Ruolin Liu ***@***.***> wrote:
@tdlong <https://github.com/tdlong> Hi I would love to create a user group. Thank you for the suggestion!
How large is your bam file? If it went beyond RAM capacity, it can fail. But it might be bad memory allocations. It will be great if you can provide me such a BAM file so I can do something about it.
Yep. "-q 10" might be enough. I would recommend it. I am currently working on a User guide. But Haven't quite finished. I will include the best practice in the user guide.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#19 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ATCNNxukUfzzT4QGOGCYOdJYT1nXeG1Qks5uBZ8QgaJpZM4U7339>.
|
@tdlong First of all, thanks for using Strawberry and giving me some feedback. I really appreciate it. And please let me know after you profile it. I am willing to work with you to fix any issues/bugs if you found. For your purpose, I agree if you have a very large bam. You can select a subset by using a known set of loci. But those highly expressed genes might be somehow in your interest. I am considering a feature to process BAM file on the fly to avoid such mem problems. I am also interested in know if you have problems running other de novo assemblers, like Cufflinks or StringTie? |
It is on the big memory node now.
Here is the script I submitted to my SGE queuing software.
#$ -N strawberry
#$ -q bigmemory
#$ -pe openmp 80
#$ -R y
module purge
module load samtools/1.8-11
module load perl/5.16.2
module load java/1.8.0.51
#/usr/bin/time -v samtools index hisat2_out/mouse.RNAseq.filter.sort.bam
#echo "finished samtools run at $(date)"
/usr/bin/time -v ./Strawberry/bin/strawberry hisat2_out/mouse.RNAseq.filter.sort.bam -o strawberry_June26 -p 80
Oddly, profiling the job it only appears to be using 6 cores, not the 80 I passed the program. It is also using those cores rather oddly.
Funny - depending on what measure you use, it's either using 605% of a core (~6 cores - via ps):
or 1 core (via top)
But note the time consumed in 'top'. 1955min or about 33 hours. But divided by 6, it's 5.4hrs, approximately the time it's been running, mod the period when it wasn't running in parallel. So that's odd. top is definitely underestimating the load.
… On Jul 1, 2018, at 9:59 PM, Ruolin Liu ***@***.***> wrote:
@tdlong <https://github.com/tdlong> First of all, thanks for using Strawberry and giving me some feedback. I really appreciate it. And please let me know after you profile it. I am willing to work with you to fix any issues/bugs if you found.
For your purpose, I agree if you have a very large bam. You can select a subset by using a known set of loci. But those highly expressed genes might be somehow in your interest.
I am considering a feature to process BAM file on the fly to avoid such mem problems. I am also interested in know if you have problems running other de novo assemblers, like Cufflinks or StringTie?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#19 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ATCNN-n1oVELZuWp4wZ6iUn8juHuhJ0Uks5uCai-gaJpZM4U7339>.
|
@tdlong Currently the parallelization can be improved a lot. The current multithreading has a huge overhead of dispatching. So I am not very surprised to see the low cpu load. Actually, I am recommending using 10 cores as the maximum for now so you won't waste the resources. Better multithreading is a feature I am working on right now. Right now using -p 10 is enough. |
It seems to do a pretty good job. In some case my old Trinity -> Augustus pipeline seems a little closer to exonerate predictions, in other cases Hisat2 -> strawberry.
When I filtered the input bam file the program did not crash...here is my "pipeline" for your reference.
foreach RNAseq_experiment:
hisat2 -p 8 -x $TREF -1 $R1 -2 $R2 | samtools view -Sbo hisat2_out/$samplename.bam -
samtools sort -o hisat2_out/$samplename.sort.bam hisat2_out/$samplename.bam
# merge into one big file
`ls hisat2_out/*.sort.bam >bamfiles.tomerge.txt`
bamtools merge -list bamfiles.tomerge.txt -out hisat2_out/RNAseq.bam
samtools sort -o hisat2_out/RNAseq.sort.bam hisat2_out/RNAseq.bam
# filter out poorly mapped reads
samtools view -b -f 0x2 -q 30 hisat2_out/RNAseq.sort.bam > hisat2_out/RNAseq.filter.sort.bam
samtools index hisat2_out/mRNAseq.filter.sort.bam
# filtered bam file is 75Gb, roughly half the size of the unfiltered
./Strawberry/bin/strawberry hisat2_out/RNAseq.filter.sort.bam -o strawberry_June26 -p 8
cd strawberry_June26/
# I want to visualize in SCGB
module load ucsc-tools/jan-19-2016
gtfToGenePred assembled_transcripts.gtf strawberry.Gp
genePredToBed strawberry.Gp strawberry.BED12
# bed formatting
sort -k1,1 -k2,2n strawberry.BED12 >temp.temp
sizes=".../PP.chrom.sizes"
bedToBigBed -type=bed12 temp.temp $sizes strawberry.bigBed -tab
… On Jul 2, 2018, at 10:17 PM, Ruolin Liu ***@***.***> wrote:
@tdlong <https://github.com/tdlong> Currently the parallelization can be improved a lot. The current multithreading has a huge overhead of dispatching. So I am not very surprised to see the low cpu load. Actually, I am recommending using 10 cores as the maximum for now so you won't waste the resources. Better multithreading is a feature I am working on right now. Right now using -p 10 is enough.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#19 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ATCNN7FUdRStdq2F55pHEWknAAw2l5zGks5uCv6FgaJpZM4U7339>.
|
I was annotating a mammalian genome and the program crashed. There does not appear to be intermediate files (beyond logs). Before I profile is it likely I just went above 500Gb of RAM and should be running this on a high memory node?
I am getting lots and lots of warnings about reads having multiple hits and/or mapping to multiple chromosomes. This does not surprise me as I am giving the program the bam file from HISAT2 (and mammalian genomes have lots of pseudogenes, etc). Should that bam file be pre-processed to only consider "-q 30" read pairs? What is "best practice".
The text was updated successfully, but these errors were encountered: