Impact of partitioning Genome? #31

BrendanBeahan · 2024-08-13T12:20:25Z

Hello,

I had attempted to run DRUMMER using the exome mode across my genome reads as follows:

mkdir -p /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/

mkdir -p /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/DRUMMER
rm -rf /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/DRUMMER

cut /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/Pipeline/inputs/refs/refs/ensemble/Mus_musculus.GRCm38.dna.primary_assembly.fa.fai -f 1 > /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/chromosomes.txt

mv 	minimap.sortG*.bam /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/
mv 	minimap.sortG*.bai /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/

	cd /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/
cp -r /DRUMMER .
      cd /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/DRUMMER/

while read -r line; 
do
  python3 /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/DRUMMER/DRUMMER.py -r /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/Pipeline/inputs/refs/refs/ensemble/Mus_musculus.GRCm38.dna.primary_assembly.fa 		  -t /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/minimap.sortG.2.*.bam 		  -c /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/minimap.sortG.1.*.bam 		  -o /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/DRUMMER/$line/ 		  -a exome 		  -p 1 		  -n $line                   -m true ;
done < /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/chromosomes.txt || true

However this consistently gave me an OOM error, despite using 60GB on my school's HPC. So I attempted to limit the analysis to just the first chromosome to potentially conserve memory as follows:

mkdir -p /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/

mkdir -p /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/DRUMMER
rm -rf /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/DRUMMER

cut /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/Pipeline/inputs/refs/refs/ensemble/Mus_musculus.GRCm38.dna.primary_assembly.fa.fai -f 1 > /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/chromosomes.txt

mv  minimap.sortG*.bam /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/
mv  minimap.sortG*.bai /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/

cd /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/
cp -r /DRUMMER .
cd /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/DRUMMER/


first_line=$(head -n 1 /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/chromosomes.txt)

python3 /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/DRUMMER/DRUMMER.py -r /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/Pipeline/inputs/refs/refs/ensemble/Mus_musculus.GRCm38.dna.primary_assembly.fa         -t /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/minimap.sortG.2.*.bam         -c /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/minimap.sortG.1.*.bam         -o /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/DRUMMER/$first_line/         -a exome         -p 1         -n $first_line         -m true ;

But despite running with 120GB now it is still running after about two days. Also, the files in the subsequently generated chromosome 1 directory haven't been updated in a while either and I don't see any intermediary logging or files to alert me to the progress of the analysis. My queries are basically thus:

Is it ok if I utilize this cutting mechanism, or will it undermine the output/prevent completion?
What is the typical run time for a Mus musculus sample with about 3.2 million reads? Also what amount of memory is reasonable for this request?
Finally, I do see in your troubleshooting section that the fasta id names must be simplified. Currently my genome ids look like this: >1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF
Whereas the dictionary values taken from my .fai are just the chromosome numbers 1 2 3 etc.
Is this plausibly the issue preventing the analysis from moving forward?

Apologies for writing a novel, but I'd appreciate any insights you all would have in helping me to get your lovely tool working for me.

Best,
Brendan

The text was updated successfully, but these errors were encountered:

DepledgeLab · 2024-08-16T06:46:22Z

Hi Brendan,

As an initial test to try and troubleshoot this, could you supply DRUMMER with just one bam files per condition and make sure these two bam files contain ~50k reads each. This should make things run very quickly and would allow you to see if DRUMMER runs to completion?

I've never tried DRUMMER with quite so many reads on such a large genome but I am a little surprised that the run time and memory requests are not sufficient. Let me know when you've tried the subsampling approach and hopefully we can better resolve the issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Impact of partitioning Genome? #31

Impact of partitioning Genome? #31

BrendanBeahan commented Aug 13, 2024 •

edited

Loading

DepledgeLab commented Aug 16, 2024

Impact of partitioning Genome? #31

Impact of partitioning Genome? #31

Comments

BrendanBeahan commented Aug 13, 2024 • edited Loading

DepledgeLab commented Aug 16, 2024

BrendanBeahan commented Aug 13, 2024 •

edited

Loading