Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impact of partitioning Genome? #31

Open
BrendanBeahan opened this issue Aug 13, 2024 · 1 comment
Open

Impact of partitioning Genome? #31

BrendanBeahan opened this issue Aug 13, 2024 · 1 comment

Comments

@BrendanBeahan
Copy link

BrendanBeahan commented Aug 13, 2024

Hello,

I had attempted to run DRUMMER using the exome mode across my genome reads as follows:

mkdir -p /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/

mkdir -p /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/DRUMMER
rm -rf /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/DRUMMER

cut /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/Pipeline/inputs/refs/refs/ensemble/Mus_musculus.GRCm38.dna.primary_assembly.fa.fai -f 1 > /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/chromosomes.txt

mv 	minimap.sortG*.bam /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/
mv 	minimap.sortG*.bai /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/

	cd /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/
cp -r /DRUMMER .
      cd /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/DRUMMER/

while read -r line; 
do
  python3 /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/DRUMMER/DRUMMER.py -r /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/Pipeline/inputs/refs/refs/ensemble/Mus_musculus.GRCm38.dna.primary_assembly.fa 		  -t /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/minimap.sortG.2.*.bam 		  -c /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/minimap.sortG.1.*.bam 		  -o /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/DRUMMER/$line/ 		  -a exome 		  -p 1 		  -n $line                   -m true ;
done < /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/chromosomes.txt || true

However this consistently gave me an OOM error, despite using 60GB on my school's HPC. So I attempted to limit the analysis to just the first chromosome to potentially conserve memory as follows:

mkdir -p /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/

mkdir -p /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/DRUMMER
rm -rf /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/DRUMMER

cut /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/Pipeline/inputs/refs/refs/ensemble/Mus_musculus.GRCm38.dna.primary_assembly.fa.fai -f 1 > /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/chromosomes.txt

mv  minimap.sortG*.bam /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/
mv  minimap.sortG*.bai /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/

cd /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/
cp -r /DRUMMER .
cd /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/DRUMMER/


first_line=$(head -n 1 /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/chromosomes.txt)

python3 /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/DRUMMER/DRUMMER.py -r /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/Pipeline/inputs/refs/refs/ensemble/Mus_musculus.GRCm38.dna.primary_assembly.fa         -t /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/minimap.sortG.2.*.bam         -c /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/minimap.sortG.1.*.bam         -o /rhea/scratch/brussel/vo/000/bvo00030/vsc11010/results_mouse_debugging/drummer/DRUMMER/$first_line/         -a exome         -p 1         -n $first_line         -m true ;

But despite running with 120GB now it is still running after about two days. Also, the files in the subsequently generated chromosome 1 directory haven't been updated in a while either and I don't see any intermediary logging or files to alert me to the progress of the analysis. My queries are basically thus:

  1. Is it ok if I utilize this cutting mechanism, or will it undermine the output/prevent completion?
  2. What is the typical run time for a Mus musculus sample with about 3.2 million reads? Also what amount of memory is reasonable for this request?
  3. Finally, I do see in your troubleshooting section that the fasta id names must be simplified. Currently my genome ids look like this: >1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF
    Whereas the dictionary values taken from my .fai are just the chromosome numbers 1 2 3 etc.
    Is this plausibly the issue preventing the analysis from moving forward?

Apologies for writing a novel, but I'd appreciate any insights you all would have in helping me to get your lovely tool working for me.

Best,
Brendan

@DepledgeLab
Copy link
Owner

Hi Brendan,

As an initial test to try and troubleshoot this, could you supply DRUMMER with just one bam files per condition and make sure these two bam files contain ~50k reads each. This should make things run very quickly and would allow you to see if DRUMMER runs to completion?

I've never tried DRUMMER with quite so many reads on such a large genome but I am a little surprised that the run time and memory requests are not sufficient. Let me know when you've tried the subsampling approach and hopefully we can better resolve the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants