Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

freebayes joint calling memory exceeded #771

Closed
dwaggott opened this issue Feb 21, 2015 · 8 comments
Closed

freebayes joint calling memory exceeded #771

dwaggott opened this issue Feb 21, 2015 · 8 comments

Comments

@dwaggott
Copy link

I thought I'd start a new issue regarding processes being killed and jobs going idle. The last run #741 ended up going to completion, thank you.

I'm trying to run the freebayes joint caller using bams and vcfs from #741. The previous run actually did pooled and not joint calling via jointcaller: freebayes-joint.

The slurm scheduler killed some of the processes due to memory at what I think is the freebayes step. I previously increased the memory to 4g from 2g for freebayes. I'm now upping to 8g, does that seem rational?

Processing reference #83 (GL000225.1)
Processing reference #84 (GL000192.1)
slurmstepd: *** JOB 1616873 CANCELLED AT 2015-02-20T13:07:23 *** on sh-5-8
slurmstepd: Exceeded step memory limit at some point. oom-killer likely killed a process.
slurmstepd: Exceeded job memory limit at some point. oom-killer likely killed a process.
slurmstepd: _slurm_cgroup_destroy: problem deleting step cgroup path /cgroup/freezer/slurm/uid_261851/job_1616873/step_batch: Device or resource busy

However in the engine logs I do see some sambamba issues. I'm using the latest dev code with updated tools.

Processing reference #19 (19)
Processing reference #20 (20)
Processing reference #21 (21)
sambamba-depth:  (Broken pipe)
Processing reference #1 (1)
Processing reference #2 (2)
Processing reference #3 (3)
Processing reference #4 (4)
Processing reference #5 (5)

The submission script also shows a memory issue. It's using 8g and I'm now bumping to 12g.

[2015-02-20T21:27Z] sh-5-15.local: Could not find any mapped reads in target region 1:109707980..109708009
[2015-02-20T21:28Z] sh-3-21.local: Could not find any mapped reads in target region 11:125313108..125313163
[2015-02-20T21:27Z] sh-3-7.local: Could not find any mapped reads in target region 2:32211131..32211188
/var/spool/slurmd/job1616871/slurm_script: line 50: 10499 Killed                  bcbio_nextgen.py --timeout 10000 --retries 2 ../config/project_n143-fb-joint.yaml -t ipython -n 320 -q normal -s slurm -r "timelimit=2-00:00:00"
slurmstepd: Exceeded step memory limit at some point. Step may have been partially swapped out to disk.
slurmstepd: _slurm_cgroup_destroy: problem deleting step cgroup path /cgroup/freezer/slurm/uid_261851/job_1616871/step_batch: Device or resource busy
@chapmanb
Copy link
Member

Daryl;
Sorry about the issues. Trying to tackle from top to bottom:

  • Is the worker engine being killed, or the main bcbio process? I wouldn't expect adding more samples to take more memory here, so it is possible that killed message is from the main process?
  • Those broken pipes are expected. It will report this if you stream a BAM file and only view the head to look up information, so I wouldn't worry about them.
  • If you've increased the number of samples, the main process and controller will require additional memory since we need to manage more messages for all the items to process. This is something we hope will be better/more scalable in the future using a CWL-based workflow processing pipeline that uses databases to manage message instead of memory like IPython.

Hope this helps.

@dwaggott
Copy link
Author

Okay, things ran for a bit with the bumped memory but seem to be stuck again i.e. no files being updated but jobs are still running.

The controller job log doesn't end with an error. Still it's about 12g with a lot of lines like below:

[2015-02-21T20:05Z] sh-2-16.local: 2015-02-21 12:05:36 sh-2-16.local INFO [bcbio.run.itx] - removing AT from Y:18006687
[2015-02-21T20:05Z] sh-2-16.local: 2015-02-21 12:05:36 sh-2-16.local INFO [bcbio.run.itx] - removing AT from Y:18040350
[2015-02-21T20:05Z] sh-2-16.local: 2015-02-21 12:05:37 sh-2-16.local INFO [bcbio.run.itx] - removing GTT from Y:18054002
[2015-02-21T20:05Z] sh-2-16.local: 2015-02-21 12:05:37 sh-2-16.local INFO [bcbio.run.itx] - removing AG from Y:18070661
[2015-02-21T20:05Z] sh-2-16.local: 2015-02-21 12:05:37 sh-2-16.local INFO [bcbio.run.itx] - removing AC from Y:18101593
[2015-02-21T20:05Z] sh-2-16.local: 2015-02-21 12:05:37 sh-2-16.local INFO [bcbio.run.itx] - removing GTAGAT from Y:18209106

There were a few earlier errors in the controller log:

[2015-02-21T12:26Z] sh-3-3.local: Uncaught exception occurred
Traceback (most recent call last):
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run 
    _do_run(cmd, checks, log_stdout)
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'bcbio-variation-recall square -Xms250m -Xmx8g -XX:+UseSerialGC -c 16 -r 1:142542311-158052760 --caller freebayes /scratch/PI/euan/projects/athletes/bcbio/project_n143-fb-joint/work/joint/freebayes-joint/batch1/1/batch1-1_142542310_158052760.vcf.gz /share/PI/euan/apps/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa /scratch/PI/    euan/projects/athletes/bcbio/project_n143-fb-joint/work/joint/freebayes-joint/batch1/1/batch1-1_142542310_158052760.vcf-inputs.txt
2015-Feb-21 04:26:55 -0800 sh-3-3.local ERROR [bcbio.variation.recall.main] - 
^[[1;31mjava.lang.RuntimeException^[[m: ^[[3mEOF while reading string^[[m
             ^[[37mclojure.lang.Util.runtimeException^[[m  ^[[32m      Util.java:  219^[[m
    ^[[37mclojure.lang.LispReader$StringReader.invoke^[[m  ^[[32mLispReader.java:  461^[[m
      ^[[37mclojure.lang.LispReader.readDelimitedList^[[m  ^[[32mLispReader.java: 1148^[[m
       ^[[37mclojure.lang.LispReader$MapReader.invoke^[[m  ^[[32mLispReader.java: 1091^[[m
                   ^[[37mclojure.lang.LispReader.read^[[m  ^[[32mLispReader.java:  185^[[m
                     ^[[37mclojure.lang.RT.readString^[[m  ^[[32m        RT.java: 1738^[[m
                       ^[[33mclojure.core/^[[1;33mread-string^[[m  ^[[32m       core.clj: 3427^[[m
^[[33mbcbio.variation.recall.square/^[[1;33msample-to-bam-map^[[m  ^[[32m     square.clj:  245^[[m
     ^[[33mbcbio.variation.recall.square/^[[1;33mcombine-vcfs^[[m  ^[[32m     square.clj:  269^[[m
            ^[[33mbcbio.variation.recall.square/^[[1;33m-main^[[m  ^[[32m     square.clj:  321^[[m
                    ^[[37mclojure.lang.RestFn.applyTo^[[m  ^[[32m    RestFn.java:  137^[[m
                             ^[[33mclojure.core/^[[1;33mapply^[[m  ^[[32m       core.clj:  617^[[m
              ^[[33mbcbio.variation.recall.main/^[[1;33m-main^[[m  ^[[32m       main.clj:   33^[[m
                    ^[[37mclojure.lang.RestFn.applyTo^[[m  ^[[32m    RestFn.java:  137^[[m
               ^[[37mbcbio.variation.recall.main.main^[[m  ^[[32m               :     ^[[m

' returned non-zero exit status 1

[2015-02-21T12:26Z] sh-3-3.local: Unexpected error
Traceback (most recent call last):
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/ipythontasks.py", line 38, in _setup_logging
    yield config
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/ipythontasks.py", line 288, in square_batch_region
    return ipython.zip_args(apply(joint.square_batch_region, *args))
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/variation/joint.py", line 123, in square_batch_region
    _square_batch_bcbio_variation(data, region, bam_files, vrn_files, out_file, "square")
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/variation/joint.py", line 173, in _square_batch_bcbio_variation
    do.run(cmd, "%s in region: %s" % (cmd, bamprep.region_to_gatk(region)))
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run 
    _do_run(cmd, checks, log_stdout)
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'bcbio-variation-recall square -Xms250m -Xmx8g -XX:+UseSerialGC -c 16 -r 1:142542311-158052760 --caller freebayes /scratch/PI/euan/projects/athletes/bcbio/project_n143-fb-joint/work/joint/freebayes-joint/batch1/1/batch1-1_142542310_158052760.vcf.gz /share/PI/euan/apps/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa /scratch/PI/    euan/projects/athletes/bcbio/project_n143-fb-joint/work/joint/freebayes-joint/batch1/1/batch1-1_142542310_158052760.vcf-inputs.txt
2015-Feb-21 04:26:55 -0800 sh-3-3.local ERROR [bcbio.variation.recall.main] - 
^[[1;31mjava.lang.RuntimeException^[[m: ^[[3mEOF while reading string^[[m
             ^[[37mclojure.lang.Util.runtimeException^[[m  ^[[32m      Util.java:  219^[[m
    ^[[37mclojure.lang.LispReader$StringReader.invoke^[[m  ^[[32mLispReader.java:  461^[[m
      ^[[37mclojure.lang.LispReader.readDelimitedList^[[m  ^[[32mLispReader.java: 1148^[[m
       ^[[37mclojure.lang.LispReader$MapReader.invoke^[[m  ^[[32mLispReader.java: 1091^[[m
                   ^[[37mclojure.lang.LispReader.read^[[m  ^[[32mLispReader.java:  185^[[m
                     ^[[37mclojure.lang.RT.readString^[[m  ^[[32m        RT.java: 1738^[[m
                       ^[[33mclojure.core/^[[1;33mread-string^[[m  ^[[32m       core.clj: 3427^[[m
^[[33mbcbio.variation.recall.square/^[[1;33msample-to-bam-map^[[m  ^[[32m     square.clj:  245^[[m
     ^[[33mbcbio.variation.recall.square/^[[1;33mcombine-vcfs^[[m  ^[[32m     square.clj:  269^[[m
            ^[[33mbcbio.variation.recall.square/^[[1;33m-main^[[m  ^[[32m     square.clj:  321^[[m
                    ^[[37mclojure.lang.RestFn.applyTo^[[m  ^[[32m    RestFn.java:  137^[[m
                             ^[[33mclojure.core/^[[1;33mapply^[[m  ^[[32m       core.clj:  617^[[m
              ^[[33mbcbio.variation.recall.main/^[[1;33m-main^[[m  ^[[32m       main.clj:   33^[[m
                    ^[[37mclojure.lang.RestFn.applyTo^[[m  ^[[32m    RestFn.java:  137^[[m
               ^[[37mbcbio.variation.recall.main.main^[[m  ^[[32m               :     ^[[m


' returned non-zero exit status 1


Individual engine jobs have this as the last line:

2015-02-21 08:37:49.007 [IPEngineApp] WARNING | No heartbeat in the last 60010 ms (1 time(s) in a row).

@dwaggott
Copy link
Author

sacct seems to still indicate the calling script. Let's try 20g.

@dwaggott
Copy link
Author

Didn't run for too long before I got some GATK errors. The difference with this one was that the engine/controller jobs actually died.

log

@dwaggott
Copy link
Author

Reran and ended up with the same

org.broadinstitute.gatk.tools.CatVariants
IPython.parallel.error.CompositeError: one or more exceptions from call to method: concat_variant_files

@chapmanb
Copy link
Member

Daryl;
Sorry about the issues. It looks like one of the files produce by joint calling is truncated or otherwise problematic. Sorry, I'm not sure how that happened, and GATK is not helping with that error message. My best suggestion would be to list through the files in:

/scratch/PI/euan/projects/athletes/bcbio/project_n143-fb-joint/work/joint/freebayes-joint/batch1/batch1-joint-files.list

and see if anything looks truncated or obviously off. If so, removing it and re-running would probably fix it. If you can provide any more feedback about what is wrong I could try to improve recalling to avoid having the issue in the future. Hope this helps some.

@dwaggott
Copy link
Author

Should have thought of that. bcftools flags a file that ends in a truncated line. Might be worth adding a quick vcf validation step. Not sure how to quickly check the last line in a bgzip file without reading everything.

$ for i in `cat work/joint/freebayes-joint/batch1/batch1-joint-files.list`; do echo $i; bcftools view -H $i | wc -l; done
/scratch/PI/euan/projects/athletes/bcbio/project_n143-fb-joint/work/joint/freebayes-joint/batch1/7/batch1-7_46554833_62066746.vcf.gz
74210
/scratch/PI/euan/projects/athletes/bcbio/project_n143-fb-joint/work/joint/freebayes-joint/batch1/7/batch1-7_62067861_77579279.vcf.gz
151063
/scratch/PI/euan/projects/athletes/bcbio/project_n143-fb-joint/work/joint/freebayes-joint/batch1/7/batch1-7_77580407_93090530.vcf.gz
Incorrect number of FORMAT fields at 7:92248076
59639

@chapmanb
Copy link
Member

Daryl;
Thanks much for digging into this. I identified some places in bcbio.variation.recall where we weren't using transcation directories and could end up with truncated files. I hope this solves the problems of your errors and will let you run future runs without problems like this. Hope the re-run after removing the truncated file worked cleanly. Thanks again for all the help debugging this.

@dwaggott dwaggott mentioned this issue Jun 29, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants