start pipeline from --joint_germline step #755

jrhaas · 2022-09-20T07:47:14Z

Description of feature

In order to deal with continuously growing number of gvcf files from haplotypecaller, it would be helpful to start the pipeline from the --joint_germline step. This would enable to tackle the n+1 problem without having to rerun the haplotypecaller for all samples.

FriederikeHanssen · 2023-06-17T21:30:17Z

I started looking into this. I think it would be easily possible if all intervals are processed together in one DB. However, it seems quite a bit more work to set it up for one DB per interval.

amizeranschi · 2023-06-29T11:46:06Z

+1 for this request, it would be very useful

FriederikeHanssen · 2023-08-04T22:58:11Z

@amizeranschi any thought on this: The way I see it now, it should be doable when processing all intervals at once because you end up with one db. Splitting it up seems really tricky. Would it be feasible for you to only allow this when all intervals are processed in one group, or is too time consuming to be useful?

amizeranschi · 2023-08-05T05:41:31Z

Hi @FriederikeHanssen, thanks a lot for looking into this. It's been a while since I tested, but I remember setting NPS to a huge number at one point, which ended up enforcing a single DB (please correct me if I'm wrong about this). I also remember having some serious performance issues with that test (very large runtimes, as well as frequent OOM errors and process restarts). I was testing with 10 WGS samples in cattle, which have a similar genome size as human.

With the default NPS value, my test jobs were effectively setting one DB for each chromosome, which was running pretty well on my infrastructure, although eventually resulting in the joint_germline.vcf problems reported on Slack (https://nfcore.slack.com/archives/CGFUX04HZ/p1690553832909949), as well as on GitHub (#1137) .

From a computational performance POV, having one GenomicsDB for each interval would very much be worth it for the --joint_germline use case, IMO.

FriederikeHanssen · 2023-08-05T08:27:22Z

I see. That makes adding the restart feature a lot trickier. I'll try to figure something out

wpoehlm · 2024-04-23T21:12:53Z

Hi, I just want to add another +1 for this feature request. As new samples are generated and QC identifies samples for removal/filtering, it becomes necessary to re-run joint calling with different groups of samples. Being able to start the pipeline with input gvcf files would remove the need to re-run haplotype calling which is quite cumbersome and expensive.

amizeranschi · 2024-05-03T07:15:06Z

Hi @FriederikeHanssen

Is this feature still on your radar?

FriederikeHanssen · 2024-05-03T07:21:21Z

Generally yes but honestly no time right now. I think @maxulysse brought it up as well. But if someone wants to take a stab at it or has some ideas on how to do it, please shout :D.

The biggest issue is that we run genomicsDBimport split by intervals to make it faster. If we have this as entry point we need all the genomicsdbs by intervals plus the vcfs. I am not sure if that means the user needs to supply split VCF files as well and we need to match it somehow or if we can add the complete input vcfs to each genomicsdb. If that is possible, we might just need a new way of handeling a list of genomicsdbs.

cmatKhan · 2024-05-23T11:30:27Z

An idea:

Rather than providing the split VCF as input, what about optionally outputting to the outputDir the split genomeDB from this step:

sarek/subworkflows/local/bam_joint_calling_germline_gatk/main.nf

Line 48 in b5b766d

GATK4_GENOMICSDBIMPORT(gendb_input, false, false, false)

along with a new file in the csv directory that associates an interval with a genomeDB file.

This step:

https://github.com/nf-core/sarek/blob/b5b766d3b4ac89864f2fa07441cdc8844e70a79e/subworkflows/local/bam_joint_calling_germline_gatk/main.nf#L39C1-L46C1

can be modified so that, if the genomedb interval map input sheet were passed, it would map the appropriate genomeDB file for that interval to the sixth position of the input tuple. Otherwise, it would just be a blank [] as it is now.

FriederikeHanssen · 2024-05-24T05:17:50Z

sounds like a good idea. Do you mean changing also the current functionality or adding an option to output the plsit genomeDb + samplesheet?

cmatKhan · 2024-05-28T12:47:29Z

I would -- no promises on how quickly I get to it, though.

The current functionality wouldn't change. I think this will entail the following:

an option to output the genomedb files. That should also output a lookup csv into the csv output subdir which has the interval in one column and the relative path from the outputDir to the genomedb file in the other.
A way to provide the lookup csv to sarek. This will be a bit more involved. It will impact both the input channel I linked above, but also the prepare_intervals subworkflow and steps that use the output of that subworkflow.

But, aside from adding a way to output the splits and then input them in another run, functionality would not be affected, I don't think.

jrhaas added the enhancement New feature or request label Sep 20, 2022

maxulysse added this to the 3.1 milestone Sep 26, 2022

maxulysse modified the milestones: 3.1, 3.2 Nov 23, 2022

FriederikeHanssen self-assigned this Jun 17, 2023

maxulysse modified the milestones: 3.2, 3.3 Jun 22, 2023

maxulysse modified the milestones: 3.3, 3.4, 3.5 Feb 8, 2024

FriederikeHanssen mentioned this issue May 23, 2024

Add option to pass in existant genomeDB to sarek #1539

Closed

hangxue-wustl mentioned this issue Jun 28, 2024

Assess nf-core/sarek sanger-tol/variantcalling#79

Open

9 tasks

FriederikeHanssen removed this from the 3.5 milestone Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

start pipeline from --joint_germline step #755

start pipeline from --joint_germline step #755

jrhaas commented Sep 20, 2022

FriederikeHanssen commented Jun 17, 2023

amizeranschi commented Jun 29, 2023

FriederikeHanssen commented Aug 4, 2023

amizeranschi commented Aug 5, 2023

FriederikeHanssen commented Aug 5, 2023

wpoehlm commented Apr 23, 2024

amizeranschi commented May 3, 2024

FriederikeHanssen commented May 3, 2024

cmatKhan commented May 23, 2024 •

edited

Loading

FriederikeHanssen commented May 24, 2024 •

edited

Loading

cmatKhan commented May 28, 2024

start pipeline from --joint_germline step #755

start pipeline from --joint_germline step #755

Comments

jrhaas commented Sep 20, 2022

Description of feature

FriederikeHanssen commented Jun 17, 2023

amizeranschi commented Jun 29, 2023

FriederikeHanssen commented Aug 4, 2023

amizeranschi commented Aug 5, 2023

FriederikeHanssen commented Aug 5, 2023

wpoehlm commented Apr 23, 2024

amizeranschi commented May 3, 2024

FriederikeHanssen commented May 3, 2024

cmatKhan commented May 23, 2024 • edited Loading

FriederikeHanssen commented May 24, 2024 • edited Loading

cmatKhan commented May 28, 2024

cmatKhan commented May 23, 2024 •

edited

Loading

FriederikeHanssen commented May 24, 2024 •

edited

Loading