Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

start pipeline from --joint_germline step #755

Open
jrhaas opened this issue Sep 20, 2022 · 11 comments
Open

start pipeline from --joint_germline step #755

jrhaas opened this issue Sep 20, 2022 · 11 comments
Assignees
Labels
enhancement New feature or request

Comments

@jrhaas
Copy link

jrhaas commented Sep 20, 2022

Description of feature

In order to deal with continuously growing number of gvcf files from haplotypecaller, it would be helpful to start the pipeline from the --joint_germline step. This would enable to tackle the n+1 problem without having to rerun the haplotypecaller for all samples.

@jrhaas jrhaas added the enhancement New feature or request label Sep 20, 2022
@maxulysse maxulysse added this to the 3.1 milestone Sep 26, 2022
@maxulysse maxulysse modified the milestones: 3.1, 3.2 Nov 23, 2022
@FriederikeHanssen FriederikeHanssen self-assigned this Jun 17, 2023
@FriederikeHanssen
Copy link
Contributor

I started looking into this. I think it would be easily possible if all intervals are processed together in one DB. However, it seems quite a bit more work to set it up for one DB per interval.

@maxulysse maxulysse modified the milestones: 3.2, 3.3 Jun 22, 2023
@amizeranschi
Copy link
Contributor

+1 for this request, it would be very useful

@FriederikeHanssen
Copy link
Contributor

@amizeranschi any thought on this: The way I see it now, it should be doable when processing all intervals at once because you end up with one db. Splitting it up seems really tricky. Would it be feasible for you to only allow this when all intervals are processed in one group, or is too time consuming to be useful?

@amizeranschi
Copy link
Contributor

Hi @FriederikeHanssen, thanks a lot for looking into this. It's been a while since I tested, but I remember setting NPS to a huge number at one point, which ended up enforcing a single DB (please correct me if I'm wrong about this). I also remember having some serious performance issues with that test (very large runtimes, as well as frequent OOM errors and process restarts). I was testing with 10 WGS samples in cattle, which have a similar genome size as human.

With the default NPS value, my test jobs were effectively setting one DB for each chromosome, which was running pretty well on my infrastructure, although eventually resulting in the joint_germline.vcf problems reported on Slack (https://nfcore.slack.com/archives/CGFUX04HZ/p1690553832909949), as well as on GitHub (#1137) .

From a computational performance POV, having one GenomicsDB for each interval would very much be worth it for the --joint_germline use case, IMO.

@FriederikeHanssen
Copy link
Contributor

I see. That makes adding the restart feature a lot trickier. I'll try to figure something out

@maxulysse maxulysse modified the milestones: 3.3, 3.4, 3.5 Feb 8, 2024
@wpoehlm
Copy link

wpoehlm commented Apr 23, 2024

Hi, I just want to add another +1 for this feature request. As new samples are generated and QC identifies samples for removal/filtering, it becomes necessary to re-run joint calling with different groups of samples. Being able to start the pipeline with input gvcf files would remove the need to re-run haplotype calling which is quite cumbersome and expensive.

@amizeranschi
Copy link
Contributor

Hi @FriederikeHanssen

Is this feature still on your radar?

@FriederikeHanssen
Copy link
Contributor

Generally yes but honestly no time right now. I think @maxulysse brought it up as well. But if someone wants to take a stab at it or has some ideas on how to do it, please shout :D.

The biggest issue is that we run genomicsDBimport split by intervals to make it faster. If we have this as entry point we need all the genomicsdbs by intervals plus the vcfs. I am not sure if that means the user needs to supply split VCF files as well and we need to match it somehow or if we can add the complete input vcfs to each genomicsdb. If that is possible, we might just need a new way of handeling a list of genomicsdbs.

@cmatKhan
Copy link
Contributor

cmatKhan commented May 23, 2024

An idea:

Rather than providing the split VCF as input, what about optionally outputting to the outputDir the split genomeDB from this step:

GATK4_GENOMICSDBIMPORT(gendb_input, false, false, false)

along with a new file in the csv directory that associates an interval with a genomeDB file.

This step:

https://github.com/nf-core/sarek/blob/b5b766d3b4ac89864f2fa07441cdc8844e70a79e/subworkflows/local/bam_joint_calling_germline_gatk/main.nf#L39C1-L46C1

can be modified so that, if the genomedb interval map input sheet were passed, it would map the appropriate genomeDB file for that interval to the sixth position of the input tuple. Otherwise, it would just be a blank [] as it is now.

@FriederikeHanssen
Copy link
Contributor

FriederikeHanssen commented May 24, 2024

sounds like a good idea. Do you mean changing also the current functionality or adding an option to output the plsit genomeDb + samplesheet?

@cmatKhan
Copy link
Contributor

I would -- no promises on how quickly I get to it, though.

The current functionality wouldn't change. I think this will entail the following:

  1. an option to output the genomedb files. That should also output a lookup csv into the csv output subdir which has the interval in one column and the relative path from the outputDir to the genomedb file in the other.

  2. A way to provide the lookup csv to sarek. This will be a bit more involved. It will impact both the input channel I linked above, but also the prepare_intervals subworkflow and steps that use the output of that subworkflow.

But, aside from adding a way to output the splits and then input them in another run, functionality would not be affected, I don't think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants