-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
start pipeline from --joint_germline step #755
Comments
I started looking into this. I think it would be easily possible if all intervals are processed together in one DB. However, it seems quite a bit more work to set it up for one DB per interval. |
+1 for this request, it would be very useful |
@amizeranschi any thought on this: The way I see it now, it should be doable when processing all intervals at once because you end up with one db. Splitting it up seems really tricky. Would it be feasible for you to only allow this when all intervals are processed in one group, or is too time consuming to be useful? |
Hi @FriederikeHanssen, thanks a lot for looking into this. It's been a while since I tested, but I remember setting NPS to a huge number at one point, which ended up enforcing a single DB (please correct me if I'm wrong about this). I also remember having some serious performance issues with that test (very large runtimes, as well as frequent OOM errors and process restarts). I was testing with 10 WGS samples in cattle, which have a similar genome size as human. With the default NPS value, my test jobs were effectively setting one DB for each chromosome, which was running pretty well on my infrastructure, although eventually resulting in the From a computational performance POV, having one GenomicsDB for each interval would very much be worth it for the |
I see. That makes adding the restart feature a lot trickier. I'll try to figure something out |
Hi, I just want to add another +1 for this feature request. As new samples are generated and QC identifies samples for removal/filtering, it becomes necessary to re-run joint calling with different groups of samples. Being able to start the pipeline with input gvcf files would remove the need to re-run haplotype calling which is quite cumbersome and expensive. |
Is this feature still on your radar? |
Generally yes but honestly no time right now. I think @maxulysse brought it up as well. But if someone wants to take a stab at it or has some ideas on how to do it, please shout :D. The biggest issue is that we run genomicsDBimport split by intervals to make it faster. If we have this as entry point we need all the genomicsdbs by intervals plus the vcfs. I am not sure if that means the user needs to supply split VCF files as well and we need to match it somehow or if we can add the complete input vcfs to each genomicsdb. If that is possible, we might just need a new way of handeling a list of genomicsdbs. |
An idea: Rather than providing the split VCF as input, what about optionally outputting to the
along with a new file in the This step: can be modified so that, if the genomedb interval map input sheet were passed, it would map the appropriate genomeDB file for that interval to the sixth position of the input tuple. Otherwise, it would just be a blank |
sounds like a good idea. Do you mean changing also the current functionality or adding an option to output the plsit genomeDb + samplesheet? |
I would -- no promises on how quickly I get to it, though. The current functionality wouldn't change. I think this will entail the following:
But, aside from adding a way to output the splits and then input them in another run, functionality would not be affected, I don't think. |
Description of feature
In order to deal with continuously growing number of gvcf files from haplotypecaller, it would be helpful to start the pipeline from the --joint_germline step. This would enable to tackle the n+1 problem without having to rerun the haplotypecaller for all samples.
The text was updated successfully, but these errors were encountered: