Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gcnv_contig_ploidy wrapper script crahses because gatk DetermineGermlineContigPloidy output is not cleaned but always fully read #443

Closed
Nicolai-vKuegelgen opened this issue Sep 18, 2023 · 1 comment · Fixed by #468

Comments

@Nicolai-vKuegelgen
Copy link
Contributor

Nicolai-vKuegelgen commented Sep 18, 2023

Describe the bug
The {mapper}.gcnv_contig_ploidy.wgs module in the sv_calling_wgs step (but probably also smimlarly for sv_calling_wes) creates subfolders for each sample that is run sv_calling_wgs/work/{mapper}.gcnv_contig_ploidy.wgs/out/{mappers}.gcnv_contig_ploidy.wgs/ploidy-calls/SAMPLE_*. However these folders are never cleaned/removed, even if the samplesheet is updated, or specifically if samples are removed.
This means that the gcnv_contig_ploidy wrapper script will fail if it reads all these SAMPLE_* folders with a glob when not all of them are defined in the samplesheet.

To Reproduce
Steps to reproduce the behavior:

  1. Run the sv_calling_wgs step with any set of samples.
  2. Change the samplesheet so that overall fewer samples are present than in the previous run.
  3. Try to rerun the sv_calling_wgs / {mapper}.gcnv_contig_ploidy.wgs step
  4. See error (sv_calling_wgs/slurm_log/{id}/snakejob.sv_calling_wgs_gcnv_contig_ploidy.{n}.sh-{id}.log) :
File "/data/cephfs-1/work/projects/medgen_genomes/2023-01-23_Limb_Study_Reboot/GRCh37/sv_calling_wgs/.snakemake/scripts/tmp9ot4x6ul.wrapper.py", line 58, in <module>
sample_sex = sex_map[sample_name]
KeyError: '{previous_sample}-N1-DNA1-WGS1' 

Expected behavior
There several relatively easy optionsto fix the behaviour of the warpper script:

  1. Read all SAMPLE_* folders but ignore all that have samples not defined in the samplesheet
  2. Only read the first N SAMPLE_* folders (N = number of samples from samplehseet)
  3. Delete all SAMPLE_* folders at the start of the wrapper script to ensure that it always contains on the most recent output data.

Additional context
It also seems that gatk DetermineGermlineContigPloidy will always overwrite the SAMPLE_N folders starting with N=0 up to the number of samples in any given run.

@Nicolai-vKuegelgen
Copy link
Contributor Author

Nicolai-vKuegelgen commented Sep 18, 2023

Addendum: the (following) {mapper}.gcnv_call_cnvs.wgs.XXXX_of_YYYY rules will also fail if their output files already exist but were created by a different user (since gatk tries to shutils copy the file ownership). Potentially the wrapper script here should remove the whole output folder as a first step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant