Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with GATK4_GENOMICSDBIMPORT during joint calling: Badly formed genome unclippedLoc: Query interval "[]" is not valid for this input #1035

Closed
amizeranschi opened this issue May 24, 2023 · 22 comments · Fixed by #1061
Labels
bug Something isn't working

Comments

@amizeranschi
Copy link
Contributor

Description of the bug

Hello,

I'm getting the error from the title while running joint calling with GATK. I'm attaching the full log below.

I am running nf-core/sarek -r cbdaaa0. I can provide my script if needed, for reproducibility.

Command used and terminal output

No response

Relevant files

nextflow-error.log.txt

System information

No response

@amizeranschi amizeranschi added the bug Something isn't working label May 24, 2023
@asp8200
Copy link
Contributor

asp8200 commented May 29, 2023

What is the status on this, @amizeranschi ?

When I try to run your script sarek-joint-VC.sh, I get:

-[nf-core/sarek] Pipeline completed with errors-
WARN: Input tuple does not match input set cardinality declared by process `NFCORE_SAREK:sarek:BAM_MERGE_INDEX_SAMTOOLS:MERGE_BAM` -- offending value: []
ERROR ~ Error executing process > 'NFCORE_SAREK:sarek:BAM_MERGE_INDEX_SAMTOOLS:MERGE_BAM (1)'

Caused by:
 Not a valid path value type: org.codehaus.groovy.runtime.NullObject (null)

@amizeranschi
Copy link
Contributor Author

Thanks a lot @asp8200 for looking into this. I tested again now (since the script is configured to run with -r dev -latest and I got the same error.

@FriederikeHanssen
Copy link
Contributor

Hey! So with the test profile i cannot reproduce the --save_mapped error:

nextflow run main.nf -profile test,targeted,docker --input ./tests/csv/3.0/fastq_pair.csv --tools haplotypecaller --joint_germline --outdir results --save_mapped -resume

I manipulated the fastq_pair.csv to this:

patient,sex,status,sample,lane,fastq_1,fastq_2
test,XX,0,test,test_L1,https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/illumina/fastq/test_1.fastq.gz,https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/illumina/fastq/test_2.fastq.gz
test2,XX,0,test2,test_L1,https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/illumina/fastq/test2_1.fastq.gz,https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/illumina/fastq/test2_2.fastq.gz

My local copy is on revision: a5122a0

@FriederikeHanssen
Copy link
Contributor

also can't reproduce with --save_mapped --save_output_as_bam it works for both just fine. :/ So not sure what to do here. @asp8200 can you test this also again with the latest dev.

Of to issue number 2

@asp8200
Copy link
Contributor

asp8200 commented Jun 2, 2023

also can't reproduce with --save_mapped --save_output_as_bam it works for both just fine. :/ So not sure what to do here. @asp8200 can you test this also again with the latest dev.

Of to issue number 2

Yes, I'll try to see if I can reproduce the error.

@FriederikeHanssen
Copy link
Contributor

Finally reproduced the save_mapped error

@FriederikeHanssen
Copy link
Contributor

fixed the nullobject issue. If you want to test it you can run with FriederikeHanssen/sarek -r issue_1035 . now i can look into the actual problem

@asp8200
Copy link
Contributor

asp8200 commented Jun 2, 2023

fixed the nullobject issue. If you want to test it you can run with FriederikeHanssen/sarek -r issue_1035 . now i can look into the actual problem

I can confirm that BAM_MERGE_INDEX_SAMTOOLS:MERGE_BAM works on FriederikeHanssen/sarek -r issue_1035. I'm now running a reduced version of Alexandru's script and it successfully passed the step BAM_MERGE_INDEX_SAMTOOLS:MERGE_BAM.

@FriederikeHanssen FriederikeHanssen linked a pull request Jun 2, 2023 that will close this issue
9 tasks
@FriederikeHanssen
Copy link
Contributor

Fixed, if you'd like to test it

@asp8200
Copy link
Contributor

asp8200 commented Jun 2, 2023

The MERGE_BAM was solved on FriederikeHanssen/sarek -r issue_1035 and the pipeline continued to the GenomicsDBImport-issue mentioned in the description of this GitHub-issue:

-[nf-core/sarek] Pipeline completed with errors-
ERROR ~ Error executing process > 'NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_GERMLINE_ALL:BAM_JOINT_CALLING_GERMLINE_GATK:GATK4_GENOMICSDBIMPORT (joint_variant_calling)'

Caused by:
  Process `NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_GERMLINE_ALL:BAM_JOINT_CALLING_GERMLINE_GATK:GATK4_GENOMICSDBIMPORT (joint_variant_calling)` terminated with an error exit status (2)

Command executed:

  gatk --java-options "-Xmx24576M" GenomicsDBImport \
      --variant SRR9041541.haplotypecaller.g.vcf.gz --variant SRR9041540.haplotypecaller.g.vcf.gz \
      --genomicsdb-workspace-path [].joint \
      --intervals [] \
      --tmp-dir . \
      --genomicsdb-shared-posixfs-optimizations true --bypass-feature-reader

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_GERMLINE_ALL:BAM_JOINT_CALLING_GERMLINE_GATK:GATK4_GENOMICSDBIMPORT":
      gatk4: $(echo $(gatk --version 2>&1) | sed 's/^.*(GATK) v//; s/ .*$//')
  END_VERSIONS

Command exit status:
  2

Command output:
  (empty)

Command error:
  Using GATK jar /usr/local/share/gatk4-4.4.0.0-0/gatk-package-4.4.0.0-local.jar
  Running:
      java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx24576M -jar /usr/local/share/gatk4-4.4.0.0-0/gatk-package-4.4.0.0-local.jar GenomicsDBImport --variant SRR9041541.haplotypecaller.g.vcf.gz --variant SRR9041540.haplotypecaller.g.vcf.gz --genomicsdb-workspace-path [].joint --intervals [] --tmp-dir . --genomicsdb-shared-posixfs-optimizations true --bypass-feature-reader
  16:11:27.508 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/usr/local/share/gatk4-4.4.0.0-0/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
  16:11:27.536 INFO  GenomicsDBImport - ------------------------------------------------------------
  16:11:27.538 INFO  GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.4.0.0
  16:11:27.539 INFO  GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
  16:11:27.539 INFO  GenomicsDBImport - Executing as ubuntu@fcd4a8ea21cc on Linux v5.15.0-1026-aws amd64
  16:11:27.539 INFO  GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v17.0.3-internal+0-adhoc..src
  16:11:27.539 INFO  GenomicsDBImport - Start Date/Time: June 2, 2023 at 4:11:27 PM GMT
  16:11:27.539 INFO  GenomicsDBImport - ------------------------------------------------------------
  16:11:27.539 INFO  GenomicsDBImport - ------------------------------------------------------------
  16:11:27.540 INFO  GenomicsDBImport - HTSJDK Version: 3.0.5
  16:11:27.540 INFO  GenomicsDBImport - Picard Version: 3.0.0
  16:11:27.540 INFO  GenomicsDBImport - Built for Spark Version: 3.3.1
  16:11:27.540 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
  16:11:27.540 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
  16:11:27.541 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
  16:11:27.541 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
  16:11:27.541 INFO  GenomicsDBImport - Deflater: IntelDeflater
  16:11:27.541 INFO  GenomicsDBImport - Inflater: IntelInflater
  16:11:27.541 INFO  GenomicsDBImport - GCS max retries/reopens: 20
  16:11:27.541 INFO  GenomicsDBImport - Requester pays: disabled
  16:11:27.542 INFO  GenomicsDBImport - Initializing engine
  16:11:27.654 INFO  GenomicsDBImport - Shutting down engine
  [June 2, 2023 at 4:11:27 PM GMT] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.00 minutes.
  Runtime.totalMemory()=285212672
  ***********************************************************************

  A USER ERROR has occurred: Badly formed genome unclippedLoc: Query interval "[]" is not valid for this input.

  ***********************************************************************
  Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

The intervals went missing, i.e. --intervals [] :-/

The pipeline was given this cmd:

nextflow run FriederikeHanssen/sarek -r issue_1035 \
-profile docker \
-resume \
--input sample-sheet.csv \
--outdir results \
--max_time '10.h' \
--max_cpus '16' \
--max_memory '30.GB' \
--fasta ${main_dir}/reproducible-sarek-error/Saccharomyces_cerevisiae.R64-1-1-chr.fa \
--fasta_fai ${main_dir}/reproducible-sarek-error/Saccharomyces_cerevisiae.R64-1-1-chr.fa.fai \
--dict ${main_dir}/reproducible-sarek-error/Saccharomyces_cerevisiae.R64-1-1-chr.dict \
--save_mapped \
--vep_cache ${main_dir}/reproducible-sarek-error/Saccharomyces_cerevisiae \
--bwa ${main_dir}/reproducible-sarek-error/bwa-index \
--igenomes_ignore \
--genome null \
--joint_germline \
--tools haplotypecaller \
--dbsnp ${main_dir}/reproducible-sarek-error/Saccharomyces_cerevisiae.vcf.gz \
--dbsnp_tbi ${main_dir}/reproducible-sarek-error/Saccharomyces_cerevisiae.vcf.gz.tbi \
--dbsnp_vqsr '--resource:ensemblvcf,known=false,training=true,truth=true,prior=10.0 Saccharomyces_cerevisiae.vcf.gz' \
--known_indels ${main_dir}/reproducible-sarek-error/Saccharomyces_cerevisiae_indels.vcf.gz \
--known_indels_tbi ${main_dir}/reproducible-sarek-error/Saccharomyces_cerevisiae_indels.vcf.gz.tbi \
--known_indels_vqsr '--resource:ensemblsnps,known=false,training=true,truth=true,prior=10.0 Saccharomyces_cerevisiae_indels.vcf.gz' \
--known_snps ${main_dir}/reproducible-sarek-error/Saccharomyces_cerevisiae_snps.vcf.gz \
--known_snps_tbi ${main_dir}/reproducible-sarek-error/Saccharomyces_cerevisiae_snps.vcf.gz.tbi \
--known_snps_vqsr '--resource:ensemblindels,known=false,training=true,truth=true,prior=10.0 Saccharomyces_cerevisiae_snps.vcf.gz'

@FriederikeHanssen
Copy link
Contributor

on which commit were you?

@FriederikeHanssen
Copy link
Contributor

I am getting this now

A USER ERROR has occurred: Bad input: Values for QD annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations.
  ``` but that seems more of a data issue

@asp8200
Copy link
Contributor

asp8200 commented Jun 2, 2023

on which commit were you?

Jun-02 15:30:43.369 [main] INFO nextflow.cli.CmdRun - Launching https://github.com/FriederikeHanssen/sarek [romantic_albattani] DSL2 - revision: 5f73731 [issue_1035]

@FriederikeHanssen
Copy link
Contributor

issue was only fixed a few commits later 04a563c

@asp8200
Copy link
Contributor

asp8200 commented Jun 2, 2023

5f73731

Ok. I'll relaunch the test.

@FriederikeHanssen
Copy link
Contributor

see #1061

@asp8200
Copy link
Contributor

asp8200 commented Jun 2, 2023

I am getting this now

A USER ERROR has occurred: Bad input: Values for QD annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations.
  ``` but that seems more of a data issue

I now also get "Bad input: Values for QD annotation not detected for ANY training variant in the input callset". And I agree that it is more of a data issue.

@amizeranschi
Copy link
Contributor Author

Hey @FriederikeHanssen and @asp8200

Thank you very much for addressing this issue. As it turns out, the yeast VCF file with known variants from Ensembl doesn't contain any indels, so it isn't suitable for VQSR. This is the reason for the QD annotation error you saw.

I've tested the same code during the week-end on Bos Taurus data and things went pretty much alright, including VQSR. This is the command I ran:

nextflow run nf-core/sarek -r dev -latest \
--skip_tools haplotypecaller_filter \
-with-dag flowchart.png \
-profile docker \
--input sample-sheet.csv \
--outdir sarek-test-jointVC \
--max_time '336.h' \
--max_cpus '64' \
--max_memory '256.GB' \
--fasta ${main_dir}/reproducible-sarek-error/bos_taurus.ARS-UCD1.2-chr.fa \
--fasta_fai ${main_dir}/reproducible-sarek-error/bos_taurus.ARS-UCD1.2-chr.fa.fai \
--dict ${main_dir}/reproducible-sarek-error/bos_taurus.ARS-UCD1.2-chr.dict \
--save_mapped \
--save_output_as_bam \
--vep_cache ${main_dir}/reproducible-sarek-error/bos_taurus \
--bwa ${main_dir}/reproducible-sarek-error/bwa-index \
--igenomes_ignore \
--genome null \
--joint_germline \
--tools haplotypecaller,vep \
--dbsnp ${main_dir}/reproducible-sarek-error/bos_taurus.vcf.gz \
--dbsnp_tbi ${main_dir}/reproducible-sarek-error/bos_taurus.vcf.gz.tbi \
--dbsnp_vqsr '--resource:ensemblvcf,known=false,training=true,truth=true,prior=10.0 bos_taurus.vcf.gz' \
--known_indels ${main_dir}/reproducible-sarek-error/bos_taurus_indels.vcf.gz \
--known_indels_tbi ${main_dir}/reproducible-sarek-error/bos_taurus_indels.vcf.gz.tbi \
--known_indels_vqsr '--resource:ensemblindels,known=false,training=true,truth=true,prior=10.0 bos_taurus_indels.vcf.gz' \
--known_snps ${main_dir}/reproducible-sarek-error/bos_taurus_snps.vcf.gz \
--known_snps_tbi ${main_dir}/reproducible-sarek-error/bos_taurus_snps.vcf.gz.tbi \
--known_snps_vqsr '--resource:ensemblsnps,known=false,training=true,truth=true,prior=10.0 bos_taurus_snps.vcf.gz' \
--vep_out_format tab \
--vep_include_fasta

And below are the last lines of the output. There was an error right at the end (after MultiQC), but the pipeline appears to have finished successfully in spite of it. One thing that puzzles me, however, is why Ensembl VEP ended up being skipped, even though I had it enabled in my command above. Is this by design, for joint calling runs?

[73/fada7f] process > NFCORE_SAREK:SAREK:CRAM_TO_BAM (ERR1746309)                                                                                                  [100%] 2 of 2 ✔
[06/9b459a] process > NFCORE_SAREK:SAREK:BAM_BASERECALIBRATOR:GATK4_BASERECALIBRATOR (ERR1746309)                                                                  [100%] 34 of 34 ✔
[3a/a400ba] process > NFCORE_SAREK:SAREK:BAM_BASERECALIBRATOR:GATK4_GATHERBQSRREPORTS (ERR1746309)                                                                 [100%] 2 of 2 ✔
[c3/f64d50] process > NFCORE_SAREK:SAREK:BAM_APPLYBQSR:GATK4_APPLYBQSR (ERR1746309)                                                                                [100%] 34 of 34 ✔
[78/8e337c] process > NFCORE_SAREK:SAREK:BAM_APPLYBQSR:CRAM_MERGE_INDEX_SAMTOOLS:MERGE_CRAM (ERR1746309)                                                           [100%] 2 of 2 ✔
[2a/0dda71] process > NFCORE_SAREK:SAREK:BAM_APPLYBQSR:CRAM_MERGE_INDEX_SAMTOOLS:INDEX_CRAM (ERR1746309)                                                           [100%] 2 of 2 ✔
[36/a39db9] process > NFCORE_SAREK:SAREK:CRAM_QC_RECAL:SAMTOOLS_STATS (ERR1746309)                                                                                 [100%] 2 of 2 ✔
[23/d2b71e] process > NFCORE_SAREK:SAREK:CRAM_QC_RECAL:MOSDEPTH (ERR1746309)                                                                                       [100%] 2 of 2 ✔
[ce/88972c] process > NFCORE_SAREK:SAREK:CRAM_TO_BAM_RECAL (ERR1746309)                                                                                            [100%] 2 of 2 ✔
[7f/3b9386] process > NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_GERMLINE_ALL:BAM_VARIANT_CALLING_HAPLOTYPECALLER:GATK4_HAPLOTYPECALLER (ERR1746309)                   [100%] 34 of 34 ✔
[70/8a2e85] process > NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_GERMLINE_ALL:BAM_VARIANT_CALLING_HAPLOTYPECALLER:MERGE_HAPLOTYPECALLER (ERR1746309)                   [100%] 2 of 2 ✔
[-        ] process > NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_GERMLINE_ALL:BAM_VARIANT_CALLING_HAPLOTYPECALLER:BAM_MERGE_INDEX_SAMTOOLS:MERGE_BAM                   -
[-        ] process > NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_GERMLINE_ALL:BAM_VARIANT_CALLING_HAPLOTYPECALLER:BAM_MERGE_INDEX_SAMTOOLS:INDEX_MERGE_BAM             -
[ba/9adcba] process > NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_GERMLINE_ALL:BAM_JOINT_CALLING_GERMLINE_GATK:GATK4_GENOMICSDBIMPORT (joint_variant_calling)           [100%] 17 of 17 ✔
[e2/eb80d4] process > NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_GERMLINE_ALL:BAM_JOINT_CALLING_GERMLINE_GATK:GATK4_GENOTYPEGVCFS (joint_variant_calling)              [100%] 17 of 17 ✔
[83/f1cfaa] process > NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_GERMLINE_ALL:BAM_JOINT_CALLING_GERMLINE_GATK:BCFTOOLS_SORT (joint_variant_calling)                    [100%] 17 of 17 ✔
[7c/60a949] process > NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_GERMLINE_ALL:BAM_JOINT_CALLING_GERMLINE_GATK:MERGE_GENOTYPEGVCFS (joint_variant_calling)              [100%] 1 of 1 ✔
[1c/328523] process > NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_GERMLINE_ALL:BAM_JOINT_CALLING_GERMLINE_GATK:VARIANTRECALIBRATOR_INDEL (joint_variant_calling)        [100%] 1 of 1 ✔
[91/627e76] process > NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_GERMLINE_ALL:BAM_JOINT_CALLING_GERMLINE_GATK:VARIANTRECALIBRATOR_SNP (joint_variant_calling)          [100%] 1 of 1 ✔
[3e/4ad249] process > NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_GERMLINE_ALL:BAM_JOINT_CALLING_GERMLINE_GATK:GATK4_APPLYVQSR_SNP (recalibrated_joint_variant_calling) [100%] 1 of 1 ✔
[0d/d85e96] process > NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_GERMLINE_ALL:BAM_JOINT_CALLING_GERMLINE_GATK:GATK4_APPLYVQSR_INDEL (recalibrated_joint_variant_cal... [100%] 1 of 1 ✔
[5c/0b025e] process > NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_GERMLINE_ALL:BAM_JOINT_CALLING_GERMLINE_GATK:MERGE_VQSR (recalibrated_joint_variant_calling)          [100%] 1 of 1 ✔
[61/2b8b53] process > NFCORE_SAREK:SAREK:VCF_QC_BCFTOOLS_VCFTOOLS:BCFTOOLS_STATS (recalibrated_joint_variant_calling)                                              [100%] 2 of 2 ✔
[be/7f4e3d] process > NFCORE_SAREK:SAREK:VCF_QC_BCFTOOLS_VCFTOOLS:VCFTOOLS_TSTV_COUNT (recalibrated_joint_variant_calling)                                         [100%] 2 of 2 ✔
[8c/b5b9e4] process > NFCORE_SAREK:SAREK:VCF_QC_BCFTOOLS_VCFTOOLS:VCFTOOLS_TSTV_QUAL (recalibrated_joint_variant_calling)                                          [100%] 2 of 2 ✔
[78/a7aab3] process > NFCORE_SAREK:SAREK:VCF_QC_BCFTOOLS_VCFTOOLS:VCFTOOLS_SUMMARY (recalibrated_joint_variant_calling)                                            [100%] 2 of 2 ✔
[-        ] process > NFCORE_SAREK:SAREK:VCF_ANNOTATE_ALL:VCF_ANNOTATE_ENSEMBLVEP:ENSEMBLVEP_VEP                                                                   -
[-        ] process > NFCORE_SAREK:SAREK:VCF_ANNOTATE_ALL:VCF_ANNOTATE_ENSEMBLVEP:TABIX_TABIX                                                                      -
[b7/9bd1bf] process > NFCORE_SAREK:SAREK:CUSTOM_DUMPSOFTWAREVERSIONS (1)                                                                                           [100%] 1 of 1 ✔
[4a/ddbce7] process > NFCORE_SAREK:SAREK:MULTIQC                                                                                                                   [100%] 1 of 1 ✔
ERROR ~ Failed to invoke `workflow.onComplete` event handler

 -- Check script '/home/ubuntu/.nextflow/assets/nf-core/sarek/./workflows/sarek.nf' at line: 1092 or see '.nextflow.log' file for more details

Completed at: 04-Jun-2023 18:46:09
Duration    : 1d 32m 28s
CPU hours   : 877.7
Succeeded   : 245

@FriederikeHanssen
Copy link
Contributor

Great news, thanks for testing. Can you send the log?

@amizeranschi
Copy link
Contributor Author

Sure, here's the full log. I couldn't find any errors or warnings inside related to Ensembl VEP, so no clue what happened to it.

nextflow-error.log.txt

@amizeranschi
Copy link
Contributor Author

For the record, VEP getting skipped was a completely separate issue (unrelated to joint calling). I've opened a new bug report here: #1084.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants