Skip to content

Releases: broadinstitute/gatk

4.2.3.0

02 Nov 22:08
f31f019
Compare
Choose a tag to compare

Download release: gatk-4.2.3.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.2.3.0 release:

  • Notable bug fixes for Mutect2 and Funcotator

  • Support in CombineGVCFs and GenotypeGVCFs for "reblocked" GVCFs as produced by the ReblockGVCF tool. Reblocked GVCFs have a significantly reduced storage footprint.

  • More control over the Smith-Waterman parameters in HaplotypeCaller and Mutect2

  • A new Fragment Allele Depth (FAD) variant annotation similar to the AD annotation except that allele support is considered per read pair, not per individual read

  • GenomicsDB bug fixes and enhancements

Full list of changes:

  • HaplotypeCaller/Mutect2

    • Fixed a bug where Mutect2 failed to filter germline variants with alternate representations (#7103)
      • This caused variants with alternative representations in gnomAD to not be recognized as being the same as called variants in some cases. This resulted in variants that were called and not filtered, but they should have been filtered by "germline".
    • Exposed Smith-Waterman parameters as tool arguments in HaplotypeCaller, Mutect2, and FilterAlignmentArtifacts. (#6885)
      • Enables use of alternative parameters for different event representation (e.g. three consecutive SNPs instead of two small indels)
    • Can now specify the Smith-Waterman implementation in FilterAlignmentArtifacts (#7105)
    • Added a --debug-assembly-variants-out diagnostic option to output a side VCF with variants detected by assembly for HaplotypeCaller and Mutect2 (#7384)
    • Mutect2: the --genotype-germline-sites argument is no longer marked as experimental (#7533)
  • GenotypeGVCFs / CombineGVCFs

    • Updated CombineGVCFs and GenotypeGVCFs to handle "reblocked" GVCFs with diploid data that are potentially missing hom-ref genotype PLs (#7223)
    • Homozygous reference genotypes with no PLs and zero depth are now output as no-calls by GenotypeGVCFs (#7471)
    • Bug fixes for GenotypeGVCFs/GnarlyGenotyper when allele-specific annotations have empty values due to lack of informative reads or no depth (#7491) (#7186)
  • GenomicsDB

    • Added a new --call-genotypes GenomicsDB argument, enabling output of called genotypes (i.e. not ./.) when tools like CombineGVCFs and SelectVariants read from a GenomicsDB workspace (#7223)
    • Added a --bypass-feature-reader argument to GenomicsDBImport to allow the C-based htslib VCF reader implementation to be used instead of the Java implementation (#7393)
      • Using this option will reduce memory usage and potentially speed up the import process
    • Updated to GenomicsDB 1.4.2 (#7520)
  • Funcotator

    • Fixed a StringIndexOutOfBoundsException in the protein change prediction code that could be triggered by certain indels. The fix avoids the crash by adding additional bounds checking. (#7513)
    • Allow FilterFuncotations to process multi-transcript genes (#7506)
  • CNV Calling

    • CNV WDLs now handle BAM/CRAM index paths explicitly, as for cases where the index is not in the same path as its file (#7518)
    • gCNV in the CASE mode now fills in all hidden DenoisingModelConfig and CopyNumberCallingConfig arguments from the input model configuration (#7464)
    • Exposed number of samples used for estimating denoised copy ratios in gCNV via a new --num-samples-copy-ratio-approx argument (#7450)
  • SV Calling

    • JointGermlineCNVSegmentation: bug fixes and refactoring (#7243)
      • A number of bugs, particularly with max-clique clustering, have been fixed, as well as a parameter swap bug in JointGermlineCNVSegmentation
      • Reworks classes used by JointGermlineCNVSegmentation for SV clustering and defragmentation. The design of SVClusterEngine has been overhauled to enable the implementation of CNVDefragmenter and BinnedCNVDefragmenter subclasses. Logic for producing representative records from a collection of clustered SVs has been separated into an SVCollapser class, which provides enhanced functionality for handling genotypes for SVs more generally.
  • Notable Enhancements

    • Added a new Fragment Allele Depth (FAD) variant annotation (#7511)
      • This annotation is identical to the AD annotation except that allele support is considered per read pair, not per individual read
  • Miscellaneous Changes

    • SplitIntervals: added new tool arguments to control output file naming (#7488)
    • Fixed an issue that caused the Travis CI test suite reports to fail to be uploaded (#7525)
    • Updated Travis CI authentication information (#7521)
  • Documentation

    • Updated StrandBiasBySample documentation (#7283)
    • Updated MarkDuplicatesSpark documentation (#7191) (#7535)
    • Added a comment to ``.travis.yml` about the checkout depth (#7421)
  • Dependencies

    • Updated to GenomicsDB 1.4.2 (#7520)
    • Updated sqlite-jdbc library to a newer version to support M1 Macs (#7519)

4.2.2.0

19 Aug 00:08
cafc6ce
Compare
Choose a tag to compare

Download release: gatk-4.2.2.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.2.2.0 release:

  • The ReblockGVCF tool is now out of beta with several important improvements. This tool can be used to postprocess HaplotypeCaller GVCFs to decrease filesize.

  • FilterMutectCalls now has a --microbial-mode argument that sets filters to defaults appropriate for microbial calling

  • Important bug fixes to CalibrateDragstrModel and Funcotator

Full list of changes:

  • New Tools

    • ShiftFasta: create a fasta with the bases shifted by an offset (#6694)
  • ReblockGVCF

    • ReblockGVCF is now out of beta (#7419)
    • Improved ReblockGVCF output to eliminate overlapping reference blocks and reference gaps following trimmed deletions (#7122)
    • Fixed bugs associated with input no-call genotypes and fixed an off-by-one error at contig starts (#7404)
    • Fixed an error on ref blocks with missing DPs (if --floor-blocks arg is not provided); fixed rare cases where spanning deletion (*) allele is incorrectly modified (#7400)
  • Mutect2

    • FilterMutectCalls: added a --microbial-mode argument that sets filters to defaults appropriate for microbial calling (#6694)
  • ValidateVariants

    • Added an optional argument to check for GVCF reference blocks overlapping variants or other reference blocks (#7405)
  • DRAGEN-GATK

    • Fixed a thread safety issue in CalibrateDragstrModel that could cause intermittent ArrayIndexOutOfBoundsExceptions (#7417)
    • Added documentation for ComposeSTRTableFile (#7409)
  • Funcotator

    • Fixed an issue where the Match_Norm_Seq_Allele1 and Match_Norm_Seq_Allele2 fields were not being populated in MAF output (#7422)
  • Mitochondrial pipeline

    • Removed calls to FilterNuMTs and FilterLowHetSites, which are no longer being used (#7325)
  • CNV Calling

    • Fixed a bug resulting from prefix strings of less than 3 characters when creating temporary files in GermlineCNVCaller and improved documentation of corresponding utility methods. (#7411)
  • Documentation

    • Fixed an argument name typo in the CombineGVCFs docs (#7413)
    • Fixed the wording of a comment in MultiVariantDataSource (#7388)

4.2.1.0

30 Jul 21:27
9951f77
Compare
Choose a tag to compare

Download release: gatk-4.2.1.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.2.1.0 release:

  • Several important fixes to HaplotypeCaller and the new DRAGEN-GATK code introduced in GATK 4.2.0.0

  • Started laying the groundwork in Mutect2 for Mutect3, which will be more machine learning focused

  • LocalAssembler: a new tool that performs local assembly of small regions to discover structural variants (#6989)

  • Support for multi-sample segmentation in ModelSegments

  • Major speed improvements and several important fixes to Funcotator

  • A new version of the Intel Genomics Kernel Library (GKL), with many important fixes and improvements

  • A new version of GenomicsDB, with improved cloud support

  • A GATK-wide option to shard VCFs on output, which is often useful for pipelining

  • GATK support for block compressed interval (.bci) files, which is useful when working with extremely large interval lists

Full list of changes:

  • New Tools

    • LocalAssembler: a new tool that performs local assembly of small regions to discover structural variants (#6989)
  • HaplotypeCaller

    • Fixed a rare edge case in DRAGEN mode that could result in negative GQs when USE_POSTERIOR_PROBABILITIES is set (#7120)
    • Fixed a rare edge case (mainly affecting DRAGEN mode) that could cause the PL arrays to be deleted when genotyping in HaplotypeCaller (#7148)
    • Fixed a bug in the AlleleLikelihoods that could result in new evidence X being assigned arbitrary likelihoods left over from previous evidence (#7154)
    • Fixed a "Padded span must contain active span" error caused by invalid feature file intervals that weren't being checked for validity against the sequence dictionary (#7295)
    • Do not add the artificial haplotype read group to the bamout file when --bam-writer-type NO_HAPLOTYPES is specified (#7141)
    • Suppressed excessive log output related to JumboAnnotation warnings in HaplotypeCaller (#7358)
  • DRAGEN-GATK

    • CalibrateDragstrModel: fixed a sporadic out-of-memory error (#7212)
    • CalibrateDragstrModel: fixed an "IllegalArgumentException: Start cannot exceed end" error (#7212)
  • Mutect2

    • Added a training data mode (--training-data-mode) to Mutect2 to prepare for Mutect3 (#7109)
      • Training data mode collects data on variant- and artifact-supporting read sets for fitting a deep learning filtering model
    • Better error bars for samples with small contamination in CalculateContamination (#7003)
  • Funcotator

    • Greatly improved Funcotator performance by optimizing the VCF sanitization code (#7370)
      • In our tests, this change appears to speed up the tool by roughly 2x
    • Updated the Gencode GTF Codec to be more permissive with transcript and gene types (#7166)
      • Now the Gencode GTF Codec no longer restricts transcriptType and geneType to a limited set of values. These fields are now each stored as a String. This allows for arbitrary values in these fields and will help to future-proof (and species-proof) the GTF parser.
      • Fixes "IndexFeatureFile Error to Run Funcotator with Mouse Ensembl GTF" (#7054)
    • Now can decode codons containing IUPAC bases into amino acids. (#7188)
    • Updated the tool to allow for protein changes with N / IUPAC bases. (#6778)
      • Added the ability to have IUPAC bases in either the ref/alt alleles OR in the reference when calculating the amino acid sequence. In this case, the code will no longer throw a user exception, but will log a warning and will produce ? amino acids in the case that they cannot be decoded from the amino acid table. Currently this will happen any time an N or IUPAC base is in the region to be coded into amino acids.
      • Added AminoAcid.UNDECODABLE as a placeholder for any unknown / undecodable amino acid (such as in the case of an ambiguous IUPAC base).
    • Funcotator now checks whether the input has already been annotated, and by default throws an error in that case.
      • We also added a --reannotate-vcf override argument to explicitly allow reannotation (#7349)
  • CNV Calling

    • Enabled multi-sample segmentation in ModelSegments (#6499)
    • Removed mapping error rate from estimate of denoised copy ratios output by gCNV, and updated sklearn. (#7261)
    • Moved gCNV sample QA check into the Postprocessing task in the WDL (#7150)
  • SV Calling

    • Added LocalAssembler, a new tool that performs local assembly of small regions to discover structural variants (#6989)
  • The Genomics Kernel Library (GKL)

    • Updated to GKL version 0.8.8, and remove the FPGA PairHMM as an option (#7203)
      • This is a significant update to the GKL that comes with many fixes and improvements:
        • Update ISAL and OTC Zlib libraries to latest version (Q1 2021)
        • Fixed 3 reproducible issues and retested out of 4 more in GKL
        • Updated build for Centos 7 and Current Mac.
        • Ran valgrind on limited C unit tests (passed)
        • Major improvements to input validation
        • Major updates to Error handling and propagation.
        • Added Negative space unit testing coverage
        • Regular Static Code Scanning
        • Good overall quality of life improvement for the software
  • GenomicsDB

    • Moved to GenomicsDB 1.4.1, and add a toggle between the GCS Connector and native GCS support (#7224)
      • This release allows for the direct use of the native GCS C++ client instead of the GCS Cloud Connector via HDFS. The GCS Cloud Connector can still be used with GenomicsDB via the ``--genomicsdb-use-gcs-hdfs-connector option`
      • Using the native client with GCS allows for GenomicsDB to use the standard paradigms to help with authentication, retries with exponential backoff, configuring credentials, etc., and also helps with performance issues with GCS. See #7070.
    • Allow specifying S3 and Azure blob storage uri's to GenomicsDB in addition to GCS and HDFS (#7271)
    • Fixes related to the GenomicsDB upgrade (#7257)
      • Fixed an issue where the combine operation for certain fields needs to take care to not remap missing fields to NON_REF
      • Fixes "Regression in GenomicsDBImport progress meter" #7222
      • Adds tests for "GenomicsDBImport Creating Workspace Where REF is Inappropriately N?" #7089
    • Improved the error message in GenomicsDBImport when failing to open a FeatureReader (#7375)
  • Mitochondrial pipeline

    • Added median coverage metric to the mitochondrial pipeline (#7253)
  • Notable Enhancements

    • Added a GATK-wide option (--max-variants-per-shard) to shard VCFs on output (#6959)
      • Sharded output is often extremely useful for pipelining
    • Added GATK support for block compressed interval (.bci) files (#7142)
    • Added an AlleleDepthPseudoCounts (DD) genotype annotation. (#7303)
      • Similar to AD, the new annotation (DD) captures the depth of each allele's supporting evidence or reads, however it does so by following a variational Bayes approach looking into the likelihoods rather than applying a fixed threshold. This turns out to be more robust in some instances.
      • To get the new non-standard annotation in HaplotypeCaller you need to add -A AllelePseudoDepth
    • We now track the source of variants in MultiVariantWalkers, which is important for some tools such as VariantEval (#7219)
  • Bug Fixes

    • Fixed key ordering bugs in the implementations of Histogram.median() and CompressedDataList.iterator() (#7131)
      • These bugs could result in incorrect RankSumTest annotations in some cases
    • Fixed the DepthPerSampleHC and StrandBiasBySample annotations to not spam the logs with "Annotation will not be calculated" warnings (#7357)
    • VariantEval: fixed contig stratification to defer to user-defined intervals (#7238)
  • Miscellaneous Changes

    • The ProgressMeter can now be completely disabled for all tools / traversals by overriding GATKTool.disableProgressMeter() (#7354)
    • We now authenticate with Dockerhub in our Travis builds, to help avoid tests failing due to quota issues (#7204) (#7256)
    • Migrated VariantEval to be a MultiVariantWalkerGroupedOnStart (#6973)
    • VariantEval: added an argument to specify the PedigreeValidationType (#7240)
    • Converted InfoFieldAnnotation/GenotypeAnnotation into interfaces. (#7041)
    • Allow MultiVariantWalkerGroupedOnStart subclasses to view/set ignoreIntervalsOutsideStart (#7301)
    • PedigreeAnnotation: consolidate code, provide getters, and allow PedigreeValidationType to be set (#7277)
    • ASEReadCounter: added a warning for variants lacking GT fields (#7326)
    • Added filters to dockstore.yml so that only the master branch and the releases get synced to Dockstore (#7217)
    • Fixed a compatibility issue between Java 11 and log4j2 (#7339)
    • We now update the gcloud package signing key at the start of every docker build (#7180)
    • Updated our Artifactory key (#7208)
    • Disabled some Spark dataproc tests because of dependency issues. (#7170)
    • Removed some embedded licenses from scripts (#7340)
  • Documentation

    • Variant annotation documentation: removed broken links to related annotations from the tool docs (#7307)
    • Updated the link to an article on Jexl expressions (#7317)
    • Fixed several broken links in docs for the CNV tools (#7309)
    • Fixed broken links in the docs for `Funco...
Read more

4.2.0.0

19 Feb 21:26
1f7fade
Compare
Choose a tag to compare

Download release: gatk-4.2.0.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.2.0.0 release:

  • We've worked closely with Illumina to port a number of significant innovations for germline short variant calling from their DRAGEN pipeline to GATK. These improvements will form the basis of the upcoming open-source implementation of the DRAGEN pipeline which we're calling DRAGEN-GATK

  • A number of other fixes and improvements to HaplotypeCaller to improve the phasing of variant calls and to fix edge cases with indels and spanning deletions

  • A new pipeline for gCNV exome joint calling

Full list of changes:

  • DRAGEN-GATK (#6634) (#7063)

    • With this release we've worked closely with Illumina to make improvements to the GATK HaplotypeCaller to allow it to output germline short variant calls that are functionally equivalent to the calls made by their DRAGEN 3.4.12 pipeline. See our blog post on DRAGEN-GATK for more details on these improvements. A full DRAGEN-GATK pipeline that leverages these new features will be released in the near future as a WDL workflow script in the WARP repo on GitHub as well as a featured workspace in Terra.
    • Below is a summary of the improvements we've ported from DRAGEN in this release. We recommend that most users wait until the complete DRAGEN-GATK pipeline is released as a WDL workflow before evaluating these features, though advanced users comfortable with building their own pipelines are welcome to try them out now:
      • DragSTR: a port of DRAGEN's model for STRs (Short Tandem Repeats) that adjusts HMM indel priors based on empirical reference contexts for better indel calling.
        • Using DragSTR involves running two new tools prior to the HaplotypeCaller:
          • ComposeSTRTableFile: scans a reference for STR sites and outputs a table file with a subsample of the available STR sites across the genome.
          • CalibrateDragstrModel: given the STR table for a reference produced by ComposeSTRTableFile and the reads for a specific sample, generates a model for potential sequencing errors for STR sites of various sizes for that sample.
        • After running these tools, you then run HaplotypeCaller with the --dragstr-params-path argument to pass it the DragSTR model generated by CalibrateDragstrModel.
      • BQD (Base Quality Dropout) and FRD (Foreign Read Detection): two new genotyper error models ported from DRAGEN
        • The Base Quality Dropout (BQD) model penalizes variants with low average base quality scores and high average sequencing cycle counts among genotyped reads and reads that were otherwise excluded from the genotyper to model read-context dependent sequencing errors.
        • The Foreign Read Detection (FRD) model uses an adjusted mapping quality score as well as read strandedness information to penalize reads that are likely to have originated from somewhere else on the genome or from contamination.
        • To activate the BQD and FRD models, run HaplotypeCaller with the --dragen-mode argument.
      • Added a new variant QUAL score model that reports the variant QUAL score as the posterior of the reference genotype based on the sample-dependent DRAGEN STR and flat SNP priors.
  • HaplotypeCaller

    • We now add physical phasing information (PGT/PID/PS attributes) to genotypes with spanning deletion alleles (#6937)
    • Fixed two phasing bugs (#7019)
      • Fixed "HaplotypeCaller emitting incorrect phasing when genotyping hom-het-het" (#6463)
      • Fixed "Phased variants do not have the same phase set identifier" (#6845)
    • Fixed quality score calculation for sites with spanning deletions (#6859)
      • This fixes a bug in the AlleleFrequencyCalculator that was causing quality to be overestimated for sites with * alleles representing spanning deletions.
    • Added the ability for indels to be recovered from dangling heads in the assembly graph, and a new --num-matching-bases-in-dangling-end-to-recover argument for filtering dangling ends (#6113) (#7086)
    • Improved handling of indels/spanning deletions in the cigar base quality adjustment code. (#6886)
      • This aims to better handle the edge cases that come up when mates have mismatching numbers of bases at the start or end of the reads relative to each-other.
    • Fixed a bug where overlapping reads in subsequent assembly regions could have invalid base qualities (#6943)
    • Convert non-ACGT IUPAC bases to N in HaplotypeCaller prior to assembly to prevent a crash (#6868)
    • Renamed the --mapping-quality-threshold argument to --mapping-quality-threshold-for-genotyping, and updated its documentation to be less confusing (#7036)
    • Added an option for HaplotypeCaller and Mutect2 to produce a bamout without artificial haplotypes (#6991)
    • Updated the --debug-graph-transformations argument to emit the assembly graph both before and after chain pruning (#7049)
  • Mutect2

    • Fixed the --dont-use-soft-clipped-bases argument in Mutect2 to actually work as intended (#6823)
      • Due to a bug, this option did nothing because a copy of the original reads was modified. By deleting the unnecessary mapping quality filtering (this is totally redundant with the M2 read filter), we finalize (and thereby discard soft clips if requested) an assembly region made from the original reads, not a copy.
    • Fixed a bug in the Mutect2 engine active region code that could affect the ability to call tumor alts when the normal has a different alt at the same site (#6908)
    • Removed an obsolete cram to bam conversion step in the Mutect2 WDL (#6970)
    • Updated the Mutect2 whitepaper in docs/mutect/mutect.pdf to accurately reflect current filter names, and updated the section on FilterAlignmentArtifacts (#6967)
  • CNV Calling

    • A new pipeline for gCNV exome joint calling (#6554)
      • Added a new tool (JointGermlineCNVSegmentation) and associated workflow (scripts/cnv_wdl/germline/joint_call_exome_cnvs.wdl) to combine gCNV segments and calls across samples
      • JointGermlineCNVSegmentation segments and genotypes CNV calls from the germline CNV pipeline jointly across multiple samples.
      • The workflow in scripts/cnv_wdl/germline/joint_call_exome_cnvs.wdl produces a joint, multi-sample genotyped VCF.
      • For whole genomes, we recommend CNVs as part of a full SV callset with https://github.com/broadinstitute/gatk-sv (soon to be added to Terra)
    • GermlineCNVCaller now restarts inference once with a new random seed when inference diverges. Also added a new entry point to PythonScriptExecutor that returnes ProcessOutput. (#6866)
      • This is intended to alleviate transient issues with GermlineCNVCaller inference in which the ELBO converges to a NaN value, by calling the python gCNV code with an updated random seed input.
    • CreateReadCountPanelOfNormals: fixed a bug in the logic for filtering zero-coverage samples and intervals (#6624)
    • FilterIntervals: fixed a bug in the tool logic when filtering on annotations and -XL is used to exclude intervals (#7046)
  • SV Calling

    • PrintSVEvidence: a new tool that prints any of the Structural Variation evidence file types: read count (RD), discordant pair (PE), split-read (SR), or B-allele frequency (BAF) (#7026)
      • This tool is used frequently in the GATK-SV pipeline for retrieving subsets of evidence records from a bucket over specific intervals. Evidence file formats comply with the current specifications in the existing GATK-SV pipeline.
  • GenomicsDB

    • Introduced a new feature for GenomicsDBImport that allows merging multiple contigs into fewer GenomicsDB partitions (#6681)
      • Controlled via the new --merge-contigs-into-num-partitions argument to GenomicsDBImport
      • This should produce a huge performance boost in cases where users have a very large number of contigs. Prior to this change, GenomicsDB would create a separate folder/partition for each contig, which slowed down import to a crawl when there were many contigs.
  • Funcotator

    • Added sorting by strand order for transcript subcomponents (#7065)
      • This fixes an issue where the coding sequence, protein prediction, and other annotations could be incorrect for the hg19 version of Gencode, due to the individual elements of each transcript appearing in numerical order, rather than the order in which they appear in the transcript at transcription time.
    • Updated the Funcotator tutorial link in the tool documentation. (#6920) (#6925)
  • Mitochondrial pipeline

    • Simplified the max_reads_per_alignment_start argument in mitochondria_m2_wdl/AlignAndCall.wdl (#6904)
    • Remove the unused "autosomal_coverage" parameter from the Filter task in mitochondria_m2_wdl/AlignAndCall.wdl (#6888)
  • Notable Enhancements

    • Add a -O option to save the output to a file in the following tools: FlagStat, CountBases, CountReads, CountVariants, and CountBasesInReference (#7072)
    • DepthOfCoverage: added a new gene_statistics output file (#7025)
    • ReblockGVCF: allow reblocking with no P...
Read more

4.1.9.0

09 Oct 22:35
9d5727d
Compare
Choose a tag to compare

Download release: gatk-4.1.9.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.1.9.0 release:

  • A major update to Funcotator, bringing in the latest Gencode release, fixing compatibility issues with dbSNP, and more!

  • Two new tools, GeneExpressionEvaluation and ReferenceBlockConcordance

  • Significant performance improvements to DepthOfCoverage and SelectVariants

  • Some important bug fixes:

    • Fixed a bug in HaplotypeCaller and Mutect2 where we were losing insertion events that immediately followed a deletion
    • A fix for the "CreateSomaticPanelOfNormals output PoN has much less variants in 4.1.8.0 than before" issue reported in #6744
    • A fix for a frequently-encountered NullPointerException in the AS_StrandBiasTest annotation when running CombineGVCFs reported in #6766

Full list of changes:

  • New Tools

    • GeneExpressionEvaluation: a tool for evaluating gene expression from RNA-seq reads aligned to whole genome (#6602)

      • This tool counts fragments to evaluate gene expression from RNA-seq reads aligned to the genome. Features to evaluate expression over are defined in an input annotation file in gff3 fomat. Output is a tsv listing sense and antisense expression for all stranded grouping features, and expression (labeled as sense) for all unstranded grouping features.
    • ReferenceBlockConcordance: a new tool to evaluate concordance of reference blocks in GVCF files (#6802)

      • This tool compares the reference blocks of two GVCF files against each other and produces three histograms:
        • Truth block histogram: Indicates the number of occurrences of reference blocks with a given confidence score and length in the truth GVCF
        • Eval block histogram: Indicates the number of occurrences of reference blocks with a given confidence score and length in the eval GVCF
        • Confidence concordance histogram: Reflects the confidence scores of bases in reference blocks in the truth and eval VCF, respectively. An entry of 10 at bin "80,90" means that there are 10 bases which simultaneously have a reference confidence of 80 in the truth GVCF and a reference confidence of 90 in the eval GVCF.
  • HaplotypeCaller/Mutect2

    • Fixed a bug in HaplotypeCaller and Mutect2 where we were losing insertion events that immediately followed a deletion (#6696)
    • Added a workaround for an issue with multiallelics in the CreateSomaticPanelOfNormals pipeline (#6871)
      • This fixes the "CreateSomaticPanelOfNormals output PoN has much less variants in 4.1.8.0 than before" issue reported in #6744
    • Made improvements to the Mutect2 active region detection code that resulted in recovering some low-AF calls that we were missing (#6821)
    • Made the HaplotypeCaller/Mutect2 adaptive pruner smarter in complex graphs, resulting in modest improvements to indel sensitivity when using the adaptive pruning option (#6520)
    • Fixed a bug in variation event detection code that could sometimes lead to mistreating indel assembly windows as SNP assembly windows (#6661)
    • Fixed a bug in FragmentUtils where insertion quals were used instead of deletion quals when adjusting base qualities for two overlapping reads from the same fragment (#6815)
    • Fixed a concurrent modification exception error for local runs of HaplotypeCallerSpark (#6741)
    • Marked the --linked-de-bruijn-graph argument as Advanced rather than Hidden (#6737)
    • Made a small tweak to Mutect2's callable sites count (#6791)
    • Added a "requester pays" option to Mutect2 WDL tasks that access bams for use with Google Cloud "requester pays" buckets (#6879)
  • Funcotator

    • A major set of updates to Funcotator (#6660)
      • Updated to the latest Gencode release
      • Fixed the contig naming compatibility issue with dbSNP reported in #6564 ("hg38 dbSNP has incorrect contig names")
      • Now both hg19 and hg38 have the contig names translated to "chr__"
      • Added 'lncRNA' to GeneTranscriptType.
      • Added "TAGENE" gene tag.
      • Added the MANE_SELECT tag to FeatureTag.
      • Added the STOP_CODON_READTHROUGH tag to FeatureTag.
      • Updated the GTF versions that are parseable.
      • Fixed a parsing error with new versions of gencode and the remap positions (for liftover files).
      • Added test for indexing new lifted over gencode GTF.
      • Added Gencode_34 entries to MAF output map.
      • Pointed data source downloader at new data sources URL.
      • Minor updates to workflows to point at new data sources.
      • Updated retrieval scripts for dbSNP and Gencode.
      • Added required field to gencode config file generation.
      • Now gencode retrieval script enforces double hash comments at top of gencode GTF files.
      • Fixed an erroneous trailing tab in MAF file output reported in #6693
    • Added a maximum version number for data sources in Funcotator (#6807)
    • Added a "requester pays" option to the Funcotator WDL for use with Google Cloud "requester pays" buckets (#6874)
    • FuncotateSegments: fixed an issue with the default value of --alias-to-key-mapping being set to an immutable value (#6700)
  • GenomicsDB

    • Updated to GenomicsDB Version 1.3.2, which brings better propagation of errors messages from the GenomicsDB library (#6852)
      • Using the GATK option GATK_STACKTRACE_ON_USER_EXCEPTION will now also output a limited C/C++ stacktrace
  • CNV Tools

    • Fixed a bug in the KernelSegmenter: the minimal data to calculate the segmentation cost should be 2 * windowSize, rather than windowSize (#6835)
    • Germline CNV WDL improvements for WGS (#6607)
      • Modified gCNV WDLs to improve Cromwell performance when running on a large number of intervals, as in WGS
      • Added optional disabled_read_filters input to CollectCounts
      • Enabled GCS streaming for CollectCounts and CollectAllelicCounts
    • Added a "requester pays" option to the germline and somatic CNV WDLs for use with Google Cloud "requester pays" buckets (#6870)
  • Mitochondrial Pipeline

    • Fix to correctly handle spaces in sample names in the Mitochondria WDL (#6773)
    • Exposed a max_reads_per_alignment_start argument in the Mitochondria WDL (#6739)
    • Updated the HaploChecker Dockerfile to reflect the correct haplocheck CLI (#6867)
  • Notable Enhancements

    • Significantly improved the performance of DepthOfCoverage by removing slow string formatting calls (#6740)
      • In a test run with default arguments locally the runtime for a WGS full chr15 drops from ~8.9 minutes to ~4.7 minutes after this patch
    • Significantly improved the performance of SelectVariants with large numbers of samples by changing an operation to scale linearly instead of quadratically with the number of samples (#6729)
      • On one example with several thousand samples there was a speed up from ~5 minutes to 0.1 minutes
    • WDL generation: made several improvements to automatic WDL generation, annotated additional tools for WDL generation, and added a section to the README with instructions on generating WDLs for GATK tools (#6800)
    • Added a suite of utility methods for working with Google BigQuery: BigQueryUtils (#6759) (#6861)
    • The GATK docker image can now be built with a simple docker build . command (no extra arguments needed) (#6764) (#6842) (#6782)
    • Added a Dockstore yml file with workflow descriptions for the WDLs in the GATK repo, to facilitate automatic publication to Dockstore (#6770)
  • Bug Fixes

    • Fixed a NullPointerException in the AS_StrandBiasTest annotation reported in #6766 (#6847)
    • Fixed a bug with soft clips in LeftAlignIndels (#6792)
    • VariantRecalibrator: uniquify annotations to fix the error reported in #2221 (#6723)
    • Fixed an issue where ContextCovariate in BaseRecalibrator mistakenly assumed that all non-ACGT bases in the read are N (#6625)
    • Fixed a crash in CountBasesSpark when using the -L option (#6767)
  • Miscellaneous Changes

    • Significant refactoring of the SV discovery classes (#6652)
    • FilterVariantTranches: report more info when the ref alleles don't match (#6723)
    • We now report the target url in exceptions thrown by HtsgetReader (#6799)
    • Added more information to error messages in AssemblyRegion for contigs not in the reference dictionary (#6781)
    • Improved an error message in GATKRead.setMatePosition() (#6779)
    • Updated the Barclay WDL template for compatibility with the Debian distribution (#6841)
    • Temporarily disabled HtsgetReader tests to work around issues caused by a server-side upgrade. (#6804)
    • Re-enabled an IndexFeatureFile test for uncompressed BCF. (#6716)
  • Documentation

    • Marked LearnReadOrientationModel as a DocumentedFeature (#6726)
    • Added a gentle warning about loss of True Positives with the default FilterIntervals params (#6751)
    • Updated the README to mention that the conda environment is not officially supported on macOS at this time. (#6788)
    • Fixed a typo in the example command for SplitIntervals (#6869)
    • Fixed a typo in the `--tmp-dir...
Read more

4.1.8.1

20 Jul 21:14
297f24e
Compare
Choose a tag to compare

Download release: gatk-4.1.8.1.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.1.8.1 release:

  • This is a minor point release intended primarily to push out a needed enhancement to the Mutect2 pipeline.

  • This release also introduces a new framework for the auto-generation of WDLs for GATK/Picard tools. Over the next several GATK releases, we intend to hook GATK/Picard tools up to the new WDL generator, with the ultimate goal of having WDLs automatically published for all tools with each release.

Full list of changes:

  • Mutect2

    • We now allow for the passing of additional arguments to GetPileupSummaries from the Mutect2 WDL (#6713)
  • GATK Engine

    • Added a new framework for the auto-generation of WDLs for GATK/Picard tools (#6504)
      • Over the next several GATK releases, we intend to hook GATK/Picard tools up to the new WDL generator, with the ultimate goal of having WDLs automatically published for all tools with each release
  • Bug Fixes

    • Fixed an error (reported in #6664) when trying to read .vcf/.tbi files located in a path that contains spaces in the name (#6702)
  • Miscellaneous Changes

    • Removed a few GATK classes that are redundant with Picard classes. (#6678)
  • Documentation

    • Added instructions for running Spark tools in LOCAL mode to the README (#6682)
    • Removed documentation reference to a GATK 3.x annotation that no longer exists (#6679)
  • Dependencies

    • Updated HTSJDK to 2.23.0 (#6702)

4.1.8.0

26 Jun 19:16
bc0994c
Compare
Choose a tag to compare

Download release: gatk-4.1.8.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.1.8.0 release:

  • A major new release of GenomicsDB (1.3.0), with enhanced support for shared filesystems such as NFS and Lustre, support for MNVs, and better compression leading to a roughly 50% reduction in workspace size in our tests. This also includes a fix for an error in GenotypeGVCFs that several users were encountering when reading from GenomicsDB.

  • A major overhaul of the PathSeq microbial detection pipeline containing many improvements

  • Initial/prototype support for reading from HTSGET services in GATK

    • Over the next several releases, we intend for HTSGET support to propagate to more tools in the GATK
  • Fixes for a couple of frequently-reported errors in HaplotypeCaller and Mutect2 (#6586 and #6516)

  • Significant updates to our Python/R library dependencies and Docker image

Full list of changes:

  • New Tools

    • HtsgetReader: an experimental tool to localize files from an HTSGET service (#6611)
      • Over the next several releases, we intend for HTSGET support to propagate to more tools in the GATK
    • ReadAnonymizer: a tool to anonymize reads with information from the reference (#6653)
      • This tool is useful in the case where you want to use data for analysis, but cannot publish the data without anonymizing the sequence information.
  • HaplotypeCaller/Mutect2

    • Fixed an "evidence provided is not in sample" error in HaplotypeCaller when performing contamination downsampling (#6593)
      • This fixes the issue reported in #6586
    • Fixed a "String index out of range" error in the TandemRepeat annotation with HaplotypeCaller and Mutect2 (#6583)
      • This addresses an edge case reported in #6516 where an alt haplotype starts with an indel, and hence the variant start is one base before the assembly region due to padding a leading matching base
    • Better documentation for FilterAlignmentArtifacts (#6638)
    • Updated the CreateSomaticPanelOfNormals documentation (#6584)
    • Improved the tests for NuMTFilterTool (#6569)
  • PathSeq

    • Major overhaul of the PathSeq WDLs (#6536)
      • This new PathSeq WDL redesigns the workflow for improved performance in the cloud.
      • Downsampling can be applied to BAMs with high microbial content (ie >10M reads) that normally cause performance issues.
      • Removed microbial fasta input, as only the sequence dictionary is needed.
      • Broke pipeline down to into smaller tasks. This helps reduce costs by a) provisioning fewer resources at the filter and score phases of the pipeline and b) reducing job wall time to minimize the likelihood of VM preemption.
      • Filter-only option, which can be used to cheaply estimate the number of microbial reads in the sample.
      • Metrics are now parsed so they can be fed as output to the Terra data model.
      • CRAM-to-BAM capability
      • Updated WDL readme
      • Deleted unneeded WDL json configuration, as the configuration can be provided in Terra
    • Added an --ignore-alignment-contigs argument to PathSeq filtering that lets users specify any contigs that should be ignored. (#6537)
      • This is useful for BAMs aligned to hg38, which contains the Epstein-Barr virus (chrEBV)
  • GenomicsDB

    • Upgraded to GenomicsDB version 1.3.0 (#6654)
      • Added a new argument --genomicsdb-shared-posixfs-optimizations to help with shared POSIX filesystems like NFS and Lustre. This turns on disable file locking and for GenomicsDB import it minimizes writes to disks. The performance on some of the gatk datasets for the import of about 10 samples went from 23.72m to 6.34m on NFS which was comparable to importing to a local filesystem. Hopefully this helps with Issue #6487 and #6627. Also, fixes Issue #6519.
      • This version of GenomicsDB also uses pre-compression filters for offset and compression files for new workspaces and genomicsdb arrays. The total sizes for a GenomicsDB workspace using the same dataset as above and the 10 samples went from 313MB to 170MB with no change in import and query times. Smaller GenomicsDB arrays also help with performance on distributed and cloud file systems.
      • This version has added support to handle MNVs similar to deletions as described in Issue #6500.
      • There is added support in GenomicsDBImport to have multiple contigs in the same GenomicsDB partition/array. This will hopefully help import times in cases where users have many thousands of contigs. Changes are still needed from the GATK side to make use of this support.
      • Logging has been improved somewhat with the native C/C++ code using spdlog and fmt and the Java layer using apache log4j and log4j.properties provided by the application. Also, info messages like No valid combination operation found for INFO field AA - the field will NOT be part of INFO fields in the generated VCF records will only be output once for the operation.
    • Made VCFCodec the default for query streams from GenomicsDB (#6675)
      • This fixes the frequently-reported NullPointerException in GenotypeGVCFs when reading from GenomicsDB (see #6667)
      • Added a --genomicsdb-use-bcf-codec argument to opt back in to using the BCFCodec, which is faster but prone to the above error on certain datasets
  • CNV Tools

    • DetermineGermlineContigPloidy can now process interval lists with a single contig (#6613)
    • FilterIntervals now filters out any singleton intervals (#6559)
    • Fixed an inaccurate error message in SVDDenoisingUtils (#6608)
  • Docker/Conda Overhaul (#5026)

    • Our docker image is now built off of Ubuntu 18.04 instead of 16.04
      • This brings in newer versions of several important packages such as samtools
    • Updated many of the Python libraries installed via our conda environment and included in our Docker image to newer versions, resolving several outstanding issues in the process
    • R dependencies are now installed via conda in our Docker build instead of the now-removed install_R_packages.R script
      • Due to this change, we recommend that tools that use R packages (e.g., to create plots) should now be run using the GATK docker image or the conda environment.
    • NOTE: significant updates and changes to the Ubuntu version, native packages, and R/python packages may result in corresponding numerical changes in results.
  • Mitochondrial Pipeline

    • Minor updates to the mitochondrial pipeline WDLs (#6597)
  • Notable Enhancements

    • RevertSamSpark now supports CRAMs (#6641)
    • Fixed a VariantAnnotator performance issue that could cause the tool to run very slowly on certain inputs (#6672)
    • More flexible matching of dbSNP variants during variant annotation (#6626)
      • Add all dbsnp id's which match a particular variant to the variant's id, instead of just the first one found in the dbsnp vcf.
      • Be less brittle to variant normalization issues, and match differing variant representations of the same underlying variant. This is implemented by splitting and trimming multiallelics before checking for a match, which I suspect are the predominant cause of these types of matching failures.
    • Added a --min-num-bases-for-segment-funcotation argument to FuncotateSegments (#6577)
      • This will allow for segments of length less than 150 bases to be annotated if given at run time (defaults to 150 bases to preserve the previous behavior).
    • SplitIntervals can now handle more than 10,000 shards (#6587)
  • Bug Fixes

    • Fixed interval summary files being empty in DepthOfCoverage (#6609)
    • Fixed a crash in the BQSR R script with newer versions of R (#6677)
    • Fix crash when reporting error when trying to build GATK with a JRE (#6676)
    • Fixed an issue where ReadsSourceSpark.getHeader() wasn't propagating the reference at all when a CRAM file input resides on GCS, so it always resulted in a "no reference was provided" error, even when a reference was provided. (#6517)
    • Fixed an issue where ReadsSourceSpark.checkCramReference() always tried to create a Hadoop Path object for the reference no matter what file system it lives on, which fails when using a reference on GCS. (#6517)
    • Fixed an issue where the tab completion integration tests weren't emitting any output (#6647)
  • Miscellaneous Changes

    • Created a new ReadsDataSource interface (#6633)
    • Migrated read arguments and downstream code to GATKPath (#6561)
    • Renamed GATKPathSpecifier to GATKPath. (#6632)
    • Add a read/write roundtrip Spark integration test for a CRAM and reference on HDFS. (#6618)
    • Deleted redundant methods in SVCigarUtils, and rewrote and moved the rest to CigarUtils (#6481)
    • Re-enabled tests for HTSGET now that the reference server is back to a stable version (#6668)
    • Disabled SortSamSparkIntegrationTest.testSortBAMsSharded() (#6635)
    • Fixed a typo in a SortSamSpark log message. (#6636)
    • Removed incorrect logger from DepthOfCoverage. (#6622)
  • Documentation

    • Fixed annotation equation rendering in the tool docs. (#6606)
    • Adding a note as to how to filter on MappingQuality in DepthOfCoverage (#6619)
    • Clarified the doc...
Read more

4.1.7.0

23 Apr 23:16
4ec2a4b
Compare
Choose a tag to compare

Download release: gatk-4.1.7.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.1.7.0 release:

  • Added allele-specific filtering to the mitochondrial pipeline.

    • Allele-specific filtering is important for mitochondrial calling because there are many more multi-allelic sites than in the germline autosome.
  • A fix for the frequently-encountered "Smith-Waterman alignment failure" error in HaplotypeCaller and Mutect2

  • Initial support for http(s) paths for BAM inputs, including signed urls

  • A new tool, DownsampleByDuplicateSet, to randomly sample a fraction of duplicate sets from an input bam sorted by UMI

Full list of changes:

  • New Tools

    • DownsampleByDuplicateSet: a new tool to randomly sample a fraction of an input bam sorted by UMI. (#6512)
      • Given a bam grouped by unique molecular identifier (UMI), this tool drops a specified fraction of duplicate sets and returns a new bam.
      • A duplicate set refers to a group of reads whose fragments start and end at the same genomic coordinate and share the same UMI.
      • The input bam must first be sorted by UMI using FGBio GroupReadsByUmi.
      • Use this tool to create, for instance, an insilico mixture of duplex-sequenced samples to simulate tumor subclones.
  • HaplotypeCaller/Mutect2

    • Fixed a regression in HaplotypeCaller and Mutect2 where alt haplotypes with a deletion at the end of the padded region caused exceptions (#6544)
      • This bug produced error messages like the following: "Smith-Waterman alignment failure. Cigar = 275M with reference length 275 but expecting reference length of 303"
    • Fixed an ArrayIndexOutOfBoundsException in GenotypeUtils.computeDiploidGenotypeCounts() caused by mistakenly assuming ploidy two for no-calls (#6563)
    • Added more control over scattering in the Mutect2 PON WDL to allow arbitrarily fine scattering, reducing the memory required for downstream runs of GenomicsDBImport (#6527)
    • Invert --correct-overlapping-quality argument in HaplotypeCaller to --do-not-correct-overlapping-quality (#6528)
  • Mitochondrial Pipeline

    • Added allele-specific filtering to the mitochondrial pipeline (#6399)
      • Allele-specific filtering is important for mitochondria because there are many more multi-allelic sites than in the germline autosome and therefore, downstream tools have access to more of the good allele data.
      • These Mutect2 filters used in the MT pipeline are now allele-specific: weak_evidence, base_qual, map_qual, duplicate, strand_bias, strand_artifact, position, contamination, and low_allele_frac.
      • They are added to the AS_FilterStatus annotation in the INFO field.
      • The numt_chimera and numt_novel filters have been replaced by the possible_numt filter.
      • Two new filtering tools have been added: NuMTFilterTool for the possible_numt filter and MTLowHeteroplasmyFilterTool for the mt_many_low_hets filter, both of which are allele-specific.
      • The --split-multi-allelics option of the LeftAlignAndTrimVariants tool now splits the annotations in the FORMAT and INFO fields that are of type A and R (allele-specific, and allele-specific with reference).
      • The VariantFiltration tool now has an --apply-allele-specific-filters option that will apply masks at the allele level. Before this addition, sites that should not be masked, but had deletions that spanned a masked site would have been masked. Now, if this option is specified, only the alleles spanning the masked site will be masked.
  • GATK Engine

    • Added initial support for http(s) paths for BAM inputs, including signed urls (#6526)
  • Miscellaneous Changes

    • Exposed maximum copy ratio and point size for CNV plotting tools (#6482)
    • Decreased an epsilon value in VariantRecalibrator so that our production exome joint genotyping tests pass (#6534)
    • Migrated reference arguments and downstream code to GATKPathSpecifier (#6524)
    • Removed obsolete isCompatibleWithSparkBroadcast() method. (#6523)
  • Documentation

    • Cleaned up the handling of some missing values in auto-generated GATK tool documentation (#6565)
      • Now docs won't include null, "", or [] in the default value list.
    • Added a README for the CNN variant scoring workflow, and added an input JSON for Mutect2 workflow files located in GCS buckets (#6542)
    • Fixed a typo in a ploidy prior example in the docs for DetermineGermlineContigPloidy (#6531)

4.1.6.0

25 Mar 16:35
bdb2e15
Compare
Choose a tag to compare

Download release: gatk-4.1.6.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.1.6.0 release:

  • Funcotator now supports ENSEMBL GTF files (and non-human species)

  • A beta port of the GATK3 tool DepthOfCoverage, a tool to assess sequence coverage by a wide array of metrics, partitioned by sample, read group, library, or gene (#5913)

  • Several important bug fixes and enhancements to HaplotypeCaller and Mutect2, including:

    • A fix for an often-reported issue where HaplotypeCaller could produce reads starting with deletions during the realignment step and error out.
    • A fix for another often-reported issue where Mutect2 could emit MNPs despite --max-mnp-distance being 0, causing downstream errors in GenomicsDB about MNPs not being supported.

Full list of changes:

  • New Tools

    • A beta port of the GATK3 tool DepthOfCoverage, a tool to assess sequence coverage by a wide array of metrics, partitioned by sample, read group, library, or gene (#5913)
      • This port fixes several bugs and changes some behavior present in the GATK3 version:
        • Fixed a longstanding bug in GATK3 DepthOfCoverage where using multiple partition types results in column header and body lines having mismatching ordering causing incorrect output.
        • The old version used to merge adjacent and overlapping intervals when generating interval summary files. This is no longer the case as in GATK4 adjacent and overlapping intervals are tabulated as separate lines in the output (This also applies to gene lists which would previously have been merged as well).
        • Changed the behavior of gene list coverage to no longer count introns when generating interval summaries for gene lists.
        • Added support for RefSeqGeneList files as optional gene list input.
  • HaplotypeCaller

    • Fixed a bug where single-base intervals led to no calls (#6507)
      • This fixes the issue reported in #6495 "HaplotypeCaller doesn't detect alternate alleles with 1 bp intervals"
    • Clean leading deletions from reads realigned to best haplotypes (#6498)
      • This fixes the issue reported in #6490 "HaplotypeCaller might be producing bogus reads with deletions at their alignment start during realignment to best haplotype step"
    • Fixed an edge case when haplotypes have leading insertion after trimming (#6518)
  • Mutect2

    • Mutect2 can now filter MNVs with orientation bias (#6486)
    • Added an experimental pileup-based read error corrector, which in our evaluations reduces false positives and improves speed at no cost to sensitivity (#6470)
    • Switched CigarBuilder's order for adjacent indels to be deletion first (#6510)
      • Fixes #6473 "Mutect2 (GATK 4.1.5.0) emitting MNPs despite max-mnp-distance 0"
      • This also resolves downstream errors in GenomicsDB about not supporting MNPs
    • Fixed several bugs involving getReadCoordinateForReferenceCoordinate() (#6485)
      • Fixes #6342 "Mutect2 occasionally writes nonsense / invalid values for MPOS info tag"
      • Fixes #6314 "GATK4.1.3.0 Mutect2 enable-all-annotations option error"
      • Fixes #6294 "ReadPosRankSumTest with leading insertions"
      • Fixes #5492 "ReadPosRankSumTest doesn't work for two deletions with one base in between"
  • Funcotator

    • Funcotator now supports ENSEMBL GTF files (and non-human species) (#6477) (#6492)
      • Users can now create datasources for any species for which ENSEMBL has an annotated GTF file and the corresponding coding region FASTA file
      • When creating new data sources, the user must still use gencode as the parent folder for the GTF data source subfolders. For example, for E. coli MG1655:
        • DATASOURCES
          • gencode
            • ASM584v2
              • Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.44.gtf
              • Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.cds.all.fa
              • gencode.config
      • For more information on creating data sources see the Funcotator tutorial on the GATK Forums.
      • An example datasource for E. coli MG1655 can be found in the large test files for Funcotator
      • For ENSEMBL datasources for vertebrates: ftp://ftp.ensembl.org/pub/
      • For ENSEMBL datasources for other species: ftp://ftp.ensemblgenomes.org/pub/
  • CNV Calling

    • Upgrade CNV WDLs to 1.0 spec (#6506)
    • Fixed an off-by-one segmentation argument in ModelSegments. (#6497)
  • Miscellaneous Changes

    • Simplified cigar and clipping code; added tests and fixed a few bugs including #6130 (#6403)
    • Refactored and enhanced ArgumentsBuilder (#6474)
    • Allow all GATKSparkTools to set the SBI index granularity (#6458)
    • Delete NioBam and related classes (#6479)
    • Clean up old interval code (#6465)
    • Remove duplicate copy of the NIO prefetching code (#6464)
    • Fix ignored test in GATKReadAdaptersUnitTest (#6471)
    • Fix alternate spellings of De Bruijn in the codebase (#6472)
  • Documentation

    • Fix a broken set of javadoc references in FeatureDataSource (#6478)

4.1.5.0

28 Feb 23:01
7f9b849
Compare
Choose a tag to compare

Download release: gatk-4.1.5.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.1.5.0 release:

  • A new, improved version of the --linked-de-bruijn-graph mode for HaplotypeCaller and Mutect2 that has better sensitivity compared to the previous linked DeBruijn graph implementation (#6394)

  • A new version of GenomicsDB that fixes many frequently-reported issues

  • LeftAlignIndels now works for multiple indels

  • VariantAnnotator and Concordance are now out of beta

  • A significant number of bug fixes to major tools like GenotypeGVCFs and SelectVariants

Full list of changes:

  • HaplotypeCaller

    • New, improved version of the --linked-de-bruijn-graph mode for HaplotypeCaller and Mutect2 that has better sensitivity compared to the previous linked DeBruijn graph implementation (#6394)
      • Running HaplotypeCaller in this mode will reduce the number of erroneous haplotypes discovered which can improve genotyping, phasing, and runtime.
      • Changed the haplotype recovery step to check that it covers all paths through the graph even if there are poorly supported paths in the JunctionTrees. Added the argument --disable-artificial-haplotype-recovery to disable this behavior.
      • Added the ability to expand graph kmer size after haplotype recovery in the event that there was a failure due to overcomplicated assembly graphs.
      • Added code to squeeze extra sensitivity out of the junction trees by tolerating SNP errors when threading the junction trees themselves
    • Realigning to best haplotype handles indels better (#6461)
    • Fixed issue #5434 on inconsistent selection of reads for the PL, AD, and DP calculations. (#6055)
    • Fixed bug where SNP and indel pseudocounts were swapped in the AlleleFrequencyCalculator (#6401)
    • The qual used in HaplotypeCaller's isActive() method now matches that of GenotypeGVCFs. That is, they both now use the new qual. (#6343)
    • Skip non-nucleotide alleles in force-calling mode, fixing bug (#6405)
    • Fixed the hidden/experimental --error-correct-reads argument to actually correct the bases and qualities (#6366)
    • Removed the deprecated and obsolete --use-new-qual-calculator argument (#6398)
    • Refactored code related to windows and padding for assembly and genotyping, with slight changes to HMM padding for indels (#6358)
  • Mutect2

    • Improved SomaticClusteringModel (#6337)
    • Sped up Mutect2 reference confidence model with fast likelihoods model (#6457)
    • Modified Fragment creation for Mutect2 to not fail for supplementary reads (#6327)
    • Uniqify PG IDs in FilterAlignmentArtifacts (#6304)
    • Fixed error in RealignmentEngine due to converting from exclusive to inclusive interval ends (#6404)
    • Added an error message for no callable sites in Mutect2 (#6445)
    • Changed filter reporting in Mutect2 (#6288)
    • Fixed force-calling mode in M2 mito WDL (#6359)
    • Pass the reference to the realignment filter in the Mutect2 WDL (#6360)
    • Deleted the old orientation bias filter (#6408)
    • Made callable sites a Long to avoid integer overflow (#6303)
  • GenomicsDB

    • Move to GenomicsDB 1.2.0 (#6305)
      • Fixes an issue with GenomicsDBImport erroring out due to duplicate fields in the Info, Format, and/or Filter fields. (#6158)
      • Fixes an issue with GenomicsDBImport not completing for mixed ploidy samples (#6275)
      • This version uses a 64-bit htslib to workaround overflow issues when computed annotation sizes exceed the 32-bit integer space
  • Joint Calling

    • GenotypeGVCFs: improved checking for upstream deletions in the GenotypingEngine (#6429)
      • Fixes rare cases where GenotypeGVCFs could emit a variant with a spanned allele (*), and a genotype that references the spanned allele, but fail to emit the upstream spanning variant.
    • GenotypeGVCFs: Don't call the NON_REF allele in genotypes or ADs (#6437)
    • Parse combined AS_QUALapprox values from older reblocked GVCFs properly (#6442)
    • Added a force output sites argument to GenotypeGVCFs (#6263)
    • Remove extraneous alleles in GenotypeGVCFs force-output mode (#6406)
  • CNV Calling

    • Copy temporary files early in gcnvkernel to avoid inadvertent temporary directory cleanup. (#6297)
    • Enabled streaming of counts.tsv/counts.tsv.gz files in gCNV CLIs. (#6266)
    • Fixed shard index in PostprocessGermlineCNVCalls log message. (#6313)
    • gCNV vcf cleanup (#6352)
    • Index output VCFs for GCNV postprocessing (#6330)
  • Notable Enhancements

    • VariantAnnotator is now out of beta (#6402)
    • Concordance is out of beta (#6397)
    • LeftAlignIndels now works for multiple indels (#6427)
    • FilterVariantTranches can now handle cases where there are only SNPs or only indels, and not both (#6411)
    • Added new read filters for NotProperlyPaired and for MateDistant (#6295)
    • Made the .git directory optional during build (#6450)
  • Bug Fixes

    • Handle zero-weight Gaussians correctly in VariantRecalibrator (#6425)
    • Fixed the --invalidate-previous-filters argument in VariantFiltration to work as intended (ie., roll back all variants to unfiltered status) (#6412)
    • Fixed a bug where SelectVariants takes forever on many-allelic somatic samples (#6446)
    • Make sure SelectVariants outputs variants in correct order (assuming input vcf is correctly sorted) (#6444)
    • Fixed a NPE crash in VariantEval when run with no intervals/reference (#6283)
    • Fixed a NPE crash in FastaReferenceMaker (#6435)
    • Fixed an out-of-bounds error in CountNs annotation (#6355)
    • Fixed a bug in hardClipCigar function that caused incorrect cigar calculation (#6280)
    • AnalyzeSaturationMutagenesis: fixed bug in codon calling for in-frame inserts (#6332)
  • Miscellaneous Changes

    • Collect split read and paired end evidence files for GATK-SV pipeline (#6356)
    • Add "PASS" filter line for ApplyVQSR and FilterMutectCalls (#6436)
    • Added engine functionality for accessing the user defined intervals without merging them (#5887)
    • Trim intervals loaded from interval files. (#6375)
    • Propagate read group filters in ReadGroupBlackListReadFilter. (#6300)
    • Modified ANDed read filter output message for readability (#6315)
    • Clearly label the number of reads processed in the BaseRecalibrator log output (#6447)
    • Clearly label the CountReads tool output (#6449)
    • Improved the error messages for missing contigs in the reference (#6469)
    • Avoid a copy and reverse operation in CigarUtils.isGood() (#6439)
    • Fixed GenotypeAlleleCount's toString() method (#6376)
    • Minor Funcotator WDL updates. (#6326)
    • Added a getPairOrientation() method to GATKRead (#6420)
    • Merged GATKProtectedVariantContextUtils methods into other classes (#6409)
    • Deleted a lot of unused VCF constants (#6361)
    • Deleted some unused genotyping code (#6354)
    • Fixed incoherent unit test cases in allele subsetting utils (#6448)
    • Add Python script executor error message for SIGKILL exit code 137. (#6414)
    • Pip install pinned numpy. (#6413)
    • Do not install R on travis, and only run the R tests on the Docker. (#6454)
    • Fixes for IndexFeatureFile error reporting. (#6367)
    • Temporarily remove dead Berkeley mirror to unblock builds. (#6422)
    • Disable CNNVariantPipelineTest.testTrainingReadModel until failures are resolved. (#6331)
    • Delete unused JsonSerializer (#6415)
    • Delete empty file SparkToggleCommandLineProgram.java. (#6311)
  • Documentation

    • Clarify the definition of the NON_REF allele (#6431)
    • Clarify behavior of SplitIntervals for lists of adjacent intervals (#6423)
    • Update docs to reflect the fact that TandemRepeat works with HaplotypeCaller (#5943)
    • Update LeftAlignIndels documentation (#6177)
    • Update hyperlink to new GATK forum page in the README (#6381)
    • Add minValue/minRecommended value to ApplyBQSRArgumentCollection (#6438)
    • Small README fixes (#6451)
    • Fix some GATK doc issues (#6318)
    • Update copyright date in LICENSE.TXT (#6383)
  • Dependencies

    • Updated HTSJDK to 2.21.2 (#6462)
    • Updated Picard to 2.21.9 (#6462)
    • Updated Disq to 0.3.5 (#6323)
    • Updated GenomicsDB to 1.2.0 (#6305)
    • Updated TestNG to 7.0.0 (#5787)