Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move to GenomicsDB 1.2.0 #6305

Merged
merged 13 commits into from
Feb 28, 2020
Merged

Move to GenomicsDB 1.2.0 #6305

merged 13 commits into from
Feb 28, 2020

Conversation

nalinigans
Copy link
Collaborator

@nalinigans nalinigans commented Dec 7, 2019

Move to GenomicsDB 1.2.0. This version uses a 64-bit htslib to workaround overflow issues when computed annotation sizes exceed the 32-bit integer space as requested by @ldgauthier.

The htslib and GenomicsDB fixes by @kgururaj also enable read/write of VCFs where the values of INFO fields can be 64-bits. But, if the input VCF (produced by HaplotypeCaller) contains INFO fields with values larger than what can fit in 32 bits, the vid must list those fields as type int64 (currently, it will be int by default). However, no changes are needed in the vid if the input values fit within 32 bits but the result of the combine operation [e.g. sum] is larger.

Note that 64-bit integers are currently only supported for INFO fields (no support for FORMAT fields).

…flow issues with computed annotation sizes outside the 32-bit integer space.
@kgururaj
Copy link
Collaborator

kgururaj commented Dec 9, 2019

I believe this release also fixes #6158 and #6275 (thanks @mlathara).

@nalinigans
Copy link
Collaborator Author

nalinigans commented Dec 9, 2019

GenomicsDB 1.2.0 also supports suppression of "No valid combination operation" warnings - see #2689 . This will however require changes in gatk(maybe hardcoded initially via GenomicsDBUtils) to indicate to GenomicsDB that a set of fields have no combinations. I will open another PR with an example.

@@ -396,7 +397,7 @@ final void printCacheStats() {
final GenomicsDBExportConfiguration.ExportConfiguration exportConfigurationBuilder =
createExportConfiguration(workspace, callsetJson, vidmapJson, vcfHeader, genomicsDBOptions);
return new GenomicsDBFeatureReader<>(exportConfigurationBuilder,
new GenomicsDBBCFCodec(),
new VCFCodec(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering about the performance implications of changing to VCF format ? Also, if we no longer need the GenomicsDBBCFCodec codec, that class (and it's tests) can be deleted.

Copy link
Collaborator

@cmnbroad cmnbroad Dec 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there are some other tests where GenomicsDBBCFCodec is still used (GenomicsDBImportIntegrationTest) - I'm surprised those don't seem to be failing.

Copy link
Collaborator Author

@nalinigans nalinigans Dec 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GenomicsDBImportIntegrationTest has been ported already to use VCFCodec in the change set for this PR. But, I can parameterize the tests to test both BCFCodec and VCFCodec if we decide to keep BCFCodec for performance reasons.

Regarding performance, we will have to benchmark GenomicsDB queries as @droazen mentions. We can probably add a FeatureReader option to choose between BCFCodec and VCFCodec for GenomicsDB if performance turns out to be an issue. If BCFCodec is chosen, GenomicsDB gracefully handles a number of overflow scenarios already.

Note that -

  • BCF and BCFCodec do not support 64-bit fields.
  • VCFCodec does not fully decode the VCF streams by default, it uses the LazyParser whereas BCFCodec always fully decodes the BCF streams. Hopefully, this keeps the VCFCodec somewhat performant even if fields have to be explicitly decoded.
  • GenomicsDBImport already uses VCFCodec to read vcfs, so no change there.

Copy link
Collaborator

@cmnbroad cmnbroad Dec 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok, I see. So then we just need to delete the GenomicsDBBCFCodec class (unless we keep an option for BCF, in which case it will depend on the version of BCF that uses). The GenomicsDBBCFCodec is only necessary if the BCF header is stamped with version 2.2.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a benchmark run with about 5K samples more than a year back - negligible difference in performance between VCF and BCF serialization/deseralization. It would be good to re-check again with your data.

@@ -396,7 +397,7 @@ final void printCacheStats() {
final GenomicsDBExportConfiguration.ExportConfiguration exportConfigurationBuilder =
createExportConfiguration(workspace, callsetJson, vidmapJson, vcfHeader, genomicsDBOptions);
return new GenomicsDBFeatureReader<>(exportConfigurationBuilder,
new GenomicsDBBCFCodec(),
new VCFCodec(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in person, we should benchmark this branch to make sure that the switch to VCFCodec did not cause a major performance regression.

@nalinigans
Copy link
Collaborator Author

@ldgauthier, does this branch fix the overflow with combined annotation for RAW_MQandDP in the GenomicsDB you encountered for the large cohort? Thanks.

@ldgauthier
Copy link
Contributor

@nalinigans I can test on my use case specifically, but it's going to take me some time to dig up all the data again.

nalinigans and others added 3 commits February 1, 2020 12:26
… default codec will be BCF2Codec.

Also, support using max-genotype-counts/max-alternate-alleles arg options from GenotypeVCFs/GnarlyGenotyper to be passed to GenomicsDB - see #6377
@nalinigans
Copy link
Collaborator Author

nalinigans commented Feb 7, 2020

@droazen, I have introduced --genomicsdb-use-vcf-codec argument defaulting to false. This will allow the user to choose between VCFCodec and BCF2Codec for GenomicsDB streams.

--genomicsdb-use-vcf-codec:Boolean
                              Use VCF Codec Streaming for data from GenomicsDB  Default value: true. Possible values:
                              {true, false} 

Also, hooked --max-genotype-count arg from GenotypeGVCFs/GnarlyGenotyper to GenomicsDB Export Configuration and introduced --genomicsdb-max-genotype-count to be used by tools like SelectVariants that do not use the Genotype ArgumentCollection. This will help with Issue #6275.

--genomicsdb-max-genotype-count:Integer
                              Maximum number of genotypes to consider at any site  Default value: 1024.

@nalinigans
Copy link
Collaborator Author

@ldgauthier, you mentioned that it would take some time for you to dig into this. But, does this branch fix the overflow with combined annotation for RAW_MQandDP in the GenomicsDB you encountered for the large cohort? Please note that you have to explicitly set the useVCFCodec arg/option to true to test.

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a WIP - submitting partial comments.

import org.broadinstitute.barclay.argparser.Argument;
import org.broadinstitute.hellbender.tools.walkers.genotyper.GenotypeCalculationArgumentCollection;

import java.io.Serializable;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dangling unused import..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this.

@@ -0,0 +1,18 @@
package org.broadinstitute.hellbender.tools.genomicsdb;

import htsjdk.variant.variantcontext.GenotypeLikelihoods;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unused.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this.

fullName = USE_VCF_CODEC_LONG_NAME,
doc = "Use VCF Codec Streaming for data from GenomicsDB",
optional = true)
public boolean useVCFCodec = DEFAULT_USE_VCF_CODEC;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally try to have boolean command line arguments default to false, with true being used to opt in to non-default behavior. If we're going to default to VCF, then this arg should probably be "genomicsdb-use-bcf-codec" with default of false.

Copy link
Collaborator Author

@nalinigans nalinigans Feb 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we need the default be BCF as we are not sure about the performance implications of VCF as yet. Changed the DEFAULT_USE_VCF_CODEC to false.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. But it looks like the GenomicsDBImportIntegrationTest tests are still using the VCF codec, which now isn't the one used by default. If we're going to keep BCF as the default, those tests should probably be reverted back to BCF, and there really should be at least one new test that uses the opt-in VCF codec, preferably with 64 bit values.

Copy link
Collaborator Author

@nalinigans nalinigans Feb 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have changed the GenomicsDB/GnarlyGenotyper tests to run with both codecs. But, I don't have the bandwidth to introduce new 64-bit VCFs to test VCF Codec. We do have tests in GenomicsDB for the 64 bit VCFs - see GenomicsDB PR and vcfs like bigint0.vcf.gz for the GenomicsDB CI tests and they can be ported to work with GATK test references. I can look into this later or perhaps it is easy enough for someone else from the team to do it.

@Advanced
@Argument(
fullName = USE_VCF_CODEC_LONG_NAME,
doc = "Use VCF Codec Streaming for data from GenomicsDB",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should state what the alternative is (that is toggles between vcf/bcf) for the internal interchange format.

/**
* Currently there is no support for 64-bit fields in BCFCodec. Specifying this option will allow
* for 64-bit width positions and INFO fields and for computed annotation sizes to exceed the 32-bit
* integer space while encoding/decoding with GenomicsDB.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments below; if we reverse the toggle polarity as I suggest, we should update this to reflect the reason for opting in to BCF (though I've lost track of what that is - is it just that we're just keeping it as a failsafe in case there is a perf issue ?).


import java.io.Serializable;

public class GenomicsDBBaseArgumentCollection implements Serializable {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better name we can use to make the logical distinction between the the two arg collections clearer ?

this.callGenotypes = genomicsdbArgs.callGenotypes;
this.maxDiploidAltAllelesThatCanBeGenotyped = genomicsdbArgs.maxDiploidAltAllelesThatCanBeGenotyped;
this.maxGenotypeCount = genotypeCalcArgs.MAX_GENOTYPE_COUNT;
this.useVCFCodec = genomicsdbArgs.useVCFCodec;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than having constructor overloads like this, add getters to argument collection classes (its basically the public interface of GenomicsDBOptions, minus the reference). Then the default argument values could be defined in one place (whichever collection winds up being the "base" collection class), and some of these aliases and constructor overloads could be eliminated.

Copy link
Collaborator

@cmnbroad cmnbroad Feb 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalinigans This suggestion can be deferred and made in a subsequent PR if you'd prefer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's defer this to another PR. Thanks.

* Genotype.getAnyAttribute() returns String or Integer based on the VC being decoded fully, so use this routine instead to
* guarantee that it is an int.
*/
private static int getAttributeAsInt(Genotype genotype, String key) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to use VariantContextGetters. getAttributeAsInt() instead of adding this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* @param v2 List of expected variant contexts
*/
private static void assertEqualVariants(
final List<VariantContext> v1, final List<VariantContext> v2) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pleas rename to actualVariants and expectedVariants.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, might as well put this right in VariantContextTestUtils adjacent to the method its calling.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Utils.nonNull(v2, "v2");
Assert.assertEquals(v1.size(), v2.size());
for (int i = 0; i < v1.size(); i++) {
VariantContextTestUtils.assertVariantContextsAreEqual(v1.get(i), v2.get(i), new ArrayList<String>(), new ArrayList<String>());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use the diamond operator for these empty lists, or better yet, just use Collections.emptyList().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed using Collections.emptyList(). Thanks.

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalinigans A few more comments and questions. I may not be understanding where some of these changes are headed, so please let me know if I'm misunderstanding.

fullName = USE_VCF_CODEC_LONG_NAME,
doc = "Use VCF Codec Streaming for data from GenomicsDB",
optional = true)
public boolean useVCFCodec = DEFAULT_USE_VCF_CODEC;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. But it looks like the GenomicsDBImportIntegrationTest tests are still using the VCF codec, which now isn't the one used by default. If we're going to keep BCF as the default, those tests should probably be reverted back to BCF, and there really should be at least one new test that uses the opt-in VCF codec, preferably with 64 bit values.

* @param g genotype for a sample
* @return int array for the given genotype
*/
private static int[] getSBFieldAsIntArray(Genotype g) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there is already code that does all this in StrandBiasTest.getStrandCounts.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just refactored the existing code with existing patterns of throwing exceptions in GnarlyGenotyper. I will leave to @ldgauthier to clean this up further. Thanks.

@@ -420,6 +423,9 @@
doc="Suppress reference path in output for test result differencing")
private boolean suppressReferencePath = false;

@ArgumentCollection
private GenomicsDBArgumentCollection genomicsdbArgs = new GenomicsDBArgumentCollection();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has any thought been given to how this new argument interacts with the existing genotype subsetting arguments SelectVariants already has ? If not, I'm thinking this change will likely cause confusion, and that we should take this out until those are resolved and some tests are added.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no existing argument that clashes with the existing subsetting arguments. However, if you would rather take this out, please let me know.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SV has several arguments that do various kinds of genotype subsetting/downsampling/filtering. The argument is marked as @Advanced, so maybe its ok. But if this argument does what I think it does, it may not be obvious to a user that this subsetting is done independently before SV ever sees the data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs more thought. Have removed all support for GenomicsDB args in SelectVariants for now and have filed #6456 to generally get the tools hook up to GenomicsDBExportConfiguration better.

@@ -169,7 +172,9 @@ public boolean requiresReference() {
@Override
protected GenomicsDBOptions getGenomicsDBOptions() {
if (genomicsDBOptions == null) {
genomicsDBOptions = new GenomicsDBOptions(referenceArguments.getReferencePath(), CALL_GENOTYPES, PIPELINE_MAX_ALT_COUNT);
genomicsdbArgs.callGenotypes = CALL_GENOTYPES;
genomicsdbArgs.maxDiploidAltAllelesThatCanBeGenotyped = PIPELINE_MAX_ALT_COUNT;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a bit strange to mutate values in an argument collection like this (I know they aren't annotated as @Argument, but presumably someday they will be exposed ?). Can you verify that the only net change here in this part of the PR is that we're exposing the vcfCodec switch to GnarlyGenotyper (?).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the net change here allows GnarlyGenotyper to access the vcfCodec arg. But, have also hooked --max-genotype-count arg from GenotypeGVCFs/GnarlyGenotyper to be passed to GenomicsDBOptions to help with Issue #6275.

@nalinigans
Copy link
Collaborator Author

nalinigans commented Feb 13, 2020

I have merged to the latest master, but Travis builds are still failing not related to the changes in this PR. @droazen, @cmnbroad, can someone take a look and let me know what should be done? Thanks.

@nalinigans nalinigans requested a review from cmnbroad February 13, 2020 18:45
@cmnbroad
Copy link
Collaborator

@nalinigans thanks for the updates. All builds are failing for everyone at the moment, so its not just this branch. We're working on a fix. I'll take a look at your changes.

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalinigans Thanks for the changes. A couple more minor comments (also, is it possible to add at least one test, maybe to Select Variants, for the --genomicsdb-max-genotype-count arg, which is exposed in several tools now ?) Also, build is fixed now, so if you rebase on master the tests should pass again.

private static final String MAX_GENOTYPE_COUNT_LONG_NAME = "genomicsdb-max-genotype-count";

@Advanced
@Argument(fullName=MAX_GENOTYPE_COUNT_LONG_NAME, doc="Maximum number of genotypes to consider at any site", optional=true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is only used outside outside of the genotyping tools, I think it would make sense to remove the word "consider" from the doc string, and also reinforce that this only applies when the underlying store is GenomicsDB. Maybe something like "Maximum number of genotypes to retrieve when the underlying store is GenomicsDB".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This argument was being used only by SelectVariants. Note that the existing GenomicsDBArgumentCollection is removed and GenomicsDBBaseArgumentCollection is renamed to GenomicsDBArgumentCollection.

@@ -420,6 +423,9 @@
doc="Suppress reference path in output for test result differencing")
private boolean suppressReferencePath = false;

@ArgumentCollection
private GenomicsDBArgumentCollection genomicsdbArgs = new GenomicsDBArgumentCollection();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SV has several arguments that do various kinds of genotype subsetting/downsampling/filtering. The argument is marked as @Advanced, so maybe its ok. But if this argument does what I think it does, it may not be obvious to a user that this subsetting is done independently before SV ever sees the data.

@nalinigans nalinigans requested a review from cmnbroad February 15, 2020 04:25
Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droazen you're on the reviewer list too, so not sure if you want to review as well before merge, but I'm approving this as is so we can get this in the next release.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants