-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move to GenomicsDB 1.2.0 #6305
Move to GenomicsDB 1.2.0 #6305
Conversation
…flow issues with computed annotation sizes outside the 32-bit integer space.
GenomicsDB 1.2.0 also supports suppression of "No valid combination operation" warnings - see #2689 . This will however require changes in gatk(maybe hardcoded initially via GenomicsDBUtils) to indicate to GenomicsDB that a set of fields have no combinations. I will open another PR with an example. |
@@ -396,7 +397,7 @@ final void printCacheStats() { | |||
final GenomicsDBExportConfiguration.ExportConfiguration exportConfigurationBuilder = | |||
createExportConfiguration(workspace, callsetJson, vidmapJson, vcfHeader, genomicsDBOptions); | |||
return new GenomicsDBFeatureReader<>(exportConfigurationBuilder, | |||
new GenomicsDBBCFCodec(), | |||
new VCFCodec(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering about the performance implications of changing to VCF format ? Also, if we no longer need the GenomicsDBBCFCodec
codec, that class (and it's tests) can be deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, there are some other tests where GenomicsDBBCFCodec
is still used (GenomicsDBImportIntegrationTest) - I'm surprised those don't seem to be failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GenomicsDBImportIntegrationTest has been ported already to use VCFCodec in the change set for this PR. But, I can parameterize the tests to test both BCFCodec and VCFCodec if we decide to keep BCFCodec for performance reasons.
Regarding performance, we will have to benchmark GenomicsDB queries as @droazen mentions. We can probably add a FeatureReader option to choose between BCFCodec and VCFCodec for GenomicsDB if performance turns out to be an issue. If BCFCodec is chosen, GenomicsDB gracefully handles a number of overflow scenarios already.
Note that -
- BCF and BCFCodec do not support 64-bit fields.
- VCFCodec does not fully decode the VCF streams by default, it uses the LazyParser whereas BCFCodec always fully decodes the BCF streams. Hopefully, this keeps the VCFCodec somewhat performant even if fields have to be explicitly decoded.
- GenomicsDBImport already uses VCFCodec to read vcfs, so no change there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok, I see. So then we just need to delete the GenomicsDBBCFCodec
class (unless we keep an option for BCF, in which case it will depend on the version of BCF that uses). The GenomicsDBBCFCodec
is only necessary if the BCF header is stamped with version 2.2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a benchmark run with about 5K samples more than a year back - negligible difference in performance between VCF and BCF serialization/deseralization. It would be good to re-check again with your data.
@@ -396,7 +397,7 @@ final void printCacheStats() { | |||
final GenomicsDBExportConfiguration.ExportConfiguration exportConfigurationBuilder = | |||
createExportConfiguration(workspace, callsetJson, vidmapJson, vcfHeader, genomicsDBOptions); | |||
return new GenomicsDBFeatureReader<>(exportConfigurationBuilder, | |||
new GenomicsDBBCFCodec(), | |||
new VCFCodec(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in person, we should benchmark this branch to make sure that the switch to VCFCodec
did not cause a major performance regression.
@ldgauthier, does this branch fix the overflow with combined annotation for RAW_MQandDP in the GenomicsDB you encountered for the large cohort? Thanks. |
@nalinigans I can test on my use case specifically, but it's going to take me some time to dig up all the data again. |
… default codec will be BCF2Codec. Also, support using max-genotype-counts/max-alternate-alleles arg options from GenotypeVCFs/GnarlyGenotyper to be passed to GenomicsDB - see #6377
@droazen, I have introduced --genomicsdb-use-vcf-codec argument defaulting to false. This will allow the user to choose between VCFCodec and BCF2Codec for GenomicsDB streams.
Also, hooked --max-genotype-count arg from GenotypeGVCFs/GnarlyGenotyper to GenomicsDB Export Configuration and introduced --genomicsdb-max-genotype-count to be used by tools like SelectVariants that do not use the Genotype ArgumentCollection. This will help with Issue #6275.
|
@ldgauthier, you mentioned that it would take some time for you to dig into this. But, does this branch fix the overflow with combined annotation for RAW_MQandDP in the GenomicsDB you encountered for the large cohort? Please note that you have to explicitly set the useVCFCodec arg/option to true to test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a WIP - submitting partial comments.
import org.broadinstitute.barclay.argparser.Argument; | ||
import org.broadinstitute.hellbender.tools.walkers.genotyper.GenotypeCalculationArgumentCollection; | ||
|
||
import java.io.Serializable; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dangling unused import..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this.
@@ -0,0 +1,18 @@ | |||
package org.broadinstitute.hellbender.tools.genomicsdb; | |||
|
|||
import htsjdk.variant.variantcontext.GenotypeLikelihoods; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this.
fullName = USE_VCF_CODEC_LONG_NAME, | ||
doc = "Use VCF Codec Streaming for data from GenomicsDB", | ||
optional = true) | ||
public boolean useVCFCodec = DEFAULT_USE_VCF_CODEC; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally try to have boolean command line arguments default to false, with true being used to opt in to non-default behavior. If we're going to default to VCF, then this arg should probably be "genomicsdb-use-bcf-codec" with default of false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we need the default be BCF as we are not sure about the performance implications of VCF as yet. Changed the DEFAULT_USE_VCF_CODEC
to false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. But it looks like the GenomicsDBImportIntegrationTest
tests are still using the VCF codec, which now isn't the one used by default. If we're going to keep BCF as the default, those tests should probably be reverted back to BCF, and there really should be at least one new test that uses the opt-in VCF codec, preferably with 64 bit values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have changed the GenomicsDB/GnarlyGenotyper tests to run with both codecs. But, I don't have the bandwidth to introduce new 64-bit VCFs to test VCF Codec. We do have tests in GenomicsDB for the 64 bit VCFs - see GenomicsDB PR and vcfs like bigint0.vcf.gz for the GenomicsDB CI tests and they can be ported to work with GATK test references. I can look into this later or perhaps it is easy enough for someone else from the team to do it.
@Advanced | ||
@Argument( | ||
fullName = USE_VCF_CODEC_LONG_NAME, | ||
doc = "Use VCF Codec Streaming for data from GenomicsDB", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should state what the alternative is (that is toggles between vcf/bcf) for the internal interchange format.
/** | ||
* Currently there is no support for 64-bit fields in BCFCodec. Specifying this option will allow | ||
* for 64-bit width positions and INFO fields and for computed annotation sizes to exceed the 32-bit | ||
* integer space while encoding/decoding with GenomicsDB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comments below; if we reverse the toggle polarity as I suggest, we should update this to reflect the reason for opting in to BCF (though I've lost track of what that is - is it just that we're just keeping it as a failsafe in case there is a perf issue ?).
|
||
import java.io.Serializable; | ||
|
||
public class GenomicsDBBaseArgumentCollection implements Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better name we can use to make the logical distinction between the the two arg collections clearer ?
this.callGenotypes = genomicsdbArgs.callGenotypes; | ||
this.maxDiploidAltAllelesThatCanBeGenotyped = genomicsdbArgs.maxDiploidAltAllelesThatCanBeGenotyped; | ||
this.maxGenotypeCount = genotypeCalcArgs.MAX_GENOTYPE_COUNT; | ||
this.useVCFCodec = genomicsdbArgs.useVCFCodec; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than having constructor overloads like this, add getters to argument collection classes (its basically the public interface of GenomicsDBOptions
, minus the reference). Then the default argument values could be defined in one place (whichever collection winds up being the "base" collection class), and some of these aliases and constructor overloads could be eliminated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nalinigans This suggestion can be deferred and made in a subsequent PR if you'd prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's defer this to another PR. Thanks.
* Genotype.getAnyAttribute() returns String or Integer based on the VC being decoded fully, so use this routine instead to | ||
* guarantee that it is an int. | ||
*/ | ||
private static int getAttributeAsInt(Genotype genotype, String key) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to use VariantContextGetters. getAttributeAsInt()
instead of adding this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* @param v2 List of expected variant contexts | ||
*/ | ||
private static void assertEqualVariants( | ||
final List<VariantContext> v1, final List<VariantContext> v2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pleas rename to actualVariants
and expectedVariants
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, might as well put this right in VariantContextTestUtils
adjacent to the method its calling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Utils.nonNull(v2, "v2"); | ||
Assert.assertEquals(v1.size(), v2.size()); | ||
for (int i = 0; i < v1.size(); i++) { | ||
VariantContextTestUtils.assertVariantContextsAreEqual(v1.get(i), v2.get(i), new ArrayList<String>(), new ArrayList<String>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use the diamond operator for these empty lists, or better yet, just use Collections.emptyList()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed using Collections.emptyList(). Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nalinigans A few more comments and questions. I may not be understanding where some of these changes are headed, so please let me know if I'm misunderstanding.
fullName = USE_VCF_CODEC_LONG_NAME, | ||
doc = "Use VCF Codec Streaming for data from GenomicsDB", | ||
optional = true) | ||
public boolean useVCFCodec = DEFAULT_USE_VCF_CODEC; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. But it looks like the GenomicsDBImportIntegrationTest
tests are still using the VCF codec, which now isn't the one used by default. If we're going to keep BCF as the default, those tests should probably be reverted back to BCF, and there really should be at least one new test that uses the opt-in VCF codec, preferably with 64 bit values.
* @param g genotype for a sample | ||
* @return int array for the given genotype | ||
*/ | ||
private static int[] getSBFieldAsIntArray(Genotype g) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there is already code that does all this in StrandBiasTest.getStrandCounts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just refactored the existing code with existing patterns of throwing exceptions in GnarlyGenotyper. I will leave to @ldgauthier to clean this up further. Thanks.
@@ -420,6 +423,9 @@ | |||
doc="Suppress reference path in output for test result differencing") | |||
private boolean suppressReferencePath = false; | |||
|
|||
@ArgumentCollection | |||
private GenomicsDBArgumentCollection genomicsdbArgs = new GenomicsDBArgumentCollection(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has any thought been given to how this new argument interacts with the existing genotype subsetting arguments SelectVariants already has ? If not, I'm thinking this change will likely cause confusion, and that we should take this out until those are resolved and some tests are added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no existing argument that clashes with the existing subsetting arguments. However, if you would rather take this out, please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SV has several arguments that do various kinds of genotype subsetting/downsampling/filtering. The argument is marked as @Advanced
, so maybe its ok. But if this argument does what I think it does, it may not be obvious to a user that this subsetting is done independently before SV ever sees the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs more thought. Have removed all support for GenomicsDB args in SelectVariants for now and have filed #6456 to generally get the tools hook up to GenomicsDBExportConfiguration better.
@@ -169,7 +172,9 @@ public boolean requiresReference() { | |||
@Override | |||
protected GenomicsDBOptions getGenomicsDBOptions() { | |||
if (genomicsDBOptions == null) { | |||
genomicsDBOptions = new GenomicsDBOptions(referenceArguments.getReferencePath(), CALL_GENOTYPES, PIPELINE_MAX_ALT_COUNT); | |||
genomicsdbArgs.callGenotypes = CALL_GENOTYPES; | |||
genomicsdbArgs.maxDiploidAltAllelesThatCanBeGenotyped = PIPELINE_MAX_ALT_COUNT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its a bit strange to mutate values in an argument collection like this (I know they aren't annotated as @Argument
, but presumably someday they will be exposed ?). Can you verify that the only net change here in this part of the PR is that we're exposing the vcfCodec switch to GnarlyGenotyper
(?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the net change here allows GnarlyGenotyper to access the vcfCodec arg. But, have also hooked --max-genotype-count
arg from GenotypeGVCFs/GnarlyGenotyper to be passed to GenomicsDBOptions to help with Issue #6275.
I have merged to the latest master, but Travis builds are still failing not related to the changes in this PR. @droazen, @cmnbroad, can someone take a look and let me know what should be done? Thanks. |
@nalinigans thanks for the updates. All builds are failing for everyone at the moment, so its not just this branch. We're working on a fix. I'll take a look at your changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nalinigans Thanks for the changes. A couple more minor comments (also, is it possible to add at least one test, maybe to Select Variants, for the --genomicsdb-max-genotype-count
arg, which is exposed in several tools now ?) Also, build is fixed now, so if you rebase on master the tests should pass again.
private static final String MAX_GENOTYPE_COUNT_LONG_NAME = "genomicsdb-max-genotype-count"; | ||
|
||
@Advanced | ||
@Argument(fullName=MAX_GENOTYPE_COUNT_LONG_NAME, doc="Maximum number of genotypes to consider at any site", optional=true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is only used outside outside of the genotyping tools, I think it would make sense to remove the word "consider" from the doc string, and also reinforce that this only applies when the underlying store is GenomicsDB. Maybe something like "Maximum number of genotypes to retrieve when the underlying store is GenomicsDB".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This argument was being used only by SelectVariants. Note that the existing GenomicsDBArgumentCollection
is removed and GenomicsDBBaseArgumentCollection
is renamed to GenomicsDBArgumentCollection
.
@@ -420,6 +423,9 @@ | |||
doc="Suppress reference path in output for test result differencing") | |||
private boolean suppressReferencePath = false; | |||
|
|||
@ArgumentCollection | |||
private GenomicsDBArgumentCollection genomicsdbArgs = new GenomicsDBArgumentCollection(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SV has several arguments that do various kinds of genotype subsetting/downsampling/filtering. The argument is marked as @Advanced
, so maybe its ok. But if this argument does what I think it does, it may not be obvious to a user that this subsetting is done independently before SV ever sees the data.
...est/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBImportIntegrationTest.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droazen you're on the reviewer list too, so not sure if you want to review as well before merge, but I'm approving this as is so we can get this in the next release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Move to GenomicsDB 1.2.0. This version uses a 64-bit htslib to workaround overflow issues when computed annotation sizes exceed the 32-bit integer space as requested by @ldgauthier.
The htslib and GenomicsDB fixes by @kgururaj also enable read/write of VCFs where the values of INFO fields can be 64-bits. But, if the input VCF (produced by HaplotypeCaller) contains INFO fields with values larger than what can fit in 32 bits, the vid must list those fields as type int64 (currently, it will be int by default). However, no changes are needed in the vid if the input values fit within 32 bits but the result of the combine operation [e.g. sum] is larger.
Note that 64-bit integers are currently only supported for INFO fields (no support for FORMAT fields).