Merge VariantAnnotation and DatabaseVariantAnnotation records #1250

heuermh · 2016-11-08T12:52:14Z

Supercedes #1144

…o Variant

heuermh · 2016-11-08T12:58:39Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

@@ -267,7 +267,7 @@ class ADAMContext private (@transient val sc: SparkContext) extends Serializable
 * @tparam T The type of records to return
 * @return An RDD with records of the specified type
 */
- private[rdd] def loadParquet[T](
+ def loadParquet[T](


I had to make this public again for unit tests in package o.b.a.projections. It also allows for loading user-defined schema (e.g. extensions to bdg-formats) from Avro-in-Parquet files.

Perhaps let's make it private[adam]?

It seems like this would be useful outside of ADAM. I haven't fully thought through the use case though: someone wants to add a new schema record Foo, they extend ADAMKyroRegistrator to register, then extend ADAMContext to add their new loadFoo method, which presumably would call loadParquet for foo.adam paths. If Foo has a sequence dictionary or samples, those avro metadata methods would also be useful, and extending from GenomicRDD and friends needs to be possible.

I don't disagree, but I'd rather keep these private until someone knocks on our door asking to make them public. My philosophy here is simply that it is easier to make private interfaces public than it is to make public interfaces private. That being said, this is a weak preference: if you feel strongly about it, I'm OK with making it public, esp. since loadParquet has been public previously.

AmplabJenkins · 2016-11-08T13:18:14Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1564/
Test PASSed.

fnothaft

This looks awesome. I've dropped a variety of suggestions and nits inline. Do we have a VCF with proper ANN fields that we could pull in and load in org.bdgenomics.adam.rdd.ADAMContextSuite and then save back out? I think that's a good round trip test that we should add.

Also, I think we can punt the next thing to a later PR, but I think we could probably autogen the test data (and more tests) for the *FieldSuites. Adding them is a massive step forward though. Thanks for pushing those in as well!

fnothaft · 2016-11-08T16:27:09Z

adam-core/src/main/scala/org/bdgenomics/adam/converters/TranscriptEffectConverter.scala

@@ -110,6 +147,7 @@ object VariantAnnotations extends Serializable with Logging {

 val te = TranscriptEffect.newBuilder()
 setIfNotEmpty(alternateAllele, te.setAlternateAllele(_))
+ // note: annotationImpact is not mapped


I don't get this comment; can you flesh it out more?

The annotationImpact field (and variable above) is output by SnpEff version 4.2 but is not part of the VCF ANN specification, so I did not include it in our TranscriptEffect schema.

That makes sense, can you add that inline?

fnothaft · 2016-11-08T16:27:39Z

adam-core/src/main/scala/org/bdgenomics/adam/converters/TranscriptEffectConverter.scala

@@ -110,6 +147,7 @@ object VariantAnnotations extends Serializable with Logging {

 val te = TranscriptEffect.newBuilder()
 setIfNotEmpty(alternateAllele, te.setAlternateAllele(_))
+ // note: annotationImpact is not mapped
 if (!effects.isEmpty) te.setEffects(effects.asJava)


effects.nonEmpty

fnothaft · 2016-11-08T16:29:10Z

adam-core/src/main/scala/org/bdgenomics/adam/converters/TranscriptEffectConverter.scala

@@ -132,26 +170,98 @@ object VariantAnnotations extends Serializable with Logging {
 Seq(te.build())


Unrelated to this PR, as this line is unchanged, but whenever possible, I prefer Iterable to Seq unless you need random lookup by index.

fnothaft · 2016-11-08T16:30:20Z

adam-core/src/main/scala/org/bdgenomics/adam/converters/TranscriptEffectConverter.scala

 variant: Variant,
 vc: VariantContext,
- stringency: ValidationStringency = ValidationStringency.STRICT): VariantAnnotation = {
+ stringency: ValidationStringency = ValidationStringency.STRICT): Option[List[TranscriptEffect]] = {


Instead of returning Option[List[TranscriptEffect]] I would just return List[TranscriptEffect]. If you would return a None, I would just return a List.empty instead.

That would make my brain hurt less. The thought is elsewhere it matters whether this field has been set, so checking Option seemed more correct than checking for an empty list.

fnothaft · 2016-11-08T16:32:21Z

adam-core/src/main/scala/org/bdgenomics/adam/converters/TranscriptEffectConverter.scala

+ if (attr == VCFConstants.MISSING_VALUE_v4) {
+ None
+ } else {
+ val filtered = parseAnn(attr, stringency).filter(_.getAlternateAllele == variant.getAlternateAllele)


If you make the above change, then the if-else clause here just becomes:

if (attr == VCFConstants.MISSING_VALUE_v4) { List.empty } else { parseAnn(attr, stringency) .filter(_.getAlternateAllele == variant.getAlternateAllele) }

Also, I would break at the .filter, because that line is a bit long.

Will be adding try catch with validation stringency here shortly...

fnothaft · 2016-11-08T16:39:35Z

adam-core/src/main/scala/org/bdgenomics/adam/converters/TranscriptEffectConverter.scala

+ val numOpt = Option(numerator)
+ val denomOpt = Option(denominator)
+
+ val sb = StringBuilder.newBuilder


I think this code would be a bit cleaner with a match:

(numOpt, denomOpt) match { case (Some(n), Some(d)) => { "%d/%d".format(n, d) } case (None, None) => { "" } case _ => { // validate/throw? if (validationStringency == ValidationStringency.STRICT) { throw new IllegalArgumentException("Incorrect fractional value in %s.".format(te)) } else if (validationStringency == ValidationStringency.LENIENT) { log.warn("Incorrect fractional value in %s.".format(te)) } "" } }

Also, I would either make this package private/private, or move it inside of toAnn, which I think is the only place it is used.

I thought it was already private since it is nested in convertToVcfInfoAnnValue? Still have some to learn about visibility in Scala. The tuple of options is cleaner. (I can't believe I just said that)

Ah yes, you are right RE: protection; I had missed the nesting.

fnothaft · 2016-11-08T16:41:41Z

adam-core/src/main/scala/org/bdgenomics/adam/projections/ContigField.scala

@@ -17,18 +17,12 @@
 */
 package org.bdgenomics.adam.projections

-import org.bdgenomics.formats.avro.DatabaseVariantAnnotation
+import org.bdgenomics.formats.avro.Contig


OOC, why does this show up as a file move? Any thoughts? May be just github being funky.

fnothaft · 2016-11-08T16:42:16Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

@@ -267,7 +267,7 @@ class ADAMContext private (@transient val sc: SparkContext) extends Serializable
 * @tparam T The type of records to return
 * @return An RDD with records of the specified type
 */
- private[rdd] def loadParquet[T](
+ def loadParquet[T](


Perhaps let's make it private[adam]?

fnothaft · 2016-11-08T16:44:18Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/variation/VariantContextRDD.scala

@@ -60,20 +60,19 @@ case class VariantContextRDD(rdd: RDD[VariantContext],
 * @param ann Annotation RDD to join against.
 * @return Returns a VariantContextRDD where annotations have been filled in.
 */
- def joinDatabaseVariantAnnotation(ann: DatabaseVariantAnnotationRDD): VariantContextRDD = {
+ def joinVariantAnnotations(ann: VariantAnnotationRDD): VariantContextRDD = {
 replaceRdd(rdd.keyBy(_.variant)


We might want to open a ticket for this, but after #1216 this should probably be implemented using a region join instead of a Spark core leftOuterJoin.

Created new issue #1259

fnothaft · 2016-11-08T16:44:54Z

adam-core/src/test/scala/org/bdgenomics/adam/converters/TranscriptEffectConverterSuite.scala

+import com.google.common.collect.ImmutableList
+import htsjdk.samtools.ValidationStringency
+import htsjdk.variant.vcf.VCFConstants
+import htsjdk.variant.variantcontext.VariantContext


Nit: htsjdk.variant.vcf after htsjdk.variant.variantcontext

AmplabJenkins · 2016-11-08T17:46:09Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1565/
Test PASSed.

AmplabJenkins · 2016-11-08T18:18:04Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1566/
Test PASSed.

AmplabJenkins · 2016-11-10T15:59:56Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1587/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1250/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 7eff061 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1250/merge^{commit} # timeout=10Checking out Revision 7eff061 (origin/pr/1250/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 7eff06161dcee656f3c48996818a95cb92e96267First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

jpdna · 2016-11-11T02:47:00Z

@heuermh - is it reasonable / useful for me to try to build the code in this PR locally to test it out at this point? I tried to compile it, but can't seem to find a version bdg-formats that works with it.

I tried both
https://github.com/heuermh/bdg-formats/tree/master as it was a 0.9.1-SNAPSHOT version and then tried changing to current bdg-formats 0.10.1-SNAPSHOT, but bdg-formats compile problems with both.

heuermh · 2016-11-11T02:55:21Z

@jpdna As is this branch will not compile due to the filter-related changes in bdg-formats. I've made the code changes locally but they need more unit tests. I'll push these in a commit tomorrow morning.

jpdna · 2016-11-11T04:14:36Z

adam-core/src/main/scala/org/bdgenomics/adam/models/VariantContext.scala

@@ -143,6 +143,5 @@ class VariantContext(
 val position: ReferencePosition,
 val variant: RichVariant,
 val genotypes: Iterable[Genotype],
- val databases: Option[DatabaseVariantAnnotation] = None) {
+ val databases: Option[VariantAnnotation] = None) {


"databases" seems kind of a strange name for this field now to me, I might prefer "annotations".

+1, databases was always a kinda strange name, but it's definitely weird now!

nice catch! fixed

…t.databases field

heuermh · 2016-11-11T17:23:58Z

Pushed new commits that fixes the separate variant and genotype filters issue and updates bdg-formats to the release version 0.10.0. I implemented the filter stuff to the best that htsjdk makes available to us; I could either continue to hack on it so that Genotype.filtersApplied is set correctly or punt until a later issue.

AmplabJenkins · 2016-11-11T17:42:24Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1591/
Test PASSed.

heuermh · 2016-11-11T17:48:44Z

Fixes #194

fnothaft

1 small nit on the filters, otherwise LGTM

fnothaft · 2016-11-11T17:51:15Z

adam-core/src/main/scala/org/bdgenomics/adam/converters/VariantContextConverter.scala

+ val copy = VariantCallingAnnotations.newBuilder(annotations)
+ // htsjdk does not provide a field filtersWereApplied for genotype as it does in VariantContext
+ // we might be able to calculate it by querying the FT FORMAT field value directly
+ copy.setFiltersApplied(true)


I think this would work:

g.getAnyAttribute("FT") != null

Unfortunately it does not
https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/variant/variantcontext/Genotype.java#L560

And careful, FT is considered a forbidden key :)
https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/variant/variantcontext/Genotype.java#L660

Can we create an issue to track the upstream htsjdk issue?

Created new issue #1269

jpdna · 2016-11-11T18:19:22Z

I tried to load this VCF:
https://drive.google.com/open?id=0B6jh69UgixwpTDlSemtreklDNUU
which is the ClinVar VCF with ANN column added by SNPEff

It seems to fail, but without an error message

val x = sc.loadVariantAnnotations("/home/paschallj/ADAM/nov11_annot/run1/test1.200.vcf")

scala> x.rdd.count
res13: Long = 0

This VCF does seem to load into a VariantRDD with loadVCF fine.

I suspect there is something unexpected about the format of my VCF file ANN field, but if this is current snpeff output then it could be problem for some users.

Can you point me to a test VCF with a ANN field that is working properly that I can compare to?

fnothaft · 2016-11-11T18:20:56Z

@jpdna do you get any error/warning messages in the logs? If you have ValidationStringency.LENIENT set, I'd expect to see error messages there.

jpdna · 2016-11-11T18:30:38Z

Where do I set `ValidationStringency.LENIENT" ?
currently I don't see any error messages in adam.log when I run this in adam-shell

heuermh · 2016-11-11T18:44:38Z

@jpdna It might be hard to follow, since things are spread over several issues, but this pull request does not yet support populating VariantAnnotation.transcriptEffects from VCF INFO reserved key ANN values. See #1044 (comment)

fnothaft · 2016-11-11T18:49:05Z

Sure, but even then @jpdna should be getting one VariantAnnotation record per Variant, no?

heuermh · 2016-11-11T19:22:01Z

Maybe, I don't know how well that part of the code works. Based on this and recent conversations on gitter (same issue apparently), not too well?

jpdna · 2016-11-11T20:42:02Z

this pull request does not yet support populating VariantAnnotation.transcriptEffects

ah, thanks for clarifying @heuermh - I'll plan to watch this PR then for the further commits and try my test again when you ping that reading ANN field into transcriptEffects is ready. Perhaps some rows of the VCF I linked to above can be a useful in the test suite - both a VEP and SNPeff derived example annotated VCF would be good.

fnothaft

Just two small changes: parseAndFilter should be private and there's still an Option related NPE issue in convertToVcfInfoAnnValue. Can you clean these two up and I will merge this PR manually?

fnothaft · 2016-11-15T16:48:41Z

adam-core/src/main/scala/org/bdgenomics/adam/converters/TranscriptEffectConverter.scala

+ */
+ def convertToVcfInfoAnnValue(effects: Seq[TranscriptEffect]): String = {
+ def toFraction(numerator: java.lang.Integer, denominator: java.lang.Integer): String = {
+ val numOpt = Option(numerator)


This NPE with Option types still needs to be fixed.

fnothaft · 2016-11-15T16:48:55Z

adam-core/src/main/scala/org/bdgenomics/adam/converters/TranscriptEffectConverter.scala

- stringency: ValidationStringency = ValidationStringency.STRICT): VariantAnnotation = {
+ stringency: ValidationStringency = ValidationStringency.STRICT): Option[List[TranscriptEffect]] = {
+
+ def parseAndFilter(attr: String): Option[List[TranscriptEffect]] = {


This method should be private.

error: illegal start of statement (no modifiers allowed here) [ERROR] private def parseAndFilter(attr: String): Option[List[TranscriptEffect]] = {

Ah, sorry, I misread this and didn't notice that it is nested inside another function.

fnothaft · 2016-11-15T16:49:56Z

adam-core/src/main/scala/org/bdgenomics/adam/converters/VariantContextConverter.scala

+ val copy = VariantCallingAnnotations.newBuilder(annotations)
+ // htsjdk does not provide a field filtersWereApplied for genotype as it does in VariantContext
+ // we might be able to calculate it by querying the FT FORMAT field value directly
+ copy.setFiltersApplied(true)


Can we create an issue to track the upstream htsjdk issue?

fnothaft · 2016-11-15T16:52:54Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

- loadVcf(filePath).toDatabaseVariantAnnotationRDD
+ def loadVcfAnnotations(
+ filePath: String): VariantAnnotationRDD = {
+ loadVcf(filePath).toVariantAnnotationRDD


Just for tracking RE @jpdna's comment about not getting any annotations from a VCF, this line is the culprit. Specifically, loadVcf right now just parses the Genotypes. We should make the VariantContextConverter parse out the annotations by default in the follow on PR.

No changes necessary in this PR, just an FYI.

Also, @jpdna it'd be great to add some unit tests that use that file and try to load a few ANN fields. That should be an acceptance test for the release. Would you be able to do that?

Running SnpEff on the VCF files we are already using for unit tests ends up being not too interesting, with all intragenic variants. It might take a little thinking to generate a more useful VCF file, say with variants right at intron/exon boundaries of a gene with a lot of splice variants, for example.

heuermh · 2016-11-15T18:16:58Z

Pushed commit with some additional unit tests. Let me know if I've addressed all the review comments, and thank you for volunteering to merge this manually.

fnothaft · 2016-11-15T18:18:24Z

LGTM now! I will merge this manually shortly.

AmplabJenkins · 2016-11-15T18:44:10Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1601/
Test PASSed.

fnothaft · 2016-11-15T19:45:49Z

Merged into upgrade-to-0.10.0 as c750830 and c06143b. Thanks @heuermh! It is fantastic to get this refactor in. I will merge in the upgrade-to-0.10.0 branch on tests rerunning and passing.

heuermh · 2016-11-15T19:47:47Z

Woot! Thank you, @fnothaft!

fnothaft and others added 4 commits November 3, 2016 13:49

Bump to bdg-formats 0.10.0-SNAPSHOT.

80f9557

Update for bdg-formats code style changes (bigdatagenomics#1126)

5c30085

Remove StructuralVariant and StructuralVariantType, add names field t…

12f245c

…o Variant

Merge VariantAnnotation and DatabaseVariantAnnotation records

8c18150

heuermh commented Nov 8, 2016

View reviewed changes

heuermh mentioned this pull request Nov 8, 2016

Merge VariantAnnotation and DatabaseVariantAnnotation records #1144

Closed

4 tasks

Add variant filter fields for projection

028b245

fnothaft requested changes Nov 8, 2016

View reviewed changes

heuermh added 2 commits November 8, 2016 10:52

Register read, sequence, and slice schema records

b3e3438

Support validation stringency when parsing VCF ANN attribute values

b43a252

Address some review comments

9b33c81

fnothaft added this to the 0.21.0 milestone Nov 8, 2016

fix typo

bcb92e7

jpdna reviewed Nov 11, 2016

View reviewed changes

heuermh added 2 commits November 11, 2016 11:16

Implement separate variant and genotype filters; rename VariantContex…

c799092

…t.databases field

Update bdg-formats dependency version to 0.10.0

bd15743

fnothaft requested changes Nov 11, 2016

View reviewed changes

fnothaft requested changes Nov 15, 2016

View reviewed changes

Add additional unit tests around conversion to VCF ANN string value

95c4107

fnothaft approved these changes Nov 15, 2016

View reviewed changes

fnothaft force-pushed the upgrade-to-bdg-formats-0.10.0 branch from 12f245c to c06143b Compare November 15, 2016 19:44

fnothaft closed this Nov 15, 2016

heuermh deleted the merge-variant-annotation2 branch November 15, 2016 19:47

This was referenced Nov 15, 2016

Upgrade to bdg-formats-0.10.0 #1135

Merged

Variant filter representation #194

Closed

loadVariantAnnotations returns empty RDD #1271

Closed

		@@ -132,26 +170,98 @@ object VariantAnnotations extends Serializable with Logging {
		Seq(te.build())

Merge VariantAnnotation and DatabaseVariantAnnotation records #1250

Merge VariantAnnotation and DatabaseVariantAnnotation records #1250

Conversation

heuermh commented Nov 8, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Nov 8, 2016

fnothaft left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Nov 8, 2016

AmplabJenkins commented Nov 8, 2016

AmplabJenkins commented Nov 10, 2016

Build result: FAILURE

jpdna commented Nov 11, 2016

heuermh commented Nov 11, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heuermh commented Nov 11, 2016

AmplabJenkins commented Nov 11, 2016

heuermh commented Nov 11, 2016

fnothaft left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpdna commented Nov 11, 2016

fnothaft commented Nov 11, 2016

jpdna commented Nov 11, 2016

heuermh commented Nov 11, 2016

fnothaft commented Nov 11, 2016

heuermh commented Nov 11, 2016 • edited Loading

jpdna commented Nov 11, 2016

fnothaft left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heuermh commented Nov 15, 2016

fnothaft commented Nov 15, 2016

AmplabJenkins commented Nov 15, 2016

fnothaft commented Nov 15, 2016

heuermh commented Nov 15, 2016

heuermh commented Nov 11, 2016 •

edited

Loading