Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge VariantAnnotation and DatabaseVariantAnnotation records #1250

Conversation

heuermh
Copy link
Member

@heuermh heuermh commented Nov 8, 2016

Supercedes #1144

@@ -267,7 +267,7 @@ class ADAMContext private (@transient val sc: SparkContext) extends Serializable
* @tparam T The type of records to return
* @return An RDD with records of the specified type
*/
private[rdd] def loadParquet[T](
def loadParquet[T](
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to make this public again for unit tests in package o.b.a.projections. It also allows for loading user-defined schema (e.g. extensions to bdg-formats) from Avro-in-Parquet files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps let's make it private[adam]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this would be useful outside of ADAM. I haven't fully thought through the use case though: someone wants to add a new schema record Foo, they extend ADAMKyroRegistrator to register, then extend ADAMContext to add their new loadFoo method, which presumably would call loadParquet for foo.adam paths. If Foo has a sequence dictionary or samples, those avro metadata methods would also be useful, and extending from GenomicRDD and friends needs to be possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree, but I'd rather keep these private until someone knocks on our door asking to make them public. My philosophy here is simply that it is easier to make private interfaces public than it is to make public interfaces private. That being said, this is a weak preference: if you feel strongly about it, I'm OK with making it public, esp. since loadParquet has been public previously.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1564/
Test PASSed.

Copy link
Member

@fnothaft fnothaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome. I've dropped a variety of suggestions and nits inline. Do we have a VCF with proper ANN fields that we could pull in and load in org.bdgenomics.adam.rdd.ADAMContextSuite and then save back out? I think that's a good round trip test that we should add.

Also, I think we can punt the next thing to a later PR, but I think we could probably autogen the test data (and more tests) for the *FieldSuites. Adding them is a massive step forward though. Thanks for pushing those in as well!

@@ -110,6 +147,7 @@ object VariantAnnotations extends Serializable with Logging {

val te = TranscriptEffect.newBuilder()
setIfNotEmpty(alternateAllele, te.setAlternateAllele(_))
// note: annotationImpact is not mapped
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get this comment; can you flesh it out more?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The annotationImpact field (and variable above) is output by SnpEff version 4.2 but is not part of the VCF ANN specification, so I did not include it in our TranscriptEffect schema.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, can you add that inline?

@@ -110,6 +147,7 @@ object VariantAnnotations extends Serializable with Logging {

val te = TranscriptEffect.newBuilder()
setIfNotEmpty(alternateAllele, te.setAlternateAllele(_))
// note: annotationImpact is not mapped
if (!effects.isEmpty) te.setEffects(effects.asJava)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

effects.nonEmpty

@@ -132,26 +170,98 @@ object VariantAnnotations extends Serializable with Logging {
Seq(te.build())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, as this line is unchanged, but whenever possible, I prefer Iterable to Seq unless you need random lookup by index.

variant: Variant,
vc: VariantContext,
stringency: ValidationStringency = ValidationStringency.STRICT): VariantAnnotation = {
stringency: ValidationStringency = ValidationStringency.STRICT): Option[List[TranscriptEffect]] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of returning Option[List[TranscriptEffect]] I would just return List[TranscriptEffect]. If you would return a None, I would just return a List.empty instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would make my brain hurt less. The thought is elsewhere it matters whether this field has been set, so checking Option seemed more correct than checking for an empty list.

if (attr == VCFConstants.MISSING_VALUE_v4) {
None
} else {
val filtered = parseAnn(attr, stringency).filter(_.getAlternateAllele == variant.getAlternateAllele)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you make the above change, then the if-else clause here just becomes:

if (attr == VCFConstants.MISSING_VALUE_v4) {
  List.empty
} else {
  parseAnn(attr, stringency)
    .filter(_.getAlternateAllele == variant.getAlternateAllele)
}

Also, I would break at the .filter, because that line is a bit long.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be adding try catch with validation stringency here shortly...

val numOpt = Option(numerator)
val denomOpt = Option(denominator)

val sb = StringBuilder.newBuilder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code would be a bit cleaner with a match:

(numOpt, denomOpt) match {
  case (Some(n), Some(d)) => {
    "%d/%d".format(n, d)
  }
  case (None, None) => {
    ""
  }
  case _ => {
     // validate/throw?
     if (validationStringency == ValidationStringency.STRICT) {
       throw new IllegalArgumentException("Incorrect fractional value in %s.".format(te))
     } else if (validationStringency == ValidationStringency.LENIENT) {
       log.warn("Incorrect fractional value in %s.".format(te))
     }
     ""
  }
}

Also, I would either make this package private/private, or move it inside of toAnn, which I think is the only place it is used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was already private since it is nested in convertToVcfInfoAnnValue? Still have some to learn about visibility in Scala. The tuple of options is cleaner. (I can't believe I just said that)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, you are right RE: protection; I had missed the nesting.

@@ -17,18 +17,12 @@
*/
package org.bdgenomics.adam.projections

import org.bdgenomics.formats.avro.DatabaseVariantAnnotation
import org.bdgenomics.formats.avro.Contig
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOC, why does this show up as a file move? Any thoughts? May be just github being funky.

@@ -267,7 +267,7 @@ class ADAMContext private (@transient val sc: SparkContext) extends Serializable
* @tparam T The type of records to return
* @return An RDD with records of the specified type
*/
private[rdd] def loadParquet[T](
def loadParquet[T](
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps let's make it private[adam]?

@@ -60,20 +60,19 @@ case class VariantContextRDD(rdd: RDD[VariantContext],
* @param ann Annotation RDD to join against.
* @return Returns a VariantContextRDD where annotations have been filled in.
*/
def joinDatabaseVariantAnnotation(ann: DatabaseVariantAnnotationRDD): VariantContextRDD = {
def joinVariantAnnotations(ann: VariantAnnotationRDD): VariantContextRDD = {
replaceRdd(rdd.keyBy(_.variant)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to open a ticket for this, but after #1216 this should probably be implemented using a region join instead of a Spark core leftOuterJoin.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created new issue #1259

import com.google.common.collect.ImmutableList
import htsjdk.samtools.ValidationStringency
import htsjdk.variant.vcf.VCFConstants
import htsjdk.variant.variantcontext.VariantContext
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: htsjdk.variant.vcf after htsjdk.variant.variantcontext

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1565/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1566/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1587/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1250/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 7eff061 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1250/merge^{commit} # timeout=10Checking out Revision 7eff061 (origin/pr/1250/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 7eff06161dcee656f3c48996818a95cb92e96267First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@jpdna
Copy link
Member

jpdna commented Nov 11, 2016

@heuermh - is it reasonable / useful for me to try to build the code in this PR locally to test it out at this point? I tried to compile it, but can't seem to find a version bdg-formats that works with it.

I tried both
https://github.com/heuermh/bdg-formats/tree/master as it was a 0.9.1-SNAPSHOT version and then tried changing to current bdg-formats 0.10.1-SNAPSHOT, but bdg-formats compile problems with both.

@heuermh
Copy link
Member Author

heuermh commented Nov 11, 2016

@jpdna As is this branch will not compile due to the filter-related changes in bdg-formats. I've made the code changes locally but they need more unit tests. I'll push these in a commit tomorrow morning.

@@ -143,6 +143,5 @@ class VariantContext(
val position: ReferencePosition,
val variant: RichVariant,
val genotypes: Iterable[Genotype],
val databases: Option[DatabaseVariantAnnotation] = None) {
val databases: Option[VariantAnnotation] = None) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"databases" seems kind of a strange name for this field now to me, I might prefer "annotations".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, databases was always a kinda strange name, but it's definitely weird now!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch! fixed

@heuermh
Copy link
Member Author

heuermh commented Nov 11, 2016

Pushed new commits that fixes the separate variant and genotype filters issue and updates bdg-formats to the release version 0.10.0. I implemented the filter stuff to the best that htsjdk makes available to us; I could either continue to hack on it so that Genotype.filtersApplied is set correctly or punt until a later issue.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1591/
Test PASSed.

@heuermh
Copy link
Member Author

heuermh commented Nov 11, 2016

Fixes #194

Copy link
Member

@fnothaft fnothaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 small nit on the filters, otherwise LGTM

val copy = VariantCallingAnnotations.newBuilder(annotations)
// htsjdk does not provide a field filtersWereApplied for genotype as it does in VariantContext
// we might be able to calculate it by querying the FT FORMAT field value directly
copy.setFiltersApplied(true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would work:

g.getAnyAttribute("FT") != null

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we create an issue to track the upstream htsjdk issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created new issue #1269

@jpdna
Copy link
Member

jpdna commented Nov 11, 2016

I tried to load this VCF:
https://drive.google.com/open?id=0B6jh69UgixwpTDlSemtreklDNUU
which is the ClinVar VCF with ANN column added by SNPEff

It seems to fail, but without an error message

val x = sc.loadVariantAnnotations("/home/paschallj/ADAM/nov11_annot/run1/test1.200.vcf")

scala> x.rdd.count
res13: Long = 0

This VCF does seem to load into a VariantRDD with loadVCF fine.

I suspect there is something unexpected about the format of my VCF file ANN field, but if this is current snpeff output then it could be problem for some users.

Can you point me to a test VCF with a ANN field that is working properly that I can compare to?

@fnothaft
Copy link
Member

@jpdna do you get any error/warning messages in the logs? If you have ValidationStringency.LENIENT set, I'd expect to see error messages there.

@jpdna
Copy link
Member

jpdna commented Nov 11, 2016

Where do I set `ValidationStringency.LENIENT" ?
currently I don't see any error messages in adam.log when I run this in adam-shell

@heuermh
Copy link
Member Author

heuermh commented Nov 11, 2016

@jpdna It might be hard to follow, since things are spread over several issues, but this pull request does not yet support populating VariantAnnotation.transcriptEffects from VCF INFO reserved key ANN values. See #1044 (comment)

@fnothaft
Copy link
Member

Sure, but even then @jpdna should be getting one VariantAnnotation record per Variant, no?

@heuermh
Copy link
Member Author

heuermh commented Nov 11, 2016

Maybe, I don't know how well that part of the code works. Based on this and recent conversations on gitter (same issue apparently), not too well?

@jpdna
Copy link
Member

jpdna commented Nov 11, 2016

this pull request does not yet support populating VariantAnnotation.transcriptEffects

ah, thanks for clarifying @heuermh - I'll plan to watch this PR then for the further commits and try my test again when you ping that reading ANN field into transcriptEffects is ready. Perhaps some rows of the VCF I linked to above can be a useful in the test suite - both a VEP and SNPeff derived example annotated VCF would be good.

Copy link
Member

@fnothaft fnothaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two small changes: parseAndFilter should be private and there's still an Option related NPE issue in convertToVcfInfoAnnValue. Can you clean these two up and I will merge this PR manually?

*/
def convertToVcfInfoAnnValue(effects: Seq[TranscriptEffect]): String = {
def toFraction(numerator: java.lang.Integer, denominator: java.lang.Integer): String = {
val numOpt = Option(numerator)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This NPE with Option types still needs to be fixed.

stringency: ValidationStringency = ValidationStringency.STRICT): VariantAnnotation = {
stringency: ValidationStringency = ValidationStringency.STRICT): Option[List[TranscriptEffect]] = {

def parseAndFilter(attr: String): Option[List[TranscriptEffect]] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method should be private.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error: illegal start of statement (no modifiers allowed here) [ERROR] private def parseAndFilter(attr: String): Option[List[TranscriptEffect]] = {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, I misread this and didn't notice that it is nested inside another function.

val copy = VariantCallingAnnotations.newBuilder(annotations)
// htsjdk does not provide a field filtersWereApplied for genotype as it does in VariantContext
// we might be able to calculate it by querying the FT FORMAT field value directly
copy.setFiltersApplied(true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we create an issue to track the upstream htsjdk issue?

loadVcf(filePath).toDatabaseVariantAnnotationRDD
def loadVcfAnnotations(
filePath: String): VariantAnnotationRDD = {
loadVcf(filePath).toVariantAnnotationRDD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for tracking RE @jpdna's comment about not getting any annotations from a VCF, this line is the culprit. Specifically, loadVcf right now just parses the Genotypes. We should make the VariantContextConverter parse out the annotations by default in the follow on PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No changes necessary in this PR, just an FYI.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, @jpdna it'd be great to add some unit tests that use that file and try to load a few ANN fields. That should be an acceptance test for the release. Would you be able to do that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running SnpEff on the VCF files we are already using for unit tests ends up being not too interesting, with all intragenic variants. It might take a little thinking to generate a more useful VCF file, say with variants right at intron/exon boundaries of a gene with a lot of splice variants, for example.

@heuermh
Copy link
Member Author

heuermh commented Nov 15, 2016

Pushed commit with some additional unit tests. Let me know if I've addressed all the review comments, and thank you for volunteering to merge this manually.

@fnothaft
Copy link
Member

LGTM now! I will merge this manually shortly.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1601/
Test PASSed.

@fnothaft fnothaft force-pushed the upgrade-to-bdg-formats-0.10.0 branch from 12f245c to c06143b Compare November 15, 2016 19:44
@fnothaft
Copy link
Member

Merged into upgrade-to-0.10.0 as c750830 and c06143b. Thanks @heuermh! It is fantastic to get this refactor in. I will merge in the upgrade-to-0.10.0 branch on tests rerunning and passing.

@fnothaft fnothaft closed this Nov 15, 2016
@heuermh heuermh deleted the merge-variant-annotation2 branch November 15, 2016 19:47
@heuermh
Copy link
Member Author

heuermh commented Nov 15, 2016

Woot! Thank you, @fnothaft!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants