filterByOverlappingRegion Incorrect for Genotypes #1042

erictu · 2016-06-03T22:59:55Z

Currently this:

  def filterByOverlappingRegion(query: ReferenceRegion): RDD[Genotype] = {
    def overlapsQuery(rec: Genotype): Boolean =
      rec.getVariant.getContigName == query.referenceName &&
        rec.getVariant.getStart < query.end &&
        rec.getVariant.getEnd > query.start
    rdd.filter(overlapsQuery)
  }

Should be this, where we get start, end, and contigName directly from the genotype. (I will submit a PR):

  def filterByOverlappingRegion(query: ReferenceRegion): RDD[Genotype] = {
    def overlapsQuery(rec: Genotype): Boolean =
      rec.getContigName == query.referenceName &&
        rec.getStart < query.end &&
        rec.getEnd > query.start
    rdd.filter(overlapsQuery)
  }

It seems like when we convert VCF to ADAM, the ADAM format won't have the start, end, and contigName fields populated in the variant record, but they will be populated in the genotype record.

The filter condition should get the start, end, and contigName fields from the genotype record, not the variant record. Is there any reason why the fields are null in the variant? Is this just so we don't replicate information?

The text was updated successfully, but these errors were encountered:

jpdna · 2016-06-04T02:17:09Z

Is there any reason why the fields are null in the variant? Is this just so we don't replicate information?

Agreed there seems to be redundancy here. If the fields inside of variant variant: contigName start stop were populated could we get rid of these fields at the base level of Genotype or is there some performance reason to avoid having these nested?

erictu · 2016-06-04T02:35:30Z

Theoretically the fields inside of variant should work too, but there is a performance benefit to avoid nesting these fields. Specifically, if we're using a projection, we only need to use those three fields, instead of using the entire variant, which will minimize the size of data being transferred.

jpdna · 2016-06-04T06:42:51Z

Then Eric's question still remains as to if the the fields inside variant are left null intentionally or not, can you comment @fnothaft

fnothaft · 2016-06-05T16:21:19Z

The fields were left intentionally null, with the assumption being that if you are loading Genotypes in, you'll use the fields in the Genotype record, not the nested fields in the Variant record. We promoted the fields to the Genotype record because of performance anomalies when using pushed down predicates.

jpdna added the bug label Jun 3, 2016

erictu mentioned this issue Jun 3, 2016

overlap query reflects new formats #1043

Merged

heuermh closed this as completed in #1043 Jun 5, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filterByOverlappingRegion Incorrect for Genotypes #1042

filterByOverlappingRegion Incorrect for Genotypes #1042

erictu commented Jun 3, 2016 •

edited

Loading

jpdna commented Jun 4, 2016

erictu commented Jun 4, 2016

jpdna commented Jun 4, 2016

fnothaft commented Jun 5, 2016

filterByOverlappingRegion Incorrect for Genotypes #1042

filterByOverlappingRegion Incorrect for Genotypes #1042

Comments

erictu commented Jun 3, 2016 • edited Loading

jpdna commented Jun 4, 2016

erictu commented Jun 4, 2016

jpdna commented Jun 4, 2016

fnothaft commented Jun 5, 2016

erictu commented Jun 3, 2016 •

edited

Loading