Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filterByOverlappingRegion Incorrect for Genotypes #1042

Closed
erictu opened this issue Jun 3, 2016 · 4 comments
Closed

filterByOverlappingRegion Incorrect for Genotypes #1042

erictu opened this issue Jun 3, 2016 · 4 comments
Labels

Comments

@erictu
Copy link
Member

erictu commented Jun 3, 2016

Currently this:

  def filterByOverlappingRegion(query: ReferenceRegion): RDD[Genotype] = {
    def overlapsQuery(rec: Genotype): Boolean =
      rec.getVariant.getContigName == query.referenceName &&
        rec.getVariant.getStart < query.end &&
        rec.getVariant.getEnd > query.start
    rdd.filter(overlapsQuery)
  }

Should be this, where we get start, end, and contigName directly from the genotype. (I will submit a PR):

  def filterByOverlappingRegion(query: ReferenceRegion): RDD[Genotype] = {
    def overlapsQuery(rec: Genotype): Boolean =
      rec.getContigName == query.referenceName &&
        rec.getStart < query.end &&
        rec.getEnd > query.start
    rdd.filter(overlapsQuery)
  }

It seems like when we convert VCF to ADAM, the ADAM format won't have the start, end, and contigName fields populated in the variant record, but they will be populated in the genotype record.

The filter condition should get the start, end, and contigName fields from the genotype record, not the variant record. Is there any reason why the fields are null in the variant? Is this just so we don't replicate information?

@jpdna
Copy link
Member

jpdna commented Jun 4, 2016

Is there any reason why the fields are null in the variant? Is this just so we don't replicate information?

Agreed there seems to be redundancy here. If the fields inside of variant variant: contigName start stop were populated could we get rid of these fields at the base level of Genotype or is there some performance reason to avoid having these nested?

@erictu
Copy link
Member Author

erictu commented Jun 4, 2016

Theoretically the fields inside of variant should work too, but there is a performance benefit to avoid nesting these fields. Specifically, if we're using a projection, we only need to use those three fields, instead of using the entire variant, which will minimize the size of data being transferred.

@jpdna
Copy link
Member

jpdna commented Jun 4, 2016

Then Eric's question still remains as to if the the fields inside variant are left null intentionally or not, can you comment @fnothaft

@fnothaft
Copy link
Member

fnothaft commented Jun 5, 2016

The fields were left intentionally null, with the assumption being that if you are loading Genotypes in, you'll use the fields in the Genotype record, not the nested fields in the Variant record. We promoted the fields to the Genotype record because of performance anomalies when using pushed down predicates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants