Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predicate to filter conversion #234

Merged
merged 1 commit into from
May 5, 2014

Conversation

arahuja
Copy link
Contributor

@arahuja arahuja commented Apr 29, 2014

This PR is for issue #62

ADAMPredicate derives from UnboundRecordFilter and can be used to set ParquetInputFormat.setUnboundRecordFilter. It also has an apply method to filter an existing RDD. This will allow to use predicates on parquet files for predicate pushdown but also on an already loaded RDD (if we load from BAM/SAM file, or use the same filters after some processing (removing duplicates after mark_duplicates before proceeding to the other read-prep stages))

I added a few examples - HighQualityReadPredicate, UniqueMappedRead and GenotypeRecordPASSPredicate.

Also, in ADAMRecordConditions and ADAMGenotypeConditions are utility predicates which can be composed using AND and OR to create a new predicate. We can also specify non-equality predicates easily as well.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/310/

@@ -48,7 +48,7 @@ class PileupAggregator(protected val args: PileupAggregatorArgs)
val companion = PileupAggregator

def run(sc: SparkContext, job: Job) {
val pileups: RDD[ADAMPileup] = sc.adamLoad(args.readInput, predicate = Some(classOf[LocusPredicate]))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this was using LocusPredicate before?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That predicate is necessary. Pileups are created from mapped reads only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should put a comment to keep others from tripping up on this too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - but for this is after pileup creation right? Also the fields that LocusPredicate will check against are not defined for an ADAMPileup. I was going to substitute the MappedReadPredicate, but wasn't able to actually because of the typing as that is only applicable on ADAMRecord

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, Arun. This is after pileup creation so the predicate isn't needed.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/311/

@massie
Copy link
Member

massie commented Apr 30, 2014

Please run scalariform, e.g. mvn org.scalariform:scalariform-maven-plugin:format

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/312/

massie added a commit that referenced this pull request May 5, 2014
Predicate to filter conversion
@massie massie merged commit 78bc6c1 into bigdatagenomics:master May 5, 2014
@massie
Copy link
Member

massie commented May 5, 2014

Thanks, Arun! I really liked seeing all the tests you put in this pull request. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants