-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replaced Contig with ContigName in AlignmentRecord and related changes #988
Conversation
Test FAILed. Build result: FAILUREGitHub pull request #988 of commit b32a60b automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/988/merge^{commit} # timeout=10 > git branch -a --contains b350f9c # timeout=10 > git rev-parse remotes/origin/pr/988/merge^{commit} # timeout=10Checking out Revision b350f9c (origin/pr/988/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f b350f9cb2506348e4fd90f36639fdb06edc0ff8fFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
@@ -106,6 +106,7 @@ | |||
<dependency> | |||
<groupId>org.bdgenomics.bdg-formats</groupId> | |||
<artifactId>bdg-formats</artifactId> | |||
<version>0.7.2-SNAPSHOT</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need the version here, it inherits from the ${ADAM_HOME}/pom.xml
.
Aside from a few nits, this LGTM! Thanks for profiling the change, @jpdna! |
Made the changes suggested by @fnothaft above and squashed |
Test FAILed. Build result: FAILUREGitHub pull request #988 of commit c924952 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/988/merge^{commit} # timeout=10 > git branch -a --contains 7a4dfe9d2d7bebb498573cd1c3fcf115ad878cb4 # timeout=10 > git rev-parse remotes/origin/pr/988/merge^{commit} # timeout=10Checking out Revision 7a4dfe9d2d7bebb498573cd1c3fcf115ad878cb4 (origin/pr/988/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 7a4dfe9d2d7bebb498573cd1c3fcf115ad878cb4First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
import org.bdgenomics.formats.avro._ | ||
import scala.collection.JavaConversions._ | ||
|
||
class FragmentRDDFunctions(rdd: RDD[Fragment]) extends ADAMSequenceDictionaryRDDAggregator[Fragment](rdd) { | ||
class FragmentRDDFunctions(rdd: RDD[Fragment]) extends Serializable with Logging { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this removed? I'm actually not sure what ADAMSequenceDictionaryRDDAggregator was for in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ADAMSequenceDictionaryRDDAggregator
seemed to be used to extract SeqDicts from existing object, but didn't seem to make sense anymore in context of having removed Contig from AlignmentRecord, including within Fragmen - and importantly was not actually used in any code or tests. Extending ADAMSequenceDictionaryRDDAggregator here in fact blocked the Contig factoring out as it requires data which is no longer in AlignmentRecord. I searched for any usages of the removed functions and there were none, so I removed it as a superclass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, I'm always in favor of removing unnecessary code.
I wonder if this changes anyone's opinion on this commit ryan-williams@c5a8f51 that adds extension to some of the Functions classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both this and the commit you pointed at @heuermh LGTM.
Contig factor out project, cleaning up some comments clean up some unintended whitespace arbitrary diffs Removed subproject POM changes and other small issues Updated bdformats to new published maven depedenecy 0.7.1, and fixed a comment typo
bdg-formats dependency has been updated to 0.7.1 and change squashed, so I think this is ready to go now |
well crud, github webpage on my fork is not updating with my push, though git thinks I pushed. Github was down a few minutes ago - I'll ping this thread again in when it matches. |
Test FAILed. |
Jenkins, retest this please. |
Test FAILed. |
Jenkins, retest this please. |
Test PASSed. |
LGTM now. Let's discuss on the call tomorrow and make sure everyone is good with merging. |
+1000 from me |
Thanks @jpdna! Merged. |
This PR depends on the related PR #72 here in
bigdatagenomics/bdg-formats
repowhich supplies version
0.7.2-SNAPSHOT
of bdg-formats.With the above available, this PR should compile and pass all tests
See related discussion in #983
The goal of this PR is to remove from
AlignmentRecord
for performance reasons theContig
andmateContig
metadata details such as MD5 and URL that are repeated for each contig. This contig metadata is referred to now only in theSequenceDictionary
and joined using the contigName and mateContigName which have replaced contig and mateContig inAlignmentRecord
Performance Improvement:
Testing on a 3.8 GB BAM file input on a single machine, I see the following improvements based on metrics in Spark web GUI with the changes in this PR as compared to current code as of
b8e36b2
: