Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replaced Contig with ContigName in AlignmentRecord and related changes #988

Merged
merged 1 commit into from
Apr 6, 2016

Conversation

jpdna
Copy link
Member

@jpdna jpdna commented Apr 2, 2016

This PR depends on the related PR #72 here in bigdatagenomics/bdg-formats repo
which supplies version 0.7.2-SNAPSHOT of bdg-formats.
With the above available, this PR should compile and pass all tests

See related discussion in #983

The goal of this PR is to remove from AlignmentRecord for performance reasons the Contig and mateContig metadata details such as MD5 and URL that are repeated for each contig. This contig metadata is referred to now only in the SequenceDictionary and joined using the contigName and mateContigName which have replaced contig and mateContig in AlignmentRecord

Performance Improvement:
Testing on a 3.8 GB BAM file input on a single machine, I see the following improvements based on metrics in Spark web GUI with the changes in this PR as compared to current code as of b8e36b2 :

27% speed up in MarkDuplicates ( 2.3 minutes to 1.7 minutes )
11% reduction in Shuffle Read and Write size in MarkDuplicates ( 10.1GB to 9.0 GB)

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1128/

Build result: FAILURE

GitHub pull request #988 of commit b32a60b automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/988/merge^{commit} # timeout=10 > git branch -a --contains b350f9c # timeout=10 > git rev-parse remotes/origin/pr/988/merge^{commit} # timeout=10Checking out Revision b350f9c (origin/pr/988/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f b350f9cb2506348e4fd90f36639fdb06edc0ff8fFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@@ -106,6 +106,7 @@
<dependency>
<groupId>org.bdgenomics.bdg-formats</groupId>
<artifactId>bdg-formats</artifactId>
<version>0.7.2-SNAPSHOT</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the version here, it inherits from the ${ADAM_HOME}/pom.xml.

@fnothaft
Copy link
Member

fnothaft commented Apr 3, 2016

Aside from a few nits, this LGTM! Thanks for profiling the change, @jpdna!

@jpdna
Copy link
Member Author

jpdna commented Apr 5, 2016

Made the changes suggested by @fnothaft above and squashed

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1130/

Build result: FAILURE

GitHub pull request #988 of commit c924952 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/988/merge^{commit} # timeout=10 > git branch -a --contains 7a4dfe9d2d7bebb498573cd1c3fcf115ad878cb4 # timeout=10 > git rev-parse remotes/origin/pr/988/merge^{commit} # timeout=10Checking out Revision 7a4dfe9d2d7bebb498573cd1c3fcf115ad878cb4 (origin/pr/988/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 7a4dfe9d2d7bebb498573cd1c3fcf115ad878cb4First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft fnothaft added this to the 0.20.0 milestone Apr 5, 2016
import org.bdgenomics.formats.avro._
import scala.collection.JavaConversions._

class FragmentRDDFunctions(rdd: RDD[Fragment]) extends ADAMSequenceDictionaryRDDAggregator[Fragment](rdd) {
class FragmentRDDFunctions(rdd: RDD[Fragment]) extends Serializable with Logging {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this removed? I'm actually not sure what ADAMSequenceDictionaryRDDAggregator was for in the first place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ADAMSequenceDictionaryRDDAggregator
seemed to be used to extract SeqDicts from existing object, but didn't seem to make sense anymore in context of having removed Contig from AlignmentRecord, including within Fragmen - and importantly was not actually used in any code or tests. Extending ADAMSequenceDictionaryRDDAggregator here in fact blocked the Contig factoring out as it requires data which is no longer in AlignmentRecord. I searched for any usages of the removed functions and there were none, so I removed it as a superclass.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, I'm always in favor of removing unnecessary code.

I wonder if this changes anyone's opinion on this commit ryan-williams@c5a8f51 that adds extension to some of the Functions classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both this and the commit you pointed at @heuermh LGTM.

Contig factor out project, cleaning up some comments

clean up some unintended whitespace arbitrary diffs

Removed subproject POM changes and other small issues

Updated bdformats to new published maven depedenecy 0.7.1, and fixed a comment typo
@jpdna
Copy link
Member Author

jpdna commented Apr 5, 2016

bdg-formats dependency has been updated to 0.7.1 and change squashed, so I think this is ready to go now

@jpdna
Copy link
Member Author

jpdna commented Apr 5, 2016

well crud, github webpage on my fork is not updating with my push, though git thinks I pushed. Github was down a few minutes ago - I'll ping this thread again in when it matches.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1131/
Test FAILed.

@fnothaft
Copy link
Member

fnothaft commented Apr 5, 2016

Jenkins, retest this please.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1132/
Test FAILed.

@fnothaft
Copy link
Member

fnothaft commented Apr 5, 2016

Jenkins, retest this please.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1133/
Test PASSed.

@fnothaft
Copy link
Member

fnothaft commented Apr 5, 2016

LGTM now. Let's discuss on the call tomorrow and make sure everyone is good with merging.

@tdanford
Copy link
Contributor

tdanford commented Apr 6, 2016

+1000 from me

@fnothaft fnothaft merged commit 7823abd into bigdatagenomics:master Apr 6, 2016
@fnothaft
Copy link
Member

fnothaft commented Apr 6, 2016

Thanks @jpdna! Merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants