Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pipe API in and out formatters for Features #1378

Merged
merged 1 commit into from
Mar 14, 2017

Conversation

heuermh
Copy link
Member

@heuermh heuermh commented Jan 27, 2017

Work in progress, opened pull request for review.

Fixes #1374

.collect
.toVector)
// create sequence records based on largest end coordinate
val featuresByContigName = rdd.keyBy(_.getContigName)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is a good idea, but it (partially) solves the trying to partition a sequence of length 1L problem in GenomicRDD.pipe

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1763/

Build result: FAILURE

[...truncated 57 lines...] at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:618) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.protocol.https.HttpsClient.(HttpsClient.java:275) at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:371) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177) at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:153) at com.tikal.hudson.plugins.notification.Protocol$3.send(Protocol.java:99) at com.tikal.hudson.plugins.notification.Phase.handle(Phase.java:45) at com.tikal.hudson.plugins.notification.JobListener.onCompleted(JobListener.java:36) at hudson.model.listeners.RunListener.fireCompleted(RunListener.java:201) at hudson.model.Run.execute(Run.java:1783) at hudson.matrix.MatrixBuild.run(MatrixBuild.java:306) at hudson.model.ResourceController.execute(ResourceController.java:98) at hudson.model.Executor.run(Executor.java:410)Failed to notify endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8' - java.net.SocketTimeoutException: connect timed out
Test FAILed.

@heuermh
Copy link
Member Author

heuermh commented Jan 30, 2017

I'm not sure why the counts don't add up for GTF and GFF3 when they are fine for BED and narrowPeak

- don't lose any features when piping as GTF format *** FAILED ***
  95 did not equal 114 (FeatureRDDSuite.scala:759)

- don't lose any features when piping as GFF3 format *** FAILED ***
  195 did not equal 199 (FeatureRDDSuite.scala:772)

Appears we are duplicating features when partitioning. Is this reasonable? Should the unit tests do a distinct after piping back in?

scala> val gtf = sc.loadGtf("src/test/resources/Homo_sapiens.GRCh37.75.trun100.gtf")
gtf2: org.bdgenomics.adam.rdd.feature.FeatureRDD =
FeatureRDD(MapPartitionsRDD[28] at flatMap at ADAMContext.scala:1186,SequenceDictionary{
1->36081})

scala> gtf.rdd.count
res2: Long = 95

scala> val pipedRdd: FeatureRDD = gtf.pipe("tee /dev/null") 
pipedRdd: org.bdgenomics.adam.rdd.feature.FeatureRDD =
FeatureRDD(MapPartitionsRDD[11] at mapPartitionsWithIndex at GenomicRDD.scala:336,SequenceDictionary{
1->36081})

scala> pipedRdd.rdd.count
res0: Long = 129

scala> pipedRdd.rdd.distinct.count
res3: Long = 95
scala> val gff3 = sc.loadGff3("src/test/resources/dvl1.200.gff3")
gff3: org.bdgenomics.adam.rdd.feature.FeatureRDD =
FeatureRDD(MapPartitionsRDD[21] at flatMap at ADAMContext.scala:1166,SequenceDictionary{
1->1363541})

scala> gff3.rdd.count
res1: Long = 195

scala> gff3.rdd.distinct.count
res3: Long = 181

scala> val pipedRdd: FeatureRDD = gff3.pipe("tee /dev/null") 
pipedRdd: org.bdgenomics.adam.rdd.feature.FeatureRDD =
FeatureRDD(MapPartitionsRDD[16] at mapPartitionsWithIndex at GenomicRDD.scala:336,SequenceDictionary{
1->1363541})

scala> pipedRdd.rdd.count
res0: Long = 199

scala> pipedRdd.rdd.distinct.count
res1: Long = 181

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1765/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1378/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 06893683001603d0decd286abb1ceca7fc021d73 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1378/merge^{commit} # timeout=10Checking out Revision 06893683001603d0decd286abb1ceca7fc021d73 (origin/pr/1378/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 06893683001603d0decd286abb1ceca7fc021d73First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1766/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1378/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 6b78a58 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1378/merge^{commit} # timeout=10Checking out Revision 6b78a58 (origin/pr/1378/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 6b78a5814a64fc2414df00e87df31003ceca4b8cFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

implicit val tFormatter = BEDInFormatter
implicit val uFormatter = new BEDOutFormatter

val pipedRdd: FeatureRDD = frdd.pipe("tee /dev/null")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use tee /dev/null in the test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was the easiest one-liner I could think of that would pipe standard in to standard out unmodified without creating any artifacts.

@heuermh
Copy link
Member Author

heuermh commented Mar 7, 2017

Rebased to pull in #1411, two unit tests still fail as discussed above.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1843/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1378/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 67905d5 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1378/merge^{commit} # timeout=10Checking out Revision 67905d5 (origin/pr/1378/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 67905d5772d8457a83894e099926d7b5b45987b5First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@heuermh heuermh modified the milestone: 0.23.0 Mar 8, 2017
@coveralls
Copy link

coveralls commented Mar 10, 2017

Coverage Status

Coverage increased (+0.2%) to 76.61% when pulling 193bab4 on heuermh:feature-formatters into 07c1982 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1852/
Test PASSed.

Copy link
Member

@fnothaft fnothaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@fnothaft
Copy link
Member

@heuermh I just realized that I approved this but forgot to merge this. Is this good to go from your side? If yes, what I propose is:

How's that sound on your end?

@heuermh
Copy link
Member Author

heuermh commented Mar 14, 2017

Sounds good. I'll push a doc commit and squash after #1422 is merged.

@coveralls
Copy link

coveralls commented Mar 14, 2017

Coverage Status

Coverage increased (+0.3%) to 76.659% when pulling 707567c on heuermh:feature-formatters into 1cae769 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1863/
Test PASSed.

@fnothaft fnothaft merged commit b8477dc into bigdatagenomics:master Mar 14, 2017
@fnothaft
Copy link
Member

Merged! Thanks @heuermh!

@heuermh heuermh deleted the feature-formatters branch March 14, 2017 16:12
@heuermh heuermh modified the milestones: 0.22.0, 0.23.0 Mar 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants