Add outer joins #1109

fnothaft · 2016-08-10T18:53:16Z

Resolves #1098. Still a WIP; needs tests, as well as more documentation.

AmplabJenkins · 2016-08-10T19:24:32Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1377/
Test PASSed.

fnothaft · 2016-08-25T05:26:23Z

Ping for review.

fnothaft · 2016-08-31T23:03:43Z

@akmorrow13 I made the region joins public again (resolves #1143) in f8019d6. Can you review?

AmplabJenkins · 2016-08-31T23:17:13Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1443/

Build result: FAILURE

GitHub pull request #1109 of commit f8019d6 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1109/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 0c0c983 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1109/merge^{commit} # timeout=10Checking out Revision 0c0c983 (origin/pr/1109/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 0c0c983ed5f7a7ee7704cd4a00bf473dad398c3cFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

fnothaft · 2016-09-01T03:37:45Z

Jenkins, retest this please.

AmplabJenkins · 2016-09-01T04:16:22Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1450/
Test PASSed.

heuermh · 2016-09-06T16:01:50Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ShuffleRegionJoin.scala

+ }
+}
+
+private trait VictimlessSortedIntervalPartitionJoin[T, U, RU] extends SortedIntervalPartitionJoin[T, U, T, RU] with Serializable {


Does this mean all the other joins take victims?

Some of the outer joins make use of a "victim cache" to store elements from one of the two iterators that did not match to an element in the other iterator. The phrase "victim cache" comes from a type of cache that is occasionally used in computer architecture to "save" cache lines that have been evicted. I'll make a pass and add more docs.

heuermh · 2016-09-06T16:04:58Z

I'm not technically proficient enough to review the implementations of these. The code style looks fine.

It would be nice to have a table describing the different performance characteristics of these and when each would be most useful. In particular, one I could use in a presentation tomorrow night. :)

fnothaft · 2016-09-06T16:07:27Z

It would be nice to have a table describing the different performance characteristics of these and when each would be most useful. In particular, one I could use in a presentation tomorrow night. :)

I'll make a pass and write these up. Do you actually have a presentation you need these for tomorrow? If so, give me a ping so I can figure out the best way to get it to you.

heuermh · 2016-09-06T16:12:11Z

Do you actually have a presentation you need these for tomorrow?

Yeah it will be something short for a local audience, who won't necessarily care about the biology but may be interested in how we extend Spark.

fnothaft · 2016-09-06T16:19:21Z

OK, cool. How about I send you a slide for that? Would you prefer Keynote, Powerpoint, Google Drive, etc...?

heuermh · 2016-09-06T16:22:19Z

Any format would be fine, thanks! For my own good, I want to go through these and sketch out whiteboard diagrams of what is happening, similar to those I saw in some or another Spark book.

jpdna · 2016-09-14T19:21:38Z

some aside comments on #1171

akmorrow13 · 2016-09-14T21:25:52Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ShuffleRegionJoin.scala

+ @transient val sc: SparkContext
+
+ // Create the set of bins across the genome for parallel processing
+ protected val seqLengths = Map(sd.records.toSeq.map(rec => (rec.name, rec.length)): _*)


This throws an error downstream in GenomeBins because it tries to set seqLengths from sd when sd is not yet set (is null). What would be the cleanest way to ensure this doesn't happen?

Just pushed rebased commits with this fixed. Thanks for catching @akmorrow13.

AmplabJenkins · 2016-09-27T02:47:00Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1503/
Test PASSed.

fnothaft · 2016-09-29T21:38:58Z

Ping for review/merge.

jpdna · 2016-09-30T13:10:05Z

I'm going to merge later today unless anyone asks for more time.

heuermh · 2016-09-30T14:41:24Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/GenomicRDD.scala

 genomicRdd.flattenRddByRegions()),
 sequences ++ genomicRdd.sequences,
 kv => { getReferenceRegions(kv._1) ++ genomicRdd.getReferenceRegions(kv._2) })
 .asInstanceOf[GenomicRDD[(T, X), Z]]
 }

+ def rightOuterBroadcastRegionJoin[X, Y <: GenomicRDD[X, Y], Z <: GenomicRDD[(Option[T], X), Z]](genomicRdd: GenomicRDD[X, Y])(


All these public join methods on GenomicRDD need code level doc.

akmorrow13 · 2016-09-30T15:27:29Z

Apparently github ate my earlier comment.. My problem is that these use the BroadcastRegionJoin, collecting one of the RDDs with no notice. In my case, this would not work because both RDD's were too large to collect and broadcast. Is there any way around this?

fnothaft · 2016-09-30T20:06:39Z

My problem is that these use the BroadcastRegionJoin, collecting one of the RDDs with no notice.

What code uses the BroadcastRegionJoin? This PR largely extends the shuffle region join code (provides 5 new shuffle joins), but does extend the BroadcastRegionJoin in two places.

akmorrow13 · 2016-09-30T20:32:47Z

@fnothaft maybe it was a temporary moment of insanity but it looks like I was wrong. I believe InnerShuffleRegionJoinAndGroupByLeft was previously calling a collect somewhere but this must have been fixed. I cannot seem to find it.

fnothaft · 2016-09-30T20:33:26Z

@akmorrow13 no sweat!

fnothaft · 2016-10-03T02:52:29Z

@heuermh added docs. Can you make another review pass?

AmplabJenkins · 2016-10-03T03:11:51Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1513/
Test PASSed.

heuermh

Docs read great, thanks! Found a couple minor typos.

heuermh · 2016-10-03T12:01:03Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/GenomicRDD.scala

+ * @param genomicRdd The right RDD in the join.
+ * @return Returns a new genomic RDD containing all pairs of keys that
+ * overlapped in the genomic coordinate space, grouped together by
+ * the value they overlapped in the left RDD., and all values from the


minor typo, ,.

heuermh · 2016-10-03T12:03:21Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ShuffleRegionJoin.scala

+
+/**
+ * Extends the ShuffleRegionJoin trait to implement an inner join followed by
+ * grouping by the left value..


minor typo, ..

Concrete implementations are now Inner<x>RegionJoin.

Resolves bigdatagenomics#1143.

fnothaft · 2016-10-03T15:50:16Z

Thanks for catching @heuermh! I've fixed the typos, squashed down the documentation commit, and rebased.

AmplabJenkins · 2016-10-03T16:11:55Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1515/
Test PASSed.

heuermh · 2016-10-03T22:37:00Z

Thank you, @fnothaft!

fnothaft mentioned this pull request Aug 10, 2016

BroadcastRegionJoin is not a broadcast join #1110

Closed

fnothaft force-pushed the issues/1098-outer-joins branch from c88e512 to f8019d6 Compare August 31, 2016 23:03

fnothaft mentioned this pull request Sep 3, 2016

Updated versions and avro formats to work with most ADAM 0.19.1-SNAPS… fnothaft/fig#5

Merged

heuermh reviewed Sep 6, 2016
View reviewed changes

heuermh mentioned this pull request Sep 7, 2016

Release ADAM version 0.20.0 #1048

Closed

61 tasks

jpdna mentioned this pull request Sep 14, 2016

Interval tree join in ADAM #1171

Closed

akmorrow13 reviewed Sep 14, 2016

View reviewed changes

fnothaft force-pushed the issues/1098-outer-joins branch from f8019d6 to a24e852 Compare September 27, 2016 02:26

heuermh requested changes Sep 30, 2016

View reviewed changes

heuermh requested changes Oct 3, 2016

View reviewed changes

fnothaft added 3 commits October 3, 2016 08:49

Refactor join object signatures to sealed traits and case classes.

66ae3f3

Concrete implementations are now Inner<x>RegionJoin.

Adding left/right outer shuffle join, right outer broadcast join.

0a7a697

[ADAM-1143] Expose region join case classes as public.

6bfac01

Resolves bigdatagenomics#1143.

fnothaft force-pushed the issues/1098-outer-joins branch from 13840da to 6bfac01 Compare October 3, 2016 15:49

heuermh approved these changes Oct 3, 2016

View reviewed changes

heuermh merged commit bd3c62a into bigdatagenomics:master Oct 3, 2016

heuermh mentioned this pull request Oct 3, 2016

Add new feature-overlap command to demonstrate new region joins #1194

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add outer joins #1109

Add outer joins #1109

fnothaft commented Aug 10, 2016

AmplabJenkins commented Aug 10, 2016

fnothaft commented Aug 25, 2016

fnothaft commented Aug 31, 2016

AmplabJenkins commented Aug 31, 2016

fnothaft commented Sep 1, 2016

AmplabJenkins commented Sep 1, 2016

heuermh Sep 6, 2016

fnothaft Sep 6, 2016

heuermh commented Sep 6, 2016

fnothaft commented Sep 6, 2016

heuermh commented Sep 6, 2016

fnothaft commented Sep 6, 2016

heuermh commented Sep 6, 2016

jpdna commented Sep 14, 2016

akmorrow13 Sep 14, 2016

fnothaft Sep 27, 2016

AmplabJenkins commented Sep 27, 2016

fnothaft commented Sep 29, 2016

jpdna commented Sep 30, 2016

heuermh Sep 30, 2016

akmorrow13 commented Sep 30, 2016

fnothaft commented Sep 30, 2016

akmorrow13 commented Sep 30, 2016

fnothaft commented Sep 30, 2016

fnothaft commented Oct 3, 2016

AmplabJenkins commented Oct 3, 2016

heuermh left a comment

heuermh Oct 3, 2016 •

edited

Loading

heuermh Oct 3, 2016

fnothaft commented Oct 3, 2016 •

edited

Loading

AmplabJenkins commented Oct 3, 2016

heuermh commented Oct 3, 2016

Add outer joins #1109

Add outer joins #1109

Conversation

fnothaft commented Aug 10, 2016

AmplabJenkins commented Aug 10, 2016

fnothaft commented Aug 25, 2016

fnothaft commented Aug 31, 2016

AmplabJenkins commented Aug 31, 2016

Build result: FAILURE

fnothaft commented Sep 1, 2016

AmplabJenkins commented Sep 1, 2016

heuermh Sep 6, 2016

Choose a reason for hiding this comment

fnothaft Sep 6, 2016

Choose a reason for hiding this comment

heuermh commented Sep 6, 2016

fnothaft commented Sep 6, 2016

heuermh commented Sep 6, 2016

fnothaft commented Sep 6, 2016

heuermh commented Sep 6, 2016

jpdna commented Sep 14, 2016

akmorrow13 Sep 14, 2016

Choose a reason for hiding this comment

fnothaft Sep 27, 2016

Choose a reason for hiding this comment

AmplabJenkins commented Sep 27, 2016

fnothaft commented Sep 29, 2016

jpdna commented Sep 30, 2016

heuermh Sep 30, 2016

Choose a reason for hiding this comment

akmorrow13 commented Sep 30, 2016

fnothaft commented Sep 30, 2016

akmorrow13 commented Sep 30, 2016

fnothaft commented Sep 30, 2016

fnothaft commented Oct 3, 2016

AmplabJenkins commented Oct 3, 2016

heuermh left a comment

Choose a reason for hiding this comment

heuermh Oct 3, 2016 • edited Loading

Choose a reason for hiding this comment

heuermh Oct 3, 2016

Choose a reason for hiding this comment

fnothaft commented Oct 3, 2016 • edited Loading

AmplabJenkins commented Oct 3, 2016

heuermh commented Oct 3, 2016

heuermh Oct 3, 2016 •

edited

Loading

fnothaft commented Oct 3, 2016 •

edited

Loading