-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add outer joins #1109
Add outer joins #1109
Conversation
Test PASSed. |
Ping for review. |
c88e512
to
f8019d6
Compare
@akmorrow13 I made the region joins public again (resolves #1143) in f8019d6. Can you review? |
Test FAILed. Build result: FAILUREGitHub pull request #1109 of commit f8019d6 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1109/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 0c0c983 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1109/merge^{commit} # timeout=10Checking out Revision 0c0c983 (origin/pr/1109/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 0c0c983ed5f7a7ee7704cd4a00bf473dad398c3cFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
Jenkins, retest this please. |
Test PASSed. |
} | ||
} | ||
|
||
private trait VictimlessSortedIntervalPartitionJoin[T, U, RU] extends SortedIntervalPartitionJoin[T, U, T, RU] with Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean all the other joins take victims?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the outer joins make use of a "victim cache" to store elements from one of the two iterators that did not match to an element in the other iterator. The phrase "victim cache" comes from a type of cache that is occasionally used in computer architecture to "save" cache lines that have been evicted. I'll make a pass and add more docs.
I'm not technically proficient enough to review the implementations of these. The code style looks fine. It would be nice to have a table describing the different performance characteristics of these and when each would be most useful. In particular, one I could use in a presentation tomorrow night. :) |
I'll make a pass and write these up. Do you actually have a presentation you need these for tomorrow? If so, give me a ping so I can figure out the best way to get it to you. |
Yeah it will be something short for a local audience, who won't necessarily care about the biology but may be interested in how we extend Spark. |
OK, cool. How about I send you a slide for that? Would you prefer Keynote, Powerpoint, Google Drive, etc...? |
Any format would be fine, thanks! For my own good, I want to go through these and sketch out whiteboard diagrams of what is happening, similar to those I saw in some or another Spark book. |
some aside comments on #1171 |
@transient val sc: SparkContext | ||
|
||
// Create the set of bins across the genome for parallel processing | ||
protected val seqLengths = Map(sd.records.toSeq.map(rec => (rec.name, rec.length)): _*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This throws an error downstream in GenomeBins because it tries to set seqLengths from sd when sd is not yet set (is null). What would be the cleanest way to ensure this doesn't happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pushed rebased commits with this fixed. Thanks for catching @akmorrow13.
f8019d6
to
a24e852
Compare
Test PASSed. |
Ping for review/merge. |
I'm going to merge later today unless anyone asks for more time. |
genomicRdd.flattenRddByRegions()), | ||
sequences ++ genomicRdd.sequences, | ||
kv => { getReferenceRegions(kv._1) ++ genomicRdd.getReferenceRegions(kv._2) }) | ||
.asInstanceOf[GenomicRDD[(T, X), Z]] | ||
} | ||
|
||
def rightOuterBroadcastRegionJoin[X, Y <: GenomicRDD[X, Y], Z <: GenomicRDD[(Option[T], X), Z]](genomicRdd: GenomicRDD[X, Y])( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these public join methods on GenomicRDD
need code level doc.
Apparently github ate my earlier comment.. My problem is that these use the BroadcastRegionJoin, collecting one of the RDDs with no notice. In my case, this would not work because both RDD's were too large to collect and broadcast. Is there any way around this? |
What code uses the BroadcastRegionJoin? This PR largely extends the shuffle region join code (provides 5 new shuffle joins), but does extend the BroadcastRegionJoin in two places. |
@fnothaft maybe it was a temporary moment of insanity but it looks like I was wrong. I believe InnerShuffleRegionJoinAndGroupByLeft was previously calling a collect somewhere but this must have been fixed. I cannot seem to find it. |
@akmorrow13 no sweat! |
@heuermh added docs. Can you make another review pass? |
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs read great, thanks! Found a couple minor typos.
* @param genomicRdd The right RDD in the join. | ||
* @return Returns a new genomic RDD containing all pairs of keys that | ||
* overlapped in the genomic coordinate space, grouped together by | ||
* the value they overlapped in the left RDD., and all values from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor typo, ,.
|
||
/** | ||
* Extends the ShuffleRegionJoin trait to implement an inner join followed by | ||
* grouping by the left value.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor typo, ..
Concrete implementations are now Inner<x>RegionJoin.
13840da
to
6bfac01
Compare
Thanks for catching @heuermh! I've fixed the typos, squashed down the documentation commit, and rebased. |
Test PASSed. |
Thank you, @fnothaft! |
Resolves #1098. Still a WIP; needs tests, as well as more documentation.