Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add outer joins #1109

Merged
merged 3 commits into from
Oct 3, 2016
Merged

Conversation

fnothaft
Copy link
Member

Resolves #1098. Still a WIP; needs tests, as well as more documentation.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1377/
Test PASSed.

@fnothaft
Copy link
Member Author

Ping for review.

@fnothaft
Copy link
Member Author

@akmorrow13 I made the region joins public again (resolves #1143) in f8019d6. Can you review?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1443/

Build result: FAILURE

GitHub pull request #1109 of commit f8019d6 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1109/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 0c0c983 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1109/merge^{commit} # timeout=10Checking out Revision 0c0c983 (origin/pr/1109/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 0c0c983ed5f7a7ee7704cd4a00bf473dad398c3cFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft
Copy link
Member Author

fnothaft commented Sep 1, 2016

Jenkins, retest this please.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1450/
Test PASSed.

}
}

private trait VictimlessSortedIntervalPartitionJoin[T, U, RU] extends SortedIntervalPartitionJoin[T, U, T, RU] with Serializable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean all the other joins take victims?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the outer joins make use of a "victim cache" to store elements from one of the two iterators that did not match to an element in the other iterator. The phrase "victim cache" comes from a type of cache that is occasionally used in computer architecture to "save" cache lines that have been evicted. I'll make a pass and add more docs.

@heuermh
Copy link
Member

heuermh commented Sep 6, 2016

I'm not technically proficient enough to review the implementations of these. The code style looks fine.

It would be nice to have a table describing the different performance characteristics of these and when each would be most useful. In particular, one I could use in a presentation tomorrow night. :)

@fnothaft
Copy link
Member Author

fnothaft commented Sep 6, 2016

It would be nice to have a table describing the different performance characteristics of these and when each would be most useful. In particular, one I could use in a presentation tomorrow night. :)

I'll make a pass and write these up. Do you actually have a presentation you need these for tomorrow? If so, give me a ping so I can figure out the best way to get it to you.

@heuermh
Copy link
Member

heuermh commented Sep 6, 2016

Do you actually have a presentation you need these for tomorrow?

Yeah it will be something short for a local audience, who won't necessarily care about the biology but may be interested in how we extend Spark.

@fnothaft
Copy link
Member Author

fnothaft commented Sep 6, 2016

OK, cool. How about I send you a slide for that? Would you prefer Keynote, Powerpoint, Google Drive, etc...?

@heuermh
Copy link
Member

heuermh commented Sep 6, 2016

Any format would be fine, thanks! For my own good, I want to go through these and sketch out whiteboard diagrams of what is happening, similar to those I saw in some or another Spark book.

@heuermh heuermh mentioned this pull request Sep 7, 2016
61 tasks
@jpdna
Copy link
Member

jpdna commented Sep 14, 2016

some aside comments on #1171

@transient val sc: SparkContext

// Create the set of bins across the genome for parallel processing
protected val seqLengths = Map(sd.records.toSeq.map(rec => (rec.name, rec.length)): _*)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This throws an error downstream in GenomeBins because it tries to set seqLengths from sd when sd is not yet set (is null). What would be the cleanest way to ensure this doesn't happen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pushed rebased commits with this fixed. Thanks for catching @akmorrow13.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1503/
Test PASSed.

@fnothaft
Copy link
Member Author

Ping for review/merge.

@jpdna
Copy link
Member

jpdna commented Sep 30, 2016

I'm going to merge later today unless anyone asks for more time.

genomicRdd.flattenRddByRegions()),
sequences ++ genomicRdd.sequences,
kv => { getReferenceRegions(kv._1) ++ genomicRdd.getReferenceRegions(kv._2) })
.asInstanceOf[GenomicRDD[(T, X), Z]]
}

def rightOuterBroadcastRegionJoin[X, Y <: GenomicRDD[X, Y], Z <: GenomicRDD[(Option[T], X), Z]](genomicRdd: GenomicRDD[X, Y])(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these public join methods on GenomicRDD need code level doc.

@akmorrow13
Copy link
Contributor

Apparently github ate my earlier comment.. My problem is that these use the BroadcastRegionJoin, collecting one of the RDDs with no notice. In my case, this would not work because both RDD's were too large to collect and broadcast. Is there any way around this?

@fnothaft
Copy link
Member Author

My problem is that these use the BroadcastRegionJoin, collecting one of the RDDs with no notice.

What code uses the BroadcastRegionJoin? This PR largely extends the shuffle region join code (provides 5 new shuffle joins), but does extend the BroadcastRegionJoin in two places.

@akmorrow13
Copy link
Contributor

@fnothaft maybe it was a temporary moment of insanity but it looks like I was wrong. I believe InnerShuffleRegionJoinAndGroupByLeft was previously calling a collect somewhere but this must have been fixed. I cannot seem to find it.

@fnothaft
Copy link
Member Author

@akmorrow13 no sweat!

@fnothaft
Copy link
Member Author

fnothaft commented Oct 3, 2016

@heuermh added docs. Can you make another review pass?

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1513/
Test PASSed.

Copy link
Member

@heuermh heuermh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs read great, thanks! Found a couple minor typos.

* @param genomicRdd The right RDD in the join.
* @return Returns a new genomic RDD containing all pairs of keys that
* overlapped in the genomic coordinate space, grouped together by
* the value they overlapped in the left RDD., and all values from the
Copy link
Member

@heuermh heuermh Oct 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor typo, ,.


/**
* Extends the ShuffleRegionJoin trait to implement an inner join followed by
* grouping by the left value..
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor typo, ..

@fnothaft
Copy link
Member Author

fnothaft commented Oct 3, 2016

Thanks for catching @heuermh! I've fixed the typos, squashed down the documentation commit, and rebased.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1515/
Test PASSed.

@heuermh heuermh merged commit bd3c62a into bigdatagenomics:master Oct 3, 2016
@heuermh
Copy link
Member

heuermh commented Oct 3, 2016

Thank you, @fnothaft!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants