-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBase genotypes backend -revised #1335
Conversation
Added HBaseFunctions Added VCF genotype save and load Hbase functions Added saveHBaseGenotypesSingleSample packing multiple genotypes at same position and samle allele into same hbase value Added multisample capable loadHBaseGenotypes() Removed commented-out earlier versions of save and load genotypes removed more dead code Clean up formatting - limit line length Added saveHbaseSampleMetadata function Added save and load SequenceDictionary functions to HBaseFunctions Added createHbaseGenotypeTable Adding loadGenotypeRddFromHBase in progress updates to multi-sample hbase save multi sample VCF save and load now working Added repartitioning parameter tp hbase genotype load Added comments identifying the public api vs helper functions COB Aug 25 Added genomic start and stop parameters to loadGenotypesFromHBaseToGenotypeRDD Added boolean saveSequenceDictionary toggle parameter to saveVariantContextRddToHBase fixed start, stop null ptr exeception first steps in adding hbaseToVariantcontextRDD Changed region query to use ADAM ReferenceRegion Added custom HBaseEncoder1 save function Added custom Encoder1 Hbase loader Added Encoder1 hbase variant context loader Working - before rowkey int Changed end in key to be size, added data block encoding in create table Added create table splits Removed dead code of encoder2 Added option to repartion vcrdd before saving to HBase Added bulk load save to Hbase option changed to cdh hbase api depedencies in POM allow sample name file list as input to load functions made sample_ids lis parameter in load optional Added deleteSamplesFromHBase function Fixed bulk delete and made loadVariantContext work even when requested samplids are missing Removed code from failed version of sample delete function Moved delete function up with test of genotype code Fixed errors after rebase against master small formatting cleanup first pass hbase cli demo second pass hbase cli demo remove saveSeqDict check add seq dict id to cli import clean up removed undeeded demo and temp.js code Ran ./scripts/format-source due to build failure on Jenkins changed Hbase dependencies to provided addressed some review issues Changed hbases dependencies back to NOT being provided fixed None.get first step key Strategy pattern second step key Strategy pattern more cleanup from review made private functions package private for hbase Factored out hbase connection Factored out hbase connection - fix1 Factored out hbase connection - cleanup Removed the non-bulk load hbase function step one adding tests first step implement DAO step2 DAO step3 DAO Added first loadHBase Test Added some javadoc use dao everywhere added more javadoc step 1 in adding save hbase test step 2 in adding save hbase test clean up hbase doc add more assertions to hbase load test improved POM and imports in HBaseFunctions fixed more PR suggestions
Test FAILed. Build result: FAILURE[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1335/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 5903417 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1335/merge^{commit} # timeout=10Checking out Revision 5903417 (origin/pr/1335/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 5903417ce134e3726b6d0df89b1a4eeab3394a3dFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good, definitely heading in the right direction!
package org.bdgenomics.adam.cli | ||
|
||
import org.apache.spark.SparkContext | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: remove whitespace.
@Argument(required = true, metaVar = "VCF", usage = "The VCF file to convert", index = 0) | ||
var vcfPath: String = _ | ||
|
||
@Argument(required = true, metaVar = "-hbase_table", usage = "HBase table name in which to load VCF file") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For arguments, should be allcaps, no -
, i.e., HBASE_TABLE
@Argument(required = true, metaVar = "-hbase_table", usage = "HBase table name in which to load VCF file") | ||
var hbaseTable: String = null | ||
|
||
@Args4jOption(required = true, name = "-seq_dict_id", usage = "User defined name to apply to the sequence dictionary create from this VCF, or name of existing sequence dictionary to be used") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto: SEQ_DICT_ID
|
||
@Argument(required = true, metaVar = "-staging_folder", usage = "Location for temporary files during bulk load") | ||
var stagingFolder: String = null | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: remove whitespace.
} | ||
|
||
class Vcf2HBase(protected val args: Vcf2HBaseArgs) extends BDGSparkCommand[Vcf2HBaseArgs] with Logging { | ||
val companion = Transform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong companion object
@@ -0,0 +1,111 @@ | |||
package org.bdgenomics.adam.hbase | |||
|
|||
import org.bdgenomics.adam.util.ADAMFunSuite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: sort imports.
assert(genoData.getAlleles.get(1) === GenotypeAllele.ALT) | ||
} | ||
|
||
sparkAfter("Done with HBase Tests") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this?
@@ -1,5 +1,15 @@ | |||
<?xml version="1.0" encoding="UTF-8"?> | |||
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: extra whitespace before/after.
<version>${spark.version}</version> | ||
<scope>provided</scope> | ||
<exclusions> | ||
<exclusion> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't need these exclusions, IIRC.
</dependencies> | ||
</dependencyManagement> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: keep whitespace.
Test FAILed. Build result: FAILURE[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1335/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains b96c96f # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1335/merge^{commit} # timeout=10Checking out Revision b96c96f (origin/pr/1335/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f b96c96f78dd117968ba25dfe2044613e7dd10ce4First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
This branch at 0ab45cd builds and passes tests locally for me after coning from my repo. |
@heuermh suggests this could be a JDK version issue given the byte level comparisons I am making. I am running java 1.8.0_111 loaclly Do we intended / need to not use java 8 for Jenkins? |
Jenkins must be using JDK8, since our build requires it. I don't know the exact version though. |
what's with then the |
There's a long boring story to why the JAVA_HOME values are configured the way they are, which mostly has to do with Jenkins specific oddities as @shaneknapp can attest to. If you grep later in the build logs for |
Test FAILed. Build result: FAILURE[...truncated 38 lines...]Triggering ADAM-prb ? 2.6.0,2.11,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
arg - now the remaining problem is that the HBase spark library version I am using seems to not support Spark 2.0 and thus we are getting complaints about lack of that Spark Logging class that went away in Spark 2.0. I can check into if the newest version of HBase/Spark connection supports Spark 2.0 as it seems to here: https://github.com/apache/hbase/blob/master/hbase-spark/pom.xml Ideas about how to best manage these issues of different versions of libraries and Jenkins? |
As you can see here: My HBase PR is failing only the 2.0.0 column of the matrix I take this to be the Spark 2.0 Logging issue. The first target I see for HBase work is making it work as we have thus far with CDH, and CDH spark/hbase support clearly doesn't support yet Spark 2.- given its continued use of
In general, it looks like the Spark 2.0 HBase connector may not exist at all: So - it may be for now impossible for this PR to pass against Spark 2.0 (without re-writing connector). Assuming no immediate solution, how should we manage this wrt Jenkins? Thoughts @heuermh and @fnothaft ? |
Ping for suggestions about how I should deal with this Spark 2.0 compatibility and Jenkins. |
We could move it into a module that is only accessible via a profile and then disable that profile when building for Spark 2. That would be simple. |
Yep, that would work. And this would be a good opportunity to bother upstream about it. |
yep, i agree.
…On Mon, Jan 16, 2017 at 3:03 PM, Michael L Heuer ***@***.***> wrote:
We could move it into a module that is only accessible via a profile and
then disable that profile when building for Spark 2.
Yep, that would work. And this would be a good opportunity to bother
upstream about it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1335 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABiDrLd6eXPhI2q7QtrzPQ96lYngSYUPks5rS_dBgaJpZM4LZ3Z0>
.
|
Alright, so do you suggest I make an HBase module at the same level as the It seems like the approach need in the POM would then be something like I see here:
Can you comment on what would be needed in the main project POM or elsewhere to selectively disable HBase module for the Spark 2 build? |
There are a few different ways to activate a profile, see |
Thanks, but I still feel like I'll be futzing around in the dark here - so to pick your brain more: So it looks like in all of these I'd use the Some questions:
|
I would just activate it through a profile as we do with the distribution module. |
I could use another set of eyes if someone could look at below: I'm attempting to move Hbase into a separate module in this new branch dev/b6 at my repo here: Trying to build after moving the module (skipping test for the moment) I get this error: [ERROR] /home/paschallj/2_4_2017/adam/adam-hbase/src/main/scala/org/bdgenomics/adam/hbase/HBaseFunctions.scala:52: error: object rdd is not a member of package org.bdgenomics.adam Be sure to skip tests for the moment if you try to replicate from: I've got more POM cleanup to do, but first need to figure out why can't currently see: |
Continued in #1388. Closing. |
supersedes #1246
and addresses comments from there.
Please make a review pass on this.
I'll add instructions soon as to how to run on cluster for reviewers to try live.