Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HBase genotypes backend -revised #1335

Closed
wants to merge 5 commits into from

Conversation

jpdna
Copy link
Member

@jpdna jpdna commented Jan 3, 2017

supersedes #1246
and addresses comments from there.

Please make a review pass on this.

I'll add instructions soon as to how to run on cluster for reviewers to try live.

Added HBaseFunctions

Added VCF genotype save and load Hbase functions

Added saveHBaseGenotypesSingleSample packing multiple genotypes at same position and samle allele into same hbase value

Added multisample capable loadHBaseGenotypes()

Removed commented-out earlier versions of save and load genotypes

removed more dead code

Clean up formatting - limit line length

Added saveHbaseSampleMetadata function

Added save and load SequenceDictionary functions to HBaseFunctions

Added createHbaseGenotypeTable

Adding loadGenotypeRddFromHBase

in progress updates to multi-sample hbase save

multi sample VCF save and load now working

Added repartitioning parameter tp hbase genotype load

Added comments identifying the public api vs helper functions

COB Aug 25

Added genomic start and stop parameters to loadGenotypesFromHBaseToGenotypeRDD

Added boolean saveSequenceDictionary toggle parameter to saveVariantContextRddToHBase

fixed start, stop null ptr exeception

first steps in adding hbaseToVariantcontextRDD

Changed region query to use ADAM ReferenceRegion

Added custom HBaseEncoder1 save function

Added custom Encoder1 Hbase loader

Added Encoder1 hbase variant context loader

Working - before rowkey int

Changed end in key to be size, added data block encoding in create table

Added create table splits

Removed dead code of encoder2

Added option to repartion vcrdd before saving to HBase

Added bulk load save to Hbase option

changed to cdh hbase api depedencies in POM

allow sample name file list as input to load functions

made sample_ids lis parameter  in load optional

Added deleteSamplesFromHBase function

Fixed bulk delete and made loadVariantContext work even when requested samplids are missing

Removed code from  failed version of sample delete function

Moved delete function up with test of genotype code

Fixed errors after rebase against master

small formatting cleanup

first pass hbase cli demo

second pass hbase cli demo

remove saveSeqDict check

add seq dict id to cli

import clean up

removed undeeded demo and temp.js code

Ran ./scripts/format-source due to build failure on Jenkins

changed Hbase dependencies to provided

addressed some review issues

Changed hbases dependencies back to NOT being provided

fixed None.get

first step key Strategy pattern

second step key Strategy pattern

more cleanup from review

made private functions package private for hbase

Factored out hbase connection

Factored out hbase connection - fix1

Factored out hbase connection - cleanup

Removed the non-bulk load hbase function

step one adding tests

first step implement DAO

step2 DAO

step3 DAO

Added first loadHBase Test

Added some javadoc

use dao everywhere

added more javadoc

step 1 in adding save hbase test

step 2 in adding save hbase test

clean up hbase doc

add more assertions to hbase load test

improved POM and imports in HBaseFunctions

fixed more PR suggestions
@jpdna jpdna mentioned this pull request Jan 3, 2017
3 tasks
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1714/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1335/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 5903417 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1335/merge^{commit} # timeout=10Checking out Revision 5903417 (origin/pr/1335/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 5903417ce134e3726b6d0df89b1a4eeab3394a3dFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Copy link
Member

@fnothaft fnothaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good, definitely heading in the right direction!

package org.bdgenomics.adam.cli

import org.apache.spark.SparkContext

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove whitespace.

@Argument(required = true, metaVar = "VCF", usage = "The VCF file to convert", index = 0)
var vcfPath: String = _

@Argument(required = true, metaVar = "-hbase_table", usage = "HBase table name in which to load VCF file")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For arguments, should be allcaps, no -, i.e., HBASE_TABLE

@Argument(required = true, metaVar = "-hbase_table", usage = "HBase table name in which to load VCF file")
var hbaseTable: String = null

@Args4jOption(required = true, name = "-seq_dict_id", usage = "User defined name to apply to the sequence dictionary create from this VCF, or name of existing sequence dictionary to be used")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto: SEQ_DICT_ID


@Argument(required = true, metaVar = "-staging_folder", usage = "Location for temporary files during bulk load")
var stagingFolder: String = null

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove whitespace.

}

class Vcf2HBase(protected val args: Vcf2HBaseArgs) extends BDGSparkCommand[Vcf2HBaseArgs] with Logging {
val companion = Transform
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong companion object

@@ -0,0 +1,111 @@
package org.bdgenomics.adam.hbase

import org.bdgenomics.adam.util.ADAMFunSuite
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: sort imports.

assert(genoData.getAlleles.get(1) === GenotypeAllele.ALT)
}

sparkAfter("Done with HBase Tests") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?

@@ -1,5 +1,15 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: extra whitespace before/after.

<version>${spark.version}</version>
<scope>provided</scope>
<exclusions>
<exclusion>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need these exclusions, IIRC.

</dependencies>
</dependencyManagement>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: keep whitespace.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1734/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1335/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains b96c96f # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1335/merge^{commit} # timeout=10Checking out Revision b96c96f (origin/pr/1335/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f b96c96f78dd117968ba25dfe2044613e7dd10ce4First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@jpdna
Copy link
Member Author

jpdna commented Jan 12, 2017

This branch at 0ab45cd builds and passes tests locally for me after coning from my repo.
Any ideas on why it fails when running on Jenkins? I did run the format script.

@jpdna
Copy link
Member Author

jpdna commented Jan 12, 2017

@heuermh suggests this could be a JDK version issue given the byte level comparisons I am making.

I am running java 1.8.0_111 loaclly
and Jenkins is running: JDK_7u60

Do we intended / need to not use java 8 for Jenkins?

@heuermh
Copy link
Member

heuermh commented Jan 12, 2017

Jenkins must be using JDK8, since our build requires it. I don't know the exact version though.

@jpdna
Copy link
Member Author

jpdna commented Jan 12, 2017

what's with then the
JAVA_HOME /home/jenkins/tools/hudson.model.JDK/JDK_7u60
I see here:
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/1734/injectedEnvVars/

@fnothaft
Copy link
Member

fnothaft commented Jan 12, 2017

There's a long boring story to why the JAVA_HOME values are configured the way they are, which mostly has to do with Jenkins specific oddities as @shaneknapp can attest to. If you grep later in the build logs for export JAVA_HOME, you'll see that we do indeed set the Java version to jdk1.8.0_60. This happens later in the build than the environment variable injection stage.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1735/

Build result: FAILURE

[...truncated 38 lines...]Triggering ADAM-prb ? 2.6.0,2.11,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@jpdna
Copy link
Member Author

jpdna commented Jan 12, 2017

arg - now the remaining problem is that the HBase spark library version I am using seems to not support Spark 2.0 and thus we are getting complaints about lack of that Spark Logging class that went away in Spark 2.0. I can check into if the newest version of HBase/Spark connection supports Spark 2.0 as it seems to here: https://github.com/apache/hbase/blob/master/hbase-spark/pom.xml
however, I am not sure if that will be compatible with the CDH version we'd want for the cluster.

Ideas about how to best manage these issues of different versions of libraries and Jenkins?
Maven Profiles?

@jpdna
Copy link
Member Author

jpdna commented Jan 12, 2017

As you can see here:
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1735/

My HBase PR is failing only the 2.0.0 column of the matrix
due to:
"[ERROR] error: bad symbolic reference. A signature in HBaseContext.class refers to type Logging"

I take this to be the Spark 2.0 Logging issue.

The first target I see for HBase work is making it work as we have thus far with CDH, and CDH spark/hbase support clearly doesn't support yet Spark 2.- given its continued use of
org.apache.spark.logging

https://github.com/cloudera/hbase/blob/cdh5-1.2.0_5.9.0/hbase-spark/src/main/scala/org/apache/hadoop/hbase/spark/HBaseContext.scala#L41

In general, it looks like the Spark 2.0 HBase connector may not exist at all:
http://stackoverflow.com/questions/40908891/which-hbase-connector-for-spark-2-0-should-i-use

So - it may be for now impossible for this PR to pass against Spark 2.0 (without re-writing connector).
( the needed changes might now be so major - but not sure that would be a good use of my time. Also, we might want to contact the person at Cloudera who would work on this - does anyone you know https://github.com/tedyu ?
What would be the forum to inquire with Cloudera about running Spark2.0 with HBase on CDH)

Assuming no immediate solution, how should we manage this wrt Jenkins? Thoughts @heuermh and @fnothaft ?

@jpdna
Copy link
Member Author

jpdna commented Jan 16, 2017

Ping for suggestions about how I should deal with this Spark 2.0 compatibility and Jenkins.

@fnothaft
Copy link
Member

We could move it into a module that is only accessible via a profile and then disable that profile when building for Spark 2. That would be simple.

@heuermh
Copy link
Member

heuermh commented Jan 16, 2017

We could move it into a module that is only accessible via a profile and then disable that profile when building for Spark 2.

Yep, that would work. And this would be a good opportunity to bother upstream about it.

@shaneknapp
Copy link
Contributor

shaneknapp commented Jan 16, 2017 via email

@jpdna
Copy link
Member Author

jpdna commented Jan 17, 2017

Alright, so do you suggest I make an HBase module at the same level as the adam-core and adam-cli modules?

It seems like the approach need in the POM would then be something like I see here:
http://maven.40175.n5.nabble.com/Excluding-certain-modules-in-a-profile-td81566.html#a81577

<project>
  ... 
  <modules>
    <module>module3</module>
    <module>module4</module>
  </modules>

  <profiles>
    <profile>
      <activation>
        <property>
          <name>!exclude-cpp-qa</name>
        </property>
      </activation>
      <modules>
        <module>module1</module>
        <module>module2</module>
      </modules>
    </profile>
  </profiles>
  ... 
</project>

Can you comment on what would be needed in the main project POM or elsewhere to selectively disable HBase module for the Spark 2 build?

@heuermh
Copy link
Member

heuermh commented Jan 17, 2017

There are a few different ways to activate a profile, see
http://books.sonatype.com/mvnref-book/reference/profiles-sect-activation.html

@jpdna
Copy link
Member Author

jpdna commented Jan 17, 2017

http://books.sonatype.com/mvnref-book/reference/profiles-sect-activation.html

Thanks, but I still feel like I'll be futzing around in the dark here - so to pick your brain more:

So it looks like in all of these I'd use the <activation> section to activate or with ! deactivate the module mentioned in the profile.

Some questions:

  1. Assuming we are trying to this to work with Jenkins based on what version of Spark is active which I assume Jenkins configuration is doing ( not by specifying the name or a Profile or setting a special env variable directly) what is the parameter/property I will be able to test inside the <activate> to determine which version of Spark as active? I'm not clear if this property is itself allowed to be a POM defined Property.

  2. Example 5.3.1 in the doc you pointed to seems like it as an AND combination of values when we'd possibly need an OR like "activate for Spark 1.6.2, 1.6.3, 1.6.4" I'm not sure how I'll include or exclude these multiple possible versions. If Jenkins could set an OS env variable like Spark2=True that would be set for any Spark versions greater than Spark 1.x then <activation> could test just for that.

@fnothaft
Copy link
Member

I would just activate it through a profile as we do with the distribution module.

@jpdna
Copy link
Member Author

jpdna commented Feb 6, 2017

I could use another set of eyes if someone could look at below:

I'm attempting to move Hbase into a separate module in this new branch dev/b6 at my repo here:
https://github.com/jpdna/adam/tree/dev/b6

Trying to build after moving the module (skipping test for the moment) I get this error:

[ERROR] /home/paschallj/2_4_2017/adam/adam-hbase/src/main/scala/org/bdgenomics/adam/hbase/HBaseFunctions.scala:52: error: object rdd is not a member of package org.bdgenomics.adam
[ERROR] import org.bdgenomics.adam.rdd.variant.VariantContextRDD
[ERROR] ^
[ERROR] /home/paschallj/2_4_2017/adam/adam-hbase/src/main/scala/org/bdgenomics/adam/hbase/HBaseFunctions.scala:53: error: object rich is not a member of package org.bdgenomics.adam
[ERROR] import org.bdgenomics.adam.rich.RichVariant
[ERROR] ^
[ERROR] /home/paschallj/2_4_2017/adam/adam-hbase/src/main/scala/org/bdgenomics/adam/hbase/HBaseFunctions.scala:55: error: object models is not a member of package org.bdgenomics.adam
[ERROR] import org.bdgenomics.adam.models.{ ReferencePosition, ReferenceRegion, SequenceDictionary, VariantContext }

coming from:
https://github.com/jpdna/adam/blob/dev/b6/adam-hbase/src/main/scala/org/bdgenomics/adam/hbase/HBaseFunctions.scala#L52

Be sure to skip tests for the moment if you try to replicate from:
https://github.com/jpdna/adam/tree/dev/b6

I've got more POM cleanup to do, but first need to figure out why can't currently see:
org.bdgenomics.adam.rdd.variant.VariantContextRDD
from above. Thanks!

@fnothaft
Copy link
Member

fnothaft commented Mar 3, 2017

Continued in #1388. Closing.

@fnothaft fnothaft closed this Mar 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants