Skip to content

Conversation

@Leemoonsoo
Copy link
Contributor

#308 Implements/fixes runtime dependency library loading. But the feature is unreliable so some library loaded correctly and some library not.

While it looks not easy to find out reliable solution for runtime library loading, this PR trying to do library loading before SparkIMain being created, so library does not need to be loaded dynamically on runtime, but just included as a classpath.

To do this, this PR adds an new interpreter, "DepInterpreter".
It provides separate scala interpreter and API to loads dependency. He is fetching necessary library from maven repository and keep the file list. And then when SparkInterpreter is initializing, it's passing that file list to SparkInterpreter, so SparkInterpreter adds them in the classpath without trying to load them on runtime.

  • DepInterpreter implementation
  • Warning message when DepInterpreter used after SparkInterpreter initialized.

Usage. DepInterpreter can be used with %dep expose instance of com.nflabs.zeppelin.spark.dep.DependencyContext as variable z.

Here's API

z.reset() // clean up previously added artifact and repository

// add maven repository
z.addRepo("RepoName").url("RepoURL")

// add maven snapshot repository
z.addRepo("RepoName").url("RepoURL").snapshot()

// add artifact from filesystem
z.load("/path/to.jar")

// add artifact from maven repository
z.load("groupId:artifactId:version")

// add artifact recursively (with all it's dependency)
z.load("groupId:artifactId:version").recursive()

// add artifact recursively except comma separated GroupID:ArtifactId list
z.load("groupId:artifactId:version").recursive().exclude("groupId:artifactId,groupId:artifactId, ...")

// add artifact recursively and distribute them to spark workers (sc.addJar())
z.load("groupId:artifactId:version").recursive().dist()

Example of use
image

@Leemoonsoo Leemoonsoo mentioned this pull request Feb 5, 2015
3 tasks
@Leemoonsoo
Copy link
Contributor Author

It's ready. Please someone review this PR.

@swkimme
Copy link
Contributor

swkimme commented Feb 5, 2015

It works on local environment, let me test more on cluster environment. Great job!!

comments and questions:

  1. Are there previously added library if %dep interpreter can be used only when before spark interpreter has initialized?
  2. How can I reload dependencies? For my thought it should work after I restart SparkInterpreter, but still "Must be used before SparkInterpreter (%spark) initialized" came out after restart SparkInterpreter.

@Leemoonsoo
Copy link
Contributor Author

@swkimme

  1. Unless a) Restart interpreter, b) call z.reset() %dep interpreter keeps previously added library.
  2. That's rights. Restart SparkInterpreter and run %dep before %spark. that's the way reload dependencies. It works for me .. and let me try again.

@swkimme
Copy link
Contributor

swkimme commented Feb 6, 2015

z.load("org.apache.james:apache-mime4j:0.7.2")
org.sonatype.aether.resolution.DependencyResolutionException: Could not find artifact org.apache.james:apache-mime4j:jar:0.7.2 in central (http://repo1.maven.org/maven2/)

it loads with build.sbt,
"org.apache.james" % "apache-mime4j" % "0.7.2",
but it failed in %dep interpreter.

@Leemoonsoo
Copy link
Contributor Author

@swkimme

it was because of apache-mime4j is pom type artifact. I pushed a fix and it can be loaded by specifying extension between groupId and version, like

%dep
z.load("org.apache.james:apache-mime4j:pom:0.7.2")

@swkimme
Copy link
Contributor

swkimme commented Feb 7, 2015

For 2) restart issue, found it was related to
#309

@Leemoonsoo
Copy link
Contributor Author

Made some improvements

  • infer scala version using '::'
    now z.load("eu.unicredit::hbase-rdd:0.4.0-SNAPSHOT") equivalent to z.load("eu.unicredit:hbase-rdd_2.10:0.4.0-SNAPSHOT")
  • recursive is now default. therefore recursive() is removed from api and excludeAll() is added instead.
  • exclusion now possible with pattern (wildcard with '*')

Here's updated API

z.reset() // clean up previously added artifact and repository

// add maven repository
z.addRepo("RepoName").url("RepoURL")

// add maven snapshot repository
z.addRepo("RepoName").url("RepoURL").snapshot()

// add artifact from filesystem
z.load("/path/to.jar")

// add artifact from maven repository, with no dependency
z.load("groupId:artifactId:version").excludeAll()

// add artifact recursively
z.load("groupId:artifactId:version")

// add artifact recursively except comma separated GroupID:ArtifactId list
z.load("groupId:artifactId:version").exclude("groupId:artifactId,groupId:artifactId, ...")

// exclude with pattern
z.load("groupId:artifactId:version").exclude(*)
z.load("groupId:artifactId:version").exclude("groupId:artifactId:*")
z.load("groupId:artifactId:version").exclude("groupId:*")

// add artifact recursively and distribute them to spark workers (sc.addJar())
z.load("groupId:artifactId:version").dist()

@swkimme
Copy link
Contributor

swkimme commented Feb 8, 2015

OMG, AWESOME job!

I've brought one discussion.
isn't .dist() should be a default??
I guess the libraries should be available in cluster in usual cases.

@Leemoonsoo
Copy link
Contributor Author

@swkimme

Updated to make 'dist' default. 'dist()' is removed from API and added 'local()', for the case does not want to add artifact to spark cluster. Here's updated API

z.reset() // clean up previously added artifact and repository

// add maven repository
z.addRepo("RepoName").url("RepoURL")

// add maven snapshot repository
z.addRepo("RepoName").url("RepoURL").snapshot()

// add artifact from filesystem
z.load("/path/to.jar")

// add artifact from maven repository, with no dependency
z.load("groupId:artifactId:version").excludeAll()

// add artifact recursively
z.load("groupId:artifactId:version")

// add artifact recursively except comma separated GroupID:ArtifactId list
z.load("groupId:artifactId:version").exclude("groupId:artifactId,groupId:artifactId, ...")

// exclude with pattern
z.load("groupId:artifactId:version").exclude(*)
z.load("groupId:artifactId:version").exclude("groupId:artifactId:*")
z.load("groupId:artifactId:version").exclude("groupId:*")

// local() skips adding artifact to spark clusters (skipping sc.addJar())
z.load("groupId:artifactId:version").local()

@Leemoonsoo
Copy link
Contributor Author

Ready to be merged! #319 -> #308 -> master

@swkimme
Copy link
Contributor

swkimme commented Feb 11, 2015

LGTM!

On 2015년 2월 11일 (수) 08:48 Lee moon soo notifications@github.com wrote:

Ready to be merged! #319 #319 ->
#308 #308 -> master


Reply to this email directly or view it on GitHub
#319 (comment).

Leemoonsoo added a commit that referenced this pull request Feb 12, 2015
Reliable dependency loading mechanism
@Leemoonsoo Leemoonsoo merged commit ab20344 into improve/libload Feb 12, 2015
@Leemoonsoo Leemoonsoo deleted the new/depinterpreter branch February 12, 2015 03:04
asfgit pushed a commit to apache/zeppelin that referenced this pull request Mar 30, 2015
From ZEPL/zeppelin#388.

Update description of dependency loader to reflect ZEPL/zeppelin#319.
To do this, document structure is changed.
* docs/zeppelincontext -> removed
* interpreter/spark -> added (includes description about zeppelincontext and dependencyloader)

Ready to merge.

Author: Lee moon soo <leemoonsoo@gmail.com>

Closes #7 from Leemoonsoo/gh-pages_update_changes and squashes the following commits:

a3894cf [Lee moon soo] Add interpreter/spark.md instead of docs/zeppelincontext.md update description about dependency loader
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants