Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR takes over #15666

Adds a flag to sc.addJar to add the jar to the current classloader

How was this patch tested?

Manually tested.

Unit tests, manual tests

This is a continuation of the pull request in #9313 and is mostly a rebase of that moved to master > with SparkR additions.

Closes #15666

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Nov 2, 2017

Most of codes are by @mariusvniekerk. I did some cleanup and addressed the review comments not being addressed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks we have a problem here with handling URI, Windows path, although most of other cases should be fine though:

> normalizePath("file:/C:/a/b/c")
[1] "C:\\Users\\IEUser\\workspace\\spark\\file:\\C:\\a\\b\\c"
Warning message:
In normalizePath(path.expand(path), winslash, mustWork) :
  path[1]="file:/C:/a/b/c": The filename, directory name, or volume label syntax
 is incorrect

This looks ending up with an weird path like "C:\\Users\\IEUser\\workspace\\spark\\file:\\C:\\a\\b\\c".

I am not sure how we should handle this as this pattern normalizedPath <- suppressWarnings(normalizePath(path)) looks quite common.

If it is fine, I would like to address this issue separately for other APIs, for example, spark.addFile right above ..

Copy link
Member Author

@HyukjinKwon HyukjinKwon Nov 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I avoided to pass URI here by passing the abs path for now in the test BTW.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, normalizePath wouldn't handle url...
https://stat.ethz.ch/R-manual/R-devel/library/base/html/normalizePath.html

I think we should require absolute paths in their canonical form here and just pass through..

@HyukjinKwon
Copy link
Member Author

cc @shivaram, @felixcheung, @mariusvniekerk, @holdenk and @brkyvz who were in the PR. Would you guys mind taking a look please?

@SparkQA
Copy link

SparkQA commented Nov 2, 2017

Test build #83337 has finished for PR 19643 at commit 49b9d48.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 3, 2017

Test build #83365 has finished for PR 19643 at commit b928ab8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

#'
#' The \code{path} passed can be either a local file, a file in HDFS (or other Hadoop-supported
#' filesystems), an HTTP, HTTPS or FTP URI, or local:/path for a file on every worker node.
#' If \code{addToCurrentClassLoader} is true, add the jar to the current driver.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, is this right add the jar to the current driver.?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is roughly right .. I wanted to avoid the words like "classloader" or "thread" .. Not sure what's the best wording to describe this within R / Python contexts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe something like underlying/backing java process ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you are back!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, probably that's better wording. Let me update it after a bit more waiting other review comments. @mariusvniekerk, I am okay with closing it if you happen to have time to proceed yours now, or I can proceed here. Either way works. Up to you :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mariusvniekerk are you okay with proceeding this here?

#' Adds a JAR dependency for Spark tasks to be executed in the future.
#'
#' The \code{path} passed can be either a local file, a file in HDFS (or other Hadoop-supported
#' filesystems), an HTTP, HTTPS or FTP URI, or local:/path for a file on every worker node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is local:/path referring to windows drive/path, or the actual text local:/ should be there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it refers the actual local:/:

case "local" => "file:" + uri.getPath

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, normalizePath wouldn't handle url...
https://stat.ethz.ch/R-manual/R-devel/library/base/html/normalizePath.html

I think we should require absolute paths in their canonical form here and just pass through..

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, one small comment on the Python side on top of Felix's existing comments.

import importlib
importlib.invalidate_caches()

def addJar(self, path, addToCurrentClassLoader=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should mention that adding a jar to the current class loader is a developer API and may change.

filesystems), an HTTP, HTTPS or FTP URI, or local:/path for a file on every worker node.
If `addToCurrentClassLoader` is true, add the jar to the current threads' class loader
in the backing JVM. In general adding to the current threads' class loader will impact all
other application threads unless they have explicitly changed their class loader.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@holdenk and @felixcheung, here I just added the comments back. I thought it's a developer API and might be fine to describe some words related with JVM but .. please let me know if you guys feel we need to take out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we currently use .. note:: DeveloperApi to indicate it's a developer API (see ml/pipeline and friends for an example).

@SparkQA
Copy link

SparkQA commented Nov 8, 2017

Test build #83596 has finished for PR 19643 at commit ab52809.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Hi @jerryshao. Would you maybe have some time to take a look for this one please?


if (addToCurrentClassLoader) {
Utils.getContextOrSparkClassLoader match {
case cl: MutableURLClassLoader => cl.addURL(Utils.resolveURI(path).toURL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure does it support remote jars on HTTPS or Hadoop FileSystems?In the executor side, we handle this explicitly by downloading jars to local and add to classpath, but here looks like we don't have such logic. I'm not sure how this URLClassLoader communicate with Hadoop or Https without certificates.

The addJar is just adding jars to fileserver, so that executor could fetch them from driver and add to classpath. It will not affect driver's classpath. If we support adding jars to current driver's classloader, then how do we leverage this newly added jars?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jerryshao. Will check through this concern within this weekend and be back.

@HyukjinKwon
Copy link
Member Author

Let me leave this closed now and will reopen when I am ready to proceed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants