-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-3256] Added support for :cp <jar> that was broken in Scala 2.10.x for REPL #1929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds the specified jar to both our compile time and runtime classpaths in the interpreter.
|
@ScrapCodes can you look at this one? |
|
Just curious, is this what the Scala team intends to do here (or did in 2.11), or could it interfere with other things in the compiler? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are mixing a trait with URLClassloader. is there a reason for not implementing a derived class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I previously was using the following inline:
new URLClassLoader(...) {
def addNewUrl(url: URL) = this.addURL(url)
}
But I was unable to access the exposed method since it was creating an anonymous subclass. In terms of providing a trait versus a class that extends URLClassLoader, I figured a trait was a lighter way to mix in the needed exposure.
|
@mateiz I don't know what the Scala team's plans would be. I know that the only suggestion thus far was to wipe the global reference and recreate it (https://issues.scala-lang.org/browse/SI-6502?focusedCommentId=68422&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-68422), which is not ideal. There hasn't been any discussion since February on this. On quick glance, it doesn't look like they have fixed anything with the REPL in 2.11.2. The code was moved (https://github.com/scala/scala/blob/v2.11.2/src/repl/scala/tools/nsc/interpreter/ILoop.scala), but has the same issue. The 2.11.2 REPL gives me the same problem: |
|
Alright, it might be good to ping them on that JIRA to make sure this doesn't break subtle things. Although obviously it does seem to work. |
|
Looks okay to me, I was wondering if you tried scala reflection before resorting to java reflection ? |
|
@ScrapCodes I recently picked up Scala and am much more familiar with Java's reflection than Scala's runtime mirrors and such. Do you have any pseudo code to express what this would look like using Scala's mirrors? @mateiz What side effects are you concerned about? FYI, I did add a comment about this pull request asking for any side effects it might have: https://issues.scala-lang.org/browse/SI-6502?focusedCommentId=70380&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-70380 |
|
I believe @mateiz is concerned about compatibility with other scala 2.10.x versions. We would have to make sure this works for all supported scala versions, as the internals may change between minor releases. In that case, we might have to add different code paths for different versions. This is unlikely though and I think the benefits far outweigh the costs in this case. |
|
I actually wasn't super concerned about new 2.10.x releases, since I don't think they'll change the REPL, but I'm just worried that there may be assumptions in the compiler that we're violating by adding these JARs. Maybe not, but it wouldn't hurt to ask them whether this is an okay fix. |
|
BTW I'd also be okay merging it if there are only minor problems, since adding these JARs at runtime is very useful. |
|
This is similar to what I tried last year, but I recently saw: which says the mechanisms are in flux in 2.11. |
|
@mateiz @som-snytt The summary I have is that this should work for 2.10.x as it doesn't appear like they are trickling the removal of the class invalidation from scala/scala#3884. I've given more analysis to the internal classpath editing from 2.10 and I don't see any potential side effects from appending to the existing merged classpath. In terms of Spark moving to 2.11(.3 or higher) in the future, one option would be to simply include a trait mixed in with global that keeps the functionality they removed. From what I've seen, the only reason they removed it was because it wasn't used in the compiler or sbt, not that it had any issues. The underlying functionality wasn't broken, so I don't see any harm in doing that as a backup. |
|
Got it, thanks for looking into this. |
|
So, this is isn't my area of expertise (I'm not sure who at Typesafe/EFPL is currently responsible for the development/maintenance of the REPL, thus I don't know who could comment on future plans/roadmap) However,
Since I’m not the foremost expert on this stuff, I’m also going to ping @gkossakowski (our best bet haha) – hopefully he can confirm/deny/clarify these insights? |
|
@heathermiller Thanks for the feedback! It looks like @gkossakowski commented here: SI-6052. So, it looks like issues that might not be tackled by the current pull request are as follows:
At least, he makes me believe that the invalidateClassPaths doesn't handle that. While I didn't mention it in the pull request earlier, I made assumptions that entries added wouldn't overlap with the existing classpaths. If the feature is added later to properly handle this, even better. For now, it looks like this implementation works as long as you don't add conflicting classpath entries. @mateiz, I'm not sure if that is a small enough problem for you to want this or not. Personally, I haven't run into these issues with Apache libraries that I've used with the patch. Nor have I run into issues trying out different jars for testing such as oddballs like JGoodies and lwjgl. Obviously, if someone wants to import jars at runtime, they need to be aware of conflicting classes. I did just test this with the following files: Placed inside TestJar.jar Placed inside TestJar2.jar And running yields something like this: So, I think the current solution does work for multiple imports that have the same package structure with entirely different classes. I don't know how it handles shadowing classes, if at all. |
|
FWIW, I remembered my fix included overriding the ClassPath (in the JavaPlatform) of the REPL compiler. So it was a bit more nuanced in handling the replacement; it was a long time ago now, in Internet or dog years. But having an "additive" solution is better than nothing, the perfect being the enemy of the good and all that. |
|
Yep, so shortly after I posted my last message, I had a conference call with @gkossakowski and then chatted briefly with @mateiz. Here's an attempt to summarize our discussions. Important note: @gkossakowski is currently recovering from surgery on his wrists, so written responses from him ATM are precious :) (and we shouldn't push him for too many...) RE: point 2)@gkossakowski and I discussed the issue of shadowing other jars on the classpath. He explained a bit about the history of However, my intuition (which could be wrong) was that, for now, this is likely less important for users of spark-shell. Rationale: you're manually adding jars to the classpath; it's unlikely that the list of jars one would add to the classpath be so numerous that you'd shadow something else that you've added. The real concern would be that someone adds a different version of a library that's already on the classpath – and you're right, for this we have no support as far as I can tell. I agree with you, that for now, if folks are going to import jars at runtime, they should just be aware of conflicting classes (at least importing jars at runtime would now once again be a possibility). @gkossakowski also assured me that for the rest of 2.10.x, logic that this PR touches most likely wouldn't fluctuate, so we shouldn't worry about any of this breaking for the rest of the 2.10.x series. The particularly important bit of @gkossakowski & I's conversation though centered on what spark-shell would do when migrating to 2.11.x when you can't even rely on a partial solution like Right now, the idea would be to write a specification of what the bare minimum that the spark-shell/the Scala REPL should do when there are conflicts on the classpath. And, ideally, we could try to get this fix into one of the next two releases of the 2.11.x series (due dates: Sept 19, Nov 28) so that this fix doesn't have to live in Spark and depend on delicate scalac internals. RE: point 1)After chatting briefly with @mateiz, it was determined that the later-appended jars should take precedence (right Matei? or do I have that backwards?) |
|
Thanks @heathermiller for summing up my conversation. I realized that I introduced a bit of confusion with my comment on SI-6052. Let me clarify. The logic that is being removed by scala/scala#3884 does support the scenario of appending a jar to classpath. This is demonstrated by @rcsenkbeil's comment. That piece of logic promises much more. It promises to handle arbitrary changes to classpath, including removal and changes to single classes. However, it doesn't deliver on that promise. We discovered all sorts of subtle issues while working on resident compilation mode of scalac. This is not a real surprise: Scala compiler wasn't designed with resident compilation in mind. The As I mentioned in SI-6052, we have a good chance to implement a small subset of what How do we determine what's the small subset? I said mentioned in SI-6502. The API that
One might think, that the first point is not important because it's unlikely the same package will span several jars. That would be true if packages were flat. However, in Scala packages are nested. So if you have one jar with When appending the second jar, you must merge org and apache packages. What you merge is contents of The second point is extremely hard to implement properly. That's why I think the alternative api to current Here's what needs to happen for Scala 2.11.x:
|
|
@gkossakowski, thanks for the detailed reply. From my point of view, what we want when new JARs are added is for earlier JARs to take precedence. This is what makes the most sense. If you already instantiated an object from the previous version of the class and stored it in a variable, it's not possible for it to suddenly change class. So instead the effect should be the same as tacking on another JAR at the end of your classpath -- only classes that are not found in earlier JARs come from there. Would these semantics be possible to implement for 2.11? |
|
BTW this actual patch looks good to me for 2.10, so I'll probably merge it after a bit of manual testing. @pwendell should we put this in branch-1.1 as well? It is technically a bug fix, but not a critical one. |
|
Jenkins, test this please |
|
Hey @rcsenkbeil, so I tested this out manually, but unfortunately it doesn't seem to quite work -- you can't use the new classes in the REPL, though you can create them using reflection. Here's a sample session: My simple-jar-v1.jar just contains a class called Test in package test. Have you seen this happen yourself? It doesn't seem like this will be usable enough in its current form. |
|
BTW, the problem happens even if I call |
|
Actually, one follow-up: It seems that the patch works if you try to use the new class before calling |
|
@mateiz I have not run into that issue. In fact, I just cloned a copy of my branch I submitted on a fresh computer and it still worked fine. To test, I cloned the branch and ran From there, I created a file called TestClass.java as an illustration: I created a jar from this by issuing Finally, I ran two different cases after starting the Spark shell through |
|
@mateiz I've added a hack that just runs a compilation error in silence before adding anything to allow this. It works as seen here (and doesn't interfere with anything else). I'll spend a little more time investigating, but this does some to work for Scala classes now. |
|
Ok, so I think I have a solution. It took an embarrassing amount of experimentation and grokking of the REPL (not my area of expertise), but if you just create a dummy So, in your implementation of def addUrlsToClassPath(urls: URL*): Unit = {
new Run // force some initialization
urls.foreach(_runtimeClassLoader.addNewUrl) // Add jars/classes to runtime for execution
updateCompilerClassPath(urls: _*) // Add jars/classes to compile time for compiling
}After running into 10,000 walls, writing a bunch of code to manually force the initialization of symbols in previous/current runs, I discovered this happening in Global: |
|
Awesome! Great find, @heathermiller! I've added an additional commit to move to that solution. @mateiz, what do you think now? This is what Global itself does to reinitialize. Is that clean enough? FYI, I ran through tests with both Scala and Java classes (from within jars) and everything checks out. I also added the SparkContext.addJar call as you requested to make this more Sparkish. I tested this by starting up a standalone cluster using the I'm hoping this demonstrates that the jars also work on the remote cluster. I don't have access to a remote Spark cluster at the moment. The Java implementation: The Scala implementation: |
|
Cool, thanks! Running a cluster locally sounds fine as a way to test it; it will go through the same codepath. |
|
Jenkins, test this please |
|
test this please |
|
QA tests have started for PR 1929 at commit
|
|
QA tests have finished for PR 1929 at commit
|
We agree on semantics. I called changing existing class shadowing but we mean the same: changes to existing classes should not be allowed. Adding jars to the classpath means just adding new classes that were not previously available. For that we need merging of packages as I explained earlier. It's possible to implement this kind of API for 2.11 but it doesn't exist yet . I hope we can figure out how to merge your changes and work on the API on the compiler side. The current approach of going deep into internals of |
|
@mateiz I'd prefer not to merge this into branch-1.1 at this point unless you see a really compelling need. Scarred from Spark 1.0.0 which actually released with a major REPL bug (which was itself, an attempt to fix another bug!). |
|
I have a easy test case which crashes the compiler. jar1 has a package repl1 and a case class A :cp jar1 import repl1 jar2 also has package repl1 and case class A and B b == a // will crash the compiler. |
|
@ScrapCodes, this was the case that I said I didn't support since the likelihood of bringing in different jars with the same package structure and class names is rare for the repl and can be tricky to deal with in many cases with the current code base. This is where we would want something upstream in Scala as @gkossakowski has mentioned. Also, as he discussed, this would really be a feature to add new classes, not replace existing ones. [EDIT] Also, you're going so far to mix comparing a previous and current class implementation in an equality comparison. Maybe this is something that should be supported, but it just seems to ask for trouble. I'd think that people adding jars to the repl would be more cautious. Maybe it makes more sense to have this command be a different name? Like |
|
I think :cp is fine, no reason to make a new command. I tested this again and it seems to work well (also on a cluster), so I'm going to merge this into 1.2. And I'll open a new JIRA for a Scala 2.11 version of this. |
|
Hey Chip, one more thing, can you create an account on https://issues.apache.org/jira/browse/SPARK so I can assign this issue to you? Then update the pull request to say [SPARK-3256] in the title. I created this as an issue: https://issues.apache.org/jira/browse/SPARK-3256. |
|
@mateiz Created an account located at https://issues.apache.org/jira/secure/ViewProfile.jspa?name=senkwich |
|
Alright, thanks! I've merged this in. I also opened https://issues.apache.org/jira/browse/SPARK-3257 to do this for the Scala 2.11 REPL. @gkossakowski it would be great if you guys add an API for this in the next Scala 2.11 release -- our 1.2 release will be in roughly 3 months. |
|
@mateiz, we haven't been planning to touch REPL code and this is not trivial request so I'll have to ask around about this. I'll update this ticket once I know what our plans are. |
…0.x for REPL As seen with [SI-6502](https://issues.scala-lang.org/browse/SI-6502) of Scala, the _:cp_ command was broken in Scala 2.10.x. As the Spark shell is a friendly wrapper on top of the Scala REPL, it is also affected by this problem. My solution was to alter the internal classpath and invalidate any new entries. I also had to add the ability to add new entries to the parent classloader of the interpreter (SparkIMain's global). The advantage of this versus wiping the interpreter and replaying all of the commands is that you don't have to worry about rerunning heavy Spark-related commands (going to the cluster) or potentially reloading data that might have changed. Instead, you get to work from where you left off. Until this is fixed upstream for 2.10.x, I had to use reflection to alter the internal compiler classpath. The solution now looks like this:  Author: Chip Senkbeil <rcsenkbe@us.ibm.com> Closes apache#1929 from rcsenkbeil/FixReplClasspathSupport and squashes the following commits: f420cbf [Chip Senkbeil] Added SparkContext.addJar calls to support executing code on remote clusters a826795 [Chip Senkbeil] Updated AddUrlsToClasspath to use 'new Run' suggestion over hackish compiler error 2ff1d86 [Chip Senkbeil] Added compilation failure on symbols hack to get Scala classes to load correctly a220639 [Chip Senkbeil] Added support for :cp <jar> that was broken in Scala 2.10.x for REPL









As seen with SI-6502 of Scala, the :cp command was broken in Scala 2.10.x. As the Spark shell is a friendly wrapper on top of the Scala REPL, it is also affected by this problem.
My solution was to alter the internal classpath and invalidate any new entries. I also had to add the ability to add new entries to the parent classloader of the interpreter (SparkIMain's global).
The advantage of this versus wiping the interpreter and replaying all of the commands is that you don't have to worry about rerunning heavy Spark-related commands (going to the cluster) or potentially reloading data that might have changed. Instead, you get to work from where you left off.
Until this is fixed upstream for 2.10.x, I had to use reflection to alter the internal compiler classpath.
The solution now looks like this:
