ClassCastException loading model in Apache Spark #17

timcroydon · 2014-11-19T16:35:37Z

Hi there,

I'm trying to use epic in an Apache Spark Streaming environment but I'm experiencing some difficulty loading the models. I'm not really sure whether this is an Epic issue, a Breeze issue, a Spark issue or where/how to solve this now! I get the following exception (for English NER):


Exception in thread "main" java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.HashMap$SerializationProxy to field epic.features.BrownClusterFeaturizer.epic$features$BrownClusterFeaturizer$$clusterFeatures of type scala.collection.immutable.Map in instance of epic.features.BrownClusterFeaturizer
    at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
    at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996)
    ... trimmed ...
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
    at breeze.util.package$.readObject(package.scala:21)
    at epic.models.package$.deserialize(package.scala:54)
        ... trimmed calls from my code ...

I've tried running my code (compiled into uberjar using 'sbt assembly') in a raw scala console and I can load the model and run it fine. However, using Spark, I get the exception described. The ONLY difference as far as I can tell is the way the model file is referenced. For the raw scala environment, I can point directly at the model file on disk (e.g. new File("mymodels/model.ser.gz")) and it loads. In Spark, I have to load the file doing something similar to:

sc.addFile("model.ser.gz")
new File(SparkFiles.get("model.ser.gz")

I've tried narrowing the code down and depending whether I point at the model extracted from the jar or the jar itself I get the same result. It's definitely loading the file (I think) as it fails in other ways if the file doesn't exist. I even tried bypassing the Breeze nonStupidObjectInputStream to no avail.

Any idea what's going on or how to test? For reference, my JVM is 1.7.0_51 and same in both scala and Spark environments.

Thanks.

The text was updated successfully, but these errors were encountered:

dlwh · 2014-11-19T17:40:53Z

I've seen this kind of problem a few times, and they are incredibly hard to
debug. It's usually a classloader problem, I think, and I'm unfortunately
not great at debugging (you can guess my frustration level last time I
debugged this, which is when I created nonstupidObjectInputStream...)

This is going to sound very hacky, but... could you try creating a new
class in epic's package explicitly before loading the model? Something as
simple as val x = new epic.features.BrownClusterFeature("foo")

You might also appeal to the spark user list. I'm happy to help with it as
best I can, but it isn't Epic-specific (I think!) and they have a lot more
expertise dealing with serialization problems caused by remoting and
classloaders.

-- David

On Wed, Nov 19, 2014 at 8:35 AM, Tim Croydon notifications@github.com
wrote:

Hi there,

I'm trying to use epic in an Apache Spark Streaming environment but I'm
experiencing some difficulty loading the models. I'm not really sure
whether this is an Epic issue, a Spark issue or where/how to solve this
now! I get the following exception (for English NER):

Exception in thread "main" java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.HashMap$SerializationProxy to field epic.features.BrownClusterFeaturizer.epic$features$BrownClusterFeaturizer$$clusterFeatures of type scala.collection.immutable.Map in instance of epic.features.BrownClusterFeaturizer
at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996)
... trimmed ...
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at breeze.util.package$.readObject(package.scala:21)
at epic.models.package$.deserialize(package.scala:54)
... trimmed calls from my code ...

I've tried running my code (compiled into uberjar using 'sbt assembly') in
a raw scala console and I can load the model and run it fine. However,
using Spark, I get the exception described. The ONLY difference as far as I
can tell is the way the model file is referenced. For the raw scala
environment, I can point directly at the model file on disk (e.g. new
File("mymodels/model.ser.gz")) and it loads. In Spark, I have to load the
file doing something similar to:

sc.addFile("model.ser.gz")
new File(SparkFiles.get("model.ser.gz")

I've tried narrowing the code down and depending whether I point at the
model extracted from the jar or the jar itself I get the same result. It's
definitely loading the file (I think) as it fails in other ways if the file
doesn't exist. I even tried bypassing the Breeze
nonStupidObjectInputStream to no avail.

Any idea what's going on or how to test? For reference, my JVM is 1.7.0_51
and same in both scala and Spark environments.

Thanks.

—
Reply to this email directly or view it on GitHub
#17.

timcroydon · 2014-11-19T19:01:04Z

I tried your suggestion and was able to create a BrownClusterFeature object with no trouble so doesn't look like it's a classloader issue (as far as I can tell). It feels more like the kind of problem you might get serialising using one version and trying to deserialise with another, although given the file can be deserialised using raw scala it's almost like something's happening to the file stream.

I'll have a closer look at the Spark side to see if I can find similar issues there.

Thanks for the prompt response and for the library!

dlwh · 2014-11-19T19:03:56Z

Is there maybe something going on with different scala versions? (Or, less
likely, Breeze versions?)

On Wed, Nov 19, 2014 at 11:01 AM, Tim Croydon notifications@github.com
wrote:

I tried your suggestion and was able to create a BrownClusterFeature
object with no trouble so doesn't look like it's a classloader issue (as
far as I can tell). It feels more like the kind of problem you might get
serialising using one version and trying to deserialise with another,
although given the file can be deserialised it's almost like something's
happening to the file stream.

I'll have a closer look at the Spark side to see if I can find similar
issues there.

Thanks for the prompt response and for the library!

—
Reply to this email directly or view it on GitHub
#17 (comment).

timcroydon · 2014-11-19T20:13:46Z

I'm compiling to 2.10.4 and my installed scala version matches that. However, there is a Breeze dependency at a different version - looks like nak pulls in an older version of breeze_natives:

'What depends on' Breeze 0.8:


[info] org.scalanlp:breeze_2.10:0.8 (evicted by: 0.9)
[info]   +-org.scalanlp:breeze-natives_2.10:0.8 [S]
[info]     +-org.scalanlp:nak_2.10:1.3 [S]
[info]       +-org.scalanlp:epic_2.10:0.2 [S]
[info]         +-my stuff
[info]         +-org.scalanlp:epic-ner-en-conll_2.10:2014.10.26 [S]
[info]         | +-my stuff
[info]         | 
[info]         +-org.scalanlp:epic-parser-en-span_2.10:2014.9.15 [S]
[info]           +-my stuff

And same for Breeze 0.9:


[info] org.scalanlp:breeze_2.10:0.9 [S]
[info]   +-org.scalanlp:breeze-natives_2.10:0.8 [S]
[info]   | +-org.scalanlp:nak_2.10:1.3 [S]
[info]   |   +-org.scalanlp:epic_2.10:0.2 [S]
[info]   |     +-my stuff
[info]   |     +-org.scalanlp:epic-ner-en-conll_2.10:2014.10.26 [S]
[info]   |     | +-my stuff
[info]   |     | 
[info]   |     +-org.scalanlp:epic-parser-en-span_2.10:2014.9.15 [S]
[info]   |       +-my stuff
[info]   |       
[info]   +-org.scalanlp:epic_2.10:0.2 [S]
[info]   | +-my stuff
[info]   | +-org.scalanlp:epic-ner-en-conll_2.10:2014.10.26 [S]
[info]   | | +-my stuff
[info]   | | 
[info]   | +-org.scalanlp:epic-parser-en-span_2.10:2014.9.15 [S]
[info]   |   +-my stuff
[info]   |   
[info]   +-org.scalanlp:nak_2.10:1.3 [S]
[info]     +-org.scalanlp:epic_2.10:0.2 [S]
[info]       +-my stuff
[info]       +-org.scalanlp:epic-ner-en-conll_2.10:2014.10.26 [S]
[info]       | +-kafkareader:kafkareader_2.10:0.1 [S]
[info]       | 
[info]       +-org.scalanlp:epic-parser-en-span_2.10:2014.9.15 [S]
[info]         +-my stuff

No idea if that might cause problems?

dlwh · 2014-11-19T20:38:58Z

nak is declared intransitive() so that shouldn't be a problem. (Seems like a bug in the dependency graph plugin...)

JSantosP · 2015-04-08T17:10:26Z

Hi there,

I just googled, looking for a solution for a similar problem in a project I'm working in, and we found and fixed the problem cause (I'm not sure if it fixes your current problem).

We solved it adding missing classpath dependencies when creating SparkContext (not only direct dependencies):

  val sparkConf = new SparkConf().setJars("...") //Add all transitive dependencies that Spark workers might need.

Hope this helps.

Regards!

acvogel · 2015-06-10T21:02:52Z

@timcroydon Any chance you found a solution to this problem? Running into the same issue.

reactormonk · 2015-06-10T21:05:26Z

@acvogel the solution @JSantosP provided doesn't work?

timcroydon · 2015-06-10T21:10:01Z

I don't recall now, I'm afraid. For various unrelated reasons, we ended up using a different library for similar functionality so I don't think I ever got round to investigating this fully - sorry!

acvogel · 2015-06-10T21:34:15Z

@reactormonk I haven't gotten it to work by that route, but perhaps I'm missing something. I assemble the project into a single jar, and also add dependent jars:

SparkConf().setJars(Seq("/root/myBigJar.jar", "/root/epic-ner-en-conll_2.10-2015.1.25.jar", "/root/epic_2.10-0.3.jar"))

Perhaps I'm missing not following @JSantosP suggestion correctly, as those should be included in myBigJar.jar anyway.

@timcroydon Thanks for your reply!

dlwh · 2015-06-10T21:44:12Z

there's a jar from february that works, i believe. can't fix atm.

On Wed, Jun 10, 2015 at 2:34 PM, acvogel notifications@github.com wrote:

@reactormonk https://github.com/reactormonk I haven't gotten it to work
by that route, but perhaps I'm missing something. I assemble the project
into a single jar, and also add dependent jars:

SparkConf().setJars(Seq("/root/myBigJar.jar",
"/root/epic-ner-en-conll_2.10-2015.1.25.jar", "/root/epic_2.10-0.3.jar"))

Perhaps I'm missing not following @JSantosP https://github.com/JSantosP
suggestion correctly, as those should be included in myBigJar.jar anyway.

@timcroydon https://github.com/timcroydon Thanks for your reply!

—
Reply to this email directly or view it on GitHub
#17 (comment).

briantopping · 2015-06-10T22:48:17Z

I've been using the 2015.2.19 data files combined with the sources from https://github.com/dlwh/epic/tree/e0238ceb16fc9adb9511240638357e8c44200a2f. The files from February work, but I believe this tree is the last one that works. I covered some of it in #24 IIRC.

I don't know if this will solve your specific issue, but it is the latest version I believe will work. From there, maybe you could fix whatever CCE is holding back usage under Spark.

https://gist.github.com/briantopping/369fb337735c1b726337 is the complete dependency closure from the subproject I am using.

lfernandez-stratio · 2015-06-11T08:42:52Z

I had the same problem and the JSantosP solutioin worked for me. Thank you.

ltao80 · 2015-07-08T10:17:51Z

What is the final solution, I have the same problem, I make a single jar file, on my local, it works, but when submit to Spark, throw exception java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.HashMap$SerializationProxy to field epic.features.BrownClusterFeaturizer.epic$features$BrownClusterFeaturizer$$clusterFeatures of type scala.collection.immutable.Map in instance of epic.features.BrownClusterFeaturizer

Who can help me, thanks a lot.

acvogel · 2015-07-08T20:54:35Z

@ltao80 I never got it to work and gave up. I'd be curious to hear from anyone else with a detailed solution.

ltao80 · 2015-07-09T07:49:53Z

@acvogel Thank you for your reply, I gave up too, I change to use Stanford NLP

Tooa · 2015-12-18T18:29:11Z

I'm facing the same problem (see here [1]). I've tried @JSantosP suggestion and added several dependencies to the SparkConf.

val path = "/home/.../.../spark-fun/jars/"
    val conf = new SparkConf().setAppName("wordCount").setJars(Seq(
      path + "epic_2.10-0.3.jar",
      path + "epic-ner-en-conll_2.10-2015.1.25.jar",
      path + "nak_2.10-1.3.jar",
      path + "scala-logging-api_2.10-2.1.2.jar",
      path + "scala-logging-slf4j_2.10-2.1.2.jar",
      path + "breeze_2.10-0.11-M0.jar",
      path + "spark-assembly-1.5.2-hadoop2.6.0.jar",
      path + "spark-fun-assembly-1.0.jar"
    ))

Do I need the path here? I also wonder, why I should add these jars to the SparkConf. Using a fat jar that was generated with sbt assembly should be enough, right? The project dependency tree looks like [2]. Do I really need to add all of these dependencies to the SparkConf?

[1] https://github.com/Tooa/spark-fun
[2] https://gist.github.com/Tooa/a2d364d7d457c64dd68f

axel22 mentioned this issue Jun 22, 2015

ClassCastException scalameter/scalameter#93

Closed

nevillelyh mentioned this issue Sep 18, 2017

Java serialization issue in SBT spotify/scio#847

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClassCastException loading model in Apache Spark #17

ClassCastException loading model in Apache Spark #17

timcroydon commented Nov 19, 2014

dlwh commented Nov 19, 2014

timcroydon commented Nov 19, 2014

dlwh commented Nov 19, 2014

timcroydon commented Nov 19, 2014

dlwh commented Nov 19, 2014

JSantosP commented Apr 8, 2015

acvogel commented Jun 10, 2015

reactormonk commented Jun 10, 2015

timcroydon commented Jun 10, 2015

acvogel commented Jun 10, 2015

dlwh commented Jun 10, 2015

briantopping commented Jun 10, 2015

lfernandez-stratio commented Jun 11, 2015

ltao80 commented Jul 8, 2015

acvogel commented Jul 8, 2015

ltao80 commented Jul 9, 2015

Tooa commented Dec 18, 2015

ClassCastException loading model in Apache Spark #17

ClassCastException loading model in Apache Spark #17

Comments

timcroydon commented Nov 19, 2014

dlwh commented Nov 19, 2014

timcroydon commented Nov 19, 2014

dlwh commented Nov 19, 2014

timcroydon commented Nov 19, 2014

dlwh commented Nov 19, 2014

JSantosP commented Apr 8, 2015

acvogel commented Jun 10, 2015

reactormonk commented Jun 10, 2015

timcroydon commented Jun 10, 2015

acvogel commented Jun 10, 2015

dlwh commented Jun 10, 2015

briantopping commented Jun 10, 2015

lfernandez-stratio commented Jun 11, 2015

ltao80 commented Jul 8, 2015

acvogel commented Jul 8, 2015

ltao80 commented Jul 9, 2015

Tooa commented Dec 18, 2015