[SPARK-3649] Remove GraphX custom serializers #2503

ankurdave · 2014-09-23T04:42:45Z

As reported on the mailing list, GraphX throws

java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2
        at org.apache.spark.graphx.impl.RoutingTableMessageSerializer$$anon$1$$anon$2.writeObject(Serializers.scala:39) 
        at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195) 
        at org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:329)

when sort-based shuffle attempts to spill to disk. This is because GraphX defines custom serializers for shuffling pair RDDs that assume Spark will always serialize the entire pair object rather than breaking it up into its components. However, the spill code path in sort-based shuffle violates this assumption.

GraphX uses the custom serializers to compress vertex ID keys using variable-length integer encoding. However, since the serializer can no longer rely on the key and value being serialized and deserialized together, performing such encoding would either require writing a tag byte (costly) or maintaining state in the serializer and assuming that serialization calls will alternate between key and value (fragile).

Instead, this PR simply removes the custom serializers. This causes a 10% slowdown (494 s to 543 s) and 16% increase in per-iteration communication (2176 MB to 2518 MB) for PageRank (averages across 3 trials, 10 iterations per trial, uk-2007-05 graph, 16 r3.2xlarge nodes).

SparkQA · 2014-09-23T04:49:25Z

QA tests have started for PR 2503 at commit a49c2ad.

This patch merges cleanly.

SparkQA · 2014-09-23T05:39:53Z

QA tests have finished for PR 2503 at commit a49c2ad.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-09-23T05:39:57Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20688/

ankurdave · 2014-09-23T07:30:36Z

retest this please

SparkQA · 2014-09-23T07:34:25Z

QA tests have started for PR 2503 at commit a49c2ad.

This patch merges cleanly.

SparkQA · 2014-09-23T08:41:35Z

QA tests have finished for PR 2503 at commit a49c2ad.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-09-23T08:41:38Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20695/

rxin · 2014-11-11T03:30:55Z

merging this. thanks!

As [reported][1] on the mailing list, GraphX throws ``` java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2 at org.apache.spark.graphx.impl.RoutingTableMessageSerializer$$anon$1$$anon$2.writeObject(Serializers.scala:39) at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195) at org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:329) ``` when sort-based shuffle attempts to spill to disk. This is because GraphX defines custom serializers for shuffling pair RDDs that assume Spark will always serialize the entire pair object rather than breaking it up into its components. However, the spill code path in sort-based shuffle [violates this assumption][2]. GraphX uses the custom serializers to compress vertex ID keys using variable-length integer encoding. However, since the serializer can no longer rely on the key and value being serialized and deserialized together, performing such encoding would either require writing a tag byte (costly) or maintaining state in the serializer and assuming that serialization calls will alternate between key and value (fragile). Instead, this PR simply removes the custom serializers. This causes a **10% slowdown** (494 s to 543 s) and **16% increase in per-iteration communication** (2176 MB to 2518 MB) for PageRank (averages across 3 trials, 10 iterations per trial, uk-2007-05 graph, 16 r3.2xlarge nodes). [1]: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassCastException-java-lang-Long-cannot-be-cast-to-scala-Tuple2-td13926.html#a14501 [2]: https://github.com/apache/spark/blob/f9d6220c792b779be385f3022d146911a22c2130/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala#L329 Author: Ankur Dave <ankurdave@gmail.com> Closes #2503 from ankurdave/SPARK-3649 and squashes the following commits: a49c2ad [Ankur Dave] [SPARK-3649] Remove GraphX custom serializers (cherry picked from commit 300887b) Signed-off-by: Reynold Xin <rxin@databricks.com>

[SPARK-3649] Remove GraphX custom serializers

a49c2ad

asfgit closed this in 300887b Nov 11, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3649] Remove GraphX custom serializers #2503

[SPARK-3649] Remove GraphX custom serializers #2503

Uh oh!

ankurdave commented Sep 23, 2014

Uh oh!

SparkQA commented Sep 23, 2014

Uh oh!

SparkQA commented Sep 23, 2014

Uh oh!

SparkQA commented Sep 23, 2014

Uh oh!

ankurdave commented Sep 23, 2014

Uh oh!

SparkQA commented Sep 23, 2014

Uh oh!

SparkQA commented Sep 23, 2014

Uh oh!

SparkQA commented Sep 23, 2014

Uh oh!

rxin commented Nov 11, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-3649] Remove GraphX custom serializers #2503

[SPARK-3649] Remove GraphX custom serializers #2503

Uh oh!

Conversation

ankurdave commented Sep 23, 2014

Uh oh!

SparkQA commented Sep 23, 2014

Uh oh!

SparkQA commented Sep 23, 2014

Uh oh!

SparkQA commented Sep 23, 2014

Uh oh!

ankurdave commented Sep 23, 2014

Uh oh!

SparkQA commented Sep 23, 2014

Uh oh!

SparkQA commented Sep 23, 2014

Uh oh!

SparkQA commented Sep 23, 2014

Uh oh!

rxin commented Nov 11, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants