[SPARK-14654][CORE] New accumulator API #12612

cloud-fan · 2016-04-22T14:55:27Z

What changes were proposed in this pull request?

This PR introduces a new accumulator API which is much simpler than before:

the type hierarchy is simplified, now we only have an Accumulator class
Combine initialValue and zeroValue concepts into just one concept: zeroValue
there in only one register method, the accumulator registration and cleanup registration are combined.
the id,name and countFailedValues are combined into an AccumulatorMetadata, and is provided during registration.

SQLMetric is a good example to show the simplicity of this new API.

What we break:

no setValue anymore. In the new API, the intermedia type can be different from the result type, it's very hard to implement a general setValue
accumulator can't be serialized before registered.

Problems need to be addressed in follow-ups:

with this new API, AccumulatorInfo doesn't make a lot of sense, the partial output is not partial updates, we need to expose the intermediate value.
ExceptionFailure should not carry the accumulator updates. Why do users care about accumulator updates for failed cases? It looks like we only use this feature to update the internal metrics, how about we sending a heartbeat to update internal metrics after the failure event?
the public event SparkListenerTaskEnd carries a TaskMetrics. Ideally this TaskMetrics don't need to carry external accumulators, as the only method of TaskMetrics that can access external accumulators is private[spark]. However, SQLListener use it to retrieve sql metrics.

How was this patch tested?

existing tests

SparkQA · 2016-04-22T14:58:49Z

Test build #56704 has finished for PR 12612 at commit 50ebb24.

This patch fails Scala style tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- abstract class NewAccumulator[IN, OUT] extends Serializable
- class IntAccumulator extends NewAccumulator[jl.Integer, jl.Integer]
- class LongAccumulator extends NewAccumulator[jl.Long, jl.Long]
- class DoubleAccumulator extends NewAccumulator[jl.Double, jl.Double]
- class AverageAccumulator extends NewAccumulator[jl.Double, jl.Double]
- class CollectionAccumulator[T] extends NewAccumulator[T, java.util.List[T]]
- class LegacyAccumulatorWrapper[R, T](

cloud-fan · 2016-04-22T15:01:24Z

core/src/main/scala/org/apache/spark/NewAccumulator.scala

+}
+
+
+class AverageAccumulator extends NewAccumulator[jl.Double, jl.Double] {


One difference from the previous API: we can't have a general setValue method, as it needs the intermedia type which is not exposed by the new API. For example, AverageAccumulator doesn't have setValue

I think getting rid of setValue is great, in the consistent accumulators based under the old API I had to just throw an exception if people were using setValue

cloud-fan · 2016-04-22T15:04:30Z

The SQLMetrics is a good example to show how simple the new API is

SparkQA · 2016-04-24T16:58:39Z

Test build #56848 has finished for PR 12612 at commit c659199.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-24T17:08:52Z

Test build #56850 has finished for PR 12612 at commit 7cc93d1.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-04-24T18:22:16Z

cc @rxin , several questions need to be discussed:

should we provide the metadata while creating accumulators instead of registration? e.g., id can be fixed like private[spark] val id = AccumulatorContext.newId(), name and countFailedValues can be ctor parameters. It's annoying that we have to register an accumulator to call its toInfo method.
should we still have Accumulable and Accumulator? Accumulator is a special Accumulable that input and output types are same.

SparkQA · 2016-04-24T20:33:52Z

Test build #56860 has finished for PR 12612 at commit f831678.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-24T20:43:46Z

Test build #56861 has finished for PR 12612 at commit a1c6865.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-24T21:03:46Z

Test build #56862 has finished for PR 12612 at commit 38cb9a1.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-24T21:13:56Z

Test build #56863 has finished for PR 12612 at commit 22d4cc6.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-04-24T21:20:24Z

core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala

-      metrics.internalAccums.find(_.name == accum.name).foreach(_.setValueAny(accum.update.get))
+    definedAccumUpdates.filter(_.internal).foreach { accInfo =>
+      metrics.internalAccums.find(_.name == accInfo.name).foreach { acc =>
+        acc.asInstanceOf[Accumulator[Any, Any]].add(accInfo.update.get)


This is an example shows a weakness of the new API: we can't setValue. For this example, we have the final output and we wanna set the value of accumulator so that it can produce the same output. With the new API, we can't guarantee that all accumulators can implement setValue, e.g. the average accumulator. I'm still thinking about how to fix it or work around it, @rxin any ideas?

Just have a reset and "add"?

I'd argue it doesn't make sense to call setValue, since "set" action is not algebraic (i.e. you cannot compose/merge set operations).

Actually i don't think we need this if we send accumulators back to the driver.

SparkQA · 2016-04-25T16:22:09Z

Test build #56900 has finished for PR 12612 at commit 470ec7b.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class NewAccumulator[IN, OUT] extends Serializable
- case class UpdatedLongValue(l: Long) extends UpdatedValue
- class LongAccumulator extends NewAccumulator[jl.Long, jl.Long]
- case class UpdatedDoubleValue(d: Double) extends UpdatedValue
- class DoubleAccumulator extends NewAccumulator[jl.Double, jl.Double]
- case class UpdatedAverageValue(sum: Double, count: Long) extends UpdatedValue
- class AverageAccumulator extends NewAccumulator[jl.Double, jl.Double]
- case class GenericUpdatedValue[T](value: T) extends UpdatedValue
- class CollectionAccumulator[T] extends NewAccumulator[T, java.util.List[T]]
- class LegacyAccumulatorWrapper[R, T](

cloud-fan · 2016-04-25T16:23:24Z

core/src/main/scala/org/apache/spark/NewAccumulator.scala

+    name: Option[String],
+    countFailedValues: Boolean) extends Serializable
+
+trait UpdatedValue extends Serializable


cc @rxin , I didn't send the accumulator back for a serialization problem.

Basically when we send accumulator from driver to executors, we don't want to send its current value(think about list accumulator, we definitely don't wanna send the current list to executors.).
But when we send accumulator from executors to driver, we do need to send the current value.

One possible solution is to have 2 local variables for each accumulator, one for driver, one for executors. But it's a lot of trouble when accumulators have complex intermedia type, e.g. average accumulator. So I end up with this apporach.

Another potential problem with sending Accumulators over the wire, with the proposed API from the JIRA, is that the Accumulators register them selves inside of readObject.

@cloud-fan Why can't we send the current list? The current list as far as I understand will always be zero sized? We can just create a copy of the accumulator for sending to the executors.

If the accumulator was used in two separate tasks it could have built up some values from the first task in the driver before the second task. But always sending a zeroed copy to the executor would be an OK solution to that.

SparkQA · 2016-04-26T07:28:52Z

Test build #56981 has finished for PR 12612 at commit 73c91d2.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-26T13:59:10Z

Test build #57000 has finished for PR 12612 at commit 77dc3fc.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-04-26T15:07:14Z

core/src/main/scala/org/apache/spark/NewAccumulator.scala

This is really a big problem...

We need some serialization hooks to support sending accumulator back from executors, and I tried 2 approaches but both failed:

Add a writing hook, which resets the accumulator before send it from driver to executor. The problem is we can't just reset, the accumulator states should be kept at driver side. And the java serializing hook isn't flex enough to allow us do a copy or something. One possible workaround is to create an AccumulatorWrapper so that we can have full control of accumulator serialization. But this will complicate the hierarchy.

Add a reading hook, which resets the accumlator after deserialization. Unfortunately it doesn't work when Accumulator is a base class. By the time readObject is called, child's fields are not initialized yet. Calling reset here is no-op, the values of child's fileds will be filled later.

Generally speaking, writeObject and readObject is not a good serialization hook. We'd either figure out some tricky to workaround it, or find out other better serialization hooks. (or do not send accumulators back)

@rxin any ideas?

as discussed offline, writeReplace

cloud-fan · 2016-04-27T09:22:32Z

core/src/test/scala/org/apache/spark/AccumulatorSuite.scala

    assert(acc.value === Seq(9, 10))
  }
-
-  test("value is reset on the executors") {


This is covered by the new test accumulator serialization

cloud-fan · 2016-04-27T09:24:44Z

core/src/test/scala/org/apache/spark/executor/TaskMetricsSuite.scala

    assert(newUpdates.size === tm.internalAccums.size + 4)
  }
-
-  test("from accumulator updates") {


This test is not valid anymore. TaskMetrics.fromAccumulatorUpdates will return a task metrics only containing internal accumulators, no need to worry about unregistered external accumulators.

SparkQA · 2016-04-27T14:33:02Z

Test build #57130 has finished for PR 12612 at commit be8ff0e.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ListAccumulator[T] extends NewAccumulator[T, java.util.List[T]]

SparkQA · 2016-04-27T15:10:25Z

Test build #57132 has finished for PR 12612 at commit 9cdddd0.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-27T17:45:08Z

Test build #57135 has finished for PR 12612 at commit c74320d.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-04-28T02:31:34Z

core/src/main/scala/org/apache/spark/NewAccumulator.scala

+/**
+ * The base class for accumulators, that can accumulate inputs of type `IN`, and produce output of
+ * type `OUT`.  Implementations must define following methods:
+ *  - isZero:       tell if this accumulator is zero value or not. e.g. for a counter accumulator,


these should be javadoc of the methods, rather than in the classdoc

rxin · 2016-04-28T04:02:34Z

This looks pretty good to me. We should get it to pass tests and then merge it asap. Some of the comments can be addressed later.

cloud-fan · 2016-04-28T05:18:27Z

core/src/main/scala/org/apache/spark/NewAccumulator.scala

+  def localValue: OUT
+
+  // Called by Java when serializing an object
+  final protected def writeReplace(): Any = {


this should be private, however, this hook won't be called if it's private, not sure why, so I use final protected to work around it.

SparkQA · 2016-04-28T07:18:15Z

Test build #57220 has finished for PR 12612 at commit 124568b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-04-28T07:26:17Z

Merging in master!

MLnick · 2016-04-28T11:52:06Z

core/src/main/scala/org/apache/spark/NewAccumulator.scala

+    if (atDriverSide) {
+      if (!isRegistered) {
+        throw new UnsupportedOperationException(
+          "Accumulator must be registered before send to executor")


@cloud-fan I'm getting intermittent, but regular, test failures in ALSSuite (not sure if there might be others, this just happens to be something I'm working on now).

e.g.

[info] - exact rank-1 matrix *** FAILED *** (4 seconds, 397 milliseconds) [info] org.apache.spark.SparkException: Job aborted due to stage failure: Failed to serialize task 74, not attempting to retry it. Exception during serialization: java.lang.UnsupportedOperationException: Accumulator must be registered before send to executor [info] at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1448) [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1436) [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1435) [info] at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) [info] at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1435) [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809) [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809) [info] at scala.Option.foreach(Option.scala:257) [info] at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:809) [info] at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1657) [info] at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1616) [info] at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) [info] at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) [info] at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) [info] at org.apache.spark.SparkContext.runJob(SparkContext.scala:1873) [info] at org.apache.spark.SparkContext.runJob(SparkContext.scala:1936) [info] at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:970) [info] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) [info] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) [info] at org.apache.spark.rdd.RDD.withScope(RDD.scala:357) [info] at org.apache.spark.rdd.RDD.reduce(RDD.scala:952) [info] at org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$stats$1.apply(DoubleRDDFunctions.scala:42) [info] at org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$stats$1.apply(DoubleRDDFunctions.scala:42) [info] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) [info] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) [info] at org.apache.spark.rdd.RDD.withScope(RDD.scala:357) [info] at org.apache.spark.rdd.DoubleRDDFunctions.stats(DoubleRDDFunctions.scala:41) [info] at org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$mean$1.apply$mcD$sp(DoubleRDDFunctions.scala:47) [info] at org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$mean$1.apply(DoubleRDDFunctions.scala:47) [info] at org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$mean$1.apply(DoubleRDDFunctions.scala:47) [info] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) [info] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) [info] at org.apache.spark.rdd.RDD.withScope(RDD.scala:357) [info] at org.apache.spark.rdd.DoubleRDDFunctions.mean(DoubleRDDFunctions.scala:46) [info] at org.apache.spark.ml.recommendation.ALSSuite.testALS(ALSSuite.scala:373) [info] at org.apache.spark.ml.recommendation.ALSSuite$$anonfun$12.apply$mcV$sp(ALSSuite.scala:385) [info] at org.apache.spark.ml.recommendation.ALSSuite$$anonfun$12.apply(ALSSuite.scala:383) [info] at org.apache.spark.ml.recommendation.ALSSuite$$anonfun$12.apply(ALSSuite.scala:383) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:56) [info] at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) [info] at scala.collection.immutable.List.foreach(List.scala:381) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) [info] at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) [info] at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) [info] at org.scalatest.Suite$class.run(Suite.scala:1424) [info] at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) [info] at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:545) [info] at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:28) [info] at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) [info] at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256) [info] at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:28) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:357) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:502) [info] at sbt.ForkMain$Run$2.call(ForkMain.java:296) [info] at sbt.ForkMain$Run$2.call(ForkMain.java:286) [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [info] at java.lang.Thread.run(Thread.java:745)

I also found some tests failed because of this indeterminately, looking into it.

mengxr · 2016-04-29T18:27:57Z

Shall we revert this commit? Got some similar errors:

Code:

sc.parallelize(1 until 100, 1).map { i => Array.fill(1e7.toInt)(1.0) }.count()

The job succeeded but error messages got emitted to Spark shell:

16/04/29 11:25:45 ERROR Utils: Uncaught exception in thread heartbeat-receiver-event-loop-thread
java.lang.UnsupportedOperationException: Can't read accumulator value in task
    at org.apache.spark.NewAccumulator.value(NewAccumulator.scala:137)
    at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5$$anonfun$apply$9$$anonfun$apply$10.apply(TaskSchedulerImpl.scala:394)
    at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5$$anonfun$apply$9$$anonfun$apply$10.apply(TaskSchedulerImpl.scala:394)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:394)
    at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:392)
    at scala.Option.map(Option.scala:146)
    at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5.apply(TaskSchedulerImpl.scala:392)
    at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5.apply(TaskSchedulerImpl.scala:391)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
    at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)
    at org.apache.spark.scheduler.TaskSchedulerImpl.executorHeartbeatReceived(TaskSchedulerImpl.scala:391)
    at org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2$$anonfun$run$2.apply$mcV$sp(HeartbeatReceiver.scala:128)
    at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1219)
    at org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2.run(HeartbeatReceiver.scala:127)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
16/04/29 11:25:55 WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@1c53df1a,BlockManagerId(driver, 192.168.99.1, 59310))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
    at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
    at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
    at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:494)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:523)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:523)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:523)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793)
    at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:523)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
    at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
    at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
    at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
    at scala.concurrent.Await$.result(package.scala:190)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
    ... 14 more
16/04/29 11:25:58 ERROR Utils: Uncaught exception in thread heartbeat-receiver-event-loop-thread
java.lang.UnsupportedOperationException: Can't read accumulator value in task
    at org.apache.spark.NewAccumulator.value(NewAccumulator.scala:137)
    at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5$$anonfun$apply$9$$anonfun$apply$10.apply(TaskSchedulerImpl.scala:394)
    at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5$$anonfun$apply$9$$anonfun$apply$10.apply(TaskSchedulerImpl.scala:394)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:394)
    at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:392)
    at scala.Option.map(Option.scala:146)
    at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5.apply(TaskSchedulerImpl.scala:392)
    at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5.apply(TaskSchedulerImpl.scala:391)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
    at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)
    at org.apache.spark.scheduler.TaskSchedulerImpl.executorHeartbeatReceived(TaskSchedulerImpl.scala:391)
    at org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2$$anonfun$run$2.apply$mcV$sp(HeartbeatReceiver.scala:128)
    at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1219)
    at org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2.run(HeartbeatReceiver.scala:127)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
16/04/29 11:26:08 WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@1c53df1a,BlockManagerId(driver, 192.168.99.1, 59310))] in 2 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
    at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
    at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
    at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:494)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:523)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:523)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:523)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793)
    at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:523)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
    at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
    at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
    at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
    at scala.concurrent.Await$.result(package.scala:190)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
    ... 14 more
16/04/29 11:26:11 ERROR Utils: Uncaught exception in thread heartbeat-receiver-event-loop-thread
java.lang.UnsupportedOperationException: Can't read accumulator value in task
    at org.apache.spark.NewAccumulator.value(NewAccumulator.scala:137)
    at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5$$anonfun$apply$9$$anonfun$apply$10.apply(TaskSchedulerImpl.scala:394)
    at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5$$anonfun$apply$9$$anonfun$apply$10.apply(TaskSchedulerImpl.scala:394)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:394)
    at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:392)
    at scala.Option.map(Option.scala:146)
    at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5.apply(TaskSchedulerImpl.scala:392)
    at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5.apply(TaskSchedulerImpl.scala:391)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
    at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)
    at org.apache.spark.scheduler.TaskSchedulerImpl.executorHeartbeatReceived(TaskSchedulerImpl.scala:391)
    at org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2$$anonfun$run$2.apply$mcV$sp(HeartbeatReceiver.scala:128)
    at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1219)
    at org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2.run(HeartbeatReceiver.scala:127)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

mengxr · 2016-04-29T19:09:34Z

Created https://issues.apache.org/jira/browse/SPARK-15010 for the reported issue.

tedyu · 2016-04-29T22:52:33Z

core/src/main/scala/org/apache/spark/NewAccumulator.scala

+        taskContext.registerAccumulator(this)
+      }
+    } else {
+      atDriverSide = true


Why is this assignment needed ?

When the accumulator is sent back from executor to driver, we should set the atDriverSide flag.

cloud-fan reviewed Apr 22, 2016
View reviewed changes

cloud-fan force-pushed the acc branch from c659199 to 7cc93d1 Compare April 24, 2016 17:04

cloud-fan force-pushed the acc branch from 7cc93d1 to f831678 Compare April 24, 2016 20:27

cloud-fan force-pushed the acc branch from f831678 to a1c6865 Compare April 24, 2016 20:39

cloud-fan force-pushed the acc branch 2 times, most recently from d4cc938 to 38cb9a1 Compare April 24, 2016 20:59

cloud-fan force-pushed the acc branch from 38cb9a1 to 22d4cc6 Compare April 24, 2016 21:11

cloud-fan reviewed Apr 24, 2016
View reviewed changes

New accumulator API

470ec7b

cloud-fan force-pushed the acc branch from 22d4cc6 to 470ec7b Compare April 25, 2016 16:15

cloud-fan reviewed Apr 25, 2016
View reviewed changes

cloud-fan force-pushed the acc branch from 73c91d2 to 77dc3fc Compare April 26, 2016 13:54

cloud-fan reviewed Apr 26, 2016
View reviewed changes

address comments

3ee948f

cloud-fan force-pushed the acc branch from 77dc3fc to 3ee948f Compare April 27, 2016 09:13

cloud-fan reviewed Apr 27, 2016
View reviewed changes

cloud-fan added 3 commits April 27, 2016 21:35

reimplement SQLMetric

467948c

refine

4fd08a4

add doc

be8ff0e

cloud-fan changed the title ~~[SPARK-14654][CORE][WIP] New accumulator API~~ [SPARK-14654][CORE] New accumulator API Apr 27, 2016

small fix

9cdddd0

fix mima

c74320d

rxin reviewed Apr 28, 2016
View reviewed changes

address comments

124568b

cloud-fan reviewed Apr 28, 2016
View reviewed changes

asfgit closed this in bf5496d Apr 28, 2016

MLnick reviewed Apr 28, 2016
View reviewed changes

tedyu reviewed Apr 29, 2016
View reviewed changes

JoshRosen mentioned this pull request May 16, 2017

[SPARK-20776] Fix perf. problems in JobProgressListener caused by TaskMetrics construction #18008

Closed

		}


		class AverageAccumulator extends NewAccumulator[jl.Double, jl.Double] {

[SPARK-14654][CORE] New accumulator API #12612

[SPARK-14654][CORE] New accumulator API #12612

Uh oh!

Conversation

cloud-fan commented Apr 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Apr 22, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Apr 22, 2016

Uh oh!

SparkQA commented Apr 24, 2016

Uh oh!

SparkQA commented Apr 24, 2016

Uh oh!

cloud-fan commented Apr 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Apr 24, 2016

Uh oh!

SparkQA commented Apr 24, 2016

Uh oh!

SparkQA commented Apr 24, 2016

Uh oh!

SparkQA commented Apr 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 25, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin Apr 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 26, 2016

Uh oh!

SparkQA commented Apr 26, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 27, 2016

Uh oh!

SparkQA commented Apr 27, 2016

Uh oh!

SparkQA commented Apr 27, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Apr 28, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 28, 2016

Uh oh!

rxin commented Apr 28, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Apr 22, 2016 •

edited

Loading

cloud-fan commented Apr 24, 2016 •

edited

Loading

rxin Apr 25, 2016 •

edited

Loading

mengxr commented Apr 29, 2016 •

edited

Loading