[SPARK-29398][core] Support dedicated thread pools for RPC endpoints. #26059

vanzin · 2019-10-08T21:51:04Z

The current RPC backend in Spark supports single- and multi-threaded
message delivery to endpoints, but they all share the same underlying
thread pool. So an RPC endpoint that blocks a dispatcher thread can
negatively affect other endpoints.

This can be more pronounced with configurations that limit the number
of RPC dispatch threads based on configuration and / or running
environment. And exposing the RPC layer to other code (for example
with something like SPARK-29396) could make it easy to affect normal
Spark operation with a badly written RPC handler.

This change adds a new RPC endpoint type that tells the RPC env to
create dedicated dispatch threads, so that those effects are minimised.
Other endpoints will still need CPU to process their messages, but
they won't be able to actively block the dispatch thread of these
isolated endpoints.

As part of the change, I've changed the most important Spark endpoints
(the driver, executor and block manager endpoints) to be isolated from
others. This means a couple of extra threads are created on the driver
and executor for these endpoints.

Tested with existing unit tests, which hammer the RPC system extensively,
and also by running applications on a cluster (with a prototype of
SPARK-29396).

The current RPC backend in Spark supports single- and multi-threaded message delivery to endpoints, but the all share the same underlying thread pool. So an RPC endpoint that blocks a dispatcher thread can negatively affect other endpoints. This can be more pronounced with configurations that limit the number of RPC dispatch threads based on configuration and / or running environment. And exposing the RPC layer to other code (for example with something like SPARK-29396) could make it easy to affect normal Spark operation with a badly written RPC handler. This change adds a new RPC endpoint type that tells the RPC env to create dedicated dispatch threads, so that those effects are minimised. Other endpoints will still need CPU to process their messages, but they won't be able to actively block the dispatch thread of these isolated endpoints. As part of the change, I've changed the most important Spark endpoints (the driver, executor and block manager endpoints) to be isolated from others. This means a couple of extra threads are created on the driver and executor for these endpoints. Tested with existing unit tests, which hammer the RPC system extensively, and also by running applications on a cluster (with a prototype of SPARK-29396).

SparkQA · 2019-10-09T00:48:54Z

Test build #111919 has finished for PR 26059 at commit 3316d5e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DriverEndpoint extends IsolatedRpcEndpoint with Logging

squito

I think this looks good, but want to check my understanding on one point

squito · 2019-10-10T18:55:24Z

core/src/main/scala/org/apache/spark/rpc/RpcEndpoint.scala

+private[spark] trait IsolatedRpcEndpoint extends RpcEndpoint {
+
+  /** How many threads to use for delivering messages. By default, use a single thread. */
+  def threadCount(): Int = 1


I'm trying to wrap my head around what happens if you create an IsolatedRpcEndpoint with threadCount() > 1, given the code in Inbox which checks for inheritance from ThreadSafeRpcEndpoint:

spark/core/src/main/scala/org/apache/spark/rpc/netty/Inbox.scala

Line 123 in 2b3c379

if (!endpoint.isInstanceOf[ThreadSafeRpcEndpoint]) {

I guess if you expect one endpoint to be served by multiple threads, it makes sense you'd want Inbox.enableConcurrent = false and you'd have to make your endpoint safe for that -- but worth a comment here at least.

I have the same question with @squito . How do you deal with ThreadSafeRpcEndpoint ?

Though we could set Inbox.enableConcurrent = false with threadCount() > 0, but multiple threads would be wasted.

I already updated the comment. ThreadSafeRpcEndpoint is irrelevant here. You may even extend both if you want; but if you do that, either it does nothing (because the thread pool has a single thread), or you're doing it wrong (because the thread pool has multiple thread but you just want one).

So it's pointless to mix in both traits.

squito · 2019-10-11T15:01:25Z

lgtm

vanzin · 2019-10-16T21:38:39Z

retest this please

SparkQA · 2019-10-17T00:06:12Z

Test build #112192 has finished for PR 26059 at commit b674b4c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2019-10-17T13:15:04Z

core/src/main/scala/org/apache/spark/rpc/RpcEndpoint.scala

+private[spark] trait IsolatedRpcEndpoint extends RpcEndpoint {
+
+  /** How many threads to use for delivering messages. By default, use a single thread. */
+  def threadCount(): Int = 1


I have the same question with @squito . How do you deal with ThreadSafeRpcEndpoint ?

Though we could set Inbox.enableConcurrent = false with threadCount() > 0, but multiple threads would be wasted.

Ngone51 · 2019-10-17T13:21:24Z

core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala

+
+    conf.get(EXECUTOR_ID).map { id =>
+      val role = if (id == SparkContext.DRIVER_IDENTIFIER) "driver" else "executor"
+      conf.getInt(s"spark.$role.rpc.netty.dispatcher.numThreads", modNumThreads)


I'm afraid that some threads resources could be wasted if user keeps the original config here and upgrades Spark without realizing this PR change. As they may considered for driver, block manager endpoints, etc, previously.

You'll be "wasting" at most 2 threads, which is not a big deal. If they weren't really needed, they'll just sit there doing nothing. Spark creates many other threads that don't do much, this will just be noise.

Ngone51 · 2019-10-17T13:23:25Z

core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala

+    setActive(inbox)
+  }
+
+  override def unregister(endpointName: String): Unit = synchronized {


should this be an idempotent method ?

Dispatcher makes sure only to call this once.

squito · 2019-10-17T18:23:47Z

merged to master

dongjoon-hyun added the SPARK CORE label Oct 8, 2019

squito reviewed Oct 10, 2019

View reviewed changes

Add comment about thread-safety.

b674b4c

Ngone51 reviewed Oct 17, 2019

View reviewed changes

asfgit closed this in 2f0a38c Oct 17, 2019

vanzin deleted the SPARK-29398 branch October 23, 2019 15:53

HeartSaVioR mentioned this pull request Dec 26, 2019

[SPARK-30313][CORE] Ensure EndpointRef is available MasterWebUI/WorkerPage #27010

Closed

Ngone51 mentioned this pull request Dec 9, 2022

[SPARK-41460][CORE] Introduce IsolatedThreadSafeRpcEndpoint to extend IsolatedRpcEndpoint #38995

Closed

[SPARK-29398][core] Support dedicated thread pools for RPC endpoints. #26059

[SPARK-29398][core] Support dedicated thread pools for RPC endpoints. #26059

Uh oh!

Conversation

vanzin commented Oct 8, 2019

Uh oh!

SparkQA commented Oct 9, 2019

Uh oh!

squito left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

squito commented Oct 11, 2019

Uh oh!

vanzin commented Oct 16, 2019

Uh oh!

SparkQA commented Oct 17, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

squito commented Oct 17, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants