-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29398][core] Support dedicated thread pools for RPC endpoints. #26059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The current RPC backend in Spark supports single- and multi-threaded message delivery to endpoints, but the all share the same underlying thread pool. So an RPC endpoint that blocks a dispatcher thread can negatively affect other endpoints. This can be more pronounced with configurations that limit the number of RPC dispatch threads based on configuration and / or running environment. And exposing the RPC layer to other code (for example with something like SPARK-29396) could make it easy to affect normal Spark operation with a badly written RPC handler. This change adds a new RPC endpoint type that tells the RPC env to create dedicated dispatch threads, so that those effects are minimised. Other endpoints will still need CPU to process their messages, but they won't be able to actively block the dispatch thread of these isolated endpoints. As part of the change, I've changed the most important Spark endpoints (the driver, executor and block manager endpoints) to be isolated from others. This means a couple of extra threads are created on the driver and executor for these endpoints. Tested with existing unit tests, which hammer the RPC system extensively, and also by running applications on a cluster (with a prototype of SPARK-29396).
|
Test build #111919 has finished for PR 26059 at commit
|
squito
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good, but want to check my understanding on one point
| private[spark] trait IsolatedRpcEndpoint extends RpcEndpoint { | ||
|
|
||
| /** How many threads to use for delivering messages. By default, use a single thread. */ | ||
| def threadCount(): Int = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to wrap my head around what happens if you create an IsolatedRpcEndpoint with threadCount() > 1, given the code in Inbox which checks for inheritance from ThreadSafeRpcEndpoint:
| if (!endpoint.isInstanceOf[ThreadSafeRpcEndpoint]) { |
I guess if you expect one endpoint to be served by multiple threads, it makes sense you'd want Inbox.enableConcurrent = false and you'd have to make your endpoint safe for that -- but worth a comment here at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same question with @squito . How do you deal with ThreadSafeRpcEndpoint ?
Though we could set Inbox.enableConcurrent = false with threadCount() > 0, but multiple threads would be wasted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already updated the comment. ThreadSafeRpcEndpoint is irrelevant here. You may even extend both if you want; but if you do that, either it does nothing (because the thread pool has a single thread), or you're doing it wrong (because the thread pool has multiple thread but you just want one).
So it's pointless to mix in both traits.
|
lgtm |
|
retest this please |
|
Test build #112192 has finished for PR 26059 at commit
|
| private[spark] trait IsolatedRpcEndpoint extends RpcEndpoint { | ||
|
|
||
| /** How many threads to use for delivering messages. By default, use a single thread. */ | ||
| def threadCount(): Int = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same question with @squito . How do you deal with ThreadSafeRpcEndpoint ?
Though we could set Inbox.enableConcurrent = false with threadCount() > 0, but multiple threads would be wasted.
|
|
||
| conf.get(EXECUTOR_ID).map { id => | ||
| val role = if (id == SparkContext.DRIVER_IDENTIFIER) "driver" else "executor" | ||
| conf.getInt(s"spark.$role.rpc.netty.dispatcher.numThreads", modNumThreads) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid that some threads resources could be wasted if user keeps the original config here and upgrades Spark without realizing this PR change. As they may considered for driver, block manager endpoints, etc, previously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll be "wasting" at most 2 threads, which is not a big deal. If they weren't really needed, they'll just sit there doing nothing. Spark creates many other threads that don't do much, this will just be noise.
| setActive(inbox) | ||
| } | ||
|
|
||
| override def unregister(endpointName: String): Unit = synchronized { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be an idempotent method ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dispatcher makes sure only to call this once.
|
merged to master |
The current RPC backend in Spark supports single- and multi-threaded
message delivery to endpoints, but they all share the same underlying
thread pool. So an RPC endpoint that blocks a dispatcher thread can
negatively affect other endpoints.
This can be more pronounced with configurations that limit the number
of RPC dispatch threads based on configuration and / or running
environment. And exposing the RPC layer to other code (for example
with something like SPARK-29396) could make it easy to affect normal
Spark operation with a badly written RPC handler.
This change adds a new RPC endpoint type that tells the RPC env to
create dedicated dispatch threads, so that those effects are minimised.
Other endpoints will still need CPU to process their messages, but
they won't be able to actively block the dispatch thread of these
isolated endpoints.
As part of the change, I've changed the most important Spark endpoints
(the driver, executor and block manager endpoints) to be isolated from
others. This means a couple of extra threads are created on the driver
and executor for these endpoints.
Tested with existing unit tests, which hammer the RPC system extensively,
and also by running applications on a cluster (with a prototype of
SPARK-29396).