-
the pic above shows a reduce stage, we have 13-machine, each machine has a T4, so we have 13 executors total, one executor per gpu, and each executor has 24 cpu cores we have set the shuffle partition = 13, and there really exists 13 tasks to handle those partitions, but why the tasks are not equally distributed to each executor,some machines/executors are idle, but others are overload,and thus hart the performance Expected behavior |
Beta Was this translation helpful? Give feedback.
Replies: 8 comments
-
this is a spark scheduling question not a spark-rapids plugin question. More than likely the reason is locality. Spark uses locality to decide where to put thing where it thinks is most efficient. Meaning the most data is local on that node so you don't have to transfer as much over the network. We were actually doing a little experimenting with adding an option to spark to force it to spread, but in that particular case the performance was the same or worse. You may get some efficiencies in spreading them by having more resources there but at the same time you likely have to transfer more data over the network. Do you know that the tasks being put an a single executor caused a long tail? Perhaps you have the spark.rapids.sql.concurrentGpuTasks < 7 so they couldn't really run on GPU at once? Note that spark.rapids.sql.concurrentGpuTasks is a plugin config and it not directly tied into how spark schedules executors. |
Beta Was this translation helpful? Give feedback.
-
@tgravescs Please correct me if I am wrong, but I think we did the spread experiments before we introduced the GPU Semaphore, and it might be nice to see if it is still the case now. |
Beta Was this translation helpful? Give feedback.
-
we did this test last week, the semaphore was there but I think in this case it was set to the like 2 or 4 and we were only getting 2 tasks per executor so both were still allowed to run. |
Beta Was this translation helpful? Give feedback.
-
Note one thing you may try is setting spark.shuffle.reduceLocality.enabled=false |
Beta Was this translation helpful? Give feedback.
-
One issue you may run into is that spreading compute will also mean increasing network transfers, especially when turning off locality. But, if there are fast links between executors (e.g. NVLink for example or even PCIe, and high BW network), I could see that being less of an issue. I also saw this on benchmarks I was using, and we are aware of the issue, but still need to formulate what the best path forward is. |
Beta Was this translation helpful? Give feedback.
-
@JustPlay are you running with |
Beta Was this translation helpful? Give feedback.
-
yes
|
Beta Was this translation helpful? Give feedback.
-
@JustPlay did you have a chance to try spark.shuffle.reduceLocality.enabled=false ? |
Beta Was this translation helpful? Give feedback.
this is a spark scheduling question not a spark-rapids plugin question.
I'm assuming you are allowing more then 1 task to run on the GPU? (ie spark.task.resource.gpu.amount=(1/24). If so then spark is doing just fine.
More than likely the reason is locality. Spark uses locality to decide where to put thing where it thinks is most efficient. Meaning the most data is local on that node so you don't have to transfer as much over the network.
We were actually doing a little experimenting with adding an option to spark to force it to spread, but in that particular case the performance was the same or worse. You may get some efficiencies in spreading them by having more resources there but at t…