[BUG] Does spark/rapids really has the gpu-aware task scheduling ability？ #5391

JustPlay · 2020-09-08T05:31:59Z

JustPlay
Sep 8, 2020

Describe the bug

the pic above shows a reduce stage, we have 13-machine, each machine has a T4, so we have 13 executors total, one executor per gpu, and each executor has 24 cpu cores

we have set the shuffle partition = 13, and there really exists 13 tasks to handle those partitions, but why the tasks are not equally distributed to each executor，some machines/executors are idle, but others are overload，and thus hart the performance

Expected behavior
I want the tasks (or data need to be processed) to be equally distributed to each executor， so no long-tail

Answered by tgravescs

Sep 8, 2020

this is a spark scheduling question not a spark-rapids plugin question.
I'm assuming you are allowing more then 1 task to run on the GPU? (ie spark.task.resource.gpu.amount=(1/24). If so then spark is doing just fine.

More than likely the reason is locality. Spark uses locality to decide where to put thing where it thinks is most efficient. Meaning the most data is local on that node so you don't have to transfer as much over the network.

We were actually doing a little experimenting with adding an option to spark to force it to spread, but in that particular case the performance was the same or worse. You may get some efficiencies in spreading them by having more resources there but at t…

View full answer

tgravescs · 2020-09-08T13:48:42Z

tgravescs
Sep 8, 2020
Maintainer

this is a spark scheduling question not a spark-rapids plugin question.
I'm assuming you are allowing more then 1 task to run on the GPU? (ie spark.task.resource.gpu.amount=(1/24). If so then spark is doing just fine.

More than likely the reason is locality. Spark uses locality to decide where to put thing where it thinks is most efficient. Meaning the most data is local on that node so you don't have to transfer as much over the network.

We were actually doing a little experimenting with adding an option to spark to force it to spread, but in that particular case the performance was the same or worse. You may get some efficiencies in spreading them by having more resources there but at the same time you likely have to transfer more data over the network.

Do you know that the tasks being put an a single executor caused a long tail? Perhaps you have the spark.rapids.sql.concurrentGpuTasks < 7 so they couldn't really run on GPU at once? Note that spark.rapids.sql.concurrentGpuTasks is a plugin config and it not directly tied into how spark schedules executors.

0 replies

revans2 · 2020-09-08T13:59:49Z

revans2
Sep 8, 2020
Maintainer

@tgravescs Please correct me if I am wrong, but I think we did the spread experiments before we introduced the GPU Semaphore, and it might be nice to see if it is still the case now.

0 replies

tgravescs · 2020-09-08T14:12:23Z

tgravescs
Sep 8, 2020
Maintainer

we did this test last week, the semaphore was there but I think in this case it was set to the like 2 or 4 and we were only getting 2 tasks per executor so both were still allowed to run.
The semaphore setting is why I asked what it was set to here to see if they are getting more tasks then can run concurrently. If that is the case it would more than likely be beneficial to spread them, we could potentially do an experiment to replicate similar setup.

0 replies

tgravescs · 2020-09-08T14:17:07Z

tgravescs
Sep 8, 2020
Maintainer

Note one thing you may try is setting spark.shuffle.reduceLocality.enabled=false

0 replies

abellina · 2020-09-08T14:34:19Z

abellina
Sep 8, 2020
Collaborator

One issue you may run into is that spreading compute will also mean increasing network transfers, especially when turning off locality. But, if there are fast links between executors (e.g. NVLink for example or even PCIe, and high BW network), I could see that being less of an issue. I also saw this on benchmarks I was using, and we are aware of the issue, but still need to formulate what the best path forward is.

0 replies

jlowe · 2020-09-08T15:19:03Z

jlowe
Sep 8, 2020
Maintainer

@JustPlay are you running with spark.locality.wait=0s? I have seen Spark often schedule all of the tasks on just a few executors in subsequent stages without that setting.

0 replies

JustPlay · 2020-09-09T01:35:00Z

JustPlay
Sep 9, 2020
Author

@JustPlay are you running with spark.locality.wait=0s? I have seen Spark often schedule all of the tasks on just a few executors in subsequent stages without that setting.

yes

set -o nounset
set -o pipefail

if [ $# -eq 0 ]; then
    cat <<-END >&2
Usage:
    bash $(basename $0) <args...>   ## you should specify parameters one-by-one following the right order

Args:
    1st: local_query_file
    2nd: local_num_executor
    3rd: local_num_core_per_executor
    4th: local_spark_batch_size
    5th: local_rapids_batch_size
    6th: local_gpu_cc_task
    7th: local_shuffle_partitions

END
    exit 1
fi

[[ -f log/spark.log ]] && rm -rf log/spark.log*

declare -r local_query_file=$1

declare -r local_num_executor=$2
declare -r local_num_core_per_executor=$3
declare -r local_spark_batch_size=$4         # spark.sql.files.maxPartitionBytes
declare -r local_rapids_batch_size=$5        # spark.rapids.sql.batchSizeBytes
declare -r local_gpu_cc_task=$6              # spark.rapids.sql.concurrentGpuTasks
declare -r local_shuffle_partitions=$7       # spark.sql.shuffle.partitions

declare -r local_rapids_plugin_jar_file=/home/0.alpha/rapids.pkgs/20200903/rapids-4-spark_2.12-0.2.0-SNAPSHOT.jar
declare -r local_rapids_cudf_jar_file=/home/0.alpha/rapids.pkgs/20200903/cudf-0.15-SNAPSHOT-cuda10-2.jar
declare -r local_cudf_kernel_cache_path=/home/rapids.tmp/cudf-0.15-SNAPSHOT-cuda10-2.jar.20200903
declare -r local_log4j_properties_file=/ssd1/software/spark-3.0.1-bin-spark3.for_rapids-0.2/conf/executor-log4j.properties # rapids-0.2

echo "--------------------[$(date '+%Y/%m/%d %H:%M:%S.%3N')]--------------------"
echo "@ SPARK_HOME                   : ${SPARK_HOME}"
echo "@ local_log4j_properties_file  : ${local_log4j_properties_file}"
echo "@ local_rapids_cudf_jar_file   : ${local_rapids_cudf_jar_file}"
echo "@ local_rapids_plugin_jar_file : ${local_rapids_plugin_jar_file}"
echo "@ local_cudf_kernel_cache_path : ${local_cudf_kernel_cache_path}"
echo "@ local_query_file             : ${local_query_file}"
echo "@ local_num_executor           : ${local_num_executor}"
echo "@ local_num_core_per_executor  : ${local_num_core_per_executor}"
echo "@ local_spark_batch_size       : ${local_spark_batch_size}"
echo "@ lcoal_rapids_batch_size      : ${local_rapids_batch_size}"
echo "@ local_gpu_cc_task            : ${local_gpu_cc_task}"
echo "@ local_shuffle_partitions     : ${local_shuffle_partitions}"


grep '^spark.sql.adaptive.enabled true$' ${SPARK_HOME}/conf/spark-defaults.conf >/dev/null 2>&1
if [ $? -eq 0 ]; then
    echo "--------------------[$(date '+%Y/%m/%d %H:%M:%S.%3N')]--------------------"
    sed "s/^.*spark.sql.adaptive.coalescePartitions.initialPartitionNum .*/spark.sql.adaptive.coalescePartitions.initialPartitionNum ${local_shuffle_partitions}/g" -i ${SPARK_HOME}/conf/spark-defaults.conf
    sed "s/^.*spark.sql.adaptive.coalescePartitions.minPartitionNum .*/spark.sql.adaptive.coalescePartitions.minPartitionNum 1/g" -i ${SPARK_HOME}/conf/spark-defaults.conf

    initial_pn=$(grep 'spark.sql.adaptive.coalescePartitions.initialPartitionNum' ${SPARK_HOME}/conf/spark-defaults.conf | awk -F' ' '{ print $NF }')
    min_pn=$(grep 'spark.sql.adaptive.coalescePartitions.minPartitionNum' ${SPARK_HOME}/conf/spark-defaults.conf | awk -F' ' '{ print $NF }')
    echo "spark.sql.adaptive.enabled                                : true"
    echo "spark.sql.adaptive.coalescePartitions.initialPartitionNum : ${initial_pn}"
    echo "spark.sql.adaptive.coalescePartitions.minPartitionNum     : ${min_pn}"
fi


echo "--------------------[$(date '+%Y/%m/%d %H:%M:%S.%3N')]--------------------"
${SPARK_HOME}/bin/spark-shell \
    -I ${local_query_file} \
    --master spark://10.255.127.23:7077 \
    --driver-memory 20G \
    --conf spark.driver.cores=4 \
    --conf spark.executorEnv.LIBCUDF_KERNEL_CACHE_PATH=${local_cudf_kernel_cache_path} \
    --conf spark.cores.max=$(( ${local_num_executor} * ${local_num_core_per_executor} )) \
    --conf spark.executor.cores=${local_num_core_per_executor} \
    --conf spark.executor.memory=36g \
    --conf spark.rapids.sql.concurrentGpuTasks=${local_gpu_cc_task} \
    --conf spark.rapids.memory.pinnedPool.size=16G \
    --conf spark.rapids.memory.host.spillStorageSize=$((1073741824 * 16)) \
    --conf spark.executor.memoryOverhead=32g \
    --conf spark.sql.broadcastTimeout=1500 \
    --conf spark.locality.wait=0s \
    --conf spark.sql.files.maxPartitionBytes=${local_spark_batch_size}m \
    --conf spark.rapids.sql.explain=ALL \
    --conf spark.rapids.sql.castFloatToString.enabled=true \
    --conf spark.sql.shuffle.partitions=${local_shuffle_partitions} \
    --conf spark.rapids.sql.batchSizeBytes=$((${local_rapids_batch_size} * 1024 * 1024)) \
    --conf spark.rapids.sql.variableFloatAgg.enabled=true \
    --conf spark.rapids.sql.exec.BroadcastNestedLoopJoinExec=true \
    --conf spark.rapids.sql.exec.CartesianProductExec=true \
    --conf spark.sql.optimizer.inSetConversionThreshold=1000 \
    --conf spark.rapids.memory.gpu.pooling.enabled=true \
    --conf spark.rapids.memory.gpu.allocFraction=0.9 \
    --conf spark.executor.resource.gpu.amount=1 \
    --conf spark.task.resource.gpu.amount=$( echo "scale=4; 1 / ${local_num_core_per_executor}" | bc) \
    --files ${local_log4j_properties_file} \
    --conf spark.plugins=com.nvidia.spark.SQLPlugin \
    --conf spark.executor.extraJavaOptions='-Dai.rapids.cudf.prefer-pinned=true -Djava.io.tmpdir=/home/javaio_tmp' \
    --conf spark.executor.extraClassPath=/usr/lib64:/usr/lib64/ucx:${local_rapids_cudf_jar_file}:${local_rapids_plugin_jar_file} \
    --conf spark.driver.extraClassPath=${local_rapids_cudf_jar_file}:${local_rapids_plugin_jar_file} \
    --conf spark.rapids.memory.gpu.debug=STDERR \
    # --conf spark.shuffle.manager=com.nvidia.spark.rapids.spark301.RapidsShuffleManager \
    # --conf spark.shuffle.service.enabled=false \
    # --conf spark.rapids.shuffle.transport.enabled=true \
    # --conf spark.executorEnv.UCX_TLS=cuda_copy,cuda_ipc,rc,tcp \
    # --conf spark.executorEnv.UCX_NET_DEVICES=mlx5_0:1,xgbe0 \
    # --conf spark.executorEnv.UCX_ERROR_SIGNALS= \
    # --conf spark.executorEnv.UCX_MAX_RNDV_RAILS=1 \
    # --conf spark.executorEnv.UCX_MEMTYPE_CACHE=n \

0 replies

tgravescs · 2020-09-09T14:47:48Z

tgravescs
Sep 9, 2020
Maintainer

@JustPlay did you have a chance to try spark.shuffle.reduceLocality.enabled=false ?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Does spark/rapids really has the gpu-aware task scheduling ability？ #5391

{{title}}

Replies: 8 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

[BUG] Does spark/rapids really has the gpu-aware task scheduling ability？ #5391

JustPlay Sep 8, 2020

Replies: 8 comments

tgravescs Sep 8, 2020 Maintainer

revans2 Sep 8, 2020 Maintainer

tgravescs Sep 8, 2020 Maintainer

tgravescs Sep 8, 2020 Maintainer

abellina Sep 8, 2020 Collaborator

jlowe Sep 8, 2020 Maintainer

JustPlay Sep 9, 2020 Author

tgravescs Sep 9, 2020 Maintainer

JustPlay
Sep 8, 2020

tgravescs
Sep 8, 2020
Maintainer

revans2
Sep 8, 2020
Maintainer

tgravescs
Sep 8, 2020
Maintainer

tgravescs
Sep 8, 2020
Maintainer

abellina
Sep 8, 2020
Collaborator

jlowe
Sep 8, 2020
Maintainer

JustPlay
Sep 9, 2020
Author

tgravescs
Sep 9, 2020
Maintainer