-
I am currently running the spark-rapids-ml benchmarks under standalone mode on a single node with 1 V100 GPU. The version I used for RAPIDS is 24.04a as I could not compile 24.02 for Spark 3.5.1. Here is my configuration: Spark configurationcommon_confs=$(
cat <<EOF
--conf spark.sql.execution.arrow.pyspark.enabled=true \
--conf spark.sql.execution.arrow.maxRecordsPerBatch=$arrow_batch_size \
--conf spark.python.worker.reuse=true \
--conf spark.master=spark://master:7077 \
--conf spark.driver.memory=300g \
--conf spark.executor.cores=6 \
--conf spark.executor.memory=128G \
--conf spark.rapids.ml.uvm.enabled=true
EOF
)
spark_rapids_confs=$(
cat <<EOF
--conf spark.executor.extraJavaOptions="-Duser.timezone=UTC" \
--conf spark.driver.extraJavaOptions="-Duser.timezone=UTC" \
--conf spark.executorEnv.PYTHONPATH=${rapids_jar} \
--conf spark.sql.files.minPartitionNum=${num_gpus} \
--conf spark.rapids.memory.gpu.minAllocFraction=0.0001 \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.locality.wait=0s \
--conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer \
--conf spark.rapids.memory.gpu.pooling.enabled=false \
--conf spark.rapids.sql.explain=ALL \
--conf spark.sql.execution.sortBeforeRepartition=false \
--conf spark.rapids.sql.format.parquet.reader.type=MULTITHREADED \
--conf spark.rapids.sql.format.parquet.multiThreadedRead.maxNumFilesParallel=20 \
--conf spark.rapids.sql.multiThreadedRead.numThreads=20 \
--conf spark.rapids.sql.python.gpu.enabled=true \
--conf spark.rapids.memory.pinnedPool.size=100G \
--conf spark.python.daemon.module=rapids.daemon \
--conf spark.rapids.sql.batchSizeBytes=512m \
--conf spark.sql.adaptive.enabled=false \
--conf spark.sql.files.maxPartitionBytes=2000000000000 \
--conf spark.rapids.sql.concurrentGpuTasks=2 \
--conf spark.executor.resource.gpu.amount=1 \
--conf spark.task.resource.gpu.amount=0.166 \
--conf spark.executorEnv.UCX_ERROR_SIGNALS="" \
--conf spark.executorEnv.UCX_MEMTYPE_CACHE=n \
--conf spark.executorEnv.UCX_IB_RX_QUEUE_LEN=1024 \
--conf spark.executorEnv.UCX_TLS=cuda_copy,cuda_ipc,rc,tcp \
--conf spark.executorEnv.UCX_RNDV_SCHEME=put_zcopy \
--conf spark.executorEnv.UCX_MAX_RNDV_RAILS=1 \
--conf spark.rapids.shuffle.manager=com.nvidia.spark.rapids.spark351.RapidsShuffleManager \
--conf spark.jars=${rapids_jar}
EOF
)
I can run the benchmarks using a small dataset (~1GB), but when I use a 16GB dataset on KMeans, it seems to hang in this section: # spark-rapids-ml/benchmark/benchmark/bench_kmeans.py:171
# count doesn't trigger compute so do something not too compute intensive
_, transform_time = with_benchmark(
"gpu transform", lambda: transformed_df.agg(sum(output_col)).collect()
) When I terminate the program I get this error: Error
. This is the log for the executor: Executor log
. I assumed that the problem is with the shuffle stage (which was the original reason why I asked here instead of as an issue in spark-rapids-ml), so I tried removing the UCX and shuffle manger part in the configuration, but it is still taking a long time, and I have no idea if it is still running successfully or if it internally crashed somewhere. With my old benchmark, it would take less than 20 minutes with a 16GB dataset, but it is still running with the For linear regression, it seems like it got to the collect stage successfully unlike with KMeans, but I did get Here is the gist containing the thread dump for the executor and driver and the heap histogram of the executor as well as the script I used to run the benchmark: https://gist.github.com/an-ys/8962fbbae2cb8909d480b249eacf9244 .
Update 2: So I did get an error (coming from RMM I am guessing?) for GPU-ETL mode when running the KMeans application locally. I will check if I can get the standalone mode to work for the KMeans benchmark when SQL is disabled.
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
I decided to close this since I seem to have solved the problem by reducing the |
Beta Was this translation helpful? Give feedback.
-
@an-ys The first error that you got appears to be the out of memory killer kicking in and shooting your process. This is related to running out of host memory, not GPU memory. You also ran out of GPU memory, which is what showed up in your second stack trace. Right now we handle running out of GPU memory much better than CPU memory. On GPU memory we have limits and we end up spilling data or pausing threads to make it work. It is not 100% perfect, but it does work rather well. For the CPU we are still in the process of making that work. Then plan is to do the same strategies that we do for GPU memory, but it is not done yet. But this is only memory limits on the java side of things, not the python side. Host memory limits are a little harder to debug. Right now we typically use as much off heap memory as we can get away with. Most of the time this is not a problem, but occasionally we can run out. Because we use off heap memory you need to add some overhead to the limits to account for the extra. I am not sure what you set your memory limit overhead to be, especially with python, but you might want to increase it instead of dropping the target batch size. Another option is to decrease the number of threads in a worker. In your query that I saw we would use host memory to buffer data read in from the file system before processing it on the GPU. We also use it to transfer data to/from the python process, and finally we would use it for storing spill/shuffle data. The shuffle/spill pool is limited and we end up spilling to disk if we run out of that. It is only used for shuffle if UCX shuffle is enabled. If it is not, then off heap host memory is used temporarily while we copy the shuffle data back to the host and the serialize it out to the heap in a format that the default Spark shuffle can handle. A few strategies to help reduce the host memory usage.
@eordentlich because this is an ML benchmark do you have some suggestions that are specific to how you run them? |
Beta Was this translation helpful? Give feedback.
@an-ys The first error that you got appears to be the out of memory killer kicking in and shooting your process. This is related to running out of host memory, not GPU memory. You also ran out of GPU memory, which is what showed up in your second stack trace. Right now we handle running out of GPU memory much better than CPU memory. On GPU memory we have limits and we end up spilling data or pausing threads to make it work. It is not 100% perfect, but it does work rather well.
For the CPU we are still in the process of making that work. Then plan is to do the same strategies that we do for GPU memory, but it is not done yet. But this is only memory limits on the java side of things, not t…