Spark RAPIDS takes a long time to initialise GPU memory #11423
-
Hi all, when testing Spark RAPIDS, I encountered a problem where the executors always take exactly 20mins to initialise the GPU memory. Here is what I see from the executors' log: Details of the hardware I'm testing now: |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Version 22.02 does not officially support H100 GPUs. H100 is not officially supported until version 23.06. The long delay is caused by the driver JIT-compiling all the cudf GPU kernels from the PTX code for H100 so they can run on that GPU. There are a ton of kernels, so this takes a very long time. I strongly suggest updating to a more recent version, e.g.: 24.06.1, and that should fix the long startup delay. Download details for 24.06.1 can be found here: https://github.com/NVIDIA/spark-rapids/blob/branch-24.06/docs/download.md |
Beta Was this translation helpful? Give feedback.
-
I have updated the rapids version to 24.06.1 and now the 20mins delay is gone. Thanks @jlowe for the help. |
Beta Was this translation helpful? Give feedback.
Version 22.02 does not officially support H100 GPUs. H100 is not officially supported until version 23.06. The long delay is caused by the driver JIT-compiling all the cudf GPU kernels from the PTX code for H100 so they can run on that GPU. There are a ton of kernels, so this takes a very long time. I strongly suggest updating to a more recent version, e.g.: 24.06.1, and that should fix the long startup delay. Download details for 24.06.1 can be found here: https://github.com/NVIDIA/spark-rapids/blob/branch-24.06/docs/download.md