Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spark-rapids-ml and spark-rapids Accelerator CUDARuntimeError: cudaErrorMemoryAllocation #552

Open
nsaman opened this issue Jan 23, 2024 · 7 comments

Comments

@nsaman
Copy link

nsaman commented Jan 23, 2024

I'm use rapids accelerator and spark-rapids-ml in conjunction and am facing below error. If rapids accelerator is disabled, it runs successfully. The documentation implies the two should be able to work together: https://docs.nvidia.com/spark-rapids/user-guide/latest/additional-functionality/ml-integration.html#existing-ml-libraries.

Is there something I'm missing?
spark.rapids.memory.gpu.pool=NONE seems to be the only suggestion on avoiding memory conflicts

Environment:
Docker running on AWS Sagemaker (ml-p3-2xlarge) (base: nvcr.io/nvidia/rapidsai/base:23.12-cuda12.0-py3.10)

Stacktrace

2024-01-23 01:29:21,462 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 76.0 (TID 382) (algo-2 executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/spark_rapids_ml/core.py", line 698, in _train_udf
    if cuda_managed_mem_enabled:
  File "/opt/conda/lib/python3.10/site-packages/spark_rapids_ml/core.py", line 383, in _set_gpu_device
    cupy.cuda.Device(gpu_id).use()
  File "cupy/cuda/device.pyx", line 192, in cupy.cuda.device.Device.use
  File "cupy/cuda/device.pyx", line 198, in cupy.cuda.device.Device.use
  File "cupy_backends/cuda/api/runtime.pyx", line 375, in cupy_backends.cuda.api.runtime.setDevice
  File "cupy_backends/cuda/api/runtime.pyx", line 144, in cupy_backends.cuda.api.runtime.check_status
--
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorMemoryAllocation: out of memory
#011at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:545)
#011at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:101)
#011at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:50)
#011at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:498)
#011at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
#011at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
#011at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
#011at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
#011at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:86)
#011at scala.collection.Iterator.foreach(Iterator.scala:943)
#011at scala.collection.Iterator.foreach$(Iterator.scala:943)
#011at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:80)
#011at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:307)
#011at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:670)
#011at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:424)
#011at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
#011at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:259)
2024-01-23 01:29:21.480 INFO clientserver - close: Closing down clientserver connection
2024-01-23 01:29:21.487 INFO ModelService - process: Exception during processing: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.

nvidia-smi

+-----------------------------------------------------------------------------+
--
\| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.2     \|
\|-------------------------------+----------------------+----------------------+
\| GPU  Name        Persistence-M\| Bus-Id        Disp.A \| Volatile Uncorr. ECC \|
\| Fan  Temp  Perf  Pwr:Usage/Cap\|         Memory-Usage \| GPU-Util  Compute M. \|
\|                               \|                      \|               MIG M. \|
\|===============================+======================+======================\|
\|   0  Tesla V100-SXM2...  On   \| 00000000:00:1E.0 Off \|                    0 \|
\| N/A   27C    P0    23W / 300W \|      0MiB / 16160MiB \|      0%      Default \|
\|                               \|                      \|                  N/A \|
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
\| Processes:                                                                  \|
\|  GPU   GI   CI        PID   Type   Process name                  GPU Memory \|
\|        ID   ID                                                   Usage      \|
\|=============================================================================\|
\|  No running processes found                                                 \|
+-----------------------------------------------------------------------------+

Dockerfile

FROM rapidsai/base:23.12-cuda11.2-py3.10

USER root

RUN apt-get update
RUN apt-get install -y openjdk-8-jdk curl zip unzip
# Fix certificate issues
RUN apt-get update && \
    apt-get install ca-certificates-java && \
    apt-get clean && \
    update-ca-certificates -f;
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/
RUN export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

# Install Hadoop
ENV HADOOP_VERSION 3.0.0
ENV HADOOP_HOME /usr/hadoop-$HADOOP_VERSION
ENV HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
ENV PATH $PATH:$HADOOP_HOME/bin
RUN curl -sL --retry 3 \
  "http://archive.apache.org/dist/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz" \
  | gunzip \
  | tar -x -C /usr/ \
 && rm -rf $HADOOP_HOME/share/doc \
 && chown -R root:root $HADOOP_HOME

# Install Spark
ENV SPARK_VERSION 3.2.0
ENV SPARK_PACKAGE spark-${SPARK_VERSION}-bin-without-hadoop
ENV SPARK_HOME /usr/spark-${SPARK_VERSION}
ENV SPARK_DIST_CLASSPATH="$HADOOP_HOME/etc/hadoop/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/tools/lib/*"
ENV PATH $PATH:${SPARK_HOME}/bin
RUN curl -sL --retry 3 \
  "https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/${SPARK_PACKAGE}.tgz" \
  | gunzip \
  | tar x -C /usr/ \
 && mv /usr/$SPARK_PACKAGE $SPARK_HOME \
 && chown -R root:root $SPARK_HOME

# http://blog.stuart.axelbrooke.com/python-3-on-spark-return-of-the-pythonhashseed
ENV PYTHONHASHSEED 0
ENV PYTHONIOENCODING UTF-8
ENV PIP_DISABLE_PIP_VERSION_CHECK 1

# Point Spark at proper python binary
ENV PYSPARK_PYTHON=/opt/conda/bin/python
ENV PYSPARK_DRIVER_PYTHON=/opt/conda/bin/python
# Setup Spark/Yarn/HDFS user as root
ENV PATH="/usr/bin:/opt/program:${PATH}"
ENV YARN_RESOURCEMANAGER_USER="root"
ENV YARN_NODEMANAGER_USER="root"
ENV HDFS_NAMENODE_USER="root"
ENV HDFS_DATANODE_USER="root"
ENV HDFS_SECONDARYNAMENODE_USER="root"

RUN curl -s https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.12.1/rapids-4-spark_2.12-23.12.1.jar -o ${SPARK_HOME}/jars/rapids-4-spark_2.12-23.12.1.jar
COPY requirements.txt .
RUN pip3 install -r requirements.txt
RUN apt-get clean
RUN rm -rf /var/lib/apt/lists/*

# set up source code
COPY src /opt/program
RUN mkdir -p /opt/module/py_files
WORKDIR /opt/program
RUN zip -r /opt/module/py_files/short-term-model.zip .

# Set up bootstrapping program and Spark configuration
COPY configuration/program /opt/program
RUN chmod +x /opt/program/submit
COPY configuration/hadoop-config /opt/hadoop-config

# make output folder for spark history logs
RUN mkdir -p /opt/ml/processing/output

RUN pip3 install psutil

WORKDIR $SPARK_HOME

ENV CUDA_VISIBLE_DEVICES 0
RUN export CUDA_VISIBLE_DEVICES=0

ENV LD_LIBRARY_PATH=/usr/local/cuda-11.2/compat/:/usr/local/cuda-11.2/lib64:${LD_LIBRARY_PATH}
RUN export LD_LIBRARY_PATH=/usr/local/cuda-11.2/compat/:/usr/local/cuda-11.2/lib64:${LD_LIBRARY_PATH}

ENTRYPOINT ["/opt/program/submit", "/opt/program/processor.py"]

requirements.txt

findspark
pyspark==3.2.0
statsmodels
scikit-learn>=1.2.1
spark_rapids_ml==23.12.0

spark-defaults.conf (note, some variables in there)

spark.driver.host=sd_host
spark.driver.memory=driver_mem
spark.yarn.am.cores=driver_cores
spark.executor.memory=exec_mem
spark.executor.cores=exec_cores
spark.task.cpus=task_cores
spark.executor.instances=exec_instances
spark.driver.maxResultSize=max_result_size
spark.executor.memoryOverhead=exec_overhead
spark.sql.adaptive.enabled=true
spark.sql.adaptive.skewJoin.enabled=true
spark.default.parallelism=shuffle_partitions
spark.sql.shuffle.partitions=shuffle_partitions
spark.sql.files.maxPartitionBytes=256m
spark.sql.execution.arrow.pyspark.enabled=true
spark.sql.execution.arrow.maxRecordsPerBatch=3000
spark.sql.execution.arrow.pyspark.fallback.enabled=true
spark.network.timeout=900s
spark.memory.offHeap.enabled=true
spark.memory.offHeap.size=8g
spark.executor.pyspark.memory=22g
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max=1g
spark.eventLog.enabled=true
spark.eventLog.dir=/opt/ml/processing/output
spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps
spark.driver.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps

; https://nvidia.github.io/spark-rapids/docs/configs.html
spark.jars=/usr/spark-3.5.0/jars/rapids-4-spark_2.12-23.12.1.jar
spark.plugins=com.nvidia.spark.SQLPlugin
spark.rapids.sql.concurrentGpuTasks=2
spark.rapids.sql.explain=NOT_ON_GPU
; https://docs.nvidia.com/spark-rapids/user-guide/latest/getting-started/on-premise.html#running-on-yarn
spark.executor.resource.gpu.discoveryScript=/opt/program/discover_gpus.sh
spark.driver.resource.gpu.discoveryScript=/opt/program/discover_gpus.sh
spark.executor.resource.gpu.amount=1
spark.driver.resource.gpu.amount=1
spark.task.resource.gpu.amount=.5
spark.rapids.memory.gpu.pool=NONE
@eordentlich
Copy link
Collaborator

Thanks for the detailed info. Can you try with this setting:
spark.rapids.memory.gpu.pooling.enabled=false
instead of
spark.rapids.memory.gpu.pool=NONE ?
I know the docs say the former is deprecated in favor of the latter, but it looks like there might be a problem with the latter setting.

@eordentlich
Copy link
Collaborator

Actually, the above is unlikely to make a difference in your case. It looks like it is an issue only if spark rapids enabled python workers are configured and that is not the case for your configs.

Is the training data for the ml part large in size? You can run nvidia-smi -l 1 to monitor as it runs, with the plugin in disabled. It might be sufficiently close to device capacity that adding the plugin uses up all gpu memory, under default settings.

@nsaman
Copy link
Author

nsaman commented Jan 24, 2024

Thanks for the detailed info. Can you try with this setting: spark.rapids.memory.gpu.pooling.enabled=false instead of spark.rapids.memory.gpu.pool=NONE ? I know the docs say the former is deprecated in favor of the latter, but it looks like there might be a problem with the latter setting.

Tested with this which as you predicted did not fix the issue.

Adding nvidia-smi -l 1, I am seeing an assigned process to the gpu at 0MiB memory (see below).

Is there any way this could be exclusive locks on the gpu?

Error log from the driver (note timestamp):

2024-01-24T06:40:17.669Z | 06:40:16.873 [task-result-getter-2] WARN org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 88.0 (TID 338) (algo-2 executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1706078261064_0001/container_1706078261064_0001_01_000002/pyspark.zip/pyspark/worker.py", line 1247, in main process() File "/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1706078261064_0001/container_1706078261064_0001_01_000002/pyspark.zip/pyspark/worker.py", line 1239, in process serializer.dump_stream(out_iter, outfile) File "/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1706078261064_0001/container_1706078261064_0001_01_000002/pyspark.zip/pyspark/sql/pandas/serializers.py", line 470, in dump_stream return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream) File "/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1706078261064_0001/container_1706078261064_0001_01_000002/pyspark.zip/pyspark/sql/pandas/serializers.py", line 100, in dump_stream for batch in iterator: File "/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1706078261064_0001/container_1706078261064_0001_01_000002/pyspark.zip/pyspark/sql/pandas/serializers.py", line 463, in init_stream_yield_batches for series in iterator: File "/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1706078261064_0001/container_1706078261064_0001_01_000002/pyspark.zip/pyspark/worker.py", line 931, in func for result_batch, result_type in result_iter: File "/opt/conda/lib/python3.10/site-packages/spark_rapids_ml/core.py", line 696, in _train_udf _CumlCommon._set_gpu_device(context, is_local) File "/opt/conda/lib/python3.10/site-packages/spark_rapids_ml/core.py", line 383, in _set_gpu_device cupy.cuda.Device(gpu_id).use() File "cupy/cuda/device.pyx", line 192, in cupy.cuda.device.Device.use File "cupy/cuda/device.pyx", line 198, in cupy.cuda.device.Device.use File "cupy_backends/cuda/api/runtime.pyx", line 375, in cupy_backends.cuda.api.runtime.setDevice File "cupy_backends/cuda/api/runtime.pyx", line 144, in cupy_backends.cuda.api.runtime.check_status
-- | --
  | 2024-01-24T06:40:17.669Z | cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorMemoryAllocation: out of memory
  | 2024-01-24T06:40:17.669Z | #011at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:572)
  | 2024-01-24T06:40:17.669Z | #011at org.apache.spark.sql.rapids.execution.python.shims.GpuPythonArrowOutput$$anon$1.read(GpuPythonArrowShims.scala:157)
  | 2024-01-24T06:40:17.669Z | #011at org.apache.spark.sql.rapids.execution.python.shims.GpuPythonArrowOutput$$anon$1.read(GpuPythonArrowShims.scala:102)
  | 2024-01-24T06:40:17.669Z | #011at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
  | 2024-01-24T06:40:17.670Z | #011at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
  | 2024-01-24T06:40:17.670Z | #011at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
  | 2024-01-24T06:40:17.670Z | #011at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:285)
  | 2024-01-24T06:40:17.670Z | #011at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
  | 2024-01-24T06:40:17.670Z | #011at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:284)
  | 2024-01-24T06:40:17.670Z | #011at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
  | 2024-01-24T06:40:17.670Z | #011at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:301)
  | 2024-01-24T06:40:17.670Z | #011at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
  | 2024-01-24T06:40:17.670Z | #011at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
  | 2024-01-24T06:40:17.670Z | #011at org.apache.spark.ContextAwareIterator.hasNext(ContextAwareIterator.scala:39)
  | 2024-01-24T06:40:17.670Z | #011at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:86)
  | 2024-01-24T06:40:17.670Z | #011at scala.collection.Iterator.foreach(Iterator.scala:943)
  | 2024-01-24T06:40:17.670Z | #011at scala.collection.Iterator.foreach$(Iterator.scala:943)
  | 2024-01-24T06:40:17.670Z | #011at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:80)
  | 2024-01-24T06:40:17.670Z | #011at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:322)
  | 2024-01-24T06:40:17.670Z | #011at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:751)
  | 2024-01-24T06:40:17.670Z | #011at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:451)
  | 2024-01-24T06:40:17.670Z | #011at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1928)
  | 2024-01-24T06:40:17.670Z | #011at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:282)

nvidia-smi -l 1 (from the executor node, there ~786MB being used since Wed Jan 24 06:38:43 2024)
subprocess.Popen([f'nvidia-smi -l 1 >> /opt/ml/processing/output/{current_host}'], shell=True, stdin=None, stdout=None, stderr=None, close_fds=True)

Wed Jan 24 06:40:15 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   36C    P0    47W / 300W |    784MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+


Wed Jan 24 06:40:16 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   36C    P0    47W / 300W |    784MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+


Wed Jan 24 06:40:17 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   36C    P0    47W / 300W |    646MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+


Wed Jan 24 06:40:18 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   36C    P0    47W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Spark conf with actuals as it's impossible for one to guess without it:

[('spark.eventLog.enabled', 'true'), ('spark.yarn.dist.pyFiles', 'file:///opt/module/py_files/short-term-model.zip'), ('spark.memory.offHeap.size', '8g'), ('spark.task.cpus', '4'), ('spark.repl.local.jars', 'file:///usr/spark-3.5.0/jars/rapids-4-spark_2.12-23.12.1.jar'), ('spark.rapids.memory.gpu.pool', 'NONE'), ('spark.serializer', 'org.apache.spark.serializer.KryoSerializer'), ('spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE', 'docker'), ('spark.executor.memory', '18g'), ('spark.executorEnv.PYTHONPATH', '{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.9.7-src.zip<CPS>{{PWD}}/short-term-model.zip'), ('spark.executor.memoryOverhead', '10g'), ('spark.driver.host', '10.0.203.204'), ('spark.ui.filters', 'org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter'), ('spark.sql.execution.arrow.pyspark.enabled', 'true'), ('spark.app.submitTime', '1706078268072'), ('spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.memory.gpu.pooling.enabled', 'false'), ('spark.sql.execution.arrow.maxRecordsPerBatch', '3000'), ('spark.executor.id', 'driver'), ('spark.rapids.memory.gpu.pooling.enabled', 'false'), ('spark.task.resource.gpu.amount', '.5'), ('spark.plugins', 'com.nvidia.spark.SQLPlugin'), ('spark.default.parallelism', '25'), ('spark.sql.execution.arrow.pyspark.fallback.enabled', 'true'), ('spark.ui.proxyBase', '/proxy/application_1706078261064_0001'), ('spark.sql.shuffle.partitions', '25'), ('spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.memory.gpu.pool', 'NONE'), ('spark.executor.extraJavaOptions', '-Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false "-XX:+PrintGCDetails -XX:+PrintGCTimeStamps"'), ('spark.rapids.sql.multiThreadedRead.numThreads', '20'), ('spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.sql.multiThreadedRead.numThreads', '20'), ('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES', 'http://algo-1:8088/proxy/application_1706078261064_0001'), ('spark.rapids.sql.explain', 'NOT_ON_GPU'), ('spark.driver.resource.gpu.amount', '1'), ('spark.yarn.dist.jars', 'file:///usr/spark-3.5.0/jars/rapids-4-spark_2.12-23.12.1.jar'), ('spark.app.name', 'SP-DSE'), ('spark.jars', 'file:/usr/spark-3.5.0/jars/rapids-4-spark_2.12-23.12.1.jar'), ('spark.submit.pyFiles', '/opt/module/py_files/short-term-model.zip'), ('spark.executor.resource.gpu.discoveryScript', '/opt/program/discover_gpus.sh'), ('spark.yarn.appMasterEnv.SPARK_HOME', '/dev/null'), ('spark.memory.offHeap.enabled', 'true'), ('spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.driver.user.timezone', 'Z'), ('spark.app.startTime', '1706078270443'), ('spark.yarn.am.cores', '8'), ('spark.driver.extraJavaOptions', '-Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false "-XX:+PrintGCDetails -XX:+PrintGCTimeStamps"'), ('spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.sql.explain', 'NOT_ON_GPU'), ('spark.driver.maxResultSize', '4096m'), ('spark.executor.instances', '2'), ('spark.serializer.objectStreamReset', '100'), ('spark.driver.resource.gpu.discoveryScript', '/opt/program/discover_gpus.sh'), ('spark.executorEnv.SPARK_HOME', '/dev/null'), ('spark.sql.extensions', 'com.nvidia.spark.rapids.SQLExecPlugin,com.nvidia.spark.udf.Plugin,com.nvidia.spark.rapids.optimizer.SQLOptimizerPlugin'), ('spark.rapids.sql.python.gpu.enabled', 'false'), ('spark.driver.port', '43249'), ('spark.yarn.secondary.jars', 'rapids-4-spark_2.12-23.12.1.jar'), ('spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.sql.concurrentGpuTasks', '2'), ('spark.executor.cores', '8'), ('spark.submit.deployMode', 'client'), ('spark.rapids.driver.user.timezone', 'Z'), ('spark.sql.files.maxPartitionBytes', '256m'), ('spark.app.id', 'application_1706078261064_0001'), ('spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.sql.python.gpu.enabled', 'false'), ('spark.eventLog.dir', '/opt/ml/processing/output'), ('spark.sql.adaptive.enabled', 'true'), ('spark.sql.adaptive.skewJoin.enabled', 'true'), ('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS', 'algo-1'), ('spark.executor.resource.gpu.amount', '1'), ('spark.kryo.registrator', 'com.nvidia.spark.rapids.GpuKryoRegistrator'), ('spark.driver.memory', '18g'), ('spark.master', 'yarn'), ('spark.rapids.sql.concurrentGpuTasks', '2'), ('spark.kryoserializer.buffer.max', '1g'), ('spark.executor.pyspark.memory', '22g'), ('spark.rdd.compress', 'True'), ('spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE', 'docker'), ('spark.yarn.isPython', 'true'), ('spark.driver.appUIAddress', 'http://10.0.203.204:4040'), ('spark.network.timeout', '900s'), ('spark.app.initial.jar.urls', 'spark://10.0.203.204:43249/jars/rapids-4-spark_2.12-23.12.1.jar')]

@eordentlich
Copy link
Collaborator

Not sure what might be going on. Your gpu looks to be configured in default mode so multiple processes (jvm for the plugin and python for spark rapids ml in this case) can share the gpu. Can you clarify if you are using the cuda 12.0 base rapids image or 11.2 ? Your Docker file has 11.2 while your description and second nvidia-smi indicate 12.0 .

@nsaman
Copy link
Author

nsaman commented Jan 24, 2024

I apologize, I've been trying with both as the driver version (470.57.02) is set by AWS sagemaker processing instance provided. https://docs.rapids.ai/install shows that this driver should only support up to CUDA 11.2.

If you can think of any other configs to try I can test them. I'll be continuing to poke at configs looking for a solution. We use pandas_udfs which really obscure root cause and eventually I will rewrite to spark which might give enough visibility

@lijinf2
Copy link
Collaborator

lijinf2 commented Jan 25, 2024

Not sure if this helps but you can try setting task gpu amount to 1, i.e. replace "spark.task.resource.gpu.amount=.5" with "spark.task.resource.gpu.amount=1".

@eordentlich
Copy link
Collaborator

One good sanity check would be if you could run this script: https://github.com/NVIDIA/spark-rapids-ml/blob/branch-24.02/python/run_benchmark.sh with gpu_etl argument to run both rapids accelerator for etl and spark rapids ml in your container. It runs in spark local mode. I was able to run this successfully on my server in the two base rapidsai images (after installing some dependencies) you have been working with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants