-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spark-rapids-ml and spark-rapids Accelerator CUDARuntimeError: cudaErrorMemoryAllocation #552
Comments
Thanks for the detailed info. Can you try with this setting: |
Actually, the above is unlikely to make a difference in your case. It looks like it is an issue only if spark rapids enabled python workers are configured and that is not the case for your configs. Is the training data for the ml part large in size? You can run |
Tested with this which as you predicted did not fix the issue. Adding nvidia-smi -l 1, I am seeing an assigned process to the gpu at 0MiB memory (see below). Is there any way this could be exclusive locks on the gpu? Error log from the driver (note timestamp):
nvidia-smi -l 1 (from the executor node, there ~786MB being used since Wed Jan 24 06:38:43 2024)
Spark conf with actuals as it's impossible for one to guess without it:
|
Not sure what might be going on. Your gpu looks to be configured in default mode so multiple processes (jvm for the plugin and python for spark rapids ml in this case) can share the gpu. Can you clarify if you are using the cuda 12.0 base rapids image or 11.2 ? Your Docker file has 11.2 while your description and second nvidia-smi indicate 12.0 . |
I apologize, I've been trying with both as the driver version (470.57.02) is set by AWS sagemaker processing instance provided. https://docs.rapids.ai/install shows that this driver should only support up to CUDA 11.2. If you can think of any other configs to try I can test them. I'll be continuing to poke at configs looking for a solution. We use pandas_udfs which really obscure root cause and eventually I will rewrite to spark which might give enough visibility |
Not sure if this helps but you can try setting task gpu amount to 1, i.e. replace "spark.task.resource.gpu.amount=.5" with "spark.task.resource.gpu.amount=1". |
One good sanity check would be if you could run this script: https://github.com/NVIDIA/spark-rapids-ml/blob/branch-24.02/python/run_benchmark.sh with |
I'm use rapids accelerator and spark-rapids-ml in conjunction and am facing below error. If rapids accelerator is disabled, it runs successfully. The documentation implies the two should be able to work together: https://docs.nvidia.com/spark-rapids/user-guide/latest/additional-functionality/ml-integration.html#existing-ml-libraries.
Is there something I'm missing?
spark.rapids.memory.gpu.pool=NONE seems to be the only suggestion on avoiding memory conflicts
Environment:
Docker running on AWS Sagemaker (ml-p3-2xlarge) (base: nvcr.io/nvidia/rapidsai/base:23.12-cuda12.0-py3.10)
Stacktrace
nvidia-smi
Dockerfile
requirements.txt
spark-defaults.conf (note, some variables in there)
The text was updated successfully, but these errors were encountered: