-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: CUDA error: an illegal memory access was encountered #1
Comments
Same error here. Have you solved it? |
Hi, unfortunately, this error is not super specific, we have seen it before in 3D Gaussian Splatting. We tried our best to replicate it, but we were never able to get it on any of our machines, so we never worked out how to debug it... Could you let us know your OS / GPU (how many GPUs are in your machine)? Getting the latest NVIDIA drivers might help, but bottom line, without full access to a setup where it happens, it might be really tough to find it. |
Hi, I'm working with Ubuntu 18.04 with one GPU (RTX 3090). I used this setup in 3D Gaussian Splatting before and it works fine. By the way, my |
The error persisted even if I comment this line.. |
Hi, |
Yes, I installed pytorch with this command. By the way, the error on my machine appeared at 10040 iterations, not from the beginning. |
Hi, I tied on another machine (Ubuntu 20.04/A6000) with the same dataset, and the error appears again on the 10040 iterations. I guess this issue associates with dataset? |
I just tried downloading SmallCity and running |
I was working with a dataset which I collected myself. I'm trying with SmallCity right now. |
@Vilour @SunHongyang10 When it fails, could you try keep an eye on the GPU memory consumption? Is it possible that the system goes out of video memory? This should not happen on a 3090... |
I just tried with small_city dataset, and it fails immediately, my device is a 3090 |
Same here in Ubuntu20.04, with the following call stacks: File "train_coarse.py", line 190, in |
The video memory consumption is ok. Only takes a few gigabytes. |
Hi, Could you please provide cuda and nvidia driver versions? |
I have same issue, here is my version:
|
The driver version is 525.105.17, and |
Thanks for providing details. I managed to replicate the error using nvidia/cuda:12.1.0-devel-ubuntu20.04. We will look into it. |
It seems there is a version mismatch here. The |
Hi, I met the same problem and I was running it on a RTX 6000 Ada and ubuntu 24.04 with cuda 11.8 and driver version 550.90.07. Thanks for help! The error mesage is: $ python scripts/full_train.py --project_dir dataset/example_dataset/
creating output dir: dataset/example_dataset/output
Optimizing dataset/example_dataset/output/scaffold
Output folder: dataset/example_dataset/output/scaffold [23/07 01:21:16]
Converting point3d.bin to .ply, will happen only the first time you open the scene. [23/07 01:21:16]
Reading camera 1158/1158 [23/07 01:21:17]
0 test images [23/07 01:21:17]
1158 train images [23/07 01:21:17]
Making Training Dataset [23/07 01:21:17]
Making Test Dataset [23/07 01:21:17]
Number of points at initialisation : 329992 [23/07 01:21:17]
Training progress: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/xiangyu/Projects/hierarchical-3d-gaussians/train_coarse.py", line 190, in <module>
training(lp.extract(args), op.extract(args), pp.extract(args), args.save_iterations, args.checkpoint_iterations, args.start_checkpoint, args.debug_from)
File "/home/xiangyu/Projects/hierarchical-3d-gaussians/train_coarse.py", line 110, in training
gaussians.max_radii2D[visibility_filter] = torch.max(gaussians.max_radii2D[visibility_filter], radii)
~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Training progress: 0%| | 0/30000 [00:00<?, ?it/s]
Error executing train_coarse: Command 'python train_coarse.py -s dataset/example_dataset/camera_calibration/aligned --save_iterations -1 -i ../rectified/images --skybox_num 100000 --model_path dataset/example_dataset/output/scaffold --alpha_masks ../rectified/masks ' returned non-zero exit status 1.` |
Hi,
With 125.Dockerfile in
|
I installed CUDA 12.5, uninstalled the original PyTorch 2.3.0, and then reinstalled the latest version of PyTorch (2.3.1). After that, I reran Commands executed: conda remove pytorch torchvision torchaudio pytorch-cuda=12.1
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -r requirements.txt
I suspect it is because I reinstalled |
There seems to be an issue associated with CUB, which is failing to compute the sum over a CUDA array, for no obvious reason. We are checking what can be done. |
There seem to be unspecified PyTorch/CUB compatibility issues on Ubuntu, we will try to figure out where they come from or if we can get a more robust alternative. In the meantime, if you can, combining PyTorch built for CUDA 12.1 with a CUDA Toolkit 12.5 installation (yes, this should be fine, minor version mismatches are allowed) seems like a good choice on Ubuntu, according to Docker. |
Hi, I built docker (based on my graphics driver, I modified it appropriately) based on the provided dockerfile to run the code, and this is my dockerfile
|
Well, same question, after about half an hour of waiting. |
Hi, |
hi, I'm sorry, but this is a bit difficult for me, mainly because these operations on the workstation, if i modify the graphics card driver and CUDA(max available version is 12.2), it may affect other people. |
Hello, it doesn't seem to have to be run in a cuda 12.5 environment, I have a cuda 12.3 pytroch 2.3.0 device that works fine, hope that helps. |
by far, cuda12.3 pytorch2.3.0 works |
Currently CUDA 12.4 + PyTorch 2.4 works on Windows. |
Hi, I solved the problem. |
Yes it works! Just a caveat, |
I got the same error here (Ubuntu 18.04, nvcc -V 11.6)
After using @ForeverAurorak 's solution, it works and it's training now! Thank you so much!!!
|
Thank you to everyone in this thread for their awesome contributions, especially @ameuleman for providing a working Dockerfile. I was able to take it and put together a working docker-compose environment. Everything appears to be working but I still haven't figured out a way to connect to the remote viewer. If anyone is interested in running H3DGS via docker compose, here is the link to the complete diff: https://github.com/graphdeco-inria/hierarchical-3d-gaussians/pull/31/files BTW I am running a RTX 3060 12GB with CUDA 12.3 installed on my host machine |
I think it should be
This is a good point. In conclusion, the following code modification works for me on Ubuntu:
line 29 in
By modifying the two files above and reinstall via |
Hi thank you for your feedbacks, I pushed the fix to https://github.com/graphdeco-inria/hierarchy-rasterizer, please update your rasterizer using Regarding
|
fix my promblem, thanks !!! pytorch 2.3.0+cu121, nvcc 12.1 |
Fixed my problem on Ubuntu 22.04 + CUDA 11.8 + PyTorch 2.3.0 on docker! |
Hello~ Wonderful Work!
I am trying to run the train_coarse.py, then I meet an error:
I tried to solve it, but I failed😭
Is there a problem with my virtual environment?
The text was updated successfully, but these errors were encountered: