Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: an illegal memory access was encountered #41

Open
seohoiki3215 opened this issue Jul 17, 2023 · 19 comments
Open

Comments

@seohoiki3215
Copy link

seohoiki3215 commented Jul 17, 2023

Hello, I was surprised by your work and tried to reproduce it with the code you've provided.
However, every time I tried to run the code, it always failed to run with the runtime error i mentioned on the title.

Traceback (most recent call last):
File "train.py", line 213, in
training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint)
File "train.py", line 87, in training
loss = (1.0 - opt.lambda_dssim) * Ll1 + opt.lambda_dssim * (1.0 - ssim(image, gt_image))
File "/home/seohoiki/Research/NeRF/gaussian-splatting/utils/loss_utils.py", line 38, in ssim
window = window.cuda(img1.get_device())
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Training progress: 0%| | 0/30000 [00:00<?, ?it/s]

I tried all the methods you've told in other issues, but failed.
My system & settings:
RTX4090
Ubuntu 22.04 LTS
Exact environment with given .yml file

Strangely, my colleague who has system with RTX 3090 / Ubuntu 20.04 runs the code without any problem.(Except them, all the settings are exactly the same including CUDA SDK version)

I hope I can get some solution for this problem!

Thank you.

=====================================
Results with cuda-memcheck

========= CUDA-MEMCHECK
========= This tool is deprecated and will be removed in a future release of the CUDA toolkit
========= Please use the compute-sanitizer tool as a drop-in replacement
Optimizing
Output folder: ./output/54877260-0 [17/07 19:21:51]
Tensorboard not available: not logging progress [17/07 19:21:51]
Found transforms_train.json file, assuming Blender data set! [17/07 19:21:51]
Reading Training Transforms [17/07 19:21:51]
Reading Test Transforms [17/07 19:21:53]
Loading Training Cameras [17/07 19:21:56]
Loading Test Cameras [17/07 19:21:57]
Number of points at initialisation : 100000 [17/07 19:21:57]
Training progress: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 213, in
training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint)
File "train.py", line 87, in training
loss = (1.0 - opt.lambda_dssim) * Ll1 + opt.lambda_dssim * (1.0 - ssim(image, gt_image))
File "/home/seohoiki/Research/NeRF/gaussian-splatting/utils/loss_utils.py", line 38, in ssim
window = window.cuda(img1.get_device())
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Training progress: 0%| | 0/30000 [00:00<?, ?it/s]
========= ERROR SUMMARY: 0 errors

@Snosixtyboo
Copy link
Collaborator

Snosixtyboo commented Jul 17, 2023

Hi,

I have been trying to get to the bottom of this, but was unable to reproduce it so far. Would you by any chance be available for a Skype (or similar) session to run through it?

@Snosixtyboo
Copy link
Collaborator

Also one question: I see the message
" Found transforms_train.json file, assuming Blender data set! [17/07 19:21:51] "

Are you in fact running it on the Blender data set?

@seohoiki3215
Copy link
Author

I'm running the code with nerf_synthetic dataset. The colleague I mentioned successed running your code on the exact same dataset.
Link: https://drive.google.com/drive/folders/128yBriW1IG_3NJ5Rp7APSTZsJqdJdfc1

And for the request of Skype session, can you make it with zoom?

@Snosixtyboo
Copy link
Collaborator

Thanks for suggesting, but I did a debug session now with another user for the same problem. It looks like I will need to add more diagnostics before I can find out what's going on. I'll let you know when I find out more :)

@Snosixtyboo
Copy link
Collaborator

Hi @seohoiki3215
I finally managed to do the debug version of the rasterizer, I hope this will help. To use it, please do

git pull
git submodule update
pip uninstall diff-gaussian-rasterization (yes)
pip install submodules/diff-gaussian-rasterization

and then run what failed before with --debug. This is slow: so if it takes a while for the error to appear, you can also use --debug_from <iteration> to start debugging only at a certain point. If everything goes well, you should get an error message and a snapshot_fw or snapshot_bw file in the gaussian_splatting directory. If you could forward this file to us, we could take a look to see if we find something wrong!

Best,
Bernhard

@seohoiki3215
Copy link
Author

seohoiki3215 commented Jul 24, 2023

Thank you for giving me some updates for the issue. I've re-run the code with the procedure, and here is the result!

snapshot_fw.zip

Optimizing
Output folder: ./output/9feda2d2-9 [24/07 11:03:34]
Tensorboard not available: not logging progress [24/07 11:03:34]
Found transforms_train.json file, assuming Blender data set! [24/07 11:03:34]
Reading Training Transforms [24/07 11:03:34]
Reading Test Transforms [24/07 11:03:36]
Loading Training Cameras [24/07 11:03:40]
Loading Test Cameras [24/07 11:03:42]
Number of points at initialisation : 100000 [24/07 11:03:42]
Training progress: 0%|
| 0/30000 [00:00<?, ?it/s]
[CUDA ERROR] in cuda_rasterizer/rasterizer_impl.cu
Line 298: an illegal memory access was encountered
An error occured in forward. Please forward snapshot_fw.dump for debugging. [24/07 11:03:42]
Traceback (most recent call last):
File "train.py", line 216, in
training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint, args.debug_from)
File "train.py", line 83, in training
render_pkg = render(viewpoint_cam, gaussians, pipe, background)
File "/home/seohoiki/Research/NeRF/gaussian-splatting/gaussian_renderer/init.py", line 93, in render
cov3D_precomp = cov3D_precomp)
File "/home/seohoiki/anaconda3/envs/gaussian_splatting/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/seohoiki/anaconda3/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/init.py", line 219, in forward
raster_settings,
File "/home/seohoiki/anaconda3/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/init.py", line 41, in rasterize_gaussians
raster_settings,
File "/home/seohoiki/anaconda3/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/init.py", line 90, in forward
raise ex
File "/home/seohoiki/anaconda3/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/init.py", line 86, in forward
num_rendered, color, radii, geomBuffer, binningBuffer, imgBuffer = _C.rasterize_gaussians(*args)
RuntimeError: an illegal memory access was encountered
Training progress: 0%|

@Snosixtyboo
Copy link
Collaborator

Hi,

so I tried it, unfortunately it just works for me, the state you submitted is valid. I have to say I'm running out of ideas what this could be ☹️. I have only seen the issue happen on Linux so far. Are there other GPUs in your machine? Are your GPU drivers up to date?

Best, Bernhard

@seohoiki3215
Copy link
Author

I am sorry to hear that the error is not reproducible. ;(
I have a single RTX4090 on my system and for driver, it's up to date(535).
For CUDA toolkit, , it's version is 11.7

@stevenygd
Copy link

stevenygd commented Sep 5, 2023

I also encounter this error. Any help/update? Here is the debug message I got:

[CUDA ERROR] in /home/gaussian-splatting/submodules/diff-gaussian-rasterization/cuda_rasterizer/rasterizer_impl.cu
Line 298: an illegal memory access was encountered
An error occured in forward. Please forward snapshot_fw.dump for debugging. [05/09 01:11:48]
Traceback (most recent call last):
  File "train.py", line 216, in <module>
    training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint, args.debug_from)
  File "train.py", line 83, in training
    render_pkg = render(viewpoint_cam, gaussians, pipe, background)
  File "/home/gaussian-splatting/gaussian_renderer/__init__.py", line 93, in render
    cov3D_precomp = cov3D_precomp)
  File "/home//miniconda3/envs/gaussian_splatting/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/miniconda3/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/__init__.py", line 219, in forward
    raster_settings, 
  File "/home//miniconda3/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/__init__.py", line 41, in rasterize_gaussians
    raster_settings,
  File "/home//miniconda3/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/__init__.py", line 90, in forward
    raise ex
  File "/home//miniconda3/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/__init__.py", line 86, in forward
    num_rendered, color, radii, geomBuffer, binningBuffer, imgBuffer = _C.rasterize_gaussians(*args)
RuntimeError: an illegal memory access was encountered

@fatbao55
Copy link

fatbao55 commented Oct 2, 2023

@seohoiki3215 did you manage to resolve this?

@fatbao55
Copy link

fatbao55 commented Oct 2, 2023

@Snosixtyboo This is my dump file
snapshot_fw.zip obtained with the debug version of the rasterizer.

This is my error:
num_rendered, color, radii, geomBuffer, binningBuffer, imgBuffer = _C.rasterize_gaussians(*args) RuntimeError: an illegal memory access was encountered Training progress: 0%| | 0/30000 [00:00<?, ?it/s]

I'm running with ubuntu 20.04 cuda 11.8 RTX3090 driver 520. I was wondering if you have any advice on how to resolve this?

@junseo013
Copy link

junseo013 commented Oct 9, 2023

@fatbao55 Please check this PR, graphdeco-inria/diff-gaussian-rasterization#10
For my case, adding "-Xcompiler -fno-gnu-unique" option in submodules/diff-gaussian-rasterization/setup.py: line 29 resolves the illegal memory access error in training.

...
29 extra_compile_args={"nvcc": ["-Xcompiler", "-fno-gnu-unique","-I" + os.path.join(os.path.dirname(os.path.abspath(__file__)), "third_party/glm/")]})
...

After changing the code, reinstall the module by
pip uninstall diff-gaussian-rasterization -y && pip install submodules/diff-gaussian-rasterization

@fatbao55
Copy link

fatbao55 commented Oct 9, 2023

@jsl013 This worked for me, thanks so much!

@FantasticOven2
Copy link

@fatbao55 Please check this PR, graphdeco-inria/diff-gaussian-rasterization#10 For my case, adding "-Xcompiler -fno-gnu-unique" option in submodules/diff-gaussian-rasterization/setup.py: line 29 resolves the illegal memory access error in training.

...
29 extra_compile_args={"nvcc": ["-Xcompiler", "-fno-gnu-unique","-I" + os.path.join(os.path.dirname(os.path.abspath(__file__)), "third_party/glm/")]})
...

After changing the code, reinstall the module by pip uninstall diff-gaussian-rasterization -y && pip install submodules/diff-gaussian-rasterization

This is a life saver for me, after two days of debugging and tried 4 different clusters, this finally help me to solve the problem on ubuntu.

@mushroonhead
Copy link

@fatbao55 Please check this PR, graphdeco-inria/diff-gaussian-rasterization#10 For my case, adding "-Xcompiler -fno-gnu-unique" option in submodules/diff-gaussian-rasterization/setup.py: line 29 resolves the illegal memory access error in training.

...
29 extra_compile_args={"nvcc": ["-Xcompiler", "-fno-gnu-unique","-I" + os.path.join(os.path.dirname(os.path.abspath(__file__)), "third_party/glm/")]})
...

After changing the code, reinstall the module by pip uninstall diff-gaussian-rasterization -y && pip install submodules/diff-gaussian-rasterization

Had same issue with diff-gaussian-rasterization as well. This solves it for me. I am running on a WSL2 Ubuntu-20.04 setup with Cuda 11.8 toolkit.

@ShuzhaoXie
Copy link

Hi @seohoiki3215 I finally managed to do the debug version of the rasterizer, I hope this will help. To use it, please do

git pull
git submodule update
pip uninstall diff-gaussian-rasterization (yes)
pip install submodules/diff-gaussian-rasterization

and then run what failed before with --debug. This is slow: so if it takes a while for the error to appear, you can also use --debug_from <iteration> to start debugging only at a certain point. If everything goes well, you should get an error message and a snapshot_fw or snapshot_bw file in the gaussian_splatting directory. If you could forward this file to us, we could take a look to see if we find something wrong!

Best, Bernhard

ORZ, I have installed the debug version. Could anyone tell me how to use the '--debug' arg? I add it to the render.py but got the following error...

Input:

python render.py --debug ...

Output:

usage: render.py [-h] [--sh_degree SH_DEGREE] [--source_path SOURCE_PATH]
                    [--model_path MODEL_PATH] [--images IMAGES]
                    [--resolution RESOLUTION] [--white_background] [--eval]
                    [--convert_SHs_python] [--compute_cov3D_python]
                    [--iteration ITERATION]

@jhq1234
Copy link

jhq1234 commented Jan 10, 2024

@fatbao55 Please check this PR, graphdeco-inria/diff-gaussian-rasterization#10 For my case, adding "-Xcompiler -fno-gnu-unique" option in submodules/diff-gaussian-rasterization/setup.py: line 29 resolves the illegal memory access error in training.

...
29 extra_compile_args={"nvcc": ["-Xcompiler", "-fno-gnu-unique","-I" + os.path.join(os.path.dirname(os.path.abspath(__file__)), "third_party/glm/")]})
...

After changing the code, reinstall the module by pip uninstall diff-gaussian-rasterization -y && pip install submodules/diff-gaussian-rasterization

This works for me! I appreciate your kind tip!

@unanan
Copy link

unanan commented Dec 3, 2024

@fatbao55 Please check this PR, graphdeco-inria/diff-gaussian-rasterization#10 For my case, adding "-Xcompiler -fno-gnu-unique" option in submodules/diff-gaussian-rasterization/setup.py: line 29 resolves the illegal memory access error in training.

...
29 extra_compile_args={"nvcc": ["-Xcompiler", "-fno-gnu-unique","-I" + os.path.join(os.path.dirname(os.path.abspath(__file__)), "third_party/glm/")]})
...

After changing the code, reinstall the module by pip uninstall diff-gaussian-rasterization -y && pip install submodules/diff-gaussian-rasterization

Not works for me. The device is 3090+cuda121+cudnn8+pytorch2.1.0

@ramazan793
Copy link

@fatbao55 Please check this PR, graphdeco-inria/diff-gaussian-rasterization#10 For my case, adding "-Xcompiler -fno-gnu-unique" option in submodules/diff-gaussian-rasterization/setup.py: line 29 resolves the illegal memory access error in training.

...
29 extra_compile_args={"nvcc": ["-Xcompiler", "-fno-gnu-unique","-I" + os.path.join(os.path.dirname(os.path.abspath(__file__)), "third_party/glm/")]})
...

After changing the code, reinstall the module by pip uninstall diff-gaussian-rasterization -y && pip install submodules/diff-gaussian-rasterization

Thanks a lot!!!
Spent 2 days trying to figure out what's the problem: tried to use different CUDA/Torch versions, nothing have helped.
CUDA versions tried: 11.7, 11.8.
Device: A6000 48GB.
System: Ubuntu 20.04.6 LTS
This fixed worked for: torch 2.0.1, CUDA 11.7 (for 11.8 probably also works, haven't tried)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests