Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyCUDA error when launching train_ft on custom colmap data #88

Open
DaddyWesker opened this issue Aug 3, 2023 · 1 comment
Open

PyCUDA error when launching train_ft on custom colmap data #88

DaddyWesker opened this issue Aug 3, 2023 · 1 comment

Comments

@DaddyWesker
Copy link

Hello and thanks for your code.

I've spent couple of days trying to launch this code on the custom data i have after colmapping some images. I was able to beat many problems on the way of launching the code but I'm facing this right now and don't know what to do:

dataset total: train 330
dataset [NerfSynthFtDataset] was created
../checkpoints/col_nerfsynth/yandex/*_net_ray_marching.pth
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Continue training from 0 epoch
Iter: 0
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
opt.act_type!!!!!!!!! LeakyReLU
self.points_embeding torch.Size([1, 841, 32])
querier device cuda:0 0
neural_params [('module.neural_points.xyz', torch.Size([841, 3]), False), ('module.neural_points.points_embeding', torch.Size([1, 841, 32]), True), ('module.neural_points.points_conf', torch.Size([1, 841, 1]), True), ('module.neural_points.points_dir', torch.Size([1, 841, 3]), True), ('module.neural_points.points_color', torch.Size([1, 841, 3]), True), ('module.neural_points.Rw2c', torch.Size([3, 3]), False)]
model [MvsPointsVolumetricModel] was created
opt.resume_iter!!!!!!!!! 0
loading ray_marching  from  ../checkpoints/col_nerfsynth/yandex/0_net_ray_marching.pth
------------------- Networks -------------------
[Network ray_marching] Total number of parameters: 0.377M
------------------------------------------------
# training images = 330
saving model (yandex, epoch 0, total_steps 0)
Traceback (most recent call last):
  File "train_ft.py", line 1081, in <module>
    main()
  File "train_ft.py", line 937, in main
    model.optimize_parameters(total_steps=total_steps)
  File "/home/daddywesker/Dioram/yandex/pointnerf/run/../models/neural_points_volumetric_model.py", line 217, in optimize_parameters
    self.backward(total_steps)
  File "/home/daddywesker/Dioram/yandex/pointnerf/run/../models/mvs_points_volumetric_model.py", line 104, in backward
    self.loss_total.backward()
  File "/home/daddywesker/anaconda3/envs/limap/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/daddywesker/anaconda3/envs/limap/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
end loading
-------------------------------------------------------------------
PyCUDA ERROR: The context stack was not empty upon module cleanup.
-------------------------------------------------------------------
A context was still active when the context stack was being
cleaned up. At this point in our execution, CUDA may already
have been deinitialized, so there is no way we can finish
cleanly. The program will be aborted now.
Use Context.pop() to avoid this problem.

I've tried to debug this problem by attaching to the process launched using bash script in w_colmap_n360 folder. But no clue currently. Any advises?

@DaddyWesker
Copy link
Author

Alright, I've kinda fixed this. I've set load_points=0 in sh file.

Problem is, that this project tries to allocate enormous amount of VRAM.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 77.28 GiB (GPU 0; 7.80 GiB total capacity; 62.91 MiB already allocated; 6.40 GiB free; 92.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
end loading

I have 330 images 1024*1024 each. Is this too much?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant