Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training crashes after 7000 #235

Open
nivibilla opened this issue Sep 24, 2023 · 6 comments
Open

Training crashes after 7000 #235

nivibilla opened this issue Sep 24, 2023 · 6 comments

Comments

@nivibilla
Copy link

hi,

it gets to 7000 steps, outputs ^C. Doesn't save a point cloud either

image
@XinyueZ
Copy link

XinyueZ commented Sep 28, 2023

I guess, it could be OOM.

@jerome3o
Copy link

I also get this, not sure if OOM. It seems to only happen around the time iter 7000 saves.

I was even running it in a docker container and it crashed the host machine.

@nivibilla did you ever figure out what was causing this?

@jpeng2012
Copy link

Similar issue. showed killed after 7000 iterations for db/drjohnson
But tandt/train works fine.

@GaneshBannur
Copy link

GaneshBannur commented Feb 21, 2024

I had the same error and it was due to OOM. When saving Gaussians there is a spike in CPU RAM usage. You're training with 423 images in Colab so I'm guessing the RAM consumption was already high. When it tried to save Gaussians, consumption must have spiked and caused an OOM.

The quick fix is to not save Gaussians at iteration 7000 and avoid the spike. Only save at 30,000 iterations (or whatever your last iteration is) using the --save_iterations 30000 argument to train.py. However the spike caused when saving at 30,000 may cause it to fail.

The better fix is given in #667. It decreases CPU RAM consumption and prevents this.

@LaurensDiels
Copy link

LaurensDiels commented Mar 6, 2024

I also ran into the same issue. In my case it was not OOM-related, though. I was able to solve the problem by changing the line

elements[:] = list(map(tuple, attributes))

of save_ply in scene/gaussian_model.py, to an explicit loop:

for i in range(len(elements)):
    elements[i] = tuple(attributes[i])

.

@guwinston
Copy link

guwinston commented Dec 12, 2024

I encountered the same problem. In my case, when training to 7000 iterations and save_ply, the code got stuck here and couldn't continue. The reason seems to be that there are too many map operations in this line elements [:]=list (map (tuple, attributes), so I changed it to

elements['x'] = xyz[:, 0]
elements['y'] = xyz[:, 1]
elements['z'] = xyz[:, 2]
elements['nx'] = normals[:, 0]
elements['ny'] = normals[:, 1]
elements['nz'] = normals[:, 2]
for i in range(f_dc.shape[1]):
    elements['f_dc_{}'.format(i)] = f_dc[:, i]
for i in range(f_rest.shape[1]):
    elements['f_rest_{}'.format(i)] = f_rest[:, i]
elements['opacity'] = opacities[:, 0]
for i in range(scale.shape[1]):
    elements['scale_{}'.format(i)] = scale[:, i]
for i in range(rotation.shape[1]):
    elements['rot_{}'.format(i)] = rotation[:, i]

works for me, hoping it would be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants