Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with running multiple training commands on a single node #297

Closed
gchhablani opened this issue Jul 21, 2024 · 4 comments · Fixed by #312
Closed

Error with running multiple training commands on a single node #297

gchhablani opened this issue Jul 21, 2024 · 4 comments · Fixed by #312

Comments

@gchhablani
Copy link

gchhablani commented Jul 21, 2024

FileNotFoundError: [Errno 2] No such file or directory: '~/.cache/torch_extensions/py39_cu121/gsplat_cuda/lock'

This happens only when I have multiple training scripts running on the same machine.
Here is the trace:

  File "~/miniforge3/envs/gs/lib/python3.9/site-packages/nerfstudio/models/base_model.py", line 143, in forward
    return self.get_outputs(ray_bundle)
...
  File "~/miniforge3/envs/gs/lib/python3.9/site-packages/gsplat/project_gaussians.py", line 61, in project_gaussians
    return _ProjectGaussians.apply(
  File "~/miniforge3/envs/gs/lib/python3.9/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "~/miniforge3/envs/gs/lib/python3.9/site-packages/gsplat/project_gaussians.py", line 110, in forward
    ) = _C.project_gaussians_forward(
  File "~/miniforge3/envs/gs/lib/python3.9/site-packages/gsplat/cuda/__init__.py", line 7, in call_cuda
    from ._backend import _C
  File "~/miniforge3/envs/gs/lib/python3.9/site-packages/gsplat/cuda/_backend.py", line 73, in <module>
    _C = load(
  File "~/miniforge3/envs/gs/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1308, in load
    return _jit_compile(
  File "~/miniforge3/envs/gs/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1722, in _jit_compile
    baton.release()
  File "~/miniforge3/envs/gs/lib/python3.9/site-packages/torch/utils/file_baton.py", line 49, in release
    os.remove(self.lock_file_path)
FileNotFoundError: [Errno 2] No such file or directory: '~/.cache/torch_extensions/py39_cu121/gsplat_cuda/lock'

Any idea how to fix this?

@gchhablani gchhablani changed the title Error with running multiple training scripts on a single node Error with running multiple training commands on a single node Jul 21, 2024
@liruilong940607
Copy link
Collaborator

liruilong940607 commented Jul 22, 2024

This is caused by multiple JIT building happens at the same time and every JIT building is trying to remove the same lock file after building is finished, but the lock file is already removed by another process

One solution is that you build the gsplat at installation to avoid triggering JIT building. This can be done via pip install .

The other solution is a bit hacky, which is to comment out the error raising line:

File "~/miniforge3/envs/gs/lib/python3.9/site-packages/torch/utils/file_baton.py", line 49, in release
os.remove(self.lock_file_path)

@gchhablani
Copy link
Author

gchhablani commented Jul 22, 2024

@liruilong940607
Thanks for the explanation!
Doing a pip install of that specific version through source fixed the issue.

If there is a better way to fix this (by adding a few checks in the cuda/__backend__.py), I can add a PR for that. Please let me know how to do so.

@liruilong940607
Copy link
Collaborator

I would love to have a fix for this but honestly I dont know what to do about it. If you have any idea I would love to hear it!

@liruilong940607
Copy link
Collaborator

It should now be fixed on the main branch with the above PR. If not feel free to reopen it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants