Error with running multiple training commands on a single node #297

gchhablani · 2024-07-21T22:26:03Z

FileNotFoundError: [Errno 2] No such file or directory: '~/.cache/torch_extensions/py39_cu121/gsplat_cuda/lock'

This happens only when I have multiple training scripts running on the same machine.
Here is the trace:

  File "~/miniforge3/envs/gs/lib/python3.9/site-packages/nerfstudio/models/base_model.py", line 143, in forward
    return self.get_outputs(ray_bundle)
...
  File "~/miniforge3/envs/gs/lib/python3.9/site-packages/gsplat/project_gaussians.py", line 61, in project_gaussians
    return _ProjectGaussians.apply(
  File "~/miniforge3/envs/gs/lib/python3.9/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "~/miniforge3/envs/gs/lib/python3.9/site-packages/gsplat/project_gaussians.py", line 110, in forward
    ) = _C.project_gaussians_forward(
  File "~/miniforge3/envs/gs/lib/python3.9/site-packages/gsplat/cuda/__init__.py", line 7, in call_cuda
    from ._backend import _C
  File "~/miniforge3/envs/gs/lib/python3.9/site-packages/gsplat/cuda/_backend.py", line 73, in <module>
    _C = load(
  File "~/miniforge3/envs/gs/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1308, in load
    return _jit_compile(
  File "~/miniforge3/envs/gs/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1722, in _jit_compile
    baton.release()
  File "~/miniforge3/envs/gs/lib/python3.9/site-packages/torch/utils/file_baton.py", line 49, in release
    os.remove(self.lock_file_path)
FileNotFoundError: [Errno 2] No such file or directory: '~/.cache/torch_extensions/py39_cu121/gsplat_cuda/lock'

Any idea how to fix this?

The text was updated successfully, but these errors were encountered:

liruilong940607 · 2024-07-22T06:12:51Z

This is caused by multiple JIT building happens at the same time and every JIT building is trying to remove the same lock file after building is finished, but the lock file is already removed by another process

One solution is that you build the gsplat at installation to avoid triggering JIT building. This can be done via pip install .

The other solution is a bit hacky, which is to comment out the error raising line:

File "~/miniforge3/envs/gs/lib/python3.9/site-packages/torch/utils/file_baton.py", line 49, in release
os.remove(self.lock_file_path)

gchhablani · 2024-07-22T20:11:57Z

@liruilong940607
Thanks for the explanation!
Doing a pip install of that specific version through source fixed the issue.

If there is a better way to fix this (by adding a few checks in the cuda/__backend__.py), I can add a PR for that. Please let me know how to do so.

liruilong940607 · 2024-07-31T21:26:30Z

I would love to have a fix for this but honestly I dont know what to do about it. If you have any idea I would love to hear it!

liruilong940607 · 2024-08-03T00:23:23Z

It should now be fixed on the main branch with the above PR. If not feel free to reopen it!

gchhablani changed the title ~~Error with running multiple training scripts on a single node~~ Error with running multiple training commands on a single node Jul 21, 2024

liruilong940607 mentioned this issue Aug 3, 2024

prevent race condition when JIT in multiprocess #312

Merged

liruilong940607 closed this as completed in #312 Aug 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with running multiple training commands on a single node #297

Error with running multiple training commands on a single node #297

gchhablani commented Jul 21, 2024 •

edited

Loading

liruilong940607 commented Jul 22, 2024 •

edited

Loading

gchhablani commented Jul 22, 2024 •

edited

Loading

liruilong940607 commented Jul 31, 2024

liruilong940607 commented Aug 3, 2024

Error with running multiple training commands on a single node #297

Error with running multiple training commands on a single node #297

Comments

gchhablani commented Jul 21, 2024 • edited Loading

liruilong940607 commented Jul 22, 2024 • edited Loading

gchhablani commented Jul 22, 2024 • edited Loading

liruilong940607 commented Jul 31, 2024

liruilong940607 commented Aug 3, 2024

gchhablani commented Jul 21, 2024 •

edited

Loading

liruilong940607 commented Jul 22, 2024 •

edited

Loading

gchhablani commented Jul 22, 2024 •

edited

Loading