-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error with running multiple training commands on a single node #297
Comments
This is caused by multiple JIT building happens at the same time and every JIT building is trying to remove the same lock file after building is finished, but the lock file is already removed by another process One solution is that you build the gsplat at installation to avoid triggering JIT building. This can be done via The other solution is a bit hacky, which is to comment out the error raising line:
|
@liruilong940607 If there is a better way to fix this (by adding a few checks in the |
I would love to have a fix for this but honestly I dont know what to do about it. If you have any idea I would love to hear it! |
It should now be fixed on the main branch with the above PR. If not feel free to reopen it! |
FileNotFoundError: [Errno 2] No such file or directory: '~/.cache/torch_extensions/py39_cu121/gsplat_cuda/lock'
This happens only when I have multiple training scripts running on the same machine.
Here is the trace:
Any idea how to fix this?
The text was updated successfully, but these errors were encountered: