-
Notifications
You must be signed in to change notification settings - Fork 426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error: unable to build: conveyor failed to get: no descriptor found for reference #4555
Comments
Singularity's cache system works well for a single user/node, but doesn't handle multiple nodes trying to read/write at the same time. I'd recommend doing a |
Actually, I just noticed that I am seeing the same error message on the cluster that uses local /tmp also. So I think using NFS itself is not the issue. By the way, I am also seeing the following error message.
Both error messages are related somehow.. but I am not 100% sure. I can fix this by removing the .sif image generated by singularity v3 under $SINGULARITY_CACHEDIR/oci-tmp directory. Unfortunately, this issue keeps happening, so one time fix won't cut it. Creating .sif image might work, but we don't control the timing of the container being updated, so it will be difficult. Also, I thought that's what singularity v3 does automatically by caching those .sif image under $SINGULARITY_CACHEDIR/oci-tmp? |
Are you trying to pull multiple copies of the container at the same time in this example? It's not NFS itself that's causing a problem, but rather pulling multiple caches at the same time. An |
I see, yes, it's probably pulling multiple caches at the same time is the issue. singularity v2 worked just fine in such case. It knew how to handle concurrency or that it didn't have to because it created separate .sif or whatever the equivalent of for v2. So, if I understand it correctly.. if the container is updated that requires singularity to pull new docker images, the very first job that uses singularity needs to run sequentially, and then subsequent jobs can run in parallel as the .sif file already exists under oci-tmp directory at that point. Right? To keep it consistent with v2 behavior, and because singularity knows that the container needs to be rebuilt with new layer or not, I think it should be singularity's responsibility to make sure that it handles concurrent executions of the same container, in my humble opinion - probably by making other processes wait for the very first singularity process to (re)build the .sif under oci-tmp. |
This looks to be related to (or a duplicate of?) #3634. |
Closing as a duplicate as above - discussion can continue on #3634
Note that this is extremely difficult to do across a cluster in a generic manner, as the variety of parallel / network filesystems in use, and their varied locking support, gives ample opportunity to hit race conditions when atomic cluster-coherent operations on files are not possible. The issue is less in Singularity 2.x as it only caches source docker layers, and does not create cached SIFs |
I am seeing error messages like this frequently.
I don't know what this error message means, but this seems to happen when multiple singularity processes across multiple compute nodes try to startup the same container around the same time, and all of them accessing NFS mounted singularity cache directory (export SINGULARITY_CACHEDIR=/export/singularity)
Am I not supposed to share the same singularity cachedir across multiple nodes? We have pretty small /tmp or /home space (only 20G) so I'd like to avoid using the local /tmp or /home directory to store singularity cache...
I am using the following singularity installation
Thank you for your support.
The text was updated successfully, but these errors were encountered: