Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: unable to build: conveyor failed to get: no descriptor found for reference #4555

Closed
soichih opened this issue Oct 2, 2019 · 6 comments
Assignees

Comments

@soichih
Copy link
Contributor

soichih commented Oct 2, 2019

I am seeing error messages like this frequently.

$ singularity exec -e docker://brainlife/mrtrix3:3.0_RC3 ./mrtrix3_tracking.sh
...
Writing manifest to image destination
Storing signatures
�[31mFATAL:  �[0m Unable to handle docker://brainlife/mrtrix3:3.0_RC3 uri: unable to build: conveyor failed to get: no descriptor found for reference "c71e252ab45b69ce96ff07e4ea175392811eae8def326e90a74c3a89b4f9a7e9"

I don't know what this error message means, but this seems to happen when multiple singularity processes across multiple compute nodes try to startup the same container around the same time, and all of them accessing NFS mounted singularity cache directory (export SINGULARITY_CACHEDIR=/export/singularity)

Am I not supposed to share the same singularity cachedir across multiple nodes? We have pretty small /tmp or /home space (only 20G) so I'd like to avoid using the local /tmp or /home directory to store singularity cache...

I am using the following singularity installation

[user@slurm5 ~]$ singularity --version
singularity version 3.2.1-1.1.el7
[user@slurm5 ~]$ cat /etc/redhat-release 
CentOS Linux release 7.6.1810 (Core) 
[user@slurm5 ~]$ uname -a
Linux slurm5.novalocal 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Thank you for your support.

@rherban
Copy link

rherban commented Oct 2, 2019

Singularity's cache system works well for a single user/node, but doesn't handle multiple nodes trying to read/write at the same time.

I'd recommend doing a singularity pull mrtrix.sif docker://brainlife/mrtrix3:3.0_RC3 and store the container somewhere centrally. Then each node can run the same image without colliding.

@soichih
Copy link
Contributor Author

soichih commented Oct 2, 2019

Actually, I just noticed that I am seeing the same error message on the cluster that uses local /tmp also. So I think using NFS itself is not the issue.

By the way, I am also seeing the following error message.

[brlife@slurm3-compute11 brlife]$ singularity -vvv exec -e docker://brainlife/freesurfer_on_mcr:6.0.0 recon-all -h
VERBOSE: Set messagelevel to: 4
VERBOSE: Container runtime
VERBOSE: Check if we are running as setuid
VERBOSE: Get root privileges
VERBOSE: Spawn stage 1
VERBOSE: Execute stage 1
FATAL:   image format not recognized
VERBOSE: stage 1 exited with status 255

Both error messages are related somehow.. but I am not 100% sure.

I can fix this by removing the .sif image generated by singularity v3 under $SINGULARITY_CACHEDIR/oci-tmp directory. Unfortunately, this issue keeps happening, so one time fix won't cut it.

Creating .sif image might work, but we don't control the timing of the container being updated, so it will be difficult. Also, I thought that's what singularity v3 does automatically by caching those .sif image under $SINGULARITY_CACHEDIR/oci-tmp?

@rherban
Copy link

rherban commented Oct 2, 2019

Are you trying to pull multiple copies of the container at the same time in this example? It's not NFS itself that's causing a problem, but rather pulling multiple caches at the same time.

An mpirun -np 8 singularity pull docker://blah can exhibit this issue on a single node, since multiple threads will try to pull the container at the same time. The caching system works best if you pull once at the start of your job script, then do a singularity exec /shared/container.sif.

@soichih
Copy link
Contributor Author

soichih commented Oct 2, 2019

I see, yes, it's probably pulling multiple caches at the same time is the issue. singularity v2 worked just fine in such case. It knew how to handle concurrency or that it didn't have to because it created separate .sif or whatever the equivalent of for v2.

So, if I understand it correctly.. if the container is updated that requires singularity to pull new docker images, the very first job that uses singularity needs to run sequentially, and then subsequent jobs can run in parallel as the .sif file already exists under oci-tmp directory at that point. Right? To keep it consistent with v2 behavior, and because singularity knows that the container needs to be rebuilt with new layer or not, I think it should be singularity's responsibility to make sure that it handles concurrent executions of the same container, in my humble opinion - probably by making other processes wait for the very first singularity process to (re)build the .sif under oci-tmp.

@adamnovak
Copy link

This looks to be related to (or a duplicate of?) #3634.

@dtrudg
Copy link
Contributor

dtrudg commented Jan 3, 2020

Closing as a duplicate as above - discussion can continue on #3634

To keep it consistent with v2 behavior, and because singularity knows that the container needs to be rebuilt with new layer or not, I think it should be singularity's responsibility to make sure that it handles concurrent executions of the same container, in my humble opinion - probably by making other processes wait for the very first singularity process to (re)build the .sif under oci-tmp.

Note that this is extremely difficult to do across a cluster in a generic manner, as the variety of parallel / network filesystems in use, and their varied locking support, gives ample opportunity to hit race conditions when atomic cluster-coherent operations on files are not possible. The issue is less in Singularity 2.x as it only caches source docker layers, and does not create cached SIFs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants