Error: unable to build: conveyor failed to get: no descriptor found for reference #4555

soichih · 2019-10-02T16:10:31Z

I am seeing error messages like this frequently.

$ singularity exec -e docker://brainlife/mrtrix3:3.0_RC3 ./mrtrix3_tracking.sh
...
Writing manifest to image destination
Storing signatures
�[31mFATAL:  �[0m Unable to handle docker://brainlife/mrtrix3:3.0_RC3 uri: unable to build: conveyor failed to get: no descriptor found for reference "c71e252ab45b69ce96ff07e4ea175392811eae8def326e90a74c3a89b4f9a7e9"

I don't know what this error message means, but this seems to happen when multiple singularity processes across multiple compute nodes try to startup the same container around the same time, and all of them accessing NFS mounted singularity cache directory (export SINGULARITY_CACHEDIR=/export/singularity)

Am I not supposed to share the same singularity cachedir across multiple nodes? We have pretty small /tmp or /home space (only 20G) so I'd like to avoid using the local /tmp or /home directory to store singularity cache...

I am using the following singularity installation

[user@slurm5 ~]$ singularity --version
singularity version 3.2.1-1.1.el7
[user@slurm5 ~]$ cat /etc/redhat-release 
CentOS Linux release 7.6.1810 (Core) 
[user@slurm5 ~]$ uname -a
Linux slurm5.novalocal 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Thank you for your support.

The text was updated successfully, but these errors were encountered:

rherban · 2019-10-02T16:26:50Z

Singularity's cache system works well for a single user/node, but doesn't handle multiple nodes trying to read/write at the same time.

I'd recommend doing a singularity pull mrtrix.sif docker://brainlife/mrtrix3:3.0_RC3 and store the container somewhere centrally. Then each node can run the same image without colliding.

soichih · 2019-10-02T18:32:52Z

Actually, I just noticed that I am seeing the same error message on the cluster that uses local /tmp also. So I think using NFS itself is not the issue.

By the way, I am also seeing the following error message.

[brlife@slurm3-compute11 brlife]$ singularity -vvv exec -e docker://brainlife/freesurfer_on_mcr:6.0.0 recon-all -h
VERBOSE: Set messagelevel to: 4
VERBOSE: Container runtime
VERBOSE: Check if we are running as setuid
VERBOSE: Get root privileges
VERBOSE: Spawn stage 1
VERBOSE: Execute stage 1
FATAL:   image format not recognized
VERBOSE: stage 1 exited with status 255

Both error messages are related somehow.. but I am not 100% sure.

I can fix this by removing the .sif image generated by singularity v3 under $SINGULARITY_CACHEDIR/oci-tmp directory. Unfortunately, this issue keeps happening, so one time fix won't cut it.

Creating .sif image might work, but we don't control the timing of the container being updated, so it will be difficult. Also, I thought that's what singularity v3 does automatically by caching those .sif image under $SINGULARITY_CACHEDIR/oci-tmp?

rherban · 2019-10-02T19:55:11Z

Are you trying to pull multiple copies of the container at the same time in this example? It's not NFS itself that's causing a problem, but rather pulling multiple caches at the same time.

An mpirun -np 8 singularity pull docker://blah can exhibit this issue on a single node, since multiple threads will try to pull the container at the same time. The caching system works best if you pull once at the start of your job script, then do a singularity exec /shared/container.sif.

soichih · 2019-10-02T20:21:16Z

I see, yes, it's probably pulling multiple caches at the same time is the issue. singularity v2 worked just fine in such case. It knew how to handle concurrency or that it didn't have to because it created separate .sif or whatever the equivalent of for v2.

So, if I understand it correctly.. if the container is updated that requires singularity to pull new docker images, the very first job that uses singularity needs to run sequentially, and then subsequent jobs can run in parallel as the .sif file already exists under oci-tmp directory at that point. Right? To keep it consistent with v2 behavior, and because singularity knows that the container needs to be rebuilt with new layer or not, I think it should be singularity's responsibility to make sure that it handles concurrent executions of the same container, in my humble opinion - probably by making other processes wait for the very first singularity process to (re)build the .sif under oci-tmp.

adamnovak · 2019-10-11T22:53:13Z

This looks to be related to (or a duplicate of?) #3634.

dtrudg · 2020-01-03T15:54:31Z

Closing as a duplicate as above - discussion can continue on #3634

To keep it consistent with v2 behavior, and because singularity knows that the container needs to be rebuilt with new layer or not, I think it should be singularity's responsibility to make sure that it handles concurrent executions of the same container, in my humble opinion - probably by making other processes wait for the very first singularity process to (re)build the .sif under oci-tmp.

Note that this is extremely difficult to do across a cluster in a generic manner, as the variety of parallel / network filesystems in use, and their varied locking support, gives ample opportunity to hit race conditions when atomic cluster-coherent operations on files are not possible. The issue is less in Singularity 2.x as it only caches source docker layers, and does not create cached SIFs

jscook2345 assigned mem and rherban Oct 2, 2019

dtrudg closed this as completed Jan 3, 2020

adamnovak mentioned this issue Jan 3, 2020

Race conditions when caching images sometimes causes cache corruption. #3634

Closed

dtrudg unassigned mem May 12, 2020

gianfilippo mentioned this issue Jul 11, 2020

dockered VarScan2 issue bioinform/somaticseq#84

Closed

MillironX mentioned this issue Jan 3, 2022

fastqc and Apple StaPH-B/docker-builds#276

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: unable to build: conveyor failed to get: no descriptor found for reference #4555

Error: unable to build: conveyor failed to get: no descriptor found for reference #4555

soichih commented Oct 2, 2019

rherban commented Oct 2, 2019

soichih commented Oct 2, 2019

rherban commented Oct 2, 2019

soichih commented Oct 2, 2019

adamnovak commented Oct 11, 2019

dtrudg commented Jan 3, 2020

Error: unable to build: conveyor failed to get: no descriptor found for reference #4555

Error: unable to build: conveyor failed to get: no descriptor found for reference #4555

Comments

soichih commented Oct 2, 2019

rherban commented Oct 2, 2019

soichih commented Oct 2, 2019

rherban commented Oct 2, 2019

soichih commented Oct 2, 2019

adamnovak commented Oct 11, 2019

dtrudg commented Jan 3, 2020