Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

download pipelines fails with some seqera containers #3285

Open
znorgaard opened this issue Nov 18, 2024 · 4 comments
Open

download pipelines fails with some seqera containers #3285

znorgaard opened this issue Nov 18, 2024 · 4 comments
Assignees
Labels
bug Something isn't working download nf-core download

Comments

@znorgaard
Copy link

Description of the bug

Error encountered when installing dev branch of fastquourum with latest nf-core/tools build.

@MatthiasZepper helped with a little digging.

I think (but could not check because of Gitpod), that the problem is this piece of code:

r = requests.get(container, allow_redirects=True, stream=True, timeout=60 * 5)
filesize = r.headers.get("Content-length")
if filesize:
    progress.update(task, total=int(filesize))
    progress.start_task(task)

# Stream download
for data in r.iter_content(chunk_size=io.DEFAULT_BUFFER_SIZE):
    # Check that the user didn't hit ctrl-c
    if self.kill_with_fire:
        raise KeyboardInterrupt
    progress.update(task, advance=len(data))
    fh.write(data)

At some iteration, progress.update(advance=) fails with a KeyError. I presume this is, because the respective self._tasks[task.id] has already terminated?

with self._lock:                                              
    task = self._tasks[task_id]                               
    completed_start = task.completed                          
    if total is not None and total != task.total:

A plausible reason could be, that the total filesize set for this task through the request header filesize = r.headers.get("Content-length"), is smaller than what is actually sent later as data. Might be just a few extra bytes, but they then crash nf-core download?

I did a little additional digging on a branch and commented out one progress logging line that seemed to be triggering the key error and added a few additional debug statements.

This generates a new error:

ERROR    [Errno 2] No such file or directory: './singularity_container_images/blobs-sha256-22-22e054c20192395e0e143df6c36fbed6ce4bd404feba05793aff16819e01fff1-data.img.partial' ->                __main__.py:131
         './singularity_container_images/blobs-sha256-22-22e054c20192395e0e143df6c36fbed6ce4bd404feba05793aff16819e01fff1-data.img'

In my debug statements it almost looks like this could be a repeated action error. There are two processes that share use this image (not sure if that could be relevant).

In my debug statements I can see that file is made and opened.

DEBUG    Opened output file, ./singularity_container_images/blobs-sha256-22-22e054c20192395e0e143df6c36fbed6ce4bd404feba05793aff16819e01fff1-data.img.partial                                     download.py:1352

It looks like the singularity and docker images are found.

DEBUG    https://community-cr-prod.seqera.io:443 "GET /docker/registry/v2/blobs/sha256/22/22e054c20192395e0e143df6c36fbed6ce4bd404feba05793aff16819e01fff1/data HTTP/11" 200 825028608       connectionpool.py:546
DEBUG    https://community-cr-prod.seqera.io:443 "GET /docker/registry/v2/blobs/sha256/22/22e054c20192395e0e143df6c36fbed6ce4bd404feba05793aff16819e01fff1/data HTTP/11" 200 825028608       connectionpool.py:546
DEBUG    Request made for https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/22/22e054c20192395e0e143df6c36fbed6ce4bd404feba05793aff16819e01fff1/data                            download.py:1355
DEBUG    File size of https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/22/22e054c20192395e0e143df6c36fbed6ce4bd404feba05793aff16819e01fff1/data is 825028608                   download.py:1358
DEBUG    Request made for https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/22/22e054c20192395e0e143df6c36fbed6ce4bd404feba05793aff16819e01fff1/data                            download.py:1355
DEBUG    File size of https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/22/22e054c20192395e0e143df6c36fbed6ce4bd404feba05793aff16819e01fff1/data is 825028608                   download.py:1358

Extra weirdness. I think the final expected file exists and is the correct size, 825028608 == 825028608.

-rw-r--r--. 1 ec2-user ec2-user 825028608 Nov 18 17:06 singularity_container_images/community-cr-prod.seqera.io-docker-registry-v2-blobs-sha256-22-22e054c20192395e0e143df6c36fbed6ce4bd404feba05793aff16819e01ff>

Command used and terminal output

# Dependencies: apptainer, nextflow, python3 + pip, git

pip3 install \
  --upgrade \
  --force-reinstall \
  git+https://github.com/nf-core/tools.git@dev

export NXF_SINGULARITY_CACHEDIR="./singularity_container_images"
mkdir -p $NXF_SINGULARITY_CACHEDIR

nf-core pipelines download nf-core/fastquorum \
  --revision dev \
  --outdir ./fastquorum \
  --compress "none" \
  --container-system 'singularity' \
  --container-library "quay.io" -l "docker.io" -l "community.wave.seqera.io" \
  --container-cache-utilisation 'amend' \
  --download-configuration 'yes'

System information

Nextflow version: 24.10.1
Hardware: AWS t2-micro
Executor: NA
OS: Amazon Linux
nf-core/tools version: dev
Python version: 3.12

@znorgaard znorgaard added the bug Something isn't working label Nov 18, 2024
@MatthiasZepper MatthiasZepper added the download nf-core download label Nov 18, 2024
@MatthiasZepper
Copy link
Member

MatthiasZepper commented Nov 18, 2024

Thanks a lot for this in depth investigation.

I need to explore this further, but if it is indeed an issue with two processes competing for the same cached blob, the download should work, if you additionally provide the argument -d 1 / --parallel-downloads 1.

@znorgaard
Copy link
Author

That works!

nf-core/fastquorum#95

@MatthiasZepper
Copy link
Member

MatthiasZepper commented Nov 19, 2024

Yes, I tested on Gitpod as well and also found the bug.

The problem is, that I had to ream the prioritize_direct_download() function to accommodate Seqera containers. They are now bypassing the prioritization function, but when rashly implementing this patch, I overlooked, that this function also performs deduplication.

If there are multiple modules in the pipeline that use the same container (like in fastquorum), the identical container URIs will be extracted and in the case of Seqera containers not deduplicated. Hence, we get the race condition in the cache, that you observed.

In terms of tools, the fix is very simple. In prioritize_direct_download(), I have to change return sorted(list(d.values()) + seqera_containers) to return sorted(list(d.values()) + list(set(seqera_containers))).

But I would also like to have a test for the future, so the PR is a bit more comprehensive than that. I am working on it.

@MatthiasZepper
Copy link
Member

Fix is on its way: #3293

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working download nf-core download
Projects
None yet
Development

No branches or pull requests

2 participants