-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docker: do not get layers if present in local docker service #199
Comments
Thank you for the issue. (Un)fortunately we do not use docker images much with datalad-containers since before advent of podman it was impossible to run them on HPCs. For that we typically used singularity containers, and even established https://github.com/ReproNim/containers (AKA
sorry, not sure what you mean by "local docker service" so that dataset is installed there.
that is unlikely since another immediately available solution is just to have a Here is demo and you would likely to use the same if containers were singularity images❯ datalad --dbg containers-run -n alpine-310 uname -a
[INFO ] Making sure inputs are available (this may take some time)
[INFO ] == Command start (output follows) =====
Linux f435385a0d00 6.1.0-3-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.8-1 (2023-01-29) x86_64 Linux
[INFO ] == Command exit (modification check follows) =====
run(ok): /tmp/test-docker (dataset) [python3.11 -m datalad_container.adapters...]
action summary:
get (notneeded: 1)
run (ok: 1)
save (notneeded: 1)
❯ datalad clone $PWD $PWD-fullclone; datalad -C $PWD-fullclone containers-run -n alpine-310 uname -a
install(ok): /tmp/test-docker-fullclone (dataset)
[INFO ] Making sure inputs are available (this may take some time)
get(ok): .datalad/environments/alpine-310/image/d7a1cfafe8443e658337f34911cbdb26304eaf248b182ee2cabd093b5e9edab2/VERSION (file) [from origin...]
get(ok): .datalad/environments/alpine-310/image/d7a1cfafe8443e658337f34911cbdb26304eaf248b182ee2cabd093b5e9edab2/json (file) [from origin...]
get(ok): .datalad/environments/alpine-310/image/d7a1cfafe8443e658337f34911cbdb26304eaf248b182ee2cabd093b5e9edab2/layer.tar (file) [from origin...]
get(ok): .datalad/environments/alpine-310/image/e7b300aee9f9bf3433d32bc9305bfdd22183beb59d933b48d77ab56ba53a197a.json (file) [from origin...]
get(ok): .datalad/environments/alpine-310/image/manifest.json (file) [from origin...]
get(ok): .datalad/environments/alpine-310/image/repositories (file) [from origin...]
get(ok): .datalad/environments/alpine-310/image (directory)
[INFO ] == Command start (output follows) =====
Linux 8b267f7059b8 6.1.0-3-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.8-1 (2023-01-29) x86_64 Linux
[INFO ] == Command exit (modification check follows) =====
run(ok): /tmp/test-docker-fullclone (dataset) [python3.11 -m datalad_container.adapters...]
action summary:
get (ok: 7)
run (ok: 1)
save (notneeded: 1)
/tmp/test-docker master ▓▒░────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────░▒▓ dev3.11 Py 09:29:10 P❯ datalad clone --reckless ephemeral $PWD $PWD-ephemeral; datalad -C $PWD-ephemeral containers-run -n alpine-310 uname -a
install(ok): /tmp/test-docker-ephemeral (dataset)
[INFO ] Making sure inputs are available (this may take some time)
[INFO ] == Command start (output follows) =====
Linux a04ba1bd2d45 6.1.0-3-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.8-1 (2023-01-29) x86_64 Linux
[INFO ] == Command exit (modification check follows) =====
run(ok): /tmp/test-docker-ephemeral (dataset) [python3.11 -m datalad_container.adapters...]
action summary:
get (notneeded: 1)
run (ok: 1)
save (notneeded: 1)
but then care and extra steps might be needed if you are planing to save results into that ephemeral cloned dataset since annexed keys would be saved into original repo |
Thanks for the detailed reply. I have been more recently been investigating the docker option, as we are pushing datalad dataset creation to a new level: orchestrating all the pipelines with gitlab for conversion jobs(heudiconv,.. ), preparation jobs (defacing,...) qc jobs (mriqc,...), processing jobs (*MRIPrep,...), testing jobs (bids-validator,...) through gitlab-runners. In that environment, it is simpler to run docker than singularity in docker. However, it requires optimizing data transfer, and containers are a big part of the requirements for individual jobs.
Let say one of the job clones a dataset containing docker images (likely in a subdataset tracked for reproducibility) to a server, it is possible that the docker layers for the images required for the job are already pulled in the server docker instance (from previous jobs) and as it is tracked with docker versioning, using such layers rather than pulling/importing from the datalad dataset is more efficient without breaking reproducibility. The ephemeral option could work (and output data will never be in the shared containers repo), but as gitlab jobs are run within docker containers (and then containers-run will likely run with docker-in-docker), it might be less obvious to properly bind the right volumes, and will be less easily portable/deployed to multiple servers. I will investigate the |
@bpinsard thanks for the idea re a docker daemon special remote. It sounds appealing, but I need to think about this more. Looking at the flow in the code, I think a docker-specific intervention might be less complex. Before
In principle it would be even better to not export them at all, but simply skip their download/ingestion to docker entirely, but this would be a much larger break from the annexed-file paradigm. |
I looked into this further. Pretty much all the necessary functionality is already implemented in the docker adaptor. After this last exploration I am considering the following approach
|
Here is a draft implementation that shows the basics of the approach. A helper is added to the docker adaptor (see inline comments): diff --git a/datalad_container/adapters/docker.py b/datalad_container/adapters/docker.py
index 8949852..1a18146 100644
--- a/datalad_container/adapters/docker.py
+++ b/datalad_container/adapters/docker.py
@@ -13,6 +13,7 @@ import hashlib
import json
import os
import os.path as op
+from pathlib import Path
import subprocess as sp
import sys
import tarfile
@@ -153,6 +154,51 @@ def load(path, repo_tag, config):
return image_id
+def repopulate_from_daemon(contds, imgpath: Path) -> None:
+ # crude check whether anything at the image location is not
+ # locally present
+ contrepo = contds.repo
+ if contrepo.call_annex(
+ ['find', '--not', '--in', 'here'],
+ files=str(imgpath),
+ ):
+ # a docker image is a collection of files in a directory
+ assert imgpath.is_dir()
+ # we could look into `manifest.json`, but it might also be
+ # annexed and not around. instead look for the config filename
+ imgcfg = [
+ p.name for p in imgpath.iterdir()
+ if len(p.name) == 69 and p.name.endswith('.json')
+ ]
+ # there is only one
+ assert len(imgcfg) == 1
+
+ # look for the employed annex backend
+ backends = set(contrepo.call_annex_oneline([
+ 'find',
+ f'--branch=HEAD:{imgpath.relative_to(contds.pathobj)}',
+ '--anything',
+ '--format=${backend} ']).split())
+ # we can only deal with a single homogeneous backend here
+ assert len(backends) == 1
+
+ # ID is filename, minus .json extension
+ img_id = imgcfg[0][:-5]
+ with tempfile.TemporaryDirectory(dir=imgpath) as tmpdir:
+ # try to export the image from a local docker instance
+ save(f'sha256:{img_id}', tmpdir)
+ # the line above will raise an exception when
+ # - this docker does not have the image.
+ # - or there is not docker running at all.
+ contrepo.call_annex(
+ ['reinject', '--known', '--backend', backends.pop()],
+ files=[
+ str(p) for p in Path(tmpdir).glob('**/*')
+ if p.is_file()
+ ],
+ )
+
+
# Command-line
This helper is executed just before diff --git a/datalad_container/containers_run.py b/datalad_container/containers_run.py
index d7b4e79..eabb448 100644
--- a/datalad_container/containers_run.py
+++ b/datalad_container/containers_run.py
@@ -4,6 +4,7 @@ __docformat__ = 'restructuredtext'
import logging
import os.path as op
+from pathlib import Path
import sys
from datalad.interface.base import Interface
@@ -163,6 +164,26 @@ class ContainersRun(Interface):
lgr.debug("extra_inputs = %r", extra_inputs)
+ if '-m datalad_container.adapters.docker run' in cmd:
+ # this will use the docker adapter to execute the container.
+ # let's check whether pieces of the container image are missing.
+ # if so, query the local docker, if it already has the respective
+ # image, in order to populate the dataset without having to go
+ # through a download -- pretty much an adhoc docker-special remote
+ from datalad_container.adapters.docker import repopulate_from_daemon
+ contds = require_dataset(
+ container['parentds'], check_installed=True,
+ purpose='check for docker images')
+ try:
+ repopulate_from_daemon(
+ contds,
+ # we use the container report here too, and not any of the
+ # processed variants from above to stay internally consistent
+ imgpath=Path(container['path']),
+ )
+ except Exception as e:
+ pass
+
with patch.dict('os.environ',
{CONTAINER_NAME_ENVVAR: container['name']}):
# fire! Issues:
So it turns out that the manifest depends on how When This complicates things a bit. When the manifest is annexed, we would need to match the query to its content (that we do not have necessarily). The generic approach would be to parse |
I resolved the repo-tag issues. I will spend some time on documenting this, and then open a PR. |
Closes: datalad#199 Demo ``` ❯ datalad create busydemo ❯ cd busydemo ❯ datalad containers-add -u dhub://busybox:latest busy ❯ datalad drop .datalad/environments --reckless availability ❯ git annex info | grep 'local annex' local annex keys: 0 local annex size: 0 bytes ❯ cat .datalad/config [datalad "dataset"] id = b7adee52-a65a-43fc-a85b-c0d5e2d5b67c [datalad "containers.busy"] image = .datalad/environments/busy/image cmdexec = {python} -m datalad_container.adapters.docker run {img} {cmd} ❯ datalad containers-run -n busy uname [INFO ] Saved busybox:latest to /tmp/busydemo/.datalad/environments/busy/image/tmpzrdd7mj2 [INFO ] Making sure inputs are available (this may take some time) [INFO ] == Command start (output follows) ===== Linux [INFO ] == Command exit (modification check follows) ===== run(ok): /tmp/busydemo (dataset) [/home/mih/env/datalad-dev/bin/python -m ...] ```
If ones repetitively clones datasets that contains a docker image on the same machine, it will need to get the layers from the remote while it could already be pulled in the local docker.
For example, for parallel processing with fMRIPrep, one will clone the dataset as temp clone, but will need to get the docker layers of the fMRIPrep image, even if these are already present in the docker instance.
For now the
datalad containers-run
will run the image if present in docker but still tries fetching the layer, doing unnecessary costly get operations. For example, if the layers were dropped recklessly, it tries to get the layer, fails, but runs the container anyway.I guess it would require to change
datalad-container/datalad_container/containers_run.py
Line 141 in 66941a9
I can see 2 options:
On a side note, would it make sense to add the local docker service where the dataset is installed as a special remote.
The text was updated successfully, but these errors were encountered: