Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide URLs to individual docker image layers #98

Open
yarikoptic opened this issue Aug 21, 2019 · 17 comments
Open

Provide URLs to individual docker image layers #98

yarikoptic opened this issue Aug 21, 2019 · 17 comments
Labels
docker Issues relating to docker support

Comments

@yarikoptic
Copy link
Member

It was postponed until there is some interest in storing/running docker images, but we seems didn't even create an issue for that.
There was some interest from users expressed so here is this issue: added docker image layers do not have URLs to point to the docker hub so they could later be fetched on another box:

(git-annex)hopa:/tmp/test-docker[master]
$> datalad containers-add -i kwyk-img -u dhub://neuronets/kwyk:version-0.2-cpu kwyk
[INFO   ] Saved neuronets/kwyk:version-0.2-cpu to /tmp/test-docker/kwyk-img 
save(ok): /tmp/test-docker (dataset)
containers_add(ok): /tmp/test-docker/kwyk-img (file)
action summary:
  containers_add (ok: 1)
...
$> git annex whereis kwyk-img | head
whereis kwyk-img/1374c101c6f7038762c71038589946d60dcf6ea66dd9b89d511474b727aa7f0e/VERSION (1 copy) 
  	0aac68f2-5a96-4826-b5ff-69ec5d31863e -- yoh@hopa:/tmp/test-docker [here]
ok
whereis kwyk-img/1374c101c6f7038762c71038589946d60dcf6ea66dd9b89d511474b727aa7f0e/json (1 copy) 
  	0aac68f2-5a96-4826-b5ff-69ec5d31863e -- yoh@hopa:/tmp/test-docker [here]

also may be there should be .gitattributes created in the image directory to instruct .json files to be committed directly to git not annex

@kyleam
Copy link
Contributor

kyleam commented Sep 19, 2019

Here are some notes about what I've looked into. tl;dr I can download a blob for a layer I get from inspecting a manifest or OCI-layout directory, but I haven't figured out if there's a way to download a layer.tar file produced by docker save.

Time conservation notice: These are pretty incomplete notes on incomplete research, and I've probably injected a good dose of confusion into them.

notes

directory with extracted docker save output

docker pull busybox:latest
mkdir bb-dsave
docker save busybox:latest >bb-dsave/bb.tar
(cd bb-dsave && tar -xf bb.tar)

That will result in a diretory that's is similar to what the our docker adapter produces:

bb-dsave
|-- 19485c79a9bbdca205fce4f791efeaa2a103e23431434696cc54fdd939e9198d.json
|-- 65836406f9479e26bb2dc27439df3efdae3c298edd1ea781dcb3ac7a7baae542
|   |-- json
|   |-- layer.tar
|   `-- VERSION
|-- bb.tar
|-- manifest.json
`-- repositories

1 directory, 7 files

The configuration JSON matches the image's hash:

docker images --no-trunc --quiet busybox:latest
sha256:19485c79a9bbdca205fce4f791efeaa2a103e23431434696cc54fdd939e9198d

pulling manifest

Based on the documentation for accessing the manifest, I was hoping to be able to use that image ID to pull down a manifest. Here's a unpolished demo script that uses the first argument as the reference.

import sys
import json
import requests

repo = "library/busybox"

### https://docs.docker.com/registry/spec/auth/token/
auth_url = ("https://auth.docker.io/token?service=registry.docker.io"
            "&scope=repository:{repository}:pull")
resp_auth = requests.get(auth_url.format(repository=repo))
if resp_auth.status_code != 200:
    sys.stderr.write("Failed to authenticate: {}\n"
                     .format(resp_auth.status_code))
    sys.exit(1)

headers = {"Authorization": "Bearer " + resp_auth.json()["token"],
           "Accept": "application/vnd.docker.distribution.manifest.v2+json"}

ref = sys.argv[1]

# https://docs.docker.com/registry/spec/api/#manifest
man_url = "https://registry-1.docker.io/v2/{repository}/manifests/{reference}"

resp_man = requests.get(man_url.format(repository=repo, reference=ref),
                        headers=headers)

if resp_man.status_code != 200:
    sys.stderr.write("Failed to download manifest: {}\n"
                     .format(resp_man.status_code))
    json.dump(resp_man.json(), sys.stderr)
    sys.exit(1)
json.dump(resp_man.json(), sys.stdout)

Passing in the image ID doesn't seem to work:

Failed to download manifest: 500
{"errors": [{"code": "UNKNOWN", "message": "unknown error", "detail": {}}]}

If I use "latest" instead of the sha256 ref, I can pull down the manifest, and it has the matching image ID. So I'm not sure what I'm missing there.

jq .config.digest <gmout
sha256:19485c79a9bbdca205fce4f791efeaa2a103e23431434696cc54fdd939e9198d

layer IDs in manifest

The manifest lists just one layer:

jq .layers <gmout
[
  {
    "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
    "size": 760770,
    "digest": "sha256:7c9d20b9b6cda1c58bc4f9d6c401386786f584437abbe87e58910f8a9a15386b"
  }
]

Note that the layer doesn't match anything listed in the docker save archive directory.

find bb-dsave -type f | xargs sha256sum | cut -f1 -d' ' | grep 7c9d

It's also not in any of the json files in the archive directory:

grep 7c9d bb-dsave/*.json bb-dsave/65836406f9479e26bb2dc27439df3efdae3c298edd1ea781dcb3ac7a7baae542/json

OCI layout

We can convert that docker archive to an OCI-layout with skopeo (gh-106).

skopeo copy docker-archive:bb-dsave/bb.tar oci:bb-oci:latest 
#+RESULTS:
Getting image source signatures
Copying blob sha256:6c0ea40aef9d2795f922f4e8642f0cd9ffb9404e6f3214693a1fd45489f38b44
 1.37 MB / 1.37 MB [========================================================] 0s
Copying config sha256:8cf90cc9e23fce3bb22a95933b0f1008115828369857f09825dfb376b175f897
 575 B / 575 B [============================================================] 0s
Writing manifest to image destination
Storing signatures

Here's how that directory looks:

bb-oci/
|-- blobs
|   `-- sha256
|       |-- 7c9d20b9b6cda1c58bc4f9d6c401386786f584437abbe87e58910f8a9a15386b
|       |-- 8cf90cc9e23fce3bb22a95933b0f1008115828369857f09825dfb376b175f897
|       `-- 96fed174fbe8d6aeab995eb9e7fc03a6326abbc25adb5aa598e970dfe8b32c6d
|-- index.json
`-- oci-layout

2 directories, 5 files

Notice that this has the 7c9d20b9b6cda1c5… layer that was in the downloaded manifest but not the docker archive. The other blobs are json files.

downloading blob

Similar to the manifest script, here's a script that downloads a blob (spec here):

import sys
import json
import requests

repo = "library/busybox"

# https://docs.docker.com/registry/spec/auth/token/
auth_url = ("https://auth.docker.io/token?service=registry.docker.io"
            "&scope=repository:{repository}:pull")
resp_auth = requests.get(auth_url.format(repository=repo))
if resp_auth.status_code != 200:
    sys.stderr.write("Failed to authenticate: {}\n"
                     .format(resp_auth.status_code))
    sys.exit(1)

headers = {"Authorization": "Bearer " + resp_auth.json()["token"],
           "Accept": "application/vnd.docker.distribution.manifest.v2+json"}

ref = sys.argv[1]

# https://docs.docker.com/registry/spec/api/#manifest
blob_url = "https://registry-1.docker.io/v2/{repository}/blobs/{reference}"

resp_blob = requests.get(blob_url.format(repository=repo, reference=ref),
                         headers=headers)

if resp_blob.status_code != 200:
    sys.stderr.write("Failed to download manifest: {}\n"
                     .format(resp_blob.status_code))
    json.dump(resp_blob.json(), sys.stderr)
    sys.exit(1)
sys.stdout.buffer.write(resp_blob.content)

Trying many (I think all) of the blobs I can find from the docker archive layer, I get a "BLOB_UNKNOWN" reposnse.

I can however download the layer blob from the OCI layout directory:

python getblob.py sha256:7c9d20b9b6cda1c58bc4f9d6c401386786f584437abbe87e58910f8a9a15386b >b1
sha256sum b1
7c9d20b9b6cda1c58bc4f9d6c401386786f584437abbe87e58910f8a9a15386b  b1

The other files in blobs/ (all json files) get "BLOB_UNKNOWN" responses.

The OCI layer finally gives us something we could expose through an annex special remote and register as a URL. It seems like using the OCI layer as the default dataset storage for docker then converting to something docker load will accept might be doable, but I haven't explored that yet.

@yarikoptic
Copy link
Member Author

@vsoch -- may be you have ideas/knowledge on how to reference (URLs or request to server) specific layers of a docker image from a hub? would it be different for dockerhub and https://quay.io/?

@vsoch
Copy link

vsoch commented Jul 3, 2020

Why not just pull to Singularity and store the SIF binary? That's one clean command, gets all the layers, and handles the metadata too. @yarikoptic why would you want only specific layers to save? If that's the case then retrieving them via the Docker API is the way to go. Singularity used to do this, also using Python, if you look at 2.x version.

@vsoch
Copy link

vsoch commented Jul 3, 2020

As for the layers, you generally need to use the OCI distribution API to request the blobs, as @kyleam was hacking together. That does require getting the manifest, and then querying for each layer. Both Quay.io and Docker Hub use the same OCI distribution format so it shouldn't vary that much between them.

@yarikoptic
Copy link
Member Author

Why not just pull to Singularity and store the SIF binary?

that is what we did for singularity hub... But good idea to look into Singularity 2.x python code to see how to deal with individual docker layers.

@vsoch
Copy link

vsoch commented Jul 3, 2020

I mean pull a SIF binary from Docker Hub, a la docker layers. You build a Docker container and then kill two birds with one stone, it can be pulled as docker or Singularity.

@yarikoptic
Copy link
Member Author

yarikoptic commented Jul 3, 2020

Sorry -- I still don't fully get it: Does Docker Hub contain (provide) SIF binary? Does SIF binary contain the layered structure of docker image(s)? The target is to share later on this git/git-annex repository in such a way that it could pull docker layers from the hub (if still there ;-)) - could be used by people on machines without singularity, just docker.

@vsoch
Copy link

vsoch commented Jul 3, 2020

Docker hub has layers, so the Singularity client pulls the layers into a sandbox and builds an image from it. I added this in old Singularity (2.1 or so) and it's still the way it rolls :) You can't build the SIF binary without Singularity, and the resulting SIF binary wouldn't have any record of the previous layers, they are dumped into one fliesystem. If you just want to pull docker layers, then just use the API to get the manifest and do that, it's fairly straight forward.

@yarikoptic
Copy link
Member Author

re original endeavors of @kyleam . did some digging, to request manifest by "digest" you need to get digest (not image id):

here I used you script but modified to pass also the repo
#!/usr/bin/env python3
import sys
import json
import requests

# repo = "library/busybox"
# repo = "bitnami/wordpress"
# repo = "library/neurodebian" # /sid"
repo = sys.argv[1]

### https://docs.docker.com/registry/spec/auth/token/
auth_url = ("https://auth.docker.io/token?service=registry.docker.io"
            "&scope=repository:{repository}:pull")
resp_auth = requests.get(auth_url.format(repository=repo))
if resp_auth.status_code != 200:
    sys.stderr.write("Failed to authenticate: {}\n"
                     .format(resp_auth.status_code))
    sys.exit(1)

headers = {"Authorization": "Bearer " + resp_auth.json()["token"],
           "Accept": "application/vnd.docker.distribution.manifest.v2+json"}

ref = sys.argv[2]

# https://docs.docker.com/registry/spec/api/#manifest
man_url = "https://registry-1.docker.io/v2/{repository}/manifests/{reference}"

resp_man = requests.get(man_url.format(repository=repo, reference=ref),
                        headers=headers)

if resp_man.status_code != 200:
    sys.stderr.write("Failed to download manifest: {}\n"
                     .format(resp_man.status_code))
    json.dump(resp_man.json(), sys.stderr)
    sys.exit(1)
json.dump(resp_man.json(), sys.stdout)
$> docker image ls busybox --digests --no-trunc                                            
REPOSITORY          TAG                 DIGEST                                                                    IMAGE ID                                                                  CREATED             SIZE
busybox             latest              sha256:a9286defaba7b3a519d585ba0e37d0b2cbee74ebfe590960b0b1d6a5e97d1e1d   sha256:f0b02e9d092d905d0d87a8455a1ae3e9bb47b4aa3dc125125ca5cd10d6441c9f   13 days ago         1.23MB
busybox             <none>              sha256:6915be4043561d64e0ab0f8f098dc2ac48e077fe23f488ac24b665166898115a   sha256:6d5fcfe5ff170471fcc3c8b47631d6d71202a1fd44cf3c147e50c8de21cf0648   10 months ago       1.22MB

$> ./get-from-docker-hub.py library/busybox sha256:a9286defaba7b3a519d585ba0e37d0b2cbee74ebfe590960b0b1d6a5e97d1e1d           
{"manifests": [{"digest": "sha256:c9249fdf56138f0d929e2080ae98ee9cb2946f71498fc1484288e6a935b5e5bc", "mediaType": "application/vnd.docker.distribution.manifest.v2+json", "platform": {"architecture": "amd64", "os": "linux"}, "size": 527}, {"digest": "sha256:a7c572c26ca470b3148d6c1e48ad3db90708a2769fdf836aa44d74b83190496d", "mediaType": ...

and then you could follow up for the specific architecture, again by digest (didn't see how to match to image id yet)

$> ./get-from-docker-hub.py library/busybox sha256:c9249fdf56138f0d929e2080ae98ee9cb2946f71498fc1484288e6a935b5e5bc | jq
{
  "schemaVersion": 2,
  "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
  "config": {
    "mediaType": "application/vnd.docker.container.image.v1+json",
    "size": 1493,
    "digest": "sha256:f0b02e9d092d905d0d87a8455a1ae3e9bb47b4aa3dc125125ca5cd10d6441c9f"
  },
  "layers": [
    {
      "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
      "size": 764619,
      "digest": "sha256:9758c28807f21c13d05c704821fdd56c0b9574912f9b916c65e1df3e6b8bc572"
    }
  ]
}

then the next question is either/how those layers from manifest relate to the ones saveed into .tars. Part of the problem I guess is that in above manifest they are "diff.tar.gz" so at least gzip'ed whenever the ones saved are just .tar. When I do use your other script and do fetch that blob, gunzip it, then it is indeed matching:

(git)lena:…lad/trash[master]docker/bb-dsave/50670a188f4d6a8bfbe56cd21b56524e458de56637142b18933beda1acad863c
$> sha256sum layer.tar ../../sha256:9758c28807f21c13d05c704821fdd56c0b9574912f9b916c65e1df3e6b8bc572.tar
d2421964bad195c959ba147ad21626ccddc73a4f2638664ad1c07bd9df48a675  layer.tar
d2421964bad195c959ba147ad21626ccddc73a4f2638664ad1c07bd9df48a675  ../../sha256:9758c28807f21c13d05c704821fdd56c0b9574912f9b916c65e1df3e6b8bc572.tar

unfortunately I have not found a way to associate with any digest we obtain from docker save. So, at least at the cost of refetching, we seems can establish urls to the 'save'ed ones as gunziped versions of the ones from the hub. Note: IIRC datalad-archive would do both gzip + tar decompression if sees .tar.gz, so whenever storing an original key (to be decompressed), we would need to not include .tar into the key name.

@vsoch
Copy link

vsoch commented Oct 28, 2020

What are you trying to do @yarikoptic ? The digests are generated based on the hash of the config, which even with "the same" image is going to be different with different timestamps.

@yarikoptic
Copy link
Member Author

yarikoptic commented Oct 28, 2020

overall:

  • we want to add individual layers of docker images under git annex with URLs to them on docker hub, and so we could later "instantiate" those images in the local docker instance/engine.

so far the approach was:

  • use docker save, add everything to git/annex (so no urls to anything on docker hub), docker load -- pretty much what notes of @kyleam show in Provide URLs to individual docker image layers #98 (comment) . This way even images which aren't on any hub could be nicely kept under git/annex control.
  • that is why we researched into how to figure out URLs for those layers produced by docker save, but apparently no digest was matching to associate with layer entries in the manifest from the hub
  • with my above exploration, the feasible way is -- do not bother "matching" ;) If for a given image id there is a digest id, request manifest for that image from the hub, download all the .tar* layers, and add them into git-annex (but so that .tar portion is not de-compressed, only trailing .gz). This way if we do have a match (at the level of the checksum of that .tar, and I expect that to happen) - git annex would gain that url for .tar which we obtained from docker save
    • edit1: since we would be adding original (not yet decompressed) layer to annex, before considering layer for download, we could check if the corresponding layer was fetched before. May be we could add metadata for those non-decompressed but annexed files the layer digest(s) so we could possibly find them among annexed keys (keys would be based on the content checksum. so not that digest)

@vsoch
Copy link

vsoch commented Oct 28, 2020

I think it would be cleaner to go directly from the Registry API, and then retrieve the exact downloads for the layers and config, which already come with the digests. As I understand docker save, it's going to write on the fly, which means new timestamps and thus new hashes. It would be confusing for the user to see a known image locally (e.g., busybox:vX.X.X) and then not see digests that line up with what is on Docker Hub (or another registry). The benefit of not using docker save is that docker does not become a dependency for datalad. It's also messy to have "the same" layers that appear different because of different timestamps.

On a higher level, is it really reasonable to start saving container images / layers to git? Those are a lot of huge files! There is something to say for having a registry with URIs that (depending on the registry, can persist) to be the provider of the metadata (digests and links of blobs to images and tags). If datalad aims to become a provider of container layers, and artifacts, you might consider looking at the OCI distribution spec so it can provide the same, standardized / expected interactions to users. I guess it seems like datalad is trying to be the tool for everything instead of a more specific or narrow use case.

@bpinsard
Copy link

Hi all, I see @asmacdo was assigned to that issue, are there any plan to implement that feature?

@asmacdo
Copy link
Member

asmacdo commented Mar 22, 2023

@bpinsard I am assigned so I can investigate but I don't have specific plans yet. FWIW, I tend to agree that we should avoid docker save layers and stick to what is provided by the OCI API (manifest/manifest list)

Its worth noting that we would only be storing metadata in git (the hashes) and the blob bits would be moved around with git-annex via datalad.

@bpinsard
Copy link

Thanks that's great! I recently opened #199 which is somewhat related.

I wonder if the get-from-docker-hub.py code from #98 (comment) could be adapted to a get-only external git-annex special remote to get layers from dockerhub. https://git-annex.branchable.com/design/external_special_remote_protocol/

Its worth noting that we would only be storing metadata in git (the hashes) and the blob bits would be moved around with git-annex via datalad.

Indeed that is something I realized was not set by default, hence #204

@yarikoptic
Copy link
Member Author

@bpinsard sorry for the delay on this one: I don't think that a special custom remote should be necessary. We "just" need to completely redo how we store docker layers and avoid docker save since produced manifest there doesn't (AFAIK) give us layers which could be fetched from the hub... need to find time to look back into this.

@yarikoptic
Copy link
Member Author

note here: discovered https://github.com/indigo-dc/udocker -- which is quite cool since pure python, and doesn't require installation of docker. But underneath there is some "magic"al downloads etc:

udocker "executes" the containers by simply providing a chroot like environment over the extracted container. The current implementation supports different methods to mimic chroot thus enabling execution of containers under a chroot like environment without requiring privileges. udocker transparently supports several methods to execute the containers based on external tools and libraries such as: PRoot Fakechroot runc crun Singularity

With the exception of Singularity the tools and libraries to support execution are downloaded and deployed by udocker during the installation process. This installation is performed in the user home directory and does not require privileges. The udocker related files such as libraries, executables, documentation, licenses, container images and extracted directory trees are placed by default under $HOME/.udocker.

but may be it could also be used just as a library to download/access/manipulate images etc, or even indeed as yet another executor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docker Issues relating to docker support
Projects
None yet
Development

No branches or pull requests

5 participants