Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

Using nvidia-docker from third-party tools #39

Closed
hannes-brt opened this issue Jan 25, 2016 · 26 comments
Closed

Using nvidia-docker from third-party tools #39

hannes-brt opened this issue Jan 25, 2016 · 26 comments

Comments

@hannes-brt
Copy link

It's very easy to use nvidia-docker when running individual containers, but is there a way to run nvidia-docker instead of docker from other Docker tools like docker-compose, Tutum, Rancher, etc?

I am assuming one would just need to specify the nvidia-docker volume to be mounted in the container, but I couldn't find any documentation on the correct syntax.

@flx42
Copy link
Member

flx42 commented Jan 25, 2016

If the tool supports overriding the docker command, then you should use that to plug in nvidia-docker. For example, we provide this option ourselves with environment variable NV_DOCKER.

If it's not possible, you can query the Docker CLI arguments to the plugin:

$ curl -s localhost:3476/docker/cli
--device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 --device=/dev/nvidia1 --volume-driver=nvidia-docker --volume=nvidia_driver_352.68:/usr/local/nvidia:ro

Of course you will need to transform this into YAML format for docker-compose (for example).
We were wondering what to do in this case and we couldn't find a clean solution. Would that help if we add a REST endpoint that returns the CLI above as YAML? Or as JSON?

Thank you!

@flx42 flx42 added the question label Jan 25, 2016
@ruffsl
Copy link
Contributor

ruffsl commented Jan 26, 2016

If a sorf-of multiline variable substitution was possible in compose files, I guess you could add a YAML format option for the for the REST endpoint. But I think the environment variables are queried after the YAML syntax is parsed, resulting in "invalid mode" error from the improper structure. This still might be useful for just a 'copy and paste' approach for compose files though.

Is there a way we could use the volume-driver option with say set label format to convey the card IDs to export?

@3XX0
Copy link
Member

3XX0 commented Jan 26, 2016

@ruffsl I like the idea of using the variable substitution.
Can't we leverage the one-line YAML syntax? If the Rest, output something like:

NV_DEVICES="['/dev/nvidia0', '/dev/nvidia1']"

We could easily do something like (the same way docker-machine does it):

eval "$(curl -s localhost:3476/docker/env)"

given the following docker-compose.yml, it would work right?

devices: ${NV_DEVICES}

@ruffsl
Copy link
Contributor

ruffsl commented Jan 26, 2016

I'm just testing a foo bar example:

test:
  image: ubuntu
  volumes: ${FOO_BAR}
  command: ping 127.0.0.1

and am seeing this:

$ mkdir /tmp/foo
$ mkdir /tmp/bar
$ export FOO_BAR="['/tmp/bar:/bar', '/tmp/foo:/foo']"
$ docker-compose up
ERROR: Validation failed in file './docker-compose.yml', reason(s):
Service 'test' configuration key 'volumes' contains an invalid type, it should be an array

Is this the correct one-line YAML syntax? I've never seen it before.
Also, how could you keep other volumes or device lists defined alongside (stuff you'd like to keep in the compose file)?

@3XX0
Copy link
Member

3XX0 commented Jan 26, 2016

Not sure if docker-compose supports it but it's part of the YAML spec:
http://yaml.org/spec/1.2/spec.html#id2759963

For other volumes/devices if docker-compose supports multiple volumes or devices keywords it would work, otherwise by omitting the brackets I suppose.

@flx42
Copy link
Member

flx42 commented Jan 26, 2016

It seems to work without variable substitution:

test:
  image: ubuntu
  devices: ['/dev/nvidiactl', '/dev/nvidia-uvm', '/dev/nvidia0']
  command: ls /dev/

Variable substitution might be designed to generate values, and not YAML.
But let's wait for an official response from the docker-compose developers.

@ruffsl: I didn't understand the following, could you explain it?

Is there a way we could use the volume-driver option with say set label format to convey the card IDs to export?

@ruffsl
Copy link
Contributor

ruffsl commented Jan 26, 2016

@flx42 This was an example of what I was thinking

test:
  image: ubuntu
  volume_driver: nvidia-docker-driver
  labels:
    nvidia.gpu: "0,1"
  command: nvidia-smi

I'm not sure how feasible this would be, but I think its attractive in its simplicity from a user perspective. You've developed a plugin, could a custom driver be extend to parse the label metadata? This then could be a clean way to define nvidia containers with compose.

You could also use the variable substitution with this as well: export NV_GPU='0,1'

labels:
    nvidia.gpu: ${NV_GPU}

@flx42
Copy link
Member

flx42 commented Jan 26, 2016

A volume plugin will not be able to mount devices, and actually a plugin can't even inspect the starting image AFAIK.

@flx42
Copy link
Member

flx42 commented Jan 27, 2016

Good progress on docker/compose#2750
There is a pending PR to solve this use-case (haven't tested it).
But it will require a bleeding edge version of docker-compose if it's accepted.

@3XX0
Copy link
Member

3XX0 commented Jan 28, 2016

@flx42 is correct, currently the only supported plugin is VolumeDriver and it only deals with volume names (code).
I think a JSON endpoint could be useful here, however, it means you would need to write small wrappers around docker-compose and other tools to do the conversion (e.g. JSON -> YAML config)

@3XX0 3XX0 mentioned this issue Mar 2, 2016
@therc
Copy link

therc commented Apr 7, 2016

Another vote here. I'm trying to get Kubernetes to talk to the plugin. Kubernetes doesn't build command lines, it uses a Docker client API (fsouza's, but there's a migration in progress to the official one). JSON might work. Or perhaps embedding parts of nvidia-docker and nvidia-docker-plugin as a library inside kubelet (the daemon that manages the node) or a helper process running on the same machine.

@3XX0
Copy link
Member

3XX0 commented Apr 7, 2016

I'm not really familiar with Kubernetes but we are definitively interested in supporting it. Since it's written in Go, the nvidia package should do it. Alternatively we could use the nvidia-docker-plugin as a flex volume driver (maybe?).
Anyhow, feel free to create a separate issue and we'll address those requirements specifically.

@matthieudelaro
Copy link

Nut now uses nvidia-docker-plugin to mount GPUs in containers :)
I'm not using nvidia-docker/nvidia module though, but rather targeting the REST API directly to retrieve GPU paths, volume name, and to inject those values in docker API using go-dockerclient.

@ruffsl
Copy link
Contributor

ruffsl commented May 5, 2016

So is there method to use the nvidia plugin with docker-compose now,
or can that be broken out into it's own specific issue?

@3XX0
Copy link
Member

3XX0 commented May 5, 2016

Well it's not working out of the box but with the addition of the /docker/cli/json endpoint you can generate docker-compose files easily. For example:

#! /usr/bin/env python

import urllib2
import json
import yaml
import sys

if len(sys.argv) == 1:
    print "usage: %s service [key=value]..." % sys.argv[0]
    sys.exit(0)

resp = urllib2.urlopen("http://localhost:3476/docker/cli/json").read()

args = json.loads(resp)
args["volumes"] = args.pop("Volumes")
args["devices"] = args.pop("Devices")
args["volume_driver"] = args.pop("VolumeDriver")

doc = {sys.argv[1]: args}
for arg in sys.argv[2:]:
    k, v = arg.split("=")
    args[k] = v

yaml.safe_dump(doc, file('docker-compose.yml', 'w'), default_flow_style=False)
./compose.py cuda image=nvidia/cuda command=nvidia-smi
docker-compose up

@anibali
Copy link

anibali commented May 6, 2016

Whilst I appreciate this as a step in the right direction, this still isn't an ideal solution from my point of view. I'd like to see proper integration with docker-compose stay on the radar.

@MadcowD
Copy link

MadcowD commented Jun 28, 2016

Agreed this is not a canonical solution by any means. This issue should be reopened :/

@flx42
Copy link
Member

flx42 commented Jun 28, 2016

@MadcowD I don't think there is much more we can do right now for a better integration. But it's still on our radar since we have people in our team using docker-compose with nvidia-docker.

@jmerkow
Copy link

jmerkow commented Jul 29, 2016

@3XX0 I am trying to use your docker compose example above with a version '2' docker-compose but I am running into difficulty.

Here is my docker-compose file:

version: '2'

volumes:
  nvidia_driver_352.63:
      driver: nvidia-docker

services:
  cuda:
    command: nvidia-smi
    devices:
    - /dev/nvidiactl
    - /dev/nvidia-uvm
    - /dev/nvidia0
    image: nvidia/cuda
    volumes:
    - nvidia_driver_352.63:/usr/local/nvidia:ro

I get the following error:

Creating volume "utility_nvidia_driver_352.63" with nvidia-docker driver
ERROR: create utility_nvidia_driver_352.63: unsupported volume: utility_nvidia_driver_352.63

Any thoughts?

@3XX0
Copy link
Member

3XX0 commented Jul 29, 2016

Last time I tried, I had to create the volume beforehand with docker volume create and specify the volume as external in the compose file (see here). Not really ideal though...

@jmerkow
Copy link

jmerkow commented Jul 29, 2016

Better than nothing. This worked for me.

Steps:

$ docker volume create --name=nvidia_driver_352.63 -d nvidia-docker # create the docker

docker-compose:

version: '2'

volumes:
  nvidia_driver_352.63:
    external: true

services:
  cuda:
    command: nvidia-smi
    devices:
    - /dev/nvidiactl
    - /dev/nvidia-uvm
    - /dev/nvidia0
    image: nvidia/cuda
    volumes:
    - nvidia_driver_352.63:/usr/local/nvidia/:ro

You should be able to generate this yaml file (and generate the volume) by modifying compose.py above.

Thank you.

@jmerkow
Copy link

jmerkow commented Sep 22, 2016

FYI. I use a different solution that is a little easier to manage between machines.

in the compose file I set an external driver as before, however, I give the service a static, common name, and set the name alias from variable interpolation like so:

volumes:
  nvidia_driver:
     external:
       name: ${NVIDIA_DRIVER_VOLUME}

then in my service, I use this common name (nvidia_driver)

    volumes:
     - nvidia_driver:/usr/local/nvidia/:ro

All that remains is to set environment variable NVIDIA_DRIVER_VOLUME to your local driver name. This can be obtained from docker volume ls, or from @3XX0's example code (just set the environment variable instead of writing a docker-compose file). I just inserted an export statement into my .bashrc.

@flx42
Copy link
Member

flx42 commented Sep 22, 2016

@jmerkow: @eywalker created a project called nvidia-docker-compose, we haven't tested it, but you might be interested to look at it.

@achimnol
Copy link

achimnol commented Nov 5, 2016

nvidia-docker ... works fine in my environment, but docker run $(curl -s http://localhost:3476/docker/cli) ... does not work with the following message:

docker: Error response from daemon: create 0cdaed180e31650f260d3902833b65560bf5ba6d995c8450138990711d6
6be36: bad volume format: 0cdaed180e31650f260d3902833b65560bf5ba6d995c8450138990711d66be36.
See 'docker run --help'.

Another symptom is that tensorflow.Session() hangs in Python 3.5.2 inside containers only when the exactly same containers are launched via my custom docker-py integration that interprets and adds configuration arguments from http://localhost:3476/docker/cli. If launched with nvidia-docker command, it works fine!

I'd like to know what exactly nvidia-docker command does, not only some volume/binding arguments, but also internal differences to the plain docker command.
For example, I found that it sets two environment variables:

CUDA_DISABLE_UNIFIED_MEMORY=1
CUDA_CACHE_DISABLE=1

and afterwards loads the nvml C library while the nvidia-docker command is running.
What differences does it make? What are the potential causes for indefinite hang in tensorflow when not launched with nvidia-docker?

@achimnol
Copy link

achimnol commented Nov 5, 2016

I've found that it's not actually hanging but becomes very very slow. (e.g., 10 sec in CPU or nvidia-docker vs. 92 sec in GPU with docker-py invocation) May be related to #224 ...?

@flx42
Copy link
Member

flx42 commented Nov 5, 2016

@achimnol We explain what nvidia-docker does on our wiki.

The "bad volume format" error is a limitation of Docker, see #181.

Finally, as I explained in #224, we have heard multiple users claiming their code was slower inside Docker, but every single time it was because they compiled the project with different flags, or they had different settings during execution.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants