Global Scheduling #601

aluzzardi · 2015-04-08T23:27:36Z

This feature would allow to schedule a container on every single node in the cluster.

A typical use case would be system containers (such as log collectors).

Those containers must be scheduled on every current node.
When new nodes join the cluster, global containers should be scheduled on them.
Standard constraints apply as well. For instance, one could want to schedule a nginx container on every node that satisfies the constraint role==frontend. In that case, the container would only be scheduled on frontend machines (current and future).
UI issues: How do we represent global containers in docker ps? Single entry or multiple entries? (Depends on Virtual IDs. See Virtual Container IDs #600)
- If single entry: What ID and name is reported? How do we operate them? For instance, what docker logs <global container> would do? How do we see the status of each one?
- If multiple entries: How can the user know that that particular ID is a global container? How do we remove the global container? That is, `docker rm would remove only that particular instance, not all the other ones. Also, it wouldn't prevent scheduling that global container into new machines.

The text was updated successfully, but these errors were encountered:

aluzzardi · 2015-04-08T23:29:26Z

/cc @vieux @abronan @chanwit @gabrtv @smothiki

abronan · 2015-04-08T23:38:21Z

Makes sense to me.

How do we represent global containers in docker ps

I lean toward multiple entries with a label specifying that this container is globally scheduled. This way you filter containers through labels using docker ps and issue a docker rm on any agent hosting the container.

jimmyxian · 2015-04-09T01:47:44Z

Agree @abronan
When remove a container, first check whether the container is globally scheduled. If True, delete it on every node. Also, I think the containers that are globally scheduled should associate a unique group ID. From that we can find all globally scheduled containers

smothiki · 2015-04-09T04:18:19Z

I agree with @abronan for multiple entries with --label=global.
Some thoughts on design
maintain Key value pair for global containers {container_name: list_nodes} .
If docker rm is issued through swarm API remove all containers and delete the key value pair.
If docker rm is issued from normal docker client remove the container in the node and update the list_nodes.

If a node is dead no need to reschedule the container as the container will be present in another node . Rescheduling global container can be prevented by referencing to the above discussed MAP

chanwit · 2015-04-12T14:55:50Z

I'd love to go wtih the single entry approach.

We've already found the problem of listing a massive number of containers (and images) on our 50-node cluster (via DockerUI). As we are expecting to have at least 200 nodes in the near future, Virtual IDs will be directly a benefit to us.

Aggregate the similar containers and maybe have a display like:

NAME                         #CONTAINERS
swarm_vid/thousand_sunny      50

Then when docker inspect swarm_vid/thousand_sunny, we could get the list of all nodes behind this vid.

abronan · 2015-04-13T16:45:31Z

@chanwit We should allow both for convenience. A single entry approach as a quick overview of the globally scheduled containers (swarm list global?) and the default multiple entry approach through docker ps and --filter. Having the multiple entry approach is mandatory if you want to diagnose which container failed amongst the 50 containers you were trying to schedule globally. We could do things through a virtual-id and list everything with inspect/logs/etc. but debugging will be a pain.

tnachen · 2015-04-13T17:14:34Z

I think having both makes sense, as the client needs to be able to remove the global schedule as well as restart certain ones in the cluster. Just that I'm not sure then once we introduce "virtual" containers then we need to somehow translate all the docker client APIs targeting a single container to do special cases for it now. I'm not sure what should swarm return when docker inspect on that container should show, as it won't have a PID or network, etc.

denverdino · 2015-04-21T01:32:03Z

+1 @abronan

I like the idea for using label to specifying and filtering the containers.

vbichov · 2015-06-02T07:30:28Z

There is an issue related to container naming.
Globally scheduled containers may run with "--name [container-name]"
All of these containers will have the same name - which is great - it's what you want.

Today, you can't do that because swarm abstracts as if it were a single host.
It would be useful to give the same name to containers on different hosts even without global scheduling, but the point is that if you have dependent containers that have "--volume-from" in the command line you'll have a problem.
If the name is not uniform, you will not know what to write in the run command.

aluzzardi · 2015-06-02T08:17:29Z

After giving in some extra thought, I think global scheduling is just a special case of scaling.

We run into the exact same issues if we wanted to scale a container to X instances (single entry, naming, ...).

I believe we should address the scaling issue at the same time, maybe there should be a higher-level concept (container group or something like that)

vbichov · 2015-06-02T08:48:18Z

@aluzzardi You don't necessarily schedule globally because you wish to scale.
Sometimes you do it because Swarm is represented as a single docker host yet provides no solution for "--volume-from" across "actual" hosts.
So you need identical instances of globally scheduled containers - not because you need more then one, but because you have no choice.

aluzzardi · 2015-06-02T18:53:41Z

@vbichov, indeed. However, in terms of design, implementation and user experience, those two are quite identical.

vbichov · 2015-08-02T14:13:57Z

Is scaling a planned feature? just select the top "-e constraint:scale=[number]" hosts from the list returned by the scheduler and run? (and apply pigeonhole principle?)

chanwit · 2015-08-16T02:54:12Z

Maybe it's time to bring this up.

@aluzzardi I am trying to conceptualize how the global scheduling work.
With libkv, it would be something like the following:

Step 1.    $ docker run -d -e global.scheduling=true -p 80:80 nginx                             

Step 2.    +-----------+                                                                        
           |           |                                                                        
           |           |        +--------------------------------------------------------------+
           |   libkv   | <------+ global:true, image:0dabcdef (nginx), container_config:{...}  |
           |           |        +--------------------------------------------------------------+
           |           |                                                                        
           +-----------+                                                                        

Step 3.    Subscribe to newly added node event: => { apply filters, run container_config }

Step 4.    Run using the created container_config

From the design, I have found that global scheduling requires a specific implementation that I cannot relate it to scaling. Please correct me if I am wrong.

abronan · 2015-08-17T18:06:36Z

I don't think it is necessary to use libkv for that. This would limit the usage of globally scheduled containers to users running a KV backend on the side (ie. step 2 is unnecessary).

We can just label containers with global==true and keep track of those based on their unique Virtual ID.

By the way, do you mean that a globally scheduled container should be automatically deployed when a new node is added to the cluster (to match the other machines running those containers)?

chanwit · 2015-08-18T13:33:11Z

I don't think it is necessary to use libkv for that. This would limit the usage of globally scheduled containers to users running a KV backend on the side (ie. step 2 is unnecessary).

@abronan yep, this may be optional. But I still think it's necessary to share the configuration of global scheduled containers, in case of the swarm master failure.

By the way, do you mean that a globally scheduled container should be automatically deployed when a new node is added to the cluster (to match the other machines running those containers)?

Yes, it's a normal use case for Big Data / Hadoop clusters. This is also mentioned by @aluzzardi above.

rgbkrk · 2015-08-18T21:11:54Z

Oooh, this! I'm all about peek-a-boo services when nodes come online.

/cc @smashwilson

abronan · 2015-08-18T21:25:22Z

@chanwit Ok my only concern with the auto-run on newly added nodes is that a Swarm can be divided into multiple sets/regions. Thus, a newly added node might be annotated with a label that says that it belongs to another group (not necessarily running the same tasks/workloads). In this case running the globally scheduled container automatically might not be the expected behavior from a user perspective.

For the use of libkv, ok if this stays optional. In the case of a Manager failure if the container has already been scheduled the other Managers are going to see the new container through a refresh/event and detect that this is a globally scheduled container through the label. That is why I think the storage is not necessary. (Storage might be needed if we store a piece of data that is not handled by remote docker daemons, like higher level abstractions or specific metadata that needs to be shared consistently, etc.)

chanwit · 2015-08-22T08:10:32Z

At code-level, it seems this kind of scheduler will be doing as a super scheduler.

It still requires filters
When selecting nodes, it's returning a list of nodes. This is at least an obvious difference from basic schedulers.
It's maintaining a list of container_configs in memory. When a new node is added, container placement mechanism will be started automatically. There will be some concerns for locking as this scheduler and user commanding from CLI maybe trying to place containers at the same time.

@aluzzardi is it OK for me to start implement this?

@abronan If @aluzzardi is OK for me to take care of a PR of this, I'll really need your inputs for it.

schmunk42 · 2015-11-23T14:29:56Z

I would also be very interested in an update about this topic.
I tried to implement my own scheduler based on swarm-manager logs, but it was unreliable.
When nodes joined or left during rescheduling, I ended up with Dead containers, see #1421

megastef · 2015-12-22T10:07:25Z

+1 I hope this will land in Swarm, have a look what others do:

Kubernetes DaemonSets - a pod running on each node
Tutum deployment strategies EVERY_NODE in stack files
Mesos constraint: "hostname", "UNIQUE"]
Fleet Global Unit - On CoreOS I would use a global fleet unit, but what to do on other platforms?

Here is my shell script to deploy monitoring agents to each node using docker machine, a kind of workaround:
https://forums.docker.com/t/best-way-to-deploy-monitoring-containers-missing-deploy-strategy-every-node/5189
But I'm looking for a better/generic solution, which is not limited tools like docker-machine, or operating system specific services like global feet units on CoreOS etc.

abronan · 2015-12-22T17:50:41Z

@megastef In the meantime you can also use a Compose file and scale to the number of nodes you have making sure to add an anti-affinity rule so that no two containers are running on the same machine :)

I agree that this would be convenient to have it directly in Swarm but Compose is also a cool way to solve it on top of Swarm.

megastef · 2015-12-22T19:34:33Z

@abronan Thank you! This seems to work and is better than the shell-script.

sematext-agent:
  image: 'sematext/sematext-agent-docker:latest'
  environment:
    - LOGSENE_TOKEN=3b549a2c-653a-4832-xxx
    - SPM_TOKEN=fe31fc3a-4660-47c6-xxx
    - affinity:container!=sematext-* 
  privileged: true
  restart: always
  volumes:
    - '/var/run/docker.sock:/var/run/docker.sock'

And this commands:

eval $(docker-machine env swarm-master --swarm)
docker-compose up -d 
# scale is == num nodes
docker-compose scale sematext-agent=$(docker-machine ls | grep swarm | grep Running | wc -l)

But it needs to run, when number of nodes changes. So I'm still looking forward for the new global scheduling feature ;)

schmunk42 · 2015-12-25T17:08:36Z

A problem I noticed with affinity is, that you run into problems if you have different stacks with same container names, there should be also a way to specify or limit the affinity scope to the current docker-compose stack.

dnephin · 2015-12-26T18:43:54Z

I think that should be handled by compose by using a unique project name. That way you'll never have duplicate container names, even if you have the same service name in different projects.

schmunk42 · 2015-12-26T20:22:28Z

@dnephin Basically I agree, the problem is, that you do not know the project name inside the docker-compose.yml or is COMPOSE_PROJECT_NAME populated, even if it is not explicitly set, now?

schmunk42 · 2015-12-28T02:27:46Z

Found the related issue :) docker/compose#2294

BrianAdams · 2016-01-13T21:07:56Z

+1

leecalcote · 2016-08-20T17:08:04Z

Given global services as a new capability in 1.12, what remains to implement here? Eliminating naming conflict for an agent's existing containers/services with the same name?

dongluochen · 2016-08-22T17:10:34Z

Given global service implementation in Docker 1.12, there is not much left for Swarm to do.

Container name conflict is not part of global service.

megastef · 2016-08-22T17:57:18Z

@leecalcote I will test swarm global service with sematext docker agent. Thanks for the hint!

aluzzardi added the area/scheduler label Apr 8, 2015

aluzzardi added this to the 0.3.0 milestone Apr 8, 2015

aluzzardi assigned aluzzardi and unassigned aluzzardi Apr 10, 2015

abronan mentioned this issue Apr 27, 2015

"All" option for running a container on several hosts #677

Closed

chanwit mentioned this issue May 5, 2015

Is anyone think about machine scaling on swarm? #676

Closed

aluzzardi modified the milestones: 0.4.0, 0.3.0 May 8, 2015

francisbouvier mentioned this issue Jun 15, 2015

Register containers configuration in discovery service #957

Closed

aluzzardi modified the milestones: 0.5.0, 0.4.0 Aug 5, 2015

aluzzardi modified the milestones: 1.0.0, 1.1.0 Nov 27, 2015

megastef mentioned this issue Dec 23, 2015

Anything for container monitoring for swarm cluster? #335

Closed

chanwit mentioned this issue Jan 29, 2016

Run container to every nodes in Swarm Cluster #1705

Closed

nishanttotla mentioned this issue Jan 29, 2016

Improve Swarm scheduler to take more metrics and parameters into account #1703

Closed

3 tasks

chanwit mentioned this issue Jan 29, 2016

Proposal: running containers on multiple nodes with constraint:global #1708

Closed

amitshukla added the kind/proposal label Jan 30, 2016

vieux modified the milestones: 1.1.0, 1.2.0 Feb 10, 2016

amitshukla modified the milestones: 1.3.0, 1.2.0 Mar 23, 2016

dongluochen mentioned this issue Apr 27, 2016

Container run on node join #2170

Closed

dongluochen closed this as completed Aug 22, 2016

megastef mentioned this issue May 5, 2017

Document how logs are collected and how they can be streamed out of Kontena kontena/kontena#1038

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Global Scheduling #601

Global Scheduling #601

aluzzardi commented Apr 8, 2015

aluzzardi commented Apr 8, 2015

abronan commented Apr 8, 2015

jimmyxian commented Apr 9, 2015

smothiki commented Apr 9, 2015

chanwit commented Apr 12, 2015

abronan commented Apr 13, 2015

tnachen commented Apr 13, 2015

denverdino commented Apr 21, 2015

vbichov commented Jun 2, 2015

aluzzardi commented Jun 2, 2015

vbichov commented Jun 2, 2015

aluzzardi commented Jun 2, 2015

vbichov commented Aug 2, 2015

chanwit commented Aug 16, 2015

abronan commented Aug 17, 2015

chanwit commented Aug 18, 2015

rgbkrk commented Aug 18, 2015

abronan commented Aug 18, 2015

chanwit commented Aug 22, 2015

schmunk42 commented Nov 23, 2015

megastef commented Dec 22, 2015

abronan commented Dec 22, 2015

megastef commented Dec 22, 2015

schmunk42 commented Dec 25, 2015

dnephin commented Dec 26, 2015

schmunk42 commented Dec 26, 2015

schmunk42 commented Dec 28, 2015

BrianAdams commented Jan 13, 2016

leecalcote commented Aug 20, 2016

dongluochen commented Aug 22, 2016

megastef commented Aug 22, 2016

Global Scheduling #601

Global Scheduling #601

Comments

aluzzardi commented Apr 8, 2015

aluzzardi commented Apr 8, 2015

abronan commented Apr 8, 2015

jimmyxian commented Apr 9, 2015

smothiki commented Apr 9, 2015

chanwit commented Apr 12, 2015

abronan commented Apr 13, 2015

tnachen commented Apr 13, 2015

denverdino commented Apr 21, 2015

vbichov commented Jun 2, 2015

aluzzardi commented Jun 2, 2015

vbichov commented Jun 2, 2015

aluzzardi commented Jun 2, 2015

vbichov commented Aug 2, 2015

chanwit commented Aug 16, 2015

abronan commented Aug 17, 2015

chanwit commented Aug 18, 2015

rgbkrk commented Aug 18, 2015

abronan commented Aug 18, 2015

chanwit commented Aug 22, 2015

schmunk42 commented Nov 23, 2015

megastef commented Dec 22, 2015

abronan commented Dec 22, 2015

megastef commented Dec 22, 2015

schmunk42 commented Dec 25, 2015

dnephin commented Dec 26, 2015

schmunk42 commented Dec 26, 2015

schmunk42 commented Dec 28, 2015

BrianAdams commented Jan 13, 2016

leecalcote commented Aug 20, 2016

dongluochen commented Aug 22, 2016

megastef commented Aug 22, 2016