Skip to content
This repository has been archived by the owner on Feb 1, 2021. It is now read-only.

Global Scheduling #601

Closed
aluzzardi opened this issue Apr 8, 2015 · 33 comments
Closed

Global Scheduling #601

aluzzardi opened this issue Apr 8, 2015 · 33 comments

Comments

@aluzzardi
Copy link
Contributor

This feature would allow to schedule a container on every single node in the cluster.

A typical use case would be system containers (such as log collectors).

  • Those containers must be scheduled on every current node.
  • When new nodes join the cluster, global containers should be scheduled on them.
  • Standard constraints apply as well. For instance, one could want to schedule a nginx container on every node that satisfies the constraint role==frontend. In that case, the container would only be scheduled on frontend machines (current and future).
  • UI issues: How do we represent global containers in docker ps? Single entry or multiple entries? (Depends on Virtual IDs. See Virtual Container IDs #600)
    • If single entry: What ID and name is reported? How do we operate them? For instance, what docker logs <global container> would do? How do we see the status of each one?
    • If multiple entries: How can the user know that that particular ID is a global container? How do we remove the global container? That is, `docker rm would remove only that particular instance, not all the other ones. Also, it wouldn't prevent scheduling that global container into new machines.
@aluzzardi aluzzardi added this to the 0.3.0 milestone Apr 8, 2015
@aluzzardi
Copy link
Contributor Author

@abronan
Copy link
Contributor

abronan commented Apr 8, 2015

Makes sense to me.

How do we represent global containers in docker ps

I lean toward multiple entries with a label specifying that this container is globally scheduled. This way you filter containers through labels using docker ps and issue a docker rm on any agent hosting the container.

@jimmyxian
Copy link
Contributor

Agree @abronan
When remove a container, first check whether the container is globally scheduled. If True, delete it on every node. Also, I think the containers that are globally scheduled should associate a unique group ID. From that we can find all globally scheduled containers

@smothiki
Copy link
Contributor

smothiki commented Apr 9, 2015

I agree with @abronan for multiple entries with --label=global.
Some thoughts on design
maintain Key value pair for global containers {container_name: list_nodes} .
If docker rm is issued through swarm API remove all containers and delete the key value pair.
If docker rm is issued from normal docker client remove the container in the node and update the list_nodes.

If a node is dead no need to reschedule the container as the container will be present in another node . Rescheduling global container can be prevented by referencing to the above discussed MAP

@aluzzardi aluzzardi assigned aluzzardi and unassigned aluzzardi Apr 10, 2015
@chanwit
Copy link
Contributor

chanwit commented Apr 12, 2015

I'd love to go wtih the single entry approach.

We've already found the problem of listing a massive number of containers (and images) on our 50-node cluster (via DockerUI). As we are expecting to have at least 200 nodes in the near future, Virtual IDs will be directly a benefit to us.

Aggregate the similar containers and maybe have a display like:

NAME                         #CONTAINERS
swarm_vid/thousand_sunny      50

Then when docker inspect swarm_vid/thousand_sunny, we could get the list of all nodes behind this vid.

@abronan
Copy link
Contributor

abronan commented Apr 13, 2015

@chanwit We should allow both for convenience. A single entry approach as a quick overview of the globally scheduled containers (swarm list global?) and the default multiple entry approach through docker ps and --filter. Having the multiple entry approach is mandatory if you want to diagnose which container failed amongst the 50 containers you were trying to schedule globally. We could do things through a virtual-id and list everything with inspect/logs/etc. but debugging will be a pain.

@tnachen
Copy link
Contributor

tnachen commented Apr 13, 2015

I think having both makes sense, as the client needs to be able to remove the global schedule as well as restart certain ones in the cluster. Just that I'm not sure then once we introduce "virtual" containers then we need to somehow translate all the docker client APIs targeting a single container to do special cases for it now. I'm not sure what should swarm return when docker inspect on that container should show, as it won't have a PID or network, etc.

@denverdino
Copy link
Contributor

+1 @abronan

I like the idea for using label to specifying and filtering the containers.

@vbichov
Copy link

vbichov commented Jun 2, 2015

There is an issue related to container naming.
Globally scheduled containers may run with "--name [container-name]"
All of these containers will have the same name - which is great - it's what you want.

Today, you can't do that because swarm abstracts as if it were a single host.
It would be useful to give the same name to containers on different hosts even without global scheduling, but the point is that if you have dependent containers that have "--volume-from" in the command line you'll have a problem.
If the name is not uniform, you will not know what to write in the run command.

@aluzzardi
Copy link
Contributor Author

After giving in some extra thought, I think global scheduling is just a special case of scaling.

We run into the exact same issues if we wanted to scale a container to X instances (single entry, naming, ...).

I believe we should address the scaling issue at the same time, maybe there should be a higher-level concept (container group or something like that)

@vbichov
Copy link

vbichov commented Jun 2, 2015

@aluzzardi You don't necessarily schedule globally because you wish to scale.
Sometimes you do it because Swarm is represented as a single docker host yet provides no solution for "--volume-from" across "actual" hosts.
So you need identical instances of globally scheduled containers - not because you need more then one, but because you have no choice.

@aluzzardi
Copy link
Contributor Author

@vbichov, indeed. However, in terms of design, implementation and user experience, those two are quite identical.

@vbichov
Copy link

vbichov commented Aug 2, 2015

Is scaling a planned feature? just select the top "-e constraint:scale=[number]" hosts from the list returned by the scheduler and run? (and apply pigeonhole principle?)

@aluzzardi aluzzardi modified the milestones: 0.5.0, 0.4.0 Aug 5, 2015
@chanwit
Copy link
Contributor

chanwit commented Aug 16, 2015

Maybe it's time to bring this up.

@aluzzardi I am trying to conceptualize how the global scheduling work.
With libkv, it would be something like the following:

Step 1.    $ docker run -d -e global.scheduling=true -p 80:80 nginx                             

Step 2.    +-----------+                                                                        
           |           |                                                                        
           |           |        +--------------------------------------------------------------+
           |   libkv   | <------+ global:true, image:0dabcdef (nginx), container_config:{...}  |
           |           |        +--------------------------------------------------------------+
           |           |                                                                        
           +-----------+                                                                        

Step 3.    Subscribe to newly added node event: => { apply filters, run container_config }

Step 4.    Run using the created container_config                                               

From the design, I have found that global scheduling requires a specific implementation that I cannot relate it to scaling. Please correct me if I am wrong.

@abronan
Copy link
Contributor

abronan commented Aug 17, 2015

I don't think it is necessary to use libkv for that. This would limit the usage of globally scheduled containers to users running a KV backend on the side (ie. step 2 is unnecessary).

We can just label containers with global==true and keep track of those based on their unique Virtual ID.

By the way, do you mean that a globally scheduled container should be automatically deployed when a new node is added to the cluster (to match the other machines running those containers)?

@chanwit
Copy link
Contributor

chanwit commented Aug 18, 2015

I don't think it is necessary to use libkv for that. This would limit the usage of globally scheduled containers to users running a KV backend on the side (ie. step 2 is unnecessary).

@abronan yep, this may be optional. But I still think it's necessary to share the configuration of global scheduled containers, in case of the swarm master failure.

By the way, do you mean that a globally scheduled container should be automatically deployed when a new node is added to the cluster (to match the other machines running those containers)?

Yes, it's a normal use case for Big Data / Hadoop clusters. This is also mentioned by @aluzzardi above.

@rgbkrk
Copy link

rgbkrk commented Aug 18, 2015

Oooh, this! I'm all about peek-a-boo services when nodes come online.

/cc @smashwilson

@abronan
Copy link
Contributor

abronan commented Aug 18, 2015

@chanwit Ok my only concern with the auto-run on newly added nodes is that a Swarm can be divided into multiple sets/regions. Thus, a newly added node might be annotated with a label that says that it belongs to another group (not necessarily running the same tasks/workloads). In this case running the globally scheduled container automatically might not be the expected behavior from a user perspective.

For the use of libkv, ok if this stays optional. In the case of a Manager failure if the container has already been scheduled the other Managers are going to see the new container through a refresh/event and detect that this is a globally scheduled container through the label. That is why I think the storage is not necessary. (Storage might be needed if we store a piece of data that is not handled by remote docker daemons, like higher level abstractions or specific metadata that needs to be shared consistently, etc.)

@chanwit
Copy link
Contributor

chanwit commented Aug 22, 2015

At code-level, it seems this kind of scheduler will be doing as a super scheduler.

  • It still requires filters
  • When selecting nodes, it's returning a list of nodes. This is at least an obvious difference from basic schedulers.
  • It's maintaining a list of container_configs in memory. When a new node is added, container placement mechanism will be started automatically. There will be some concerns for locking as this scheduler and user commanding from CLI maybe trying to place containers at the same time.

@aluzzardi is it OK for me to start implement this?

@abronan If @aluzzardi is OK for me to take care of a PR of this, I'll really need your inputs for it.

@schmunk42
Copy link

I would also be very interested in an update about this topic.
I tried to implement my own scheduler based on swarm-manager logs, but it was unreliable.
When nodes joined or left during rescheduling, I ended up with Dead containers, see #1421

@aluzzardi aluzzardi modified the milestones: 1.0.0, 1.1.0 Nov 27, 2015
@megastef
Copy link

+1 I hope this will land in Swarm, have a look what others do:

  1. Kubernetes DaemonSets - a pod running on each node
  2. Tutum deployment strategies EVERY_NODE in stack files
  3. Mesos constraint: "hostname", "UNIQUE"]
  4. Fleet Global Unit - On CoreOS I would use a global fleet unit, but what to do on other platforms?

Here is my shell script to deploy monitoring agents to each node using docker machine, a kind of workaround:
https://forums.docker.com/t/best-way-to-deploy-monitoring-containers-missing-deploy-strategy-every-node/5189
But I'm looking for a better/generic solution, which is not limited tools like docker-machine, or operating system specific services like global feet units on CoreOS etc.

@abronan
Copy link
Contributor

abronan commented Dec 22, 2015

@megastef In the meantime you can also use a Compose file and scale to the number of nodes you have making sure to add an anti-affinity rule so that no two containers are running on the same machine :)

I agree that this would be convenient to have it directly in Swarm but Compose is also a cool way to solve it on top of Swarm.

@megastef
Copy link

@abronan Thank you! This seems to work and is better than the shell-script.

sematext-agent:
  image: 'sematext/sematext-agent-docker:latest'
  environment:
    - LOGSENE_TOKEN=3b549a2c-653a-4832-xxx
    - SPM_TOKEN=fe31fc3a-4660-47c6-xxx
    - affinity:container!=sematext-* 
  privileged: true
  restart: always
  volumes:
    - '/var/run/docker.sock:/var/run/docker.sock'

And this commands:

eval $(docker-machine env swarm-master --swarm)
docker-compose up -d 
# scale is == num nodes
docker-compose scale sematext-agent=$(docker-machine ls | grep swarm | grep Running | wc -l)

But it needs to run, when number of nodes changes. So I'm still looking forward for the new global scheduling feature ;)

@schmunk42
Copy link

A problem I noticed with affinity is, that you run into problems if you have different stacks with same container names, there should be also a way to specify or limit the affinity scope to the current docker-compose stack.

@dnephin
Copy link
Contributor

dnephin commented Dec 26, 2015

I think that should be handled by compose by using a unique project name. That way you'll never have duplicate container names, even if you have the same service name in different projects.

@schmunk42
Copy link

@dnephin Basically I agree, the problem is, that you do not know the project name inside the docker-compose.yml or is COMPOSE_PROJECT_NAME populated, even if it is not explicitly set, now?

@schmunk42
Copy link

Found the related issue :) docker/compose#2294

@BrianAdams
Copy link

+1

@leecalcote
Copy link

Given global services as a new capability in 1.12, what remains to implement here? Eliminating naming conflict for an agent's existing containers/services with the same name?

@dongluochen
Copy link
Contributor

Given global service implementation in Docker 1.12, there is not much left for Swarm to do.

Container name conflict is not part of global service.

@megastef
Copy link

@leecalcote I will test swarm global service with sematext docker agent. Thanks for the hint!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests