Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update docker client version to 18.06.3 #4430

Merged
merged 16 commits into from
May 15, 2019

Conversation

style95
Copy link
Member

@style95 style95 commented Apr 5, 2019

Description

With this change, it is required to install Docker=18.06.3 on target servers.
This is a breaking change.

Related issue and scope

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

@style95
Copy link
Member Author

style95 commented Apr 5, 2019

Todo:

  • Update controller docker cli version there is no docker cli in the controller There "is" docker cli for lean deployment.
  • performance comparison
  • Update documents accordingly
  • Update Travis build configuration accordingly

@style95
Copy link
Member Author

style95 commented Apr 11, 2019

In the latest docker, runc does not show any running containers which are created by docker.
So container is not paused by runc and just removed upon error(container not found).

so I am looking into the way to utilize runc.

@style95
Copy link
Member Author

style95 commented Apr 11, 2019

The issue was previously container information was stored in /run/runc but now it is stored in /run/docker/runtime-runc/moby.

After changing mounting directory, it seems working as expected.

@style95
Copy link
Member Author

style95 commented Apr 12, 2019

I have discussed this issue with @sven-lange-last.
Let me share the results here.

  1. Generally, it takes some times to upgrade docker version for many reasons.
    So it would require some time for downstream to use the latest docker version in their production.

  2. But at the same time CVE should be properly handled.
    This CVE, https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-5736 requires docker >=18.06.2

So if we upgrade the docker version to 18.06.x rather than 18.09, it would give more time to downstream at the same time resolve the CVE.

@style95
Copy link
Member Author

style95 commented Apr 12, 2019

I did benchmarking to comparison any performance issue in docker-18.06.3.
It seems there is no big performance difference.

  • 17.05: 12,500 TPS
  • 18.06: 12,800 TPS

For more information, you can refer to this.
https://cwiki.apache.org/confluence/display/OPENWHISK/Docker+version+comparison

I think it would be good to go with docker-18.06.3

@style95 style95 force-pushed the update-docker-version branch from 7c8cab1 to 19f03eb Compare April 12, 2019 11:39
@style95 style95 changed the title [WIP] Update docker client version to 18.09.4 Update docker client version to 18.06.3 Apr 12, 2019
Copy link
Member

@rabbah rabbah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I think there's a readme/doc we have to update also regarding the docker version.

Copy link
Member

@rabbah rabbah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@style95 style95 changed the title Update docker client version to 18.06.3 [WIP]Update docker client version to 18.06.3 Apr 12, 2019
Copy link
Member

@rabbah rabbah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need a similar change in the core/controller/Dockerfile as well for the "lean" deployment.

@rabbah rabbah added deployment docker awaits-contributor The contributor needs to respond to comments from reviewer. labels Apr 12, 2019
@style95
Copy link
Member Author

style95 commented Apr 12, 2019

@rabbah
I readded WIP in the title.
Actually there are still few things to handle: #4430 (comment)

I will cover what you mentioned as well

@style95 style95 changed the title [WIP]Update docker client version to 18.06.3 Update docker client version to 18.06.3 Apr 15, 2019
@markusthoemmes
Copy link
Contributor

So it would require some time for downstream to use the latest docker version in their production.

We should really try to decouple from this. Sure, downstream will always be slower than upstream but we should not need to hold back on innovation and advancement because of that fact.

Is there any possibility for us to have different versions of DockerClient that flatten out the differences and make it a deployment/configuration concern which one is used? We could have a deprecation period for older clients to clean up after some time.

@style95
Copy link
Member Author

style95 commented Apr 15, 2019

I think we need to differentiate the build step of the invoker Dockerfile to achieve that.
It is mainly related to the docker version inside the invoker.

Regarding the downstream version, I will defer to @sven-lange-last as he wanted to make sure this PR does not break IBM production system.

As the binary name of runc is changed from docker-runc to runc in docker-18.09, if we decide to upgrade to 18.09, that should be addressed as well.

@rabbah
Copy link
Member

rabbah commented Apr 15, 2019

I like the suggestion from Markus. Did I understand correctly that the conditioning is relatively easier for ansible than the dockerfile?

@markusthoemmes
Copy link
Contributor

The Dockerfile only ships docker binaries, right? How much overhead would it be to have multiple versions inside of the Dockerfile that we simply pick through configuration?

@style95
Copy link
Member Author

style95 commented Apr 15, 2019

@markusthoemmes
With the assumption that we will move forward with 18.09 and just including 18.06 for the deprecation period, docker-18.06 would add about 43MB(36 + 7.2 MB) of overhead.

$ ll -h docker-18.06
total 141M
drwxrwxr-x 2 style95 style95 4.0K Feb 20 11:26 ./
drwxr-xr-x 4 style95 style95 4.0K Apr 15 19:03 ../
-rwxr-xr-x 1 style95 style95  36M Feb 20 11:26 docker*
-rwxr-xr-x 1 style95 style95  26M Feb 20 11:26 docker-containerd*
-rwxr-xr-x 1 style95 style95  15M Feb 20 11:26 docker-containerd-ctr*
-rwxr-xr-x 1 style95 style95 4.0M Feb 20 11:26 docker-containerd-shim*
-rwxr-xr-x 1 style95 style95  51M Feb 20 11:26 dockerd*
-rwxr-xr-x 1 style95 style95 747K Feb 20 11:26 docker-init*
-rwxr-xr-x 1 style95 style95 2.8M Feb 20 11:26 docker-proxy*
-rwxr-xr-x 1 style95 style95 7.2M Feb 20 11:26 docker-runc*

@rabbah
Let me clarify the issues.
There are the following changes.

Docker 1.12 -> 18.06 or higher

runc no more use default(/run/runc) directory as a root directory.
It uses /run/docker/runtime-runc/moby.

Docker 18.06 -> 18.09

The name of runc is changed from docker-runc to runc.

IMHO, there are few options to have multiple versions of docker, but I think it would be easier to configure the version at the deployment time.

For example, we can have both versions of docker binaries in the invoker as @markusthoemmes suggested.
It will have two binaries, docker-runc and just runc.
Based on the configuration, DockerClient(RuncClient) will selectively look for one of them.
(Currently, it is hardcoded to /usr/bin/docker-runc)

Keeping docker-1.12

Besides of the above things, If we want to keep docker-1.12, we need to differentiate the ansible script based on the version flag as well.
The mounting directory for runc will vary between /run/runc and /run/docker/runtime-runc/moby.

@markusthoemmes
Copy link
Contributor

I see, thanks for the writeup. It looks like moving to 18.06 introduces the least friction for now and makes all parties kinda happy?

If so, it's reasonable to make that move first and think about unblocking future upgrade later.

@style95 style95 closed this Apr 16, 2019
@style95 style95 reopened this Apr 16, 2019
@style95 style95 closed this Apr 19, 2019
@style95 style95 reopened this Apr 19, 2019
@sven-lange-last
Copy link
Member

sven-lange-last commented May 7, 2019

@style95 thanks a lot for providing this PR - very thorough work. Particularly, the performance tests are helpful for deciding whether the update to a newer Docker version is a feasible.

Some thoughts around the transient data directory docker-runc is using for containers:

  • When Docker 18.06 uses docker-runc, directory /run/docker/runtime-runc/moby is used as base directory.
  • docker-runc itself has different default directories that are used when not specifying the --root option. These are /run/runc for root (uid 0) and /run/user/<uid>/runc for non-root users.
  • Apparently, all uses of docker-runc need to make sure that they use the same data directory as Docker - otherwise docker-runc won't be able to work on the containers created by Docker before.
  • In ansible/roles/invoker/tasks/clean.yml, you explicitly use the --root option to make sure the proper directory is used.
  • You have no changes in the Scala RuncClient used by the invoker. (BTW - the comment is wrong.)
    https://github.com/apache/incubator-openwhisk/blob/c68fc94212009d3a5cd98d6ea7e06482272c487b/core/invoker/src/main/scala/org/apache/openwhisk/core/containerpool/docker/RuncClient.scala#L56-L64
  • The Scala RuncClient works because you always mount the (configurable) transient data directory location to /run/runc in file ansible/roles/invoker/tasks/deploy.yml and docker-runc is run by root user.
  • What happens if the invoker process is run by a non-root user? Won't docker-runc invoked by Scala RuncClient then fall back to the default transient data directory, i.e. /run/user/<uid>/runc?
  • I do not know the details of lean setup - but what happens in other setup variants? Do we always make sure that the proper transient data directory location is available at /run/runc when running Scala RuncClient? Which user runs the process?
  • Can we make things more robust by also using the --root /run/runc option in Scala RuncClient?

@sven-lange-last
Copy link
Member

@style95 concerning the failing Jenkins test:

Jenkin machine is using Docker-17.05 so it uses /run/runc as a root directory.
But this PR will use /run/docker/runc-runtime/moby as it is used in Docker-18.06.
So ConcurrencyTests are failing because it could not find the container.

Wouldn't it be possible to work around this problem by setting invoker_runcdir: "/run/runc" in the Ansible environment for Jenkins?


Seq(docker, "--host", host + ":" + dockerPort, cmd, component)
Seq(docker, cmd, component)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the test supposed to work if there is more than one controller running and the controllers run on different systems?

The original test code determines which host is running the first controller and would ask the remote host's Docker daemon to start, stop and restart the controller container. With the change, the test permanently fails in our test environment that uses multiple systems in parallel.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to completely disable remote Docker API from the code as we don't use it anymore in our normal use cases.
It seems remote docker API is only required for test cases.

We can selectively enable/disable remote API based on the deployment type such as
production or unit tests.
But it depends on the OW operator, there is still a possibility that Docker remote API is enabled without consciousness. And it could induce a security breach.

How about making this test rely on ssh command instead?
Then what we need is just to make ssh connection available between target hosts and machine which runs tests.
(Generally, tests are executed in build machines such as jenkins, this requirement would be highly likely already met.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually opt for keeping the current approach with running Docker commands against remote systems. From a security perspective, it's a matter of how you run your Docker daemon. In test environments, you can open a TCP listening socket to accept connections from remote systems - and on production systems, you can just disable the TCP listening socket. The rest is how you configure openwhisk.

Today, the DockerClientWithFileAccess() is created with an empty host parameter anyway such that the local Docker daemon BSD stream socket is used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense.
I just wanted to clean up this configuration as well: https://github.com/apache/incubator-openwhisk/pull/4430/files#diff-85b6ff9e51a5005c641ac4690b975b2cL55
But that's also a viable option.
I will work on it.

@style95
Copy link
Member Author

style95 commented May 8, 2019

@sven-lange-last

Wouldn't it be possible to work around this problem by setting invoker_runcdir: "/run/runc" in the Ansible environment for Jenkins?

Thank you for the suggestion, I will try it.

@style95
Copy link
Member Author

style95 commented May 9, 2019

@sven-lange-last It seems introducing invoker_runcdir with Docker 17.05 does not work.
This is because two different versions of runc are not compatible with each other.

[2019-05-09T10:04:50.544Z] [ERROR] [#tid_sid_invokerNanny] [RuncClient] info: command was unsuccessful, code: 1 (unsuccessful), stdout: , stderr: json: cannot unmarshal string into Go struct field State.init_process_start of type uint64 [marker:invoker_runc.pause_error:1322355:73]

Yeah, that's because you have two runc versions that are sharing incompatible state. That's not really a supported configuration (though we should be more friendly about not bailing when an error occurs).

opencontainers/runc#1592 (comment)

@style95 style95 force-pushed the update-docker-version branch 2 times, most recently from 2265246 to bfa6fa9 Compare May 9, 2019 11:40
@style95
Copy link
Member Author

style95 commented May 10, 2019

Let me share the discussion with @sven-lange-last on Slack for history.

He suggested using invoker_use_runc: false in Jenkins CI environments and I applied it.

We discussed to add --root option in RuncClient as well, and it seems it's not required as of now because invoker is running as a root user.
So RuncClient client will always have access to /run/runc dir.
Since it can be broken at any time if the invoker runs as a non-root user, he also suggested leaving comments about its implications and it is also applied accordingly.

Now let's see what happens in Jenkins CI.

@sven-lange-last
Copy link
Member

sven-lange-last commented May 10, 2019

@style95 thanks a lot for the valuable discussion and your recent updates. Jenkins is working now.

We still have two different problems in Travis:

  • The code scanner complains about whitespace.
  • services.KafkaConnectorTests > Kafka connector should send and receive a kafka message even after shutdown one of instances fails because a remote Docker daemon cannot be connected.

The problem with services.KafkaConnectorTests is in the last line of this code snippet:

https://github.com/style95/openwhisk/blob/faf765cd9d3fedd3139298a19f911247f127c91a/tests/src/test/scala/services/KafkaConnectorTests.scala#L146-L152

  if (kafkaHosts.length > 1) {
    it should "send and receive a kafka message even after shutdown one of instances" in {
      kafkaHosts.indices.foreach { i =>
        val message = createMessage()
        val kafkaHost = kafkaHosts(i).split(":")(0)
        val startLog = s", started"
        val prevCount = startLog.r.findAllMatchIn(commandComponent(kafkaHost, "logs", s"kafka$i").stdout).length

@style95 style95 force-pushed the update-docker-version branch from a377698 to 3f22d34 Compare May 13, 2019 04:53
@sven-lange-last
Copy link
Member

sven-lange-last commented May 13, 2019

PG 4 / 2840 completed successfully. No failures.

@style95
Copy link
Member Author

style95 commented May 13, 2019

Finally!
This is ready to merge! 😄

@@ -36,3 +36,4 @@ runtimes_enable_concurrency: true
limit_action_concurrency_max: 500

invoker1_machine: openwhisk-vm3-he-de
invoker_use_runc: false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@houshengbo do you have plans to update Jenkins VMs to a higher Docker version once this PR is merged? This PR will update Docker to a higher version on the CLI side so that we needed to switch off runc usage in the invoker...

Copy link
Member

@sven-lange-last sven-lange-last left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for addressing all review feedback. Looks good to me now.

What do you think about giving a heads-up on the dev mailing list (again)? When merging this PR, a high Docker version is expected in development, test and production environments. So I suggest to ask whether somebody anticipates a general problem and to announce that this PR gets merged in 48 hours if nobody objects. Does that approach make sense to you?

@style95
Copy link
Member Author

style95 commented May 13, 2019

@sven-lange-last Got it.
I would open a confirmation thread (based on the silent assent) in dev list.

tools/macos/README.md Outdated Show resolved Hide resolved
tools/macos/docker-machine/README.md Outdated Show resolved Hide resolved
@style95 style95 force-pushed the update-docker-version branch from 81bea87 to 31603d2 Compare May 14, 2019 02:25
@codecov-io
Copy link

Codecov Report

Merging #4430 into master will decrease coverage by 3.06%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4430      +/-   ##
==========================================
- Coverage   83.97%   80.91%   -3.07%     
==========================================
  Files         170      170              
  Lines        7940     7943       +3     
  Branches      536      539       +3     
==========================================
- Hits         6668     6427     -241     
- Misses       1272     1516     +244
Impacted Files Coverage Δ
...pache/openwhisk/core/invoker/InvokerReactive.scala 80.86% <100%> (+16.58%) ⬆️
...core/database/cosmosdb/RxObservableImplicits.scala 0% <0%> (-100%) ⬇️
...core/database/cosmosdb/CosmosDBArtifactStore.scala 0% <0%> (-95.46%) ⬇️
...sk/core/database/cosmosdb/CosmosDBViewMapper.scala 0% <0%> (-92.67%) ⬇️
...whisk/core/database/cosmosdb/CosmosDBSupport.scala 0% <0%> (-84.62%) ⬇️
...abase/cosmosdb/CosmosDBArtifactStoreProvider.scala 4% <0%> (-52%) ⬇️
...in/scala/org/apache/openwhisk/common/Counter.scala 40% <0%> (-20%) ⬇️
...penwhisk/core/database/cosmosdb/CosmosDBUtil.scala 81.81% <0%> (-15.16%) ⬇️
...nwhisk/core/database/cosmosdb/CosmosDBConfig.scala 94.11% <0%> (-5.89%) ⬇️
...in/scala/org/apache/openwhisk/common/Logging.scala 69.75% <0%> (-0.62%) ⬇️
... and 26 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4740d40...31603d2. Read the comment docs.

@rabbah rabbah merged commit d353d26 into apache:master May 15, 2019
@style95 style95 mentioned this pull request May 17, 2019
BillZong pushed a commit to BillZong/openwhisk that referenced this pull request Nov 18, 2019
* Update docker version in the controller and guides
* Update travis configuration to install docker=18.06.3
* Separate docker installation from the travis setup script
* Add docker setup script
* Add CoordinatedShutdown to cleanup runtime containers
* add a configuration for root runc dir
* Disable runc use in Jenkins environment
* Add comments which explain the correlation among the version docker and runc and the type of user
* Reenable Docker remote API again
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaits-contributor The contributor needs to respond to comments from reviewer. deployment docker
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants