Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to get network during CreateEndpoint #888

Closed
2 of 3 tasks
mojodevops opened this issue Jan 6, 2020 · 12 comments · Fixed by moby/libnetwork#2554 · May be fixed by moby/moby#41011
Closed
2 of 3 tasks

failed to get network during CreateEndpoint #888

mojodevops opened this issue Jan 6, 2020 · 12 comments · Fixed by moby/libnetwork#2554 · May be fixed by moby/moby#41011
Labels

Comments

@mojodevops
Copy link

mojodevops commented Jan 6, 2020

  • This is a bug report
  • This is a feature request
  • I searched existing issues before opening this one

This maybe similar to
moby/moby#35288
moby/libnetwork#2341
moby/libnetwork#2015

Expected behavior

docker-compose -f zk.yml restart would restart the container correctly

Actual behavior

$ docker-compose -f zk.yml restart
Restarting 0_zookeeper_1 ... error

ERROR: for 0_zookeeper_1  Cannot restart container a6944380cb96aa82ee6508cffbb3487e1698b9aaf2ef1a6ff62c3db220292ed8: failed to get network during CreateEndpoint: network zreirydw66jtrf0z1kjx8lnco not found

# excute again, it works.

# then, excute again, it failed

# excute again, it works.

# then, excute again, it failed

Steps to reproduce the behavior

Output of docker version:

Client: Docker Engine - Community
 Version:           19.03.2
 API version:       1.40
 Go version:        go1.12.8
 Git commit:        6a30dfc
 Built:             Thu Aug 29 05:29:11 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.2
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.8
  Git commit:       6a30dfc
  Built:            Thu Aug 29 05:27:45 2019
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.6
  GitCommit:        894b81a4b802e4eb2a91d1ce216b8817763c29fb
 runc:
  Version:          1.0.0-rc8
  GitCommit:        425e105d5a03fabd737a126ad93d62a9eeede87f
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Output of docker info:

Client:
 Debug Mode: false

Server:
 Containers: 1
  Running: 0
  Paused: 0
  Stopped: 1
 Images: 13
 Server Version: 19.03.2
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: active
  NodeID: m3hsambx486sk19fy1wod0cqj
  Is Manager: true
  ClusterID: o6dplgv86jcrdrbxzitl32yc7
  Managers: 1
  Nodes: 1
  Default Address Pool: 10.0.0.0/8  
  SubnetSize: 24
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 127.0.0.1
  Manager Addresses:
   127.0.0.1:2377
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
 runc version: 425e105d5a03fabd737a126ad93d62a9eeede87f
 init version: fec3683
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.0.0-37-generic
 Operating System: Ubuntu 18.04.3 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 19.47GiB
 Name: ding
 ID: 5YGV:TQ6C:ZGP7:UE43:CFOK:GHP5:6X2K:RB7U:QBN7:J3IH:2RIQ:6FNV
 Docker Root Dir: /data/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: API is accessible on http://127.0.0.1:1433 without encryption.
         Access to the remote API is equivalent to root access on the host. Refer
         to the 'Docker daemon attack surface' section in the documentation for
         more information: https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface
WARNING: No swap limit support

Output of docker network ls:

NETWORK ID          NAME                DRIVER              SCOPE
a0c5dde1871d        bridge              bridge              local
zreirydw66jt        dev                 overlay             swarm
c7871d5f9166        develop             bridge              local
8244a98af18a        docker_gwbridge     bridge              local
6e4c766edd81        host                host                local
xqqui9bmoyx5        ingress             overlay             swarm
e9fa17dbf86a        none                null                local

Additional environment details (AWS, VirtualBox, physical, etc.)

zk.yml

version: '3.6'

services:
  zookeeper:
    image: zookeeper
    volumes: 
      - /etc/localtime:/etc/localtime:ro
      - /etc/localtime:/etc/timezone:ro
      - /data/demo/zookeeper/conf:/conf
      - /data/demo/zookeeper/data:/data
      - /data/demo/zookeeper/datalog:/datalog
      - /data/logs/zookeeper:/logs
    hostname: zookeeper
    ports:
      - 2181:2181
    networks: ['dev']
    restart: always
    environment:
      - TERM=linux
      - ZOO_LOG_DIR=/logs/
      - ZOO_LOG4J_PROP=INFO,ROLLINGFILE
      - LANG=en_US.UTF-8
      - LC_ALL=en_US.UTF-8
      - JVMFLAGS=-Xmx1024M

networks: 
  dev:
    external: true
    name: dev
    driver: overlay
# create network like this
$ docker network create --driver overlay --attachable --subnet 172.30.100.0/24 dev

# start container like this
$ docker-compose -f zk.yml up -d

# restart container like this
$ docker-compose -f zk.yml restart

No other container on my os.

Is anyone can help?

@mojodevops
Copy link
Author

Once create the container, same with the docker command:

# worked
$ docker restart 0_zookeeper_1
0_zookeeper_1

# failed
$ docker restart 0_zookeeper_1
Error response from daemon: Cannot restart container 0_zookeeper_1: failed to get network during CreateEndpoint: network zreirydw66jtrf0z1kjx8lnco not found

# worked
$ docker restart 0_zookeeper_1
0_zookeeper_1

# failed
$ docker restart 0_zookeeper_1
Error response from daemon: Cannot restart container 0_zookeeper_1: failed to get network during CreateEndpoint: network zreirydw66jtrf0z1kjx8lnco not found

@arkodg
Copy link

arkodg commented Jan 9, 2020

@dingzhengkai thanks for raising this issue, it seems to be related to the docker container restart call specifically tied to attachable overlay networks
I can consistently produce this issue by running

docker network create --driver overlay --attachable --subnet 172.40.200.0/24 fooNet
docker run -d --name foo --net fooNet nginx
docker container restart foo

I'm not too familiar with this code flow but I narrowed the race to this line by adding delays (time.Sleep) before and after this line https://github.com/moby/moby/blob/master/daemon/container_operations.go#L410 and adding the sleep before this line solves the issue :)

any hints @cpuguy83 ?

@cpuguy83
Copy link
Collaborator

There seems to be a race between detach and attach. I'm not even sure how the literal detach from the node is happening.

The for loops seems pretty useless here.

@cpuguy83
Copy link
Collaborator

libnetwork just seems all wrong:

https://github.com/docker/libnetwork/blob/f5e0618b985702a6d517f728865c5ec660a03418/store.go#L232-L251

In deleting an object from the kv store it does an atomic operation, but if there is a failure it just retries by fetching the latest version. It doesn't check what's changed it just goes along with the delete.
That's the error that happens in the logs.

@mojodevops
Copy link
Author

Can someone fix it? Thanks.

@mojodevops
Copy link
Author

@thaJeztah,@tiborvass can you help with this?

@nylocx
Copy link

nylocx commented Mar 25, 2020

A fix would be highly appreciated, for now I only had it in my pycharm dev environment, so it was uncomfortable to start twice to get the container running, but now I' hitting this also in some CI/CD chains which is not that easy to fix.

@italodeveloper
Copy link

I do not believe I have software with more than +5 million pageviews per month caught on exactly that! does anyone have any solution?

@arkodg
Copy link

arkodg commented Mar 27, 2020

@cpuguy83 I don't think this is related to the KV store logic

Took another look at the logs and it might be related to the async nature in which we create and delete the lb- service . If daemon.clusterProvider.AttachNetwork is called after the LB service is completely removed we don't see this issue

DEBU[2020-03-27T20:19:04.193770400Z] (*worker).reconcileTaskState                  len(removedTasks)=1 len(updatedTasks)=1 module=node/agent node.id=tjfm4v5pavnel6xcmegczqvjf
DEBU[2020-03-27T20:19:04.193874300Z] assigned                                      module=node/agent node.id=tjfm4v5pavnel6xcmegczqvjf task.desiredstate=RUNNING task.id=rbdfnbvnuiibe3sb3l8682ae0
DEBU[2020-03-27T20:19:04.194005700Z] state changed                                 module=node/agent node.id=tjfm4v5pavnel6xcmegczqvjf service.id= state.desired=RUNNING state.transition="ASSIGNED->ACCEPTED" task.id=rbdfnbvnuiibe3sb3l8682ae0
DEBU[2020-03-27T20:19:04.194500400Z] state changed                                 module=node/agent/taskmanager node.id=tjfm4v5pavnel6xcmegczqvjf service.id= state.desired=RUNNING state.transition="ASSIGNED->ACCEPTED" task.id=rbdfnbvnuiibe3sb3l8682ae0
DEBU[2020-03-27T20:19:04.194460200Z] (*Agent).UpdateTaskStatus                     module=node/agent node.id=tjfm4v5pavnel6xcmegczqvjf task.id=rbdfnbvnuiibe3sb3l8682ae0
DEBU[2020-03-27T20:19:04.195990600Z] DisableService lb-fooNet START               
DEBU[2020-03-27T20:19:04.196061300Z] DisableService lb-fooNet DONE

@blundey
Copy link

blundey commented May 6, 2020

Just want to add, that I have been having this issue for close to a year. docker restart will not work, however a docker stop and docker start is successful. Ive read many posts that all touch on the subject, and this post falls inline with our findings on the race condition. How do we go about gettting the escalated to the right person and getting a fix in the master branch?

@xinfengliu
Copy link

I have a few customers having the same issue (they use docker stack deploy though), this issue breaks their CI/CD pipelines.

I opened debug mode on docker engine, and found that it is a race condition. The issue happened in the following time order for example:

  • Swarm Task A was asked to shutdown, it was the last container of the network at that time so it removed the network on the node.
  • Then Task B was asked to start, it created the network on the node since it was the first container of the network on the node at that time.
  • There was another task C in READY state but was asked to shutdown at this time, it removed the network on the node again.
  • Task B attempted to create network endpoint and found the network did not exist. Error occurred.

I think there's a issue in swarmkit related code: Shutdown() in daemon/cluster/executor/container/controller.go , it should not remove the network if the container has not ever been started (e.g. task C in above example).

@mightydok
Copy link

We use workaround for this issue, we create network in global mode, so after CI/CD there is no errors

thaJeztah added a commit to thaJeztah/docker that referenced this issue Jul 8, 2020
full diff: moby/libnetwork@2e24aed...9e99af2

- moby/libnetwork#2548 Add docker interfaces to firewalld docker zone
    - fixes docker/for-linux#957 DNS Not Resolving under Network [CentOS8]
    - fixes moby/libnetwork#2496 Port Forwarding does not work on RHEL 8 with Firewalld running with FirewallBackend=nftables
- store.getNetworksFromStore() remove unused error return
- moby/libnetwork#2554 Fix 'failed to get network during CreateEndpoint'
    - fixes/addresses docker/for-linux#888 failed to get network during CreateEndpoint
- moby/libnetwork#2558 [master] bridge: disable IPv6 router advertisements
- moby/libnetwork#2563 log error instead if disabling IPv6 router advertisement failed
    - fixes docker/for-linux#1033 Shouldn't be fatal: Unable to disable IPv6 router advertisement: open /proc/sys/net/ipv6/conf/docker0/accept_ra: read-only file system

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
docker-jenkins pushed a commit to docker-archive/docker-ce that referenced this issue Jul 13, 2020
full diff: moby/libnetwork@2e24aed...9e99af2

- moby/libnetwork#2548 Add docker interfaces to firewalld docker zone
    - fixes docker/for-linux#957 DNS Not Resolving under Network [CentOS8]
    - fixes moby/libnetwork#2496 Port Forwarding does not work on RHEL 8 with Firewalld running with FirewallBackend=nftables
- store.getNetworksFromStore() remove unused error return
- moby/libnetwork#2554 Fix 'failed to get network during CreateEndpoint'
    - fixes/addresses docker/for-linux#888 failed to get network during CreateEndpoint
- moby/libnetwork#2558 [master] bridge: disable IPv6 router advertisements
- moby/libnetwork#2563 log error instead if disabling IPv6 router advertisement failed
    - fixes docker/for-linux#1033 Shouldn't be fatal: Unable to disable IPv6 router advertisement: open /proc/sys/net/ipv6/conf/docker0/accept_ra: read-only file system

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Upstream-commit: 219e7e7ddcf5f0314578d2a517fc0832f03622c1
Component: engine
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
8 participants