Skip to content
This repository has been archived by the owner on Nov 9, 2020. It is now read-only.

[CI] Failover test for swarm failed as vDVS was not running on the worker node after vm power-on #1864

Closed
ashahi1 opened this issue Sep 5, 2017 · 5 comments
Assignees

Comments

@ashahi1
Copy link
Contributor

ashahi1 commented Sep 5, 2017

CI: https://ci.vmware.run/vmware/docker-volume-vsphere/1569

Plugin and driver logs have been attached at the end.

Test log:

2017/09/05 18:56:35 vDVS is not running on VM: 192.168.31.85
2017/09/05 18:56:35 vDVS is not running on VM [Ubuntu.16.10]. Please check if the previous VM IP [192.168.31.85] has changed.
2017/09/05 18:56:35 Destroying volume [swarm_test_volume_194316]

----------------------------------------------------------------------
FAIL: swarm_test.go:73: SwarmTestSuite.TearDownTest

swarm_test.go:76:
    c.Assert(err, IsNil, Commentf(out))
... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc4201ad8a0), Stderr:[]uint8(nil)} ("exit status 1")
... Error response from daemon: unable to remove volume: remove swarm_test_volume_194316: VolumeDriver.Remove: Failed to remove volume swarm_test_volume_194316, in use by VM = Ubuntu.14.04.


----------------------------------------------------------------------
PANIC: swarm_test.go:96: SwarmTestSuite.TestFailoverAcrossSwarmNodes

swarm_test.go:164:
    c.Assert(isVDVSRunning, Equals, true, Commentf("vDVS is not running after VM [%s] being restarted", hostName))
... obtained bool = false
... expected bool = true
... vDVS is not running after VM [Ubuntu.16.10] being restarted

Logs:
docker-volume-vsphere.txt
vmdk_ops.txt

@ashahi1
Copy link
Contributor Author

ashahi1 commented Sep 5, 2017

@shaominchen Can you please take a look?

@shaominchen
Copy link
Contributor

Anup and I debugged this issue. After VM was powered off and then powered on, it took about 2.5 minutes for the vDVS plugin to get up and running. Here are the plugin logs:

2017-09-05 21:50:16.821252145 +0000 UTC [WARNING] Received signal signal=terminated 
2017-09-05 21:52:34.103082066 +0000 UTC [INFO] No config file found. Using defaults.
2017-09-05 21:52:34.146583933 +0000 UTC [INFO] Starting plugin driver=vsphere log_level=info config="/etc/docker-volume-vsphere.conf" 
2017-09-05 21:52:34.146708017 +0000 UTC [INFO] Getting volume data from unix:///var/run/docker.sock
2017-09-05 21:52:36.147061936 +0000 UTC [INFO] Can't connect to unix:///var/run/docker.sock due to (An error occurred trying to connect: context deadline exceeded), skipping discovery
2017-09-05 21:52:36.147160647 +0000 UTC [INFO] Refcounting failed: (An error occurred trying to connect: context deadline exceeded).
2017-09-05 21:52:36.147180745 +0000 UTC [INFO] Docker VMDK plugin started version="vSphere Volume Driver v0.5" port=1019 mock_esx=false 
2017-09-05 21:52:36.148000293 +0000 UTC [INFO] Going into ServeUnix - Listening on Unix socket address="/run/docker/plugins/vsphere.sock" 
2017-09-05 21:52:36.148524901 +0000 UTC [INFO] Scheduling again after 2 seconds
2017-09-05 21:52:38.154959097 +0000 UTC [INFO] Getting volume data from unix:///var/run/docker.sock
2017-09-05 21:52:40.156367475 +0000 UTC [INFO] Can't connect to unix:///var/run/docker.sock due to (An error occurred trying to connect: context deadline exceeded), skipping discovery
2017-09-05 21:52:40.156489255 +0000 UTC [INFO] Refcounting failed: (An error occurred trying to connect: context deadline exceeded). Attempts left: 20 
2017-09-05 21:52:40.156563916 +0000 UTC [INFO] Scheduling again after 4 seconds
2017-09-05 21:52:44.156884722 +0000 UTC [INFO] Getting volume data from unix:///var/run/docker.sock
2017-09-05 21:52:46.157409448 +0000 UTC [INFO] Can't connect to unix:///var/run/docker.sock due to (An error occurred trying to connect: context deadline exceeded), skipping discovery
2017-09-05 21:52:46.157566438 +0000 UTC [INFO] Refcounting failed: (An error occurred trying to connect: context deadline exceeded). Attempts left: 19 
2017-09-05 21:52:46.157604926 +0000 UTC [INFO] Scheduling again after 8 seconds
2017-09-05 21:52:54.157808463 +0000 UTC [INFO] Getting volume data from unix:///var/run/docker.sock
2017-09-05 21:52:54.182122921 +0000 UTC [INFO] Found 0 running or paused containers
2017-09-05 21:52:54.817017907 +0000 UTC [INFO] Discovered 0 volumes that may be in use.
2017-09-05 21:52:54.817117631 +0000 UTC [INFO] Refcounting successfully completed

From above logs, we can see that the plugin was terminated at 21:50:16, and then was fully back and running at 21:52:54. From e2e test logs, we can see that we stopped polling at 21:52:25 before the plugin is back running at 21:52:54:

2017/09/05 21:50:30 Powering on VM [Ubuntu.16.10]
2017/09/05 21:50:31 Sleep for 5 seconds
2017/09/05 21:50:36 Verifying if vDVS is running on vm: 192.168.31.85
2017/09/05 21:50:36 Sleep for 3 seconds
2017/09/05 21:50:57 Sleep for 3 seconds
2017/09/05 21:51:01 Sleep for 3 seconds
2017/09/05 21:51:04 Sleep for 3 seconds
2017/09/05 21:51:07 Sleep for 3 seconds
2017/09/05 21:51:10 Sleep for 3 seconds
2017/09/05 21:51:13 Sleep for 3 seconds
2017/09/05 21:51:16 Sleep for 3 seconds
2017/09/05 21:51:19 Sleep for 3 seconds
2017/09/05 21:51:22 Sleep for 3 seconds
2017/09/05 21:51:25 Sleep for 3 seconds
2017/09/05 21:51:28 Sleep for 3 seconds
2017/09/05 21:51:31 Sleep for 3 seconds
2017/09/05 21:51:34 Sleep for 3 seconds
2017/09/05 21:51:37 Sleep for 3 seconds
2017/09/05 21:51:40 Sleep for 3 seconds
2017/09/05 21:51:43 Sleep for 3 seconds
2017/09/05 21:51:46 Sleep for 3 seconds
2017/09/05 21:51:49 Sleep for 3 seconds
2017/09/05 21:51:52 Sleep for 3 seconds
2017/09/05 21:51:55 Sleep for 3 seconds
2017/09/05 21:51:58 Sleep for 3 seconds
2017/09/05 21:52:01 Sleep for 3 seconds
2017/09/05 21:52:04 Sleep for 3 seconds
2017/09/05 21:52:07 Sleep for 3 seconds
2017/09/05 21:52:10 Sleep for 3 seconds
2017/09/05 21:52:13 Sleep for 3 seconds
2017/09/05 21:52:16 Sleep for 3 seconds
2017/09/05 21:52:19 Sleep for 3 seconds
2017/09/05 21:52:22 Sleep for 3 seconds
2017/09/05 21:52:25 vDVS is not running on VM: 192.168.31.85

So, this is not a test code issue. It's just that the whole process, including VM PowerOff/PowerOn, Docker bootstrapping, vDVS plugin initialization, takes too long to complete.

@shaominchen
Copy link
Contributor

My suggestions:

  1. Bump up the timeout of IsVDVSRunning from 90 seconds to 180 seconds
  2. Bump up the e2e timeout threshold value from 50m to 60m

Since now e2e will take ~60m to complete, we may want to move some long running test cases out of the precheckin group.

@shaominchen shaominchen assigned ashahi1 and unassigned shaominchen Sep 5, 2017
@shaominchen
Copy link
Contributor

@ashahi1 will generate a PR for this issue.

@ashahi1
Copy link
Contributor Author

ashahi1 commented Sep 7, 2017

Fixed as part of PR #1866

@ashahi1 ashahi1 closed this as completed Sep 7, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants