[CI] Failover test for swarm failed as vDVS was not running on the worker node after vm power-on #1864

ashahi1 · 2017-09-05T20:00:48Z

CI: https://ci.vmware.run/vmware/docker-volume-vsphere/1569

Plugin and driver logs have been attached at the end.

Test log:

2017/09/05 18:56:35 vDVS is not running on VM: 192.168.31.85
2017/09/05 18:56:35 vDVS is not running on VM [Ubuntu.16.10]. Please check if the previous VM IP [192.168.31.85] has changed.
2017/09/05 18:56:35 Destroying volume [swarm_test_volume_194316]

----------------------------------------------------------------------
FAIL: swarm_test.go:73: SwarmTestSuite.TearDownTest

swarm_test.go:76:
    c.Assert(err, IsNil, Commentf(out))
... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc4201ad8a0), Stderr:[]uint8(nil)} ("exit status 1")
... Error response from daemon: unable to remove volume: remove swarm_test_volume_194316: VolumeDriver.Remove: Failed to remove volume swarm_test_volume_194316, in use by VM = Ubuntu.14.04.


----------------------------------------------------------------------
PANIC: swarm_test.go:96: SwarmTestSuite.TestFailoverAcrossSwarmNodes

swarm_test.go:164:
    c.Assert(isVDVSRunning, Equals, true, Commentf("vDVS is not running after VM [%s] being restarted", hostName))
... obtained bool = false
... expected bool = true
... vDVS is not running after VM [Ubuntu.16.10] being restarted

Logs:
docker-volume-vsphere.txt
vmdk_ops.txt

The text was updated successfully, but these errors were encountered:

ashahi1 · 2017-09-05T21:13:21Z

@shaominchen Can you please take a look?

shaominchen · 2017-09-05T22:19:37Z

Anup and I debugged this issue. After VM was powered off and then powered on, it took about 2.5 minutes for the vDVS plugin to get up and running. Here are the plugin logs:

2017-09-05 21:50:16.821252145 +0000 UTC [WARNING] Received signal signal=terminated 
2017-09-05 21:52:34.103082066 +0000 UTC [INFO] No config file found. Using defaults.
2017-09-05 21:52:34.146583933 +0000 UTC [INFO] Starting plugin driver=vsphere log_level=info config="/etc/docker-volume-vsphere.conf" 
2017-09-05 21:52:34.146708017 +0000 UTC [INFO] Getting volume data from unix:///var/run/docker.sock
2017-09-05 21:52:36.147061936 +0000 UTC [INFO] Can't connect to unix:///var/run/docker.sock due to (An error occurred trying to connect: context deadline exceeded), skipping discovery
2017-09-05 21:52:36.147160647 +0000 UTC [INFO] Refcounting failed: (An error occurred trying to connect: context deadline exceeded).
2017-09-05 21:52:36.147180745 +0000 UTC [INFO] Docker VMDK plugin started version="vSphere Volume Driver v0.5" port=1019 mock_esx=false 
2017-09-05 21:52:36.148000293 +0000 UTC [INFO] Going into ServeUnix - Listening on Unix socket address="/run/docker/plugins/vsphere.sock" 
2017-09-05 21:52:36.148524901 +0000 UTC [INFO] Scheduling again after 2 seconds
2017-09-05 21:52:38.154959097 +0000 UTC [INFO] Getting volume data from unix:///var/run/docker.sock
2017-09-05 21:52:40.156367475 +0000 UTC [INFO] Can't connect to unix:///var/run/docker.sock due to (An error occurred trying to connect: context deadline exceeded), skipping discovery
2017-09-05 21:52:40.156489255 +0000 UTC [INFO] Refcounting failed: (An error occurred trying to connect: context deadline exceeded). Attempts left: 20 
2017-09-05 21:52:40.156563916 +0000 UTC [INFO] Scheduling again after 4 seconds
2017-09-05 21:52:44.156884722 +0000 UTC [INFO] Getting volume data from unix:///var/run/docker.sock
2017-09-05 21:52:46.157409448 +0000 UTC [INFO] Can't connect to unix:///var/run/docker.sock due to (An error occurred trying to connect: context deadline exceeded), skipping discovery
2017-09-05 21:52:46.157566438 +0000 UTC [INFO] Refcounting failed: (An error occurred trying to connect: context deadline exceeded). Attempts left: 19 
2017-09-05 21:52:46.157604926 +0000 UTC [INFO] Scheduling again after 8 seconds
2017-09-05 21:52:54.157808463 +0000 UTC [INFO] Getting volume data from unix:///var/run/docker.sock
2017-09-05 21:52:54.182122921 +0000 UTC [INFO] Found 0 running or paused containers
2017-09-05 21:52:54.817017907 +0000 UTC [INFO] Discovered 0 volumes that may be in use.
2017-09-05 21:52:54.817117631 +0000 UTC [INFO] Refcounting successfully completed

From above logs, we can see that the plugin was terminated at 21:50:16, and then was fully back and running at 21:52:54. From e2e test logs, we can see that we stopped polling at 21:52:25 before the plugin is back running at 21:52:54:

2017/09/05 21:50:30 Powering on VM [Ubuntu.16.10]
2017/09/05 21:50:31 Sleep for 5 seconds
2017/09/05 21:50:36 Verifying if vDVS is running on vm: 192.168.31.85
2017/09/05 21:50:36 Sleep for 3 seconds
2017/09/05 21:50:57 Sleep for 3 seconds
2017/09/05 21:51:01 Sleep for 3 seconds
2017/09/05 21:51:04 Sleep for 3 seconds
2017/09/05 21:51:07 Sleep for 3 seconds
2017/09/05 21:51:10 Sleep for 3 seconds
2017/09/05 21:51:13 Sleep for 3 seconds
2017/09/05 21:51:16 Sleep for 3 seconds
2017/09/05 21:51:19 Sleep for 3 seconds
2017/09/05 21:51:22 Sleep for 3 seconds
2017/09/05 21:51:25 Sleep for 3 seconds
2017/09/05 21:51:28 Sleep for 3 seconds
2017/09/05 21:51:31 Sleep for 3 seconds
2017/09/05 21:51:34 Sleep for 3 seconds
2017/09/05 21:51:37 Sleep for 3 seconds
2017/09/05 21:51:40 Sleep for 3 seconds
2017/09/05 21:51:43 Sleep for 3 seconds
2017/09/05 21:51:46 Sleep for 3 seconds
2017/09/05 21:51:49 Sleep for 3 seconds
2017/09/05 21:51:52 Sleep for 3 seconds
2017/09/05 21:51:55 Sleep for 3 seconds
2017/09/05 21:51:58 Sleep for 3 seconds
2017/09/05 21:52:01 Sleep for 3 seconds
2017/09/05 21:52:04 Sleep for 3 seconds
2017/09/05 21:52:07 Sleep for 3 seconds
2017/09/05 21:52:10 Sleep for 3 seconds
2017/09/05 21:52:13 Sleep for 3 seconds
2017/09/05 21:52:16 Sleep for 3 seconds
2017/09/05 21:52:19 Sleep for 3 seconds
2017/09/05 21:52:22 Sleep for 3 seconds
2017/09/05 21:52:25 vDVS is not running on VM: 192.168.31.85

So, this is not a test code issue. It's just that the whole process, including VM PowerOff/PowerOn, Docker bootstrapping, vDVS plugin initialization, takes too long to complete.

shaominchen · 2017-09-05T22:24:07Z

My suggestions:

Bump up the timeout of IsVDVSRunning from 90 seconds to 180 seconds
Bump up the e2e timeout threshold value from 50m to 60m

Since now e2e will take ~60m to complete, we may want to move some long running test cases out of the precheckin group.

shaominchen · 2017-09-05T22:25:00Z

@ashahi1 will generate a PR for this issue.

ashahi1 · 2017-09-07T17:47:09Z

Fixed as part of PR #1866

ashahi1 added kind/bug P0 labels Sep 5, 2017

ashahi1 added this to the Sprint - Rafiki milestone Sep 5, 2017

ashahi1 assigned shaominchen Sep 5, 2017

shaominchen assigned ashahi1 and unassigned shaominchen Sep 5, 2017

ashahi1 mentioned this issue Sep 5, 2017

Increase the test run-time for test-e2e-runonce target #1866

Merged

shaominchen mentioned this issue Sep 6, 2017

Optimize vmdkops_admin ls operation #1867

Merged

ashahi1 closed this as completed Sep 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Failover test for swarm failed as vDVS was not running on the worker node after vm power-on #1864

[CI] Failover test for swarm failed as vDVS was not running on the worker node after vm power-on #1864

ashahi1 commented Sep 5, 2017

ashahi1 commented Sep 5, 2017

shaominchen commented Sep 5, 2017

shaominchen commented Sep 5, 2017

shaominchen commented Sep 5, 2017

ashahi1 commented Sep 7, 2017

[CI] Failover test for swarm failed as vDVS was not running on the worker node after vm power-on #1864

[CI] Failover test for swarm failed as vDVS was not running on the worker node after vm power-on #1864

Comments

ashahi1 commented Sep 5, 2017

ashahi1 commented Sep 5, 2017

shaominchen commented Sep 5, 2017

shaominchen commented Sep 5, 2017

shaominchen commented Sep 5, 2017

ashahi1 commented Sep 7, 2017