vFile: handle swarm node promotion and demotion #1868

luomiao · 2017-09-07T00:26:53Z

Resolves #1732

When a node is promoted from worker to manager, the helper thread will join ETCD cluster according to swarm information.
On the other hand, when the node is demoted from manager to worker, the helper thread should stop the watcher, delete itself from ETCD member list, and clean up the ETCD data directory.

This is required since due to the role change, the cluster may eventually run out of original managers, and thus the ETCD cluster.

Manually tested with 4-node swarm cluster and having one of the node promoted/demoted multiple times. Using etcdctl to verify the ETCD service is in correct status according to the node role change.

msterin

Overall it looks good to my (already untrained) eye, but we should start adding automated testing IN the PRs.
A couple of minor comments are also inside

msterin · 2017-09-07T07:38:37Z

client_plugin/drivers/vfile/kvstore/etcdops/etcdops.go

-	_, err := exec.Command("/bin/etcd", cmd...).Output()
+// leaveEtcdCluster function is called when a manager is demoted
+func (e *EtcdKVS) leaveEtcdCluster() error {
+	nodeAddr := e.nodeAddr


not an error, just curious - why not use e.nodeAddr where needed, why the extra vars ?

I am trying to follow some rule to avoid multiple accesses to a parameter inside a struct...
But maybe it's not applicable here.
I can replace with using e.nodeAddr directly.

msterin · 2017-09-07T07:39:25Z

client_plugin/drivers/vfile/kvstore/etcdops/etcdops.go

+				).Error("Failed to remove member for ETCD ")
+				return err
+			}
+			// the same peerAddr can only join at once. no need to continue.


print info ?

msterin · 2017-09-07T07:39:42Z

client_plugin/drivers/vfile/kvstore/etcdops/etcdops.go

+		}
+	}
+
+	e.etcdStopService()


pls log info

etcdStopService already has log info inside.

msterin · 2017-09-07T07:41:16Z

client_plugin/drivers/vfile/kvstore/etcdops/etcdops.go

+
+// etcdStartService function starts an ETCD process
+func (e *EtcdKVS) etcdStartService(lines []string) {
+	cmd := exec.Command("/bin/etcd", lines...)


We should use systemd to manage services. Running daemons ourselves means we are in charge of resource allocation and restart on issues.... If we do no have a tracking issue for this, please do open one

Agree.
Issue created: #1873

lipingxue

Overall looks good, only have some comments/questions.

lipingxue · 2017-09-07T17:15:04Z

client_plugin/drivers/vfile/kvstore/etcdops/etcdops.go

+		).Error("Failed to list member for ETCD")
+		return err
+	}
+


Please add a comment about what is "peerAddr", and it could be helpful with an example

lipingxue · 2017-09-07T17:16:38Z

client_plugin/drivers/vfile/kvstore/etcdops/etcdops.go

+		}
+	}
+
+	e.etcdStopService()


lipingxue · 2017-09-07T17:22:06Z

client_plugin/drivers/vfile/kvstore/etcdops/etcdops.go

@@ -74,6 +77,9 @@ type EtcdKVS struct {
 	dockerOps *dockerops.DockerOps
 	nodeID    string
 	nodeAddr  string
+	isManager bool


Please add comments for those three newly added fields.

lipingxue · 2017-09-07T17:24:32Z

client_plugin/drivers/vfile/kvstore/etcdops/etcdops.go

 				go e.etcdWatcher(cli)
-				go e.serviceAndVolumeGC(cli)


Why here we don't need to call e.serviceAndVolumeGC?

serviceAndVolumeGC is renamed to etcdHelper with the role check function inside now.
It's now moved to outside of checkLocalEtcd and after joinETCD/startETCD, so the joinETCD function can be re-used by etcdHelper itself.

lipingxue · 2017-09-07T17:27:03Z

client_plugin/drivers/vfile/kvstore/etcdops/etcdops.go

+		}
+	} else {
+		if e.isManager {
+			err = e.leaveEtcdCluster()


add comment to say that "demote from manger to worker, leave ETCD cluster"

luomiao · 2017-09-07T21:26:41Z

@msterin @lipingxue
Addressed your comments.
Please review again.

@msterin Yes we should start adding tests. This is a high priority issue for the next one month.
We need first resolve the testbed problem.

msterin · 2017-09-07T22:35:22Z

I am not sure what is the "testbed problem", so I assume this is something preventing you from writing and committing automated tests. In this case IMO this should be the top priority and top work item - to enable automated testing before doing any (not automatically tested) feature work

luomiao · 2017-10-26T22:40:18Z

@lipingxue
I added a e2e test for this PR.
The new test changes the role of manager and worker in swarm cluster and do volume lifecycle test before and after the role change.
Also some new updates to resolve code conflicts with master branch.
Please review the new changes accordingly. Thank you!

lipingxue

Overall looks good, and I have a few comments.

lipingxue · 2017-10-27T18:14:53Z

tests/e2e/vfile_demote_promote_test.go

+// limitations under the License.
+
+// This test suite includes test cases to verify basic functionality
+// before upgrade for upgrade test


This should be "before and after for promote/demote test", a copy paste issue.

Thanks for the catching! will update.

lipingxue · 2017-10-27T18:26:02Z

tests/e2e/vfile_demote_promote_test.go

+
+var _ = Suite(&VFileDemotePromoteTestSuite{})
+
+// All VMs are created in a shared datastore


Test overall looks good. Can we add some enhancement?

before promote/demote, after create and attach the 1st volume, write some data in the volume

after promote/demote, read the data back from the 1st volume to make sure data written in step 1 are still exist

The read/write tests have been covered by the advanced_vfile_test.
Also the role change code only affects ETCD, and won't affect neither the file server nor the internal volumes.
So I think this one should focus on the role change only?

lipingxue · 2017-10-27T18:29:03Z

tests/e2e/vfile_demote_promote_test.go

+
+	out, err = dockercli.DeleteVolume(s.worker1, s.volName2)
+	c.Assert(err, IsNil, Commentf(out))
+


I think code after this line is just to make reset the test bed, right? It is not part of the test itself. If it is true, please add a comment here.

I am a little confused here. Below this we are trying to reset the testbed's swarm role back to the beginning, in order to not affect other following tests.
If we don't put it here, where we should include this reset part of code?

You can put it here, but just add a one line comment to say that the following code is for reset the testbed.

When a node is promoted from worker to manager, the helper thread will join ETCD cluster according to swarm information; On the other hand, when the node is demoted from manager to worker, the helper thread should stop the watcher, delete itself from ETCD member list, and clean up the ETCD data directory.

lipingxue

LGTM

CI test for vFile has been added so this request has been cleared.

When a node is promoted from worker to manager, the helper thread will join ETCD cluster according to swarm information; On the other hand, when the node is demoted from manager to worker, the helper thread should stop the watcher, delete itself from ETCD member list, and clean up the ETCD data directory.

luomiao requested review from msterin and lipingxue September 7, 2017 00:26

vmwclabot added the cla-not-required label Sep 7, 2017

msterin previously requested changes Sep 7, 2017

View reviewed changes

lipingxue reviewed Sep 7, 2017

View reviewed changes

luomiao force-pushed the vfile-role-change branch from a0d4cc1 to 5009c82 Compare September 7, 2017 21:26

luomiao force-pushed the vfile-role-change branch from 5009c82 to 5997ab0 Compare October 26, 2017 22:21

lipingxue reviewed Oct 27, 2017

View reviewed changes

luomiao force-pushed the vfile-role-change branch from 5ea3152 to 42c40e1 Compare October 30, 2017 21:17

Miao Luo added 6 commits October 30, 2017 14:24

Address comments.

fbde1b1

Resolve conflicts; Add e2e test.

53149ed

Add debug building script info.

fbf3a15

Minor update and address comments.

a16774a

Address comment.

0bdf110

luomiao force-pushed the vfile-role-change branch from 42c40e1 to 5561c40 Compare October 30, 2017 21:24

lipingxue approved these changes Oct 30, 2017

View reviewed changes

Resolve conflicts, retrigger CI.

d021b5a

luomiao force-pushed the vfile-role-change branch from 5561c40 to d021b5a Compare October 31, 2017 04:19

luomiao merged commit 3aa760a into vmware-archive:master Oct 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vFile: handle swarm node promotion and demotion #1868

vFile: handle swarm node promotion and demotion #1868

luomiao commented Sep 7, 2017

msterin left a comment

msterin Sep 7, 2017

luomiao Sep 7, 2017

msterin Sep 7, 2017

msterin Sep 7, 2017

lipingxue Sep 7, 2017

luomiao Sep 7, 2017

msterin Sep 7, 2017

luomiao Sep 7, 2017

lipingxue left a comment

lipingxue Sep 7, 2017

lipingxue Sep 7, 2017

lipingxue Sep 7, 2017

lipingxue Sep 7, 2017

luomiao Sep 7, 2017

lipingxue Sep 7, 2017

luomiao commented Sep 7, 2017

msterin commented Sep 7, 2017

luomiao commented Oct 26, 2017

lipingxue left a comment

lipingxue Oct 27, 2017

luomiao Oct 27, 2017

lipingxue Oct 27, 2017

luomiao Oct 27, 2017

lipingxue Oct 27, 2017

luomiao Oct 27, 2017

lipingxue Oct 30, 2017

lipingxue left a comment


		var _ = Suite(&VFileDemotePromoteTestSuite{})

		// All VMs are created in a shared datastore


		out, err = dockercli.DeleteVolume(s.worker1, s.volName2)
		c.Assert(err, IsNil, Commentf(out))

vFile: handle swarm node promotion and demotion #1868

vFile: handle swarm node promotion and demotion #1868

Conversation

luomiao commented Sep 7, 2017

msterin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lipingxue left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luomiao commented Sep 7, 2017

msterin commented Sep 7, 2017

luomiao commented Oct 26, 2017

lipingxue left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lipingxue left a comment

Choose a reason for hiding this comment