-
Notifications
You must be signed in to change notification settings - Fork 95
vFile: handle swarm node promotion and demotion #1868
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks good to my (already untrained) eye, but we should start adding automated testing IN the PRs.
A couple of minor comments are also inside
_, err := exec.Command("/bin/etcd", cmd...).Output() | ||
// leaveEtcdCluster function is called when a manager is demoted | ||
func (e *EtcdKVS) leaveEtcdCluster() error { | ||
nodeAddr := e.nodeAddr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not an error, just curious - why not use e.nodeAddr where needed, why the extra vars ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to follow some rule to avoid multiple accesses to a parameter inside a struct...
But maybe it's not applicable here.
I can replace with using e.nodeAddr directly.
).Error("Failed to remove member for ETCD ") | ||
return err | ||
} | ||
// the same peerAddr can only join at once. no need to continue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print info ?
} | ||
} | ||
|
||
e.etcdStopService() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls log info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
etcdStopService already has log info inside.
|
||
// etcdStartService function starts an ETCD process | ||
func (e *EtcdKVS) etcdStartService(lines []string) { | ||
cmd := exec.Command("/bin/etcd", lines...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use systemd to manage services. Running daemons ourselves means we are in charge of resource allocation and restart on issues.... If we do no have a tracking issue for this, please do open one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree.
Issue created: #1873
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, only have some comments/questions.
).Error("Failed to list member for ETCD") | ||
return err | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment about what is "peerAddr", and it could be helpful with an example
} | ||
} | ||
|
||
e.etcdStopService() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@@ -74,6 +77,9 @@ type EtcdKVS struct { | |||
dockerOps *dockerops.DockerOps | |||
nodeID string | |||
nodeAddr string | |||
isManager bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comments for those three newly added fields.
go e.etcdWatcher(cli) | ||
go e.serviceAndVolumeGC(cli) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why here we don't need to call e.serviceAndVolumeGC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
serviceAndVolumeGC is renamed to etcdHelper with the role check function inside now.
It's now moved to outside of checkLocalEtcd and after joinETCD/startETCD, so the joinETCD function can be re-used by etcdHelper itself.
} | ||
} else { | ||
if e.isManager { | ||
err = e.leaveEtcdCluster() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add comment to say that "demote from manger to worker, leave ETCD cluster"
a0d4cc1
to
5009c82
Compare
@msterin @lipingxue @msterin Yes we should start adding tests. This is a high priority issue for the next one month. |
I am not sure what is the "testbed problem", so I assume this is something preventing you from writing and committing automated tests. In this case IMO this should be the top priority and top work item - to enable automated testing before doing any (not automatically tested) feature work |
5009c82
to
5997ab0
Compare
@lipingxue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, and I have a few comments.
// limitations under the License. | ||
|
||
// This test suite includes test cases to verify basic functionality | ||
// before upgrade for upgrade test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be "before and after for promote/demote test", a copy paste issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catching! will update.
|
||
var _ = Suite(&VFileDemotePromoteTestSuite{}) | ||
|
||
// All VMs are created in a shared datastore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test overall looks good. Can we add some enhancement?
- before promote/demote, after create and attach the 1st volume, write some data in the volume
- after promote/demote, read the data back from the 1st volume to make sure data written in step 1 are still exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The read/write tests have been covered by the advanced_vfile_test.
Also the role change code only affects ETCD, and won't affect neither the file server nor the internal volumes.
So I think this one should focus on the role change only?
|
||
out, err = dockercli.DeleteVolume(s.worker1, s.volName2) | ||
c.Assert(err, IsNil, Commentf(out)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think code after this line is just to make reset the test bed, right? It is not part of the test itself. If it is true, please add a comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little confused here. Below this we are trying to reset the testbed's swarm role back to the beginning, in order to not affect other following tests.
If we don't put it here, where we should include this reset part of code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can put it here, but just add a one line comment to say that the following code is for reset the testbed.
5ea3152
to
42c40e1
Compare
When a node is promoted from worker to manager, the helper thread will join ETCD cluster according to swarm information; On the other hand, when the node is demoted from manager to worker, the helper thread should stop the watcher, delete itself from ETCD member list, and clean up the ETCD data directory.
42c40e1
to
5561c40
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
5561c40
to
d021b5a
Compare
CI test for vFile has been added so this request has been cleared.
When a node is promoted from worker to manager, the helper thread will join ETCD cluster according to swarm information; On the other hand, when the node is demoted from manager to worker, the helper thread should stop the watcher, delete itself from ETCD member list, and clean up the ETCD data directory.
Resolves #1732
When a node is promoted from worker to manager, the helper thread will join ETCD cluster according to swarm information.
On the other hand, when the node is demoted from manager to worker, the helper thread should stop the watcher, delete itself from ETCD member list, and clean up the ETCD data directory.
This is required since due to the role change, the cluster may eventually run out of original managers, and thus the ETCD cluster.
Manually tested with 4-node swarm cluster and having one of the node promoted/demoted multiple times. Using etcdctl to verify the ETCD service is in correct status according to the node role change.