Implement the new locking and notification system for vFile #2001

luomiao · 2017-11-28T03:33:39Z

The PR is to solve issue #1943

The new design has the following changes:

The start and stop of file server is not triggered by global refcount change now. Instead, every global mount/umount request will increase the value of StartTrigger/StopTrigger in the KV store. Two separate watchers (on each master node) will generate events according to the PUT operations to these two triggers.
The states of volumes are reduced to Ready and Mounted only. No intermediate states are needed. As a result, locks are required when the states need to be updated.
To avoid overlapping operations, locks are also required for updating global refcounts. The locks for global refcount are different from the locks for volume states. Usually, the workers are the ones who grab global refcount locks, and the managers (watchers) are the ones who need state locks.
Two fields StartMarker and StopMarker are used to guarantee only one manager's watcher is able to proceed to do the start/stop server operations. And thus other watchers will be able to return and handle events for different volumes in parallel.
To make sure when an error happens, the volumes won't be left in an error state, the order for updating metadata in KV store has been changed too. First of all, global refcount will only be increased after the state of the volume is Mounted; Second, the state of volume can be changed to Mounted only after the file server is up and running. On the other side, during unmount, global refcount and client list will be updated first in the same transaction; then stop trigger is increased to trigger the event to stop the file server.
During the volume deletion, both global refcount and state locks are required. The plugin is responsible to reset the global refcount to 0 and reset state to Ready, and then should increase the StopTrigger to shut down the file server service, if for some reason it's still running.

lipingxue

This is a very big change. I finished half for the first pass. I want to have a code walk through with you together to better understand the code.

lipingxue · 2017-11-28T06:23:44Z

client_plugin/drivers/vfile/dockerops/dockerops.go

+			return 0, "", false
+		} else {
+			// if the service is already stopped
+			return 0, "", true


Why here we return "true" to indicate the service is already stopped?

Because the purpose of this function is to stop the service.
When the service is already stopped, we just return true so the event handlers can proceed to next steps.

lipingxue · 2017-11-28T06:29:09Z

client_plugin/drivers/vfile/kvstore/etcdops/etcdops.go

-			string(fromState), string(interimState))
-		if !succeeded {
-			// this handler doesn't get the right to start/stop server
+	).Debug("Watcher on start trigger returns event ")


Better change the log level to info, then we can trace it without enable debug log.

lipingxue · 2017-11-28T06:31:13Z

client_plugin/drivers/vfile/kvstore/etcdops/etcdops.go

+		// Compare the value of start marker, only one watcher will be able to successfully update the value to
+		// the new value of start trigger
+		success, err := e.CompareAndPutIfNotEqual(kvstore.VolPrefixStartMarker+volName, string(ev.Kv.Value))
+		if err != nil || success == false {
 			return


Add log for this error case.

This is not a real error. Only one watcher will be able to successfully update the marker.
Other watchers will fail, but they should just return. I will add more comments here.

lipingxue · 2017-11-28T06:37:23Z

client_plugin/drivers/vfile/kvstore/etcdops/etcdops.go

+func (e *EtcdKVS) etcdStopEventHandler(ev *etcdClient.Event) {
+	log.WithFields(
+		log.Fields{"type": ev.Type},
+	).Debug("Watcher on stop trigger returns event ")


Change the log level to info.

lipingxue · 2017-11-28T06:40:17Z

client_plugin/drivers/vfile/kvstore/etcdops/etcdops.go

+		// the new value of stop trigger
+		success, err := e.CompareAndPutIfNotEqual(kvstore.VolPrefixStopMarker+volName, string(ev.Kv.Value))
+		if err != nil || success == false {
+			return


Add log for this error case.

lipingxue · 2017-11-28T06:52:45Z

client_plugin/drivers/vfile/kvstore/etcdops/etcdops.go

+
+// TryLock: try to get a ETCD mutex lock
+func (e *EtcdLock) TryLock() error {
+	session, err := concurrency.NewSession(e.lockCli, concurrency.WithTTL(20))


TTL is 20 seconds, right? Better define a const instead of using hard coded value "20" here.

lipingxue · 2017-11-28T06:53:30Z

client_plugin/drivers/vfile/kvstore/etcdops/etcdops.go

+func (e *EtcdLock) BlockingLockWithLease() error {
+	log.Debugf("BlockingLockWithLease: key=%s", e.Key)
+
+	session, err := concurrency.NewSession(e.lockCli, concurrency.WithTTL(20))


same as above.

lipingxue · 2017-11-28T07:00:18Z

client_plugin/drivers/vfile/kvstore/etcdops/etcdops.go

+	if err != nil {
+		msg := fmt.Sprintf("Transactional metadata update failed: %v.", err)
+		if err == context.DeadlineExceeded {
+			msg += fmt.Sprintf(swarmUnhealthyErrorMsg)


Why here we add error for "swarmUnhealthy"?

Should be etcdUnhealthy. Will update.

lipingxue · 2017-11-28T07:04:05Z

client_plugin/drivers/vfile/kvstore/kvstore.go

@@ -57,13 +57,36 @@ const (
 	VolPrefixGRef                     = "SVOLS_gref_"
 	VolPrefixInfo                     = "SVOLS_info_"
 	VolPrefixClient                   = "SVOLS_client_"
+	VolPrefixStartTrigger             = "SVOLS_start_trigger_"


I am still not quite clear with VolPrefixStartTrigger and VolPrefixStartMarker. Could you explain when those two values are used?

lipingxue

Overall looks good, only have a few comments.

lipingxue · 2017-11-29T00:38:22Z

client_plugin/drivers/vfile/kvstore/etcdops/etcdops.go

+		// stop the SMB server
+		port, servName, succeeded := e.dockerOps.StopSMBServer(volName)
+		if succeeded {
+			err = e.updateServerInfo(kvstore.VolPrefixInfo+volName, port, servName)


Why do we need to update server after removing the smb service? I think port and servName should be empty since the service is stopped,right?

lipingxue · 2017-11-29T05:02:43Z

client_plugin/drivers/vfile/vfile_driver.go

+			kvstore.VolPrefixState+r.Name, err)
+		log.Error(msg)
+		grefLock.ReleaseLock()
+		return volume.Response{Err: msg}


I think we should call stateLock.ClearLock() here.

No this error is when it's failed to create state lock. Since it's not created, we don't need to clear it.

lipingxue · 2017-11-29T05:03:31Z

client_plugin/drivers/vfile/vfile_driver.go

+		msg = fmt.Sprintf("Failed to try lock for removing volume %s. Error: %v",
+			kvstore.VolPrefixState+r.Name, err)
+		log.Error(msg)
+		stateLock.ClearLock()


I think here should be "stateLock.ReleaseLock()"

When we are not successfully get the lock, we do ClearLock instead of ReleaseLock.

lipingxue · 2017-11-29T05:06:15Z

client_plugin/drivers/vfile/vfile_driver.go

+	lock, err := d.kvStore.CreateLock(kvstore.VolPrefixGRef + name)
+	if err != nil {
+		log.Errorf("Failed to create lock for mounting volume %s", kvstore.VolPrefixGRef+name)
+		return "", err


I think we should call "lock.ClearLock()" before return error.

Same comment as above.

lipingxue · 2017-11-29T05:10:09Z

client_plugin/drivers/vfile/vfile_driver.go

-			log.Fields{"volume name": name,
-				"error": err,
-			}).Error("Failed to get IP address from docker swarm ")
+		log.Errorf("Failed to create lock for mounting volume %s", kvstore.VolPrefixGRef+name)


Should call lock.ClearLock() before return error.

lipingxue

LGTM

vmwclabot added the cla-not-required label Nov 28, 2017

luomiao requested a review from lipingxue November 28, 2017 03:35

Implement the new locking and notification system for vFile

6dbd2aa

luomiao force-pushed the vfile-error-handling branch from d87e690 to 6dbd2aa Compare November 28, 2017 05:10

lipingxue reviewed Nov 28, 2017

View reviewed changes

Handle comments; Fix minor bugs

d00c4b2

luomiao force-pushed the vfile-error-handling branch from 3ed5632 to d00c4b2 Compare November 28, 2017 10:13

lipingxue reviewed Nov 29, 2017

View reviewed changes

lipingxue approved these changes Nov 30, 2017

View reviewed changes

luomiao merged commit e1190b5 into vmware-archive:master Nov 30, 2017

luomiao mentioned this pull request Dec 1, 2017

Revisit the error handling for vFile design #1943

Closed

luomiao mentioned this pull request Dec 19, 2017

[vFile] file server service spends one minute to start #1954

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the new locking and notification system for vFile #2001

Implement the new locking and notification system for vFile #2001

luomiao commented Nov 28, 2017

lipingxue left a comment

lipingxue Nov 28, 2017

luomiao Nov 28, 2017

lipingxue Nov 28, 2017

lipingxue Nov 28, 2017

luomiao Nov 28, 2017

lipingxue Nov 28, 2017

lipingxue Nov 28, 2017

lipingxue Nov 28, 2017

lipingxue Nov 28, 2017

lipingxue Nov 28, 2017

luomiao Nov 28, 2017

lipingxue Nov 28, 2017

lipingxue left a comment

lipingxue Nov 29, 2017

lipingxue Nov 29, 2017

luomiao Nov 30, 2017

lipingxue Nov 29, 2017

luomiao Nov 30, 2017

lipingxue Nov 29, 2017

luomiao Nov 30, 2017

lipingxue Nov 29, 2017

lipingxue left a comment

Implement the new locking and notification system for vFile #2001

Implement the new locking and notification system for vFile #2001

Conversation

luomiao commented Nov 28, 2017

lipingxue left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lipingxue left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lipingxue left a comment

Choose a reason for hiding this comment