Skip to content
This repository has been archived by the owner on Nov 9, 2020. It is now read-only.

Implement the new locking and notification system for vFile #2001

Merged
merged 2 commits into from
Nov 30, 2017

Conversation

luomiao
Copy link
Contributor

@luomiao luomiao commented Nov 28, 2017

The PR is to solve issue #1943

The new design has the following changes:

  1. The start and stop of file server is not triggered by global refcount change now. Instead, every global mount/umount request will increase the value of StartTrigger/StopTrigger in the KV store. Two separate watchers (on each master node) will generate events according to the PUT operations to these two triggers.
  2. The states of volumes are reduced to Ready and Mounted only. No intermediate states are needed. As a result, locks are required when the states need to be updated.
  3. To avoid overlapping operations, locks are also required for updating global refcounts. The locks for global refcount are different from the locks for volume states. Usually, the workers are the ones who grab global refcount locks, and the managers (watchers) are the ones who need state locks.
  4. Two fields StartMarker and StopMarker are used to guarantee only one manager's watcher is able to proceed to do the start/stop server operations. And thus other watchers will be able to return and handle events for different volumes in parallel.
  5. To make sure when an error happens, the volumes won't be left in an error state, the order for updating metadata in KV store has been changed too. First of all, global refcount will only be increased after the state of the volume is Mounted; Second, the state of volume can be changed to Mounted only after the file server is up and running. On the other side, during unmount, global refcount and client list will be updated first in the same transaction; then stop trigger is increased to trigger the event to stop the file server.
  6. During the volume deletion, both global refcount and state locks are required. The plugin is responsible to reset the global refcount to 0 and reset state to Ready, and then should increase the StopTrigger to shut down the file server service, if for some reason it's still running.

Copy link
Contributor

@lipingxue lipingxue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very big change. I finished half for the first pass. I want to have a code walk through with you together to better understand the code.

return 0, "", false
} else {
// if the service is already stopped
return 0, "", true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why here we return "true" to indicate the service is already stopped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the purpose of this function is to stop the service.
When the service is already stopped, we just return true so the event handlers can proceed to next steps.

string(fromState), string(interimState))
if !succeeded {
// this handler doesn't get the right to start/stop server
).Debug("Watcher on start trigger returns event ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better change the log level to info, then we can trace it without enable debug log.

// Compare the value of start marker, only one watcher will be able to successfully update the value to
// the new value of start trigger
success, err := e.CompareAndPutIfNotEqual(kvstore.VolPrefixStartMarker+volName, string(ev.Kv.Value))
if err != nil || success == false {
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add log for this error case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a real error. Only one watcher will be able to successfully update the marker.
Other watchers will fail, but they should just return. I will add more comments here.

func (e *EtcdKVS) etcdStopEventHandler(ev *etcdClient.Event) {
log.WithFields(
log.Fields{"type": ev.Type},
).Debug("Watcher on stop trigger returns event ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the log level to info.

// the new value of stop trigger
success, err := e.CompareAndPutIfNotEqual(kvstore.VolPrefixStopMarker+volName, string(ev.Kv.Value))
if err != nil || success == false {
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add log for this error case.


// TryLock: try to get a ETCD mutex lock
func (e *EtcdLock) TryLock() error {
session, err := concurrency.NewSession(e.lockCli, concurrency.WithTTL(20))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TTL is 20 seconds, right? Better define a const instead of using hard coded value "20" here.

func (e *EtcdLock) BlockingLockWithLease() error {
log.Debugf("BlockingLockWithLease: key=%s", e.Key)

session, err := concurrency.NewSession(e.lockCli, concurrency.WithTTL(20))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above.

if err != nil {
msg := fmt.Sprintf("Transactional metadata update failed: %v.", err)
if err == context.DeadlineExceeded {
msg += fmt.Sprintf(swarmUnhealthyErrorMsg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why here we add error for "swarmUnhealthy"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be etcdUnhealthy. Will update.

@@ -57,13 +57,36 @@ const (
VolPrefixGRef = "SVOLS_gref_"
VolPrefixInfo = "SVOLS_info_"
VolPrefixClient = "SVOLS_client_"
VolPrefixStartTrigger = "SVOLS_start_trigger_"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not quite clear with VolPrefixStartTrigger and VolPrefixStartMarker. Could you explain when those two values are used?

Copy link
Contributor

@lipingxue lipingxue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, only have a few comments.

// stop the SMB server
port, servName, succeeded := e.dockerOps.StopSMBServer(volName)
if succeeded {
err = e.updateServerInfo(kvstore.VolPrefixInfo+volName, port, servName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to update server after removing the smb service? I think port and servName should be empty since the service is stopped,right?

kvstore.VolPrefixState+r.Name, err)
log.Error(msg)
grefLock.ReleaseLock()
return volume.Response{Err: msg}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should call stateLock.ClearLock() here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this error is when it's failed to create state lock. Since it's not created, we don't need to clear it.

msg = fmt.Sprintf("Failed to try lock for removing volume %s. Error: %v",
kvstore.VolPrefixState+r.Name, err)
log.Error(msg)
stateLock.ClearLock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here should be "stateLock.ReleaseLock()"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we are not successfully get the lock, we do ClearLock instead of ReleaseLock.

lock, err := d.kvStore.CreateLock(kvstore.VolPrefixGRef + name)
if err != nil {
log.Errorf("Failed to create lock for mounting volume %s", kvstore.VolPrefixGRef+name)
return "", err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should call "lock.ClearLock()" before return error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

log.Fields{"volume name": name,
"error": err,
}).Error("Failed to get IP address from docker swarm ")
log.Errorf("Failed to create lock for mounting volume %s", kvstore.VolPrefixGRef+name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should call lock.ClearLock() before return error.

Copy link
Contributor

@lipingxue lipingxue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants