Skip to content
This repository has been archived by the owner on Nov 9, 2020. It is now read-only.

Revisit the error handling for vFile design #1943

Closed
luomiao opened this issue Oct 18, 2017 · 1 comment
Closed

Revisit the error handling for vFile design #1943

luomiao opened this issue Oct 18, 2017 · 1 comment

Comments

@luomiao
Copy link
Contributor

luomiao commented Oct 18, 2017

Current design has a flaw when handling multiple errors in ETCD updating at the same time.
Consider the following scenario:

  1. Worker creates a new volume and start to use it, internally global refcount is atomic increased by one.
  2. All the managers are triggered to try to update the state from "Ready" to "Mounting". Assume ETCD cluster is not accessible suddenly, and thus the update of the state is failed. State of the volume will stay as "Ready"
  3. Worker will timeout when waiting for state to be "Mounted". As an error handling, worker will try to reduce the global refcount. However the ETCD cluster is not accessible again, and the worker may fail to reduce the global refcount and returned error back.
  4. Now the volume's state is "Ready" while the global refcount of the volume is 1. From now on, even ETCD cluster is back to normal, no one is able to use this volume correctly, since the global refcount won't be changed from 0 to 1, which means no filer server is able to be started for it.

To handle the above race condition, the design needs to be adjusted to introduce locks/helper threads to check the unmatched global refcount and users.

@luomiao
Copy link
Contributor Author

luomiao commented Dec 1, 2017

Closed by #2001

@luomiao luomiao closed this as completed Dec 1, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants