You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 9, 2020. It is now read-only.
Current design has a flaw when handling multiple errors in ETCD updating at the same time.
Consider the following scenario:
Worker creates a new volume and start to use it, internally global refcount is atomic increased by one.
All the managers are triggered to try to update the state from "Ready" to "Mounting". Assume ETCD cluster is not accessible suddenly, and thus the update of the state is failed. State of the volume will stay as "Ready"
Worker will timeout when waiting for state to be "Mounted". As an error handling, worker will try to reduce the global refcount. However the ETCD cluster is not accessible again, and the worker may fail to reduce the global refcount and returned error back.
Now the volume's state is "Ready" while the global refcount of the volume is 1. From now on, even ETCD cluster is back to normal, no one is able to use this volume correctly, since the global refcount won't be changed from 0 to 1, which means no filer server is able to be started for it.
To handle the above race condition, the design needs to be adjusted to introduce locks/helper threads to check the unmatched global refcount and users.
The text was updated successfully, but these errors were encountered:
Current design has a flaw when handling multiple errors in ETCD updating at the same time.
Consider the following scenario:
To handle the above race condition, the design needs to be adjusted to introduce locks/helper threads to check the unmatched global refcount and users.
The text was updated successfully, but these errors were encountered: