You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Topo.LockShard is a blocking call. When concurrent calls to lock same shard happens, the client gets block until the lock releases itself or context times out. But this create an issue where a lot of clients try to acquire lock all together and result in server getting overwhelmed.
In order to avoid server getting overwhelmed we need to introduce TryLock which is non-blocking and return immediately if it cannot acquire lock instead of waiting at server.
What this means is even calling CheckShardLocked will not prevent us to avoid calls for LockShard across multiple process. for e.g replManager and vtorc try to acquire lock for same shard.
This does raise an interesting point that Lockshard does actually result in creating multiple Key/value under same path e.g /vitess/global/keyspaces/ks/shards/0/locks, but the way we avoid it is once we create the latest key/value we wait for older key/values to disappear before we give control back to client. That actually result in 10 second (TTL) block for client if it tries to acquire lock for already acquired path. We do this using revision id. https://github.com/vitessio/vitess/blob/main/go/vt/topo/etcd2topo/lock.go#L165
We were able to repro state mentioned in step#4 , where for the same file path we can have multiple lock files. One is blocked on other to get expire hence client is blocked in second lock. See attached image in comments for call to get path /vitess/global/keyspaces/ks/shards/0.
Off course this state is ephemeral and last for 10 seconds. therefore we felt adding checkshardLock before lockshard won't help.
Optimization we can do here is make client lock call non-blocking. We should come up with semantics for tryLockAcquire. This will help us to un-block client in case of concurrency for up to TTL which is 10 seconds right now.
Use Case(s)
This can be called anywhere where we are doing LockShard call.
The text was updated successfully, but these errors were encountered:
rsajwani
changed the title
Feature Request: Support for new API TryShard at Topo level.
Feature Request: Support for new API TryLock at Topo level.
Sep 21, 2022
Feature Description
Topo.LockShard is a blocking call. When concurrent calls to lock same shard happens, the client gets block until the lock releases itself or context times out. But this create an issue where a lot of clients try to acquire lock all together and result in server getting overwhelmed.
In order to avoid server getting overwhelmed we need to introduce TryLock which is non-blocking and return immediately if it cannot acquire lock instead of waiting at server.
Context
Originally we were trying to add optimization for LockShard under go/vt/orchestrator/logic/tablet_discovery.go. We wanted to call topo.CheckShardLocked before we call ts.LockShard at https://github.com/vitessio/vitess/blob/main/go/vt/orchestrator/logic/tablet_discovery.go#L260, to avoid excessive calls to lock.
But here are some observation
https://github.com/vitessio/vitess/blob/main/go/vt/topo/etcd2topo/lock.go#L165
Off course this state is ephemeral and last for 10 seconds. therefore we felt adding checkshardLock before lockshard won't help.
Optimization we can do here is make client lock call non-blocking. We should come up with semantics for tryLockAcquire. This will help us to un-block client in case of concurrency for up to TTL which is 10 seconds right now.
Use Case(s)
This can be called anywhere where we are doing LockShard call.
The text was updated successfully, but these errors were encountered: