Feature Request: Support for new API TryLock at Topo level. #11271

rsajwani · 2022-09-21T01:44:46Z

Feature Description

Topo.LockShard is a blocking call. When concurrent calls to lock same shard happens, the client gets block until the lock releases itself or context times out. But this create an issue where a lot of clients try to acquire lock all together and result in server getting overwhelmed.

In order to avoid server getting overwhelmed we need to introduce TryLock which is non-blocking and return immediately if it cannot acquire lock instead of waiting at server.

Context

Originally we were trying to add optimization for LockShard under go/vt/orchestrator/logic/tablet_discovery.go. We wanted to call topo.CheckShardLocked before we call ts.LockShard at https://github.com/vitessio/vitess/blob/main/go/vt/orchestrator/logic/tablet_discovery.go#L260, to avoid excessive calls to lock.
But here are some observation

Our lockshard is context bases. Which means it hold lock per context. https://github.com/vitessio/vitess/blob/main/go/vt/topo/locks.go#L296
What this means is even calling CheckShardLocked will not prevent us to avoid calls for LockShard across multiple process. for e.g replManager and vtorc try to acquire lock for same shard.
We already check for lock presence within ts.lockshard https://github.com/vitessio/vitess/blob/main/go/vt/topo/locks.go#L307, which means within same context we don't need an extra checkshardlocked call and across process anyways we don't handle concurrency (as it is context base)
This does raise an interesting point that Lockshard does actually result in creating multiple Key/value under same path e.g /vitess/global/keyspaces/ks/shards/0/locks, but the way we avoid it is once we create the latest key/value we wait for older key/values to disappear before we give control back to client. That actually result in 10 second (TTL) block for client if it tries to acquire lock for already acquired path. We do this using revision id.
https://github.com/vitessio/vitess/blob/main/go/vt/topo/etcd2topo/lock.go#L165
We were able to repro state mentioned in step#4 , where for the same file path we can have multiple lock files. One is blocked on other to get expire hence client is blocked in second lock. See attached image in comments for call to get path /vitess/global/keyspaces/ks/shards/0.

Off course this state is ephemeral and last for 10 seconds. therefore we felt adding checkshardLock before lockshard won't help.

Optimization we can do here is make client lock call non-blocking. We should come up with semantics for tryLockAcquire. This will help us to un-block client in case of concurrency for up to TTL which is 10 seconds right now.

Use Case(s)

This can be called anywhere where we are doing LockShard call.

rsajwani added Type: Feature Needs Triage This issue needs to be correctly labelled and triaged labels Sep 21, 2022

rsajwani mentioned this issue Sep 21, 2022

Create new api for topo lock shard exists #11269

Merged

3 tasks

rsajwani changed the title ~~Feature Request: Support for new API TryShard at Topo level.~~ Feature Request: Support for new API TryLock at Topo level. Sep 21, 2022

mattlord added Component: Cluster management and removed Needs Triage This issue needs to be correctly labelled and triaged labels Sep 23, 2022

rsajwani self-assigned this Dec 1, 2022

rsajwani closed this as completed Dec 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Support for new API TryLock at Topo level. #11271

Feature Request: Support for new API TryLock at Topo level. #11271

rsajwani commented Sep 21, 2022 •

edited by deepthi

Loading

Feature Request: Support for new API TryLock at Topo level. #11271

Feature Request: Support for new API TryLock at Topo level. #11271

Comments

rsajwani commented Sep 21, 2022 • edited by deepthi Loading

Feature Description

Context

Use Case(s)

rsajwani commented Sep 21, 2022 •

edited by deepthi

Loading