You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Leader stepdown with the DynamoDB HA backed can put you in a state where no leader exists, and the lock cannot be acquired by any node.
Vault leadership with Dynamo works as follows:
When it writes the lock, it has the following logic (writeItem code:
// If both key and path already exist, we can only write if
// A. identity is equal to our identity (or the identity doesn't exist)
// or
// B. The ttl on the item is <= to the current time
The race condition is once #5 occurs, there's a race between whether 3.a or 3.b happens first. If 3.a happens after the delete, you now have the lock re-created and will continue to renew forever, even though this host gave up being leader. It is now stuck at #2 unable to acquire a lock, and no one else is able to acquire it. If 3.b happens first, then it detects the lock is deleted and then calls stopLeaderCh, which will stop the renewLock from happening and a new leader getting elected.
This only happens on manual step-down because during a shutdown stopLeaderCh is closed, so there is no goroutine to continuously renew the lock.
Environment:
Vault Server Version 0.11.4
The text was updated successfully, but these errors were encountered:
Describe the bug
Leader stepdown with the DynamoDB HA backed can put you in a state where no leader exists, and the lock cannot be acquired by any node.
Vault leadership with Dynamo works as follows:
When it writes the lock, it has the following logic (writeItem code:
vault/physical/dynamodb/dynamodb.go
Line 659 in f85efad
In an infinite loop (code:
vault/vault/ha.go
Line 376 in a58d313
vault/physical/dynamodb/dynamodb.go
Line 564 in f85efad
vault/physical/dynamodb/dynamodb.go
Line 565 in f85efad
The race condition is once #5 occurs, there's a race between whether 3.a or 3.b happens first. If 3.a happens after the delete, you now have the lock re-created and will continue to renew forever, even though this host gave up being leader. It is now stuck at #2 unable to acquire a lock, and no one else is able to acquire it. If 3.b happens first, then it detects the lock is deleted and then calls stopLeaderCh, which will stop the renewLock from happening and a new leader getting elected.
This only happens on manual step-down because during a shutdown stopLeaderCh is closed, so there is no goroutine to continuously renew the lock.
Environment:
The text was updated successfully, but these errors were encountered: