Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CAS to Put consul state, and reacquire locks when necessary #14930

Merged
merged 3 commits into from
May 30, 2017

Conversation

jbardin
Copy link
Member

@jbardin jbardin commented May 30, 2017

Consul locks favor liveness, and when a client can't maintain it's session the lock is automatically released.

Since it isn't possible for Terraform to ensure that a lock is not lost at any point during execution, the only option is to attempt to reacquire the lock. When a consul lock is taken, start a goroutine to monitor the lock status channel and attempt to reacquire the lock immediately if it is lost.

This also converts the state Put operation to a CAS, which ensures that the state is always consistent, regardless of the state of the lock.

fixes #14312

@jbardin
Copy link
Member Author

jbardin commented May 30, 2017

$ TF_ACC=1 go test -race -v  2> /dev/null
=== RUN   TestBackend_impl
--- PASS: TestBackend_impl (0.00s)
=== RUN   TestBackend
--- PASS: TestBackend (2.96s)
	testing.go:222: TestBackend: testing state locking for *consul.Backend
=== RUN   TestBackend_lockDisabled
--- PASS: TestBackend_lockDisabled (1.51s)
	testing.go:222: TestBackend: testing state locking for *consul.Backend
	testing.go:251: TestBackend: *consul.Backend: empty string returned for lock, assuming disabled
=== RUN   TestBackend_gzip
--- PASS: TestBackend_gzip (1.77s)
=== RUN   TestRemoteClient_impl
--- PASS: TestRemoteClient_impl (0.00s)
=== RUN   TestRemoteClient
--- PASS: TestRemoteClient (1.85s)
=== RUN   TestRemoteClient_gzipUpgrade
--- PASS: TestRemoteClient_gzipUpgrade (1.43s)
=== RUN   TestConsul_stateLock
--- PASS: TestConsul_stateLock (3.32s)
=== RUN   TestConsul_destroyLock
--- PASS: TestConsul_destroyLock (2.00s)
=== RUN   TestConsul_lostLock
--- PASS: TestConsul_lostLock (1.53s)
PASS

Copy link
Contributor

@apparentlymart apparentlymart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable to me!

These seems like a similar sort of issue as the one where IAM credentials can expire in the middle of a run and cause us to fail to write remote state at the end. I assume in that case it would fall into the recovery codepath we added in 0.9.6 and write the updated state to disk.

For the future, I wonder if we should look at a way for a remote state backend to monitor on an ongoing basis whether it's still "functional" (that is, whether a write could succeed) and abort early if not, to minimize the amount of state that could potentially be lost; at first glance it feels like we could treat this similarly to SIGINT, completing any outstanding work but not starting anything new. On the other hand, perhaps the local-file recovery workflow is sufficient to address the problem since these situations should be rare anyway.

Consul locks are based on liveness, and may be lost due timeouts,
network issued, etc. If the client determines the lock was lost, attempt
to reacquire the lock immediately.

The client was also not using the `lock` config option. Disable locks if
that is not set.
@jbardin jbardin merged commit 9548987 into master May 30, 2017
@jbardin jbardin deleted the jbardin/consul-cas branch May 30, 2017 19:12
@ghost
Copy link

ghost commented Apr 11, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Apr 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consul Back-end Error Releasing Lock
2 participants