State Locking initial implementations #11187

jbardin · 2017-01-12T23:24:58Z

Basic state locking.

This all starts with the state.Locker interface:

type StateLocker interface {
    Lock(reason string) error
    Unlock() error
}

This is implemented by the internal state wrappers (CacheState, BackupState), as well as LocalState. Any commands that operating on the state can use these functions to prevent concurrent modification.

Remote state clients can also implement this same interface, which will be called via the State structures that wraps them.

This PR contains the initial LocalState and AWS locking implementations. No state locking actually occurs yet, and adding the Lock calls and associated commands will come in another PR.

apparentlymart · 2017-01-13T19:28:11Z

This seems related to #5036. 😀

radeksimko · 2017-01-19T19:29:42Z

Is there any specific reason you decided to use DynamoDB to hold the lock for S3 remote state backend instead of saving a "lockfile" in S3? Is it just for the potentially strong consistency and/or conditional Put operations?

jbardin · 2017-01-19T19:45:19Z

@radeksimko, yes it's specifically for the strong consistency offered by dynamodb. S3 is only eventually consistent in almost all cases.

IIRC S3 might be consistent after a PUT of a new object and a GET of that object specifically, but even in that case there's no way to make a PUT conditional, so if multiple locks were in-flight at the same time the last timestamp wins. In this case 2 clients could see no lock file, PUT a new lock file, and even verify that only their unique lock file exists while continuing to run concurrently.

radeksimko · 2017-01-19T20:12:37Z

state/remote/s3.go

+		Item: map[string]*dynamodb.AttributeValue{
+			"LockID":  {S: aws.String(stateName)},
+			"Created": {S: aws.String(time.Now().UTC().Format(time.RFC3339))},
+			"Expires": {S: aws.String(time.Now().Add(time.Hour).UTC().Format(time.RFC3339))},


Could the expiration period be configurable somewhere? Maybe not necessarily from user's perspective (CLI), but I'm thinking it could be passed in as time.Duration argument to Lock()? That might also make developer implementing new remote backends realise that the expiration should be respected/implemented properly where possible.

Actually I discussed this with @mitchellh and we came to a conclusion that expiration generally may introduce unnecessary edge cases - e.g. long TF runs or leap seconds/years so it's probably best to just not implement expiration at all.

Yes, I was going to leave expiration handling up to the individual state client, and have it be strictly informational to start. The user can see the expiration value in the error message and choose to override manually.

mitchellh

Some early feedback from me, general and not tied to any specific line:

I'd comment more why we don't use flock. Its probably worth doing in the future but I'm okay not doing it for now. Is symlink guaranteed to be atomic? If not, we may just have to use flock and LockFileEx.
I'd remove any sort of expiration. I don't think we'll ever automatically expire locks. Stale locks should be manually unlocked by an operator (or, a cron setup to do so, though that is pretty YOLO). Its too risky for TF to automatically expire locks in any case since it risks data corruption. Let's leave the potential for locks to live forever and use terraform unlock to unlock it.
One thing we forgot to discuss on the RFC is: are we blocking on Lock, or are we returning an error if its already locked? I'm leaning towards: error on lock. This is both easier to implement and allows us to just do retries on our side if necessary.

apparentlymart · 2017-01-20T09:04:08Z

Probably-too-late design question: would it be useful to make locking method orthogonal to state storage? For example: maybe I want to make a stronger locking guarantee by sharing the same lock across multiple states, ensuring that only one can change at a time.

Probably less-compelling example: I store my state in S3 but I want to lock in Consul because I don't want to complicate my world with DynamoDB, or because (continuing the last use-case) I want other non-terraform processes to be able to hold the lock for operations that affect Terraform. (e.g. an admin cron that deletes old AMIs that don't appear to be used anymore, but which Terraform is simultaneously trying to start using.)

Having a default seems important for UX, but having the ability to override in more complex cases would be useful, I think.

Probably none of this for this first pass though. :)

jbardin · 2017-01-20T20:13:00Z

@mitchellh,

symlinks are atomic on POSIX systems, which is why they are sort of the standard to avoid the issues surrounding inconsistent support for locking in unix systems. I'm happy to change to fcntl() and LockFileEx as long as we can accept the issue that some shared filesystems won't support it, and the user should beware.
I'm OK with removing expiration. It was only in there as an informational hint, but since it requires user intervention anyway they may as well use their best judgement based on the initial lock time.
I agree that we shouldn't block on the lock and immediately return an error. It makes the implementation semantics easier, and we can always retry.

@apparentlymart,

I also like that idea, and was toying around with that while implementing the locks (there was a lot more code that's no longer here, as it ended up conflicting with the new Backends). I think that this will eventually be able to extend to that use case, if only by separating the lock implementations and making them configurable.

Changed from state.StateLocker to remove the stutter. State implementations can provide Lock/Unlock methods to lock the state file. Remote clients can also provide these same methods, which will be called through remote.State.

Add the LockUnlock methods to LocalState and BackupState. The implementation for LocalState will be platform specific. We will use OS-native locking on the state files, speficially locking whichever state file we intend to write to.

In order to provide lockign for remote states, the Cache state, remote.State need to expose Lock and Unlock methods. The actual locking will be done by the remote.Client, which can implement the same state.Locker methods.

Use a DynamoDB table to coodinate state locking in S3. We use a simple strategy here, defining a key containing the value of the bucket/key of the state file as the lock. If the keys exists, the locks fails. TODO: decide if locks should automatically be expired, or require manual intervention.

Read state would assume that having a reader meant there should be a valid state. Check for an empty file and return ErrNoState to differentiate a bad file from an empty one.

Output log output when testing is verbose

Having the state files always created for locking breaks a lot of tests. Most can be fixed by simple checking for state within a file, but a few still might be writing state when they shouldn't.

The old behavior in this situation was to simply delete the file. Since we now have a lock on this file we don't want to close or delete it, so instead truncate the file at offset 0. Fix a number of related tests

After LocalState writes to a state file, we will refresh off the new state file rather than the original Path argument.

A missing state file was allowed, and treated as an empty state.

Not really a problem, but created unnecessary files and changes existing behavior.

Test should not simply check for the existence of a file for state, but make sure that file also contains data.

Original comments were incorrect, and the test was checking for the absence of state

Refactored the code to maintain the behavior.

jbardin · 2017-01-30T23:27:34Z

Re-wrote the local locks to use POSIX and Windows file locks. This turned out to be quite a bit more involved than I expected, because the state files were often deleted and/or recreated, and locking requires that we keep track of the locked file handles.

I'd like to review and merge the PR at this stage, since it's getting rather large and unwieldy.

The commands don't hook into the state.Lock methods yet at all, so the goal here is to have this PR maintain existing behavior. A notable change in behavior is that the state output file is almost always created, since that is what will be locked when using local state. The backend tests had to be modified to check for non-empty state files, rather than simply checking for their existence.

I'll start the work of enabling the locks and the associated commands in a new PR.

mitchellh

Looks great! A couple changes plus I would recommend updating state/testing.go to test the semantics of double-lock and so on and verify error. That way anyone who implements locking can run the generic test case and verify that their implementation adheres to the "spec".

mitchellh · 2017-01-31T19:18:47Z

state/local_lock_windows.go

+		return err
+	}
+
+	lockedFiles[s.stateFileOut] = handle


We probably need to wrap this in a lock. It'd be surprising as a consumer to learn you can't lock two seemingly distinct local states in parallel.

mitchellh · 2017-01-31T19:19:30Z

state/local_lock_windows.go

+
+const (
+	_LOCKFILE_FAIL_IMMEDIATELY = 1
+	_LOCKFILE_EXCLUSIVE_LOCK   = 2


I generally find it useful when interacting with DLLs and constants like this to put the MSDN link above to the docs so that we can always easily look it back up.

jbardin · 2017-02-01T17:50:47Z

added mutex and MSDN link

ghost · 2020-04-17T02:09:10Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

apparentlymart added core enhancement labels Jan 13, 2017

radeksimko reviewed Jan 19, 2017

View reviewed changes

mitchellh reviewed Jan 20, 2017

View reviewed changes

jbardin force-pushed the jbardin/state-locking branch 9 times, most recently from fe2d91a to 086af12 Compare January 30, 2017 22:16

jbardin added 11 commits January 30, 2017 17:16

add state.Locker interface

cc0712e

Changed from state.StateLocker to remove the stutter. State implementations can provide Lock/Unlock methods to lock the state file. Remote clients can also provide these same methods, which will be called through remote.State.

Add basic local state locking

6162cde

Add the LockUnlock methods to LocalState and BackupState. The implementation for LocalState will be platform specific. We will use OS-native locking on the state files, speficially locking whichever state file we intend to write to.

Add remote state locking

35307d5

In order to provide lockign for remote states, the Cache state, remote.State need to expose Lock and Unlock methods. The actual locking will be done by the remote.Client, which can implement the same state.Locker methods.

Check for no state from the io.Reader

f204855

Read state would assume that having a reader meant there should be a valid state. Check for an empty file and return ErrNoState to differentiate a bad file from an empty one.

Silence state package logs during tests

da0c325

Output log output when testing is verbose

Fix some tests, and make rest fail with good errs

eb59b59

Having the state files always created for locking breaks a lot of tests. Most can be fixed by simple checking for state within a file, but a few still might be writing state when they shouldn't.

Remove state file data when writing a nil state

8f7f191

The old behavior in this situation was to simply delete the file. Since we now have a lock on this file we don't want to close or delete it, so instead truncate the file at offset 0. Fix a number of related tests

Switch from Path to PathOut on LocalState.written

3fdcbda

After LocalState writes to a state file, we will refresh off the new state file rather than the original Path argument.

Allow a non-existent state file

1646310

A missing state file was allowed, and treated as an empty state.

Don't create empty backups

7590154

Not really a problem, but created unnecessary files and changes existing behavior.

jbardin force-pushed the jbardin/state-locking branch from 086af12 to 7590154 Compare January 30, 2017 22:17

jbardin added 4 commits January 30, 2017 17:48

Ensure that backend tests check for data in state

39ca4fa

Test should not simply check for the existence of a file for state, but make sure that file also contains data.

fix test despite original comments

b8bd484

Original comments were incorrect, and the test was checking for the absence of state

don't print err in Fatalf when it's nil

1380bbe

TestRefresh_badState can be re-enabled

11d601a

Refactored the code to maintain the behavior.

jbardin changed the title ~~[WIP] State Locking~~ State Locking initial implementations Jan 30, 2017

mitchellh suggested changes Jan 31, 2017

View reviewed changes

jbardin added 2 commits February 1, 2017 12:36

add mutex for windows lockedFiles map

370a4ca

add msdn link for LockFileEx

ebd88f8

mitchellh approved these changes Feb 1, 2017

View reviewed changes

jbardin merged commit 9acb86a into master Feb 1, 2017

jbardin deleted the jbardin/state-locking branch February 1, 2017 19:35

apparentlymart mentioned this pull request Feb 7, 2017

remote: Introduce locking mechanism into remote backend interface #5036

Closed

jeffbyrnes mentioned this pull request Feb 23, 2017

Terraform is introducing its own state locking gruntwork-io/terragrunt#146

Closed

jrandall mentioned this pull request Mar 4, 2017

Add minio remote state backend #12418

Closed

ghost locked and limited conversation to collaborators Apr 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State Locking initial implementations #11187

State Locking initial implementations #11187

jbardin commented Jan 12, 2017 •

edited

Loading

apparentlymart commented Jan 13, 2017

radeksimko commented Jan 19, 2017 •

edited

Loading

jbardin commented Jan 19, 2017

radeksimko Jan 19, 2017

radeksimko Jan 19, 2017

jbardin Jan 19, 2017

mitchellh left a comment

apparentlymart commented Jan 20, 2017

jbardin commented Jan 20, 2017 •

edited

Loading

jbardin commented Jan 30, 2017

mitchellh left a comment

mitchellh Jan 31, 2017

mitchellh Jan 31, 2017

jbardin commented Feb 1, 2017

ghost commented Apr 17, 2020

State Locking initial implementations #11187

State Locking initial implementations #11187

Conversation

jbardin commented Jan 12, 2017 • edited Loading

apparentlymart commented Jan 13, 2017

radeksimko commented Jan 19, 2017 • edited Loading

jbardin commented Jan 19, 2017

radeksimko Jan 19, 2017

Choose a reason for hiding this comment

radeksimko Jan 19, 2017

Choose a reason for hiding this comment

jbardin Jan 19, 2017

Choose a reason for hiding this comment

mitchellh left a comment

Choose a reason for hiding this comment

apparentlymart commented Jan 20, 2017

jbardin commented Jan 20, 2017 • edited Loading

jbardin commented Jan 30, 2017

mitchellh left a comment

Choose a reason for hiding this comment

mitchellh Jan 31, 2017

Choose a reason for hiding this comment

mitchellh Jan 31, 2017

Choose a reason for hiding this comment

jbardin commented Feb 1, 2017

ghost commented Apr 17, 2020

jbardin commented Jan 12, 2017 •

edited

Loading

radeksimko commented Jan 19, 2017 •

edited

Loading

jbardin commented Jan 20, 2017 •

edited

Loading