-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
State Locking initial implementations #11187
Conversation
This seems related to #5036. 😀 |
Is there any specific reason you decided to use DynamoDB to hold the lock for S3 remote state backend instead of saving a "lockfile" in S3? Is it just for the potentially strong consistency and/or conditional |
@radeksimko, yes it's specifically for the strong consistency offered by dynamodb. S3 is only eventually consistent in almost all cases. IIRC S3 might be consistent after a PUT of a new object and a GET of that object specifically, but even in that case there's no way to make a PUT conditional, so if multiple locks were in-flight at the same time the last timestamp wins. In this case 2 clients could see no lock file, PUT a new lock file, and even verify that only their unique lock file exists while continuing to run concurrently. |
Item: map[string]*dynamodb.AttributeValue{ | ||
"LockID": {S: aws.String(stateName)}, | ||
"Created": {S: aws.String(time.Now().UTC().Format(time.RFC3339))}, | ||
"Expires": {S: aws.String(time.Now().Add(time.Hour).UTC().Format(time.RFC3339))}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the expiration period be configurable somewhere? Maybe not necessarily from user's perspective (CLI), but I'm thinking it could be passed in as time.Duration
argument to Lock()
? That might also make developer implementing new remote backends realise that the expiration should be respected/implemented properly where possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I discussed this with @mitchellh and we came to a conclusion that expiration generally may introduce unnecessary edge cases - e.g. long TF runs or leap seconds/years so it's probably best to just not implement expiration at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was going to leave expiration handling up to the individual state client, and have it be strictly informational to start. The user can see the expiration value in the error message and choose to override manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some early feedback from me, general and not tied to any specific line:
-
I'd comment more why we don't use
flock
. Its probably worth doing in the future but I'm okay not doing it for now. Is symlink guaranteed to be atomic? If not, we may just have to use flock and LockFileEx. -
I'd remove any sort of expiration. I don't think we'll ever automatically expire locks. Stale locks should be manually unlocked by an operator (or, a cron setup to do so, though that is pretty YOLO). Its too risky for TF to automatically expire locks in any case since it risks data corruption. Let's leave the potential for locks to live forever and use
terraform unlock
to unlock it. -
One thing we forgot to discuss on the RFC is: are we blocking on Lock, or are we returning an error if its already locked? I'm leaning towards: error on lock. This is both easier to implement and allows us to just do retries on our side if necessary.
Probably-too-late design question: would it be useful to make locking method orthogonal to state storage? For example: maybe I want to make a stronger locking guarantee by sharing the same lock across multiple states, ensuring that only one can change at a time. Probably less-compelling example: I store my state in S3 but I want to lock in Consul because I don't want to complicate my world with DynamoDB, or because (continuing the last use-case) I want other non-terraform processes to be able to hold the lock for operations that affect Terraform. (e.g. an admin cron that deletes old AMIs that don't appear to be used anymore, but which Terraform is simultaneously trying to start using.) Having a default seems important for UX, but having the ability to override in more complex cases would be useful, I think. Probably none of this for this first pass though. :) |
I also like that idea, and was toying around with that while implementing the locks (there was a lot more code that's no longer here, as it ended up conflicting with the new Backends). I think that this will eventually be able to extend to that use case, if only by separating the lock implementations and making them configurable. |
fe2d91a
to
086af12
Compare
Changed from state.StateLocker to remove the stutter. State implementations can provide Lock/Unlock methods to lock the state file. Remote clients can also provide these same methods, which will be called through remote.State.
Add the LockUnlock methods to LocalState and BackupState. The implementation for LocalState will be platform specific. We will use OS-native locking on the state files, speficially locking whichever state file we intend to write to.
In order to provide lockign for remote states, the Cache state, remote.State need to expose Lock and Unlock methods. The actual locking will be done by the remote.Client, which can implement the same state.Locker methods.
Use a DynamoDB table to coodinate state locking in S3. We use a simple strategy here, defining a key containing the value of the bucket/key of the state file as the lock. If the keys exists, the locks fails. TODO: decide if locks should automatically be expired, or require manual intervention.
Read state would assume that having a reader meant there should be a valid state. Check for an empty file and return ErrNoState to differentiate a bad file from an empty one.
Output log output when testing is verbose
Having the state files always created for locking breaks a lot of tests. Most can be fixed by simple checking for state within a file, but a few still might be writing state when they shouldn't.
The old behavior in this situation was to simply delete the file. Since we now have a lock on this file we don't want to close or delete it, so instead truncate the file at offset 0. Fix a number of related tests
After LocalState writes to a state file, we will refresh off the new state file rather than the original Path argument.
A missing state file was allowed, and treated as an empty state.
Not really a problem, but created unnecessary files and changes existing behavior.
086af12
to
7590154
Compare
Test should not simply check for the existence of a file for state, but make sure that file also contains data.
Original comments were incorrect, and the test was checking for the absence of state
Refactored the code to maintain the behavior.
Re-wrote the local locks to use POSIX and Windows file locks. This turned out to be quite a bit more involved than I expected, because the state files were often deleted and/or recreated, and locking requires that we keep track of the locked file handles. I'd like to review and merge the PR at this stage, since it's getting rather large and unwieldy. The commands don't hook into the state.Lock methods yet at all, so the goal here is to have this PR maintain existing behavior. A notable change in behavior is that the state output file is almost always created, since that is what will be locked when using local state. The backend tests had to be modified to check for non-empty state files, rather than simply checking for their existence. I'll start the work of enabling the locks and the associated commands in a new PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! A couple changes plus I would recommend updating state/testing.go
to test the semantics of double-lock and so on and verify error. That way anyone who implements locking can run the generic test case and verify that their implementation adheres to the "spec".
return err | ||
} | ||
|
||
lockedFiles[s.stateFileOut] = handle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need to wrap this in a lock. It'd be surprising as a consumer to learn you can't lock two seemingly distinct local states in parallel.
state/local_lock_windows.go
Outdated
|
||
const ( | ||
_LOCKFILE_FAIL_IMMEDIATELY = 1 | ||
_LOCKFILE_EXCLUSIVE_LOCK = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally find it useful when interacting with DLLs and constants like this to put the MSDN link above to the docs so that we can always easily look it back up.
added mutex and MSDN link |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further. |
Basic state locking.
This all starts with the
state.Locker
interface:This is implemented by the internal state wrappers (CacheState, BackupState), as well as LocalState. Any commands that operating on the state can use these functions to prevent concurrent modification.
Remote state clients can also implement this same interface, which will be called via the State structures that wraps them.
This PR contains the initial LocalState and AWS locking implementations. No state locking actually occurs yet, and adding the Lock calls and associated commands will come in another PR.