Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State Locking initial implementations #11187

Merged
merged 17 commits into from
Feb 1, 2017
Merged

State Locking initial implementations #11187

merged 17 commits into from
Feb 1, 2017

Conversation

jbardin
Copy link
Member

@jbardin jbardin commented Jan 12, 2017

Basic state locking.

This all starts with the state.Locker interface:

type StateLocker interface {
    Lock(reason string) error
    Unlock() error
}

This is implemented by the internal state wrappers (CacheState, BackupState), as well as LocalState. Any commands that operating on the state can use these functions to prevent concurrent modification.

Remote state clients can also implement this same interface, which will be called via the State structures that wraps them.

This PR contains the initial LocalState and AWS locking implementations. No state locking actually occurs yet, and adding the Lock calls and associated commands will come in another PR.

@apparentlymart
Copy link
Contributor

This seems related to #5036. 😀

@radeksimko
Copy link
Member

radeksimko commented Jan 19, 2017

Is there any specific reason you decided to use DynamoDB to hold the lock for S3 remote state backend instead of saving a "lockfile" in S3? Is it just for the potentially strong consistency and/or conditional Put operations?

@jbardin
Copy link
Member Author

jbardin commented Jan 19, 2017

@radeksimko, yes it's specifically for the strong consistency offered by dynamodb. S3 is only eventually consistent in almost all cases.

IIRC S3 might be consistent after a PUT of a new object and a GET of that object specifically, but even in that case there's no way to make a PUT conditional, so if multiple locks were in-flight at the same time the last timestamp wins. In this case 2 clients could see no lock file, PUT a new lock file, and even verify that only their unique lock file exists while continuing to run concurrently.

Item: map[string]*dynamodb.AttributeValue{
"LockID": {S: aws.String(stateName)},
"Created": {S: aws.String(time.Now().UTC().Format(time.RFC3339))},
"Expires": {S: aws.String(time.Now().Add(time.Hour).UTC().Format(time.RFC3339))},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the expiration period be configurable somewhere? Maybe not necessarily from user's perspective (CLI), but I'm thinking it could be passed in as time.Duration argument to Lock()? That might also make developer implementing new remote backends realise that the expiration should be respected/implemented properly where possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I discussed this with @mitchellh and we came to a conclusion that expiration generally may introduce unnecessary edge cases - e.g. long TF runs or leap seconds/years so it's probably best to just not implement expiration at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was going to leave expiration handling up to the individual state client, and have it be strictly informational to start. The user can see the expiration value in the error message and choose to override manually.

Copy link
Contributor

@mitchellh mitchellh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some early feedback from me, general and not tied to any specific line:

  • I'd comment more why we don't use flock. Its probably worth doing in the future but I'm okay not doing it for now. Is symlink guaranteed to be atomic? If not, we may just have to use flock and LockFileEx.

  • I'd remove any sort of expiration. I don't think we'll ever automatically expire locks. Stale locks should be manually unlocked by an operator (or, a cron setup to do so, though that is pretty YOLO). Its too risky for TF to automatically expire locks in any case since it risks data corruption. Let's leave the potential for locks to live forever and use terraform unlock to unlock it.

  • One thing we forgot to discuss on the RFC is: are we blocking on Lock, or are we returning an error if its already locked? I'm leaning towards: error on lock. This is both easier to implement and allows us to just do retries on our side if necessary.

@apparentlymart
Copy link
Contributor

Probably-too-late design question: would it be useful to make locking method orthogonal to state storage? For example: maybe I want to make a stronger locking guarantee by sharing the same lock across multiple states, ensuring that only one can change at a time.

Probably less-compelling example: I store my state in S3 but I want to lock in Consul because I don't want to complicate my world with DynamoDB, or because (continuing the last use-case) I want other non-terraform processes to be able to hold the lock for operations that affect Terraform. (e.g. an admin cron that deletes old AMIs that don't appear to be used anymore, but which Terraform is simultaneously trying to start using.)

Having a default seems important for UX, but having the ability to override in more complex cases would be useful, I think.

Probably none of this for this first pass though. :)

@jbardin
Copy link
Member Author

jbardin commented Jan 20, 2017

@mitchellh,

  • symlinks are atomic on POSIX systems, which is why they are sort of the standard to avoid the issues surrounding inconsistent support for locking in unix systems. I'm happy to change to fcntl() and LockFileEx as long as we can accept the issue that some shared filesystems won't support it, and the user should beware.

  • I'm OK with removing expiration. It was only in there as an informational hint, but since it requires user intervention anyway they may as well use their best judgement based on the initial lock time.

  • I agree that we shouldn't block on the lock and immediately return an error. It makes the implementation semantics easier, and we can always retry.

@apparentlymart,

I also like that idea, and was toying around with that while implementing the locks (there was a lot more code that's no longer here, as it ended up conflicting with the new Backends). I think that this will eventually be able to extend to that use case, if only by separating the lock implementations and making them configurable.

@jbardin jbardin force-pushed the jbardin/state-locking branch 9 times, most recently from fe2d91a to 086af12 Compare January 30, 2017 22:16
Changed from state.StateLocker to remove the stutter.

State implementations can provide Lock/Unlock methods to lock the state
file. Remote clients can also provide these same methods, which will be
called through remote.State.
Add the LockUnlock methods to LocalState and BackupState.

The implementation for LocalState will be platform specific. We will use
OS-native locking on the state files, speficially locking whichever
state file we intend to write to.
In order to provide lockign for remote states, the Cache state,
remote.State need to expose Lock and Unlock methods. The actual locking
will be done by the remote.Client, which can implement the same
state.Locker methods.
Use a DynamoDB table to coodinate state locking in S3.

We use a simple strategy here, defining a key containing the value of
the bucket/key of the state file as the lock. If the keys exists, the
locks fails.

TODO: decide if locks should automatically be expired, or require manual
intervention.
Read state would assume that having a reader meant there should be a
valid state. Check for an empty file and return ErrNoState to
differentiate a bad file from an empty one.
Output log output when testing is verbose
Having the state files always created for locking breaks a lot of tests.
Most can be fixed by simple checking for state within a file, but a few
still might be writing state when they shouldn't.
The old behavior in this situation was to simply delete the file. Since
we now have a lock on this file we don't want to close or delete it, so
instead truncate the file at offset 0.

Fix a number of related tests
After LocalState writes to a state file, we will refresh off the new
state file rather than the original Path argument.
A missing state file was allowed, and treated as an empty state.
Not really a problem, but created unnecessary files and changes existing
behavior.
@jbardin jbardin force-pushed the jbardin/state-locking branch from 086af12 to 7590154 Compare January 30, 2017 22:17
Test should not simply check for the existence of a file for state, but
make sure that file also contains data.
Original comments were incorrect, and the test was checking for the
absence of state
Refactored the code to maintain the behavior.
@jbardin
Copy link
Member Author

jbardin commented Jan 30, 2017

Re-wrote the local locks to use POSIX and Windows file locks. This turned out to be quite a bit more involved than I expected, because the state files were often deleted and/or recreated, and locking requires that we keep track of the locked file handles.

I'd like to review and merge the PR at this stage, since it's getting rather large and unwieldy.

The commands don't hook into the state.Lock methods yet at all, so the goal here is to have this PR maintain existing behavior. A notable change in behavior is that the state output file is almost always created, since that is what will be locked when using local state. The backend tests had to be modified to check for non-empty state files, rather than simply checking for their existence.

I'll start the work of enabling the locks and the associated commands in a new PR.

@jbardin jbardin changed the title [WIP] State Locking State Locking initial implementations Jan 30, 2017
Copy link
Contributor

@mitchellh mitchellh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! A couple changes plus I would recommend updating state/testing.go to test the semantics of double-lock and so on and verify error. That way anyone who implements locking can run the generic test case and verify that their implementation adheres to the "spec".

return err
}

lockedFiles[s.stateFileOut] = handle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to wrap this in a lock. It'd be surprising as a consumer to learn you can't lock two seemingly distinct local states in parallel.


const (
_LOCKFILE_FAIL_IMMEDIATELY = 1
_LOCKFILE_EXCLUSIVE_LOCK = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally find it useful when interacting with DLLs and constants like this to put the MSDN link above to the docs so that we can always easily look it back up.

@jbardin
Copy link
Member Author

jbardin commented Feb 1, 2017

added mutex and MSDN link

@ghost
Copy link

ghost commented Apr 17, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Apr 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants