Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance locking/serialization for concurrent exec #8180

Merged
merged 1 commit into from
Aug 10, 2018

Conversation

lcastellano
Copy link
Contributor

This PR contains two sets of changes:

Locking and serialization
The current implementation consists of two separate mechanisms to provide serialization. The external mechanism is based on multi-versioned concurrency control, (sometime called snapshot isolation), two or more concurrent updates are allowed to occur, but only one wins. The failed ones are retried automatically up to a certain max elapsed time. The internal mechanism uses a traditional lock, which fully serializes operations on both ExecCreate and ExecStart. Requests block on this lock for up to 3 minutes then they are allowed to continue in parallel, basically falling back to the external retry mechanism. Notice that the internal mechanism rely on the same lock for both ExecCreate and ExecStart.

Unfortunately the chosen timeout value guarantees that after timing out on the lock (3 minutes) no retries are executed by the external mechanism (since the max elapsed time is 1 minute). This may result in the following error:

Conflict error from portlayer: [PUT /containers/{handle}][409] commitConflict &{Code:0 Message:Cannot complete operation due to concurrent modification by another operation.}

To ameliorate this issue the following changes have been made:

  • Increase external max elapsed time (from 1 minute to 15),
  • Instead of blocking for a long time on the internal lock, requests block for a maximum of 1 second
    then rely on the external retry mechanism,
  • The average interval between two retries on the external mechanism has been customized to favor
    ExecStart over ExecCreate. Requests that have made it through the Create phase are now privileged
    over new incoming requests. This provides a more even response output to the end user.

Removal of extensive Logging
There were a couple of places in the Port Layer where extensive logging was performed. These have been removed since OpID are now correctly propagated from the Engine to the Portlayer.


[specific ci=1-38-Docker-Exec]

@@ -205,8 +205,13 @@ func (c *ContainerBackend) ContainerExecCreate(name string, config *types.ExecCo

var eid string
operation := func() error {
if vc.TryLock(APITimeout) {
op.Infof("ContainerExecCreate trying to lock vc: %s", name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we drop this and the matching acquisition to Debug?
The else case we could leave as Info if you think that's worthy of logging by default.

Copy link
Contributor Author

@lcastellano lcastellano Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are log messages in the retry package but they don't have the opID. I think it should be added, but I would like to do it in another check-in as it touches many source files. For now I am changing all message to Debug level.

defer vc.Unlock()
} else {
op.Infof("ContainerExecCreate cannot lock vc, continuing ..: %s", name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also get a message from the retry package - if that's the case we don't need this one.
We may want to rephrase it to note what we expect, e.g.
ContainerExecCreate cannot lock container. Returning lock timeout error to prompt retry.

defer vc.Unlock()
} else {
op.Infof("ContainerExecStart cannot lock vc, continuing ..: %s", name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as for the other block like this

@@ -178,3 +178,25 @@ type DetachError struct{}
func (DetachError) Error() string {
return "detached from container"
}

// Conflict error is returned when
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment fragment and probable copy/paste error

defer vc.Unlock()
} else {
op.Infof("ContainerExecStart cannot lock vc, continuing ..: %s", name)
return engerr.NewLockTimeoutError("ContainerExecCreate")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy/paste error -> ContainerExecStart

@@ -522,16 +533,21 @@ func (c *ContainerBackend) ContainerExecStart(ctx context.Context, eid string, s

var ec *models.TaskInspectResponse
operation := func() error {
if vc.TryLock(APITimeout) {
op.Debugf("ContainerExecStart trying to lock vc: %s", name)
if vc.TryLock(time.Second) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be helpful to extract time.Second to a constant to de-duplicate with the value above and so you can add a comment explaining how the duration was selected (it's in the pull request, but it seems worth including in the file as well).

@lcastellano lcastellano merged commit 67b0892 into vmware:master Aug 10, 2018
@lcastellano lcastellano deleted the props branch August 10, 2018 21:12
zjs pushed a commit to zjs/vic that referenced this pull request Aug 10, 2018
zjs added a commit that referenced this pull request Aug 10, 2018
@zjs zjs added the impact/doc/note Requires creation of or changes to an official release note label Sep 13, 2018
@zjs
Copy link
Member

zjs commented Sep 13, 2018

This may be worth release noting as it's a continuation of something noted as mitigated, but not fully fixed, in the v1.4.1 release notes.

@stuclem
Copy link
Contributor

stuclem commented Sep 17, 2018

Added as a resolved issue in https://github.com/vmware/vic/releases/tag/v1.4.3

@stuclem stuclem removed the impact/doc/note Requires creation of or changes to an official release note label Sep 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants