-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance locking/serialization for concurrent exec #8180
Conversation
@@ -205,8 +205,13 @@ func (c *ContainerBackend) ContainerExecCreate(name string, config *types.ExecCo | |||
|
|||
var eid string | |||
operation := func() error { | |||
if vc.TryLock(APITimeout) { | |||
op.Infof("ContainerExecCreate trying to lock vc: %s", name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we drop this and the matching acquisition to Debug?
The else case we could leave as Info if you think that's worthy of logging by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are log messages in the retry package but they don't have the opID. I think it should be added, but I would like to do it in another check-in as it touches many source files. For now I am changing all message to Debug level.
defer vc.Unlock() | ||
} else { | ||
op.Infof("ContainerExecCreate cannot lock vc, continuing ..: %s", name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also get a message from the retry package - if that's the case we don't need this one.
We may want to rephrase it to note what we expect, e.g.
ContainerExecCreate cannot lock container. Returning lock timeout error to prompt retry.
defer vc.Unlock() | ||
} else { | ||
op.Infof("ContainerExecStart cannot lock vc, continuing ..: %s", name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments as for the other block like this
@@ -178,3 +178,25 @@ type DetachError struct{} | |||
func (DetachError) Error() string { | |||
return "detached from container" | |||
} | |||
|
|||
// Conflict error is returned when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment fragment and probable copy/paste error
defer vc.Unlock() | ||
} else { | ||
op.Infof("ContainerExecStart cannot lock vc, continuing ..: %s", name) | ||
return engerr.NewLockTimeoutError("ContainerExecCreate") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy/paste error -> ContainerExecStart
@@ -522,16 +533,21 @@ func (c *ContainerBackend) ContainerExecStart(ctx context.Context, eid string, s | |||
|
|||
var ec *models.TaskInspectResponse | |||
operation := func() error { | |||
if vc.TryLock(APITimeout) { | |||
op.Debugf("ContainerExecStart trying to lock vc: %s", name) | |||
if vc.TryLock(time.Second) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be helpful to extract time.Second
to a constant to de-duplicate with the value above and so you can add a comment explaining how the duration was selected (it's in the pull request, but it seems worth including in the file as well).
(cherry picked from commit 67b0892)
This may be worth release noting as it's a continuation of something noted as mitigated, but not fully fixed, in the v1.4.1 release notes. |
Added as a resolved issue in https://github.com/vmware/vic/releases/tag/v1.4.3 |
This PR contains two sets of changes:
Locking and serialization
The current implementation consists of two separate mechanisms to provide serialization. The external mechanism is based on multi-versioned concurrency control, (sometime called snapshot isolation), two or more concurrent updates are allowed to occur, but only one wins. The failed ones are retried automatically up to a certain max elapsed time. The internal mechanism uses a traditional lock, which fully serializes operations on both ExecCreate and ExecStart. Requests block on this lock for up to 3 minutes then they are allowed to continue in parallel, basically falling back to the external retry mechanism. Notice that the internal mechanism rely on the same lock for both ExecCreate and ExecStart.
Unfortunately the chosen timeout value guarantees that after timing out on the lock (3 minutes) no retries are executed by the external mechanism (since the max elapsed time is 1 minute). This may result in the following error:
Conflict error from portlayer: [PUT /containers/{handle}][409] commitConflict &{Code:0 Message:Cannot complete operation due to concurrent modification by another operation.}
To ameliorate this issue the following changes have been made:
then rely on the external retry mechanism,
ExecStart over ExecCreate. Requests that have made it through the Create phase are now privileged
over new incoming requests. This provides a more even response output to the end user.
Removal of extensive Logging
There were a couple of places in the Port Layer where extensive logging was performed. These have been removed since OpID are now correctly propagated from the Engine to the Portlayer.
[specific ci=1-38-Docker-Exec]