Enhance locking/serialization for concurrent exec #8180

lcastellano · 2018-08-06T21:39:18Z

This PR contains two sets of changes:

Locking and serialization
The current implementation consists of two separate mechanisms to provide serialization. The external mechanism is based on multi-versioned concurrency control, (sometime called snapshot isolation), two or more concurrent updates are allowed to occur, but only one wins. The failed ones are retried automatically up to a certain max elapsed time. The internal mechanism uses a traditional lock, which fully serializes operations on both ExecCreate and ExecStart. Requests block on this lock for up to 3 minutes then they are allowed to continue in parallel, basically falling back to the external retry mechanism. Notice that the internal mechanism rely on the same lock for both ExecCreate and ExecStart.

Unfortunately the chosen timeout value guarantees that after timing out on the lock (3 minutes) no retries are executed by the external mechanism (since the max elapsed time is 1 minute). This may result in the following error:

Conflict error from portlayer: [PUT /containers/{handle}][409] commitConflict &{Code:0 Message:Cannot complete operation due to concurrent modification by another operation.}

To ameliorate this issue the following changes have been made:

Increase external max elapsed time (from 1 minute to 15),
Instead of blocking for a long time on the internal lock, requests block for a maximum of 1 second
then rely on the external retry mechanism,
The average interval between two retries on the external mechanism has been customized to favor
ExecStart over ExecCreate. Requests that have made it through the Create phase are now privileged
over new incoming requests. This provides a more even response output to the end user.

Removal of extensive Logging
There were a couple of places in the Port Layer where extensive logging was performed. These have been removed since OpID are now correctly propagated from the Engine to the Portlayer.

[specific ci=1-38-Docker-Exec]

hickeng · 2018-08-08T20:38:34Z

lib/apiservers/engine/backends/container.go

@@ -205,8 +205,13 @@ func (c *ContainerBackend) ContainerExecCreate(name string, config *types.ExecCo

 	var eid string
 	operation := func() error {
-		if vc.TryLock(APITimeout) {
+		op.Infof("ContainerExecCreate trying to lock vc: %s", name)


Can we drop this and the matching acquisition to Debug?
The else case we could leave as Info if you think that's worthy of logging by default.

There are log messages in the retry package but they don't have the opID. I think it should be added, but I would like to do it in another check-in as it touches many source files. For now I am changing all message to Debug level.

hickeng · 2018-08-08T20:42:12Z

lib/apiservers/engine/backends/container.go

 			defer vc.Unlock()
+		} else {
+			op.Infof("ContainerExecCreate cannot lock vc, continuing ..: %s", name)


I think we should also get a message from the retry package - if that's the case we don't need this one.
We may want to rephrase it to note what we expect, e.g.
ContainerExecCreate cannot lock container. Returning lock timeout error to prompt retry.

hickeng · 2018-08-08T20:42:27Z

lib/apiservers/engine/backends/container.go

 			defer vc.Unlock()
+		} else {
+			op.Infof("ContainerExecStart cannot lock vc, continuing ..: %s", name)


Same comments as for the other block like this

hickeng · 2018-08-08T20:43:27Z

lib/apiservers/engine/errors/errors.go

@@ -178,3 +178,25 @@ type DetachError struct{}
 func (DetachError) Error() string {
 	return "detached from container"
 }
+
+// Conflict error is returned when


Comment fragment and probable copy/paste error

hickeng · 2018-08-08T20:45:21Z

lib/apiservers/engine/backends/container.go

 			defer vc.Unlock()
+		} else {
+			op.Infof("ContainerExecStart cannot lock vc, continuing ..: %s", name)
+			return engerr.NewLockTimeoutError("ContainerExecCreate")


Copy/paste error -> ContainerExecStart

zjs · 2018-08-10T19:25:11Z

lib/apiservers/engine/backends/container.go

@@ -522,16 +533,21 @@ func (c *ContainerBackend) ContainerExecStart(ctx context.Context, eid string, s

 	var ec *models.TaskInspectResponse
 	operation := func() error {
-		if vc.TryLock(APITimeout) {
+		op.Debugf("ContainerExecStart trying to lock vc: %s", name)
+		if vc.TryLock(time.Second) {


I think it'd be helpful to extract time.Second to a constant to de-duplicate with the value above and so you can add a comment explaining how the duration was selected (it's in the pull request, but it seems worth including in the file as well).

(cherry picked from commit 67b0892)

zjs · 2018-09-13T15:33:55Z

This may be worth release noting as it's a continuation of something noted as mitigated, but not fully fixed, in the v1.4.1 release notes.

stuclem · 2018-09-17T13:34:08Z

Added as a resolved issue in https://github.com/vmware/vic/releases/tag/v1.4.3

lcastellano requested a review from a team as a code owner August 6, 2018 21:39

vmwclabot added the cla-not-required label Aug 6, 2018

lcastellano force-pushed the props branch from 850d765 to c4b7058 Compare August 6, 2018 22:29

lcastellano requested review from zjs and hickeng August 7, 2018 21:16

hickeng reviewed Aug 8, 2018

View reviewed changes

Enhance locking/serialization for concurrent exec

da96e2c

lcastellano force-pushed the props branch from c4b7058 to da96e2c Compare August 9, 2018 21:32

zjs approved these changes Aug 10, 2018

View reviewed changes

hickeng approved these changes Aug 10, 2018

View reviewed changes

lcastellano merged commit 67b0892 into vmware:master Aug 10, 2018

lcastellano deleted the props branch August 10, 2018 21:12

zjs pushed a commit to zjs/vic that referenced this pull request Aug 10, 2018

Enhance locking/serialization for concurrent exec (vmware#8180)

c5dd266

(cherry picked from commit 67b0892)

zjs added a commit that referenced this pull request Aug 10, 2018

Enhance locking/serialization for concurrent exec (#8180) (#8204)

5a5c2a9

(cherry picked from commit 67b0892)

zjs added the impact/doc/note Requires creation of or changes to an official release note label Sep 13, 2018

stuclem removed the impact/doc/note Requires creation of or changes to an official release note label Sep 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance locking/serialization for concurrent exec #8180

Enhance locking/serialization for concurrent exec #8180

lcastellano commented Aug 6, 2018

hickeng Aug 8, 2018

lcastellano Aug 9, 2018 •

edited

Loading

hickeng Aug 8, 2018

hickeng Aug 8, 2018

hickeng Aug 8, 2018

hickeng Aug 8, 2018

zjs Aug 10, 2018

zjs commented Sep 13, 2018

stuclem commented Sep 17, 2018

Enhance locking/serialization for concurrent exec #8180

Enhance locking/serialization for concurrent exec #8180

Conversation

lcastellano commented Aug 6, 2018

hickeng Aug 8, 2018

Choose a reason for hiding this comment

lcastellano Aug 9, 2018 • edited Loading

Choose a reason for hiding this comment

hickeng Aug 8, 2018

Choose a reason for hiding this comment

hickeng Aug 8, 2018

Choose a reason for hiding this comment

hickeng Aug 8, 2018

Choose a reason for hiding this comment

hickeng Aug 8, 2018

Choose a reason for hiding this comment

zjs Aug 10, 2018

Choose a reason for hiding this comment

zjs commented Sep 13, 2018

stuclem commented Sep 17, 2018

lcastellano Aug 9, 2018 •

edited

Loading