Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhance: add retry helper method for Trylock #2319

Merged
merged 1 commit into from
Oct 18, 2018
Merged

enhance: add retry helper method for Trylock #2319

merged 1 commit into from
Oct 18, 2018

Conversation

fuweid
Copy link
Contributor

@fuweid fuweid commented Oct 17, 2018

Signed-off-by: Wei Fu fuweid89@gmail.com

Ⅰ. Describe what this PR did

The ctrd maybe hold the lock for a while. In the concurrent request, it
might fail to serve request frequently. For this case, pouchcontainer
should retry in the specific duration to reduce the number of failed
request.

Ⅱ. Does this pull request fix one issue?

fix #2281

Ⅲ. Why don't you add test cases (unit test/integration test)? (你真的觉得不需要加测试吗?)

Added

Ⅳ. Describe how to verify it

Check case

Ⅴ. Special notes for reviews

5 seconds is enough? or use request context?

@codecov
Copy link

codecov bot commented Oct 17, 2018

Codecov Report

Merging #2319 into master will increase coverage by 0.28%.
The diff coverage is 59.09%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2319      +/-   ##
==========================================
+ Coverage   67.16%   67.45%   +0.28%     
==========================================
  Files         215      217       +2     
  Lines       17574    17663      +89     
==========================================
+ Hits        11804    11915     +111     
+ Misses       4360     4350      -10     
+ Partials     1410     1398      -12
Flag Coverage Δ
#criv1alpha1test 32.07% <27.27%> (-0.22%) ⬇️
#criv1alpha2test 36.23% <27.27%> (-0.07%) ⬇️
#integrationtest 40.45% <27.27%> (+0.23%) ⬆️
#nodee2etest 33.76% <50%> (+0.2%) ⬆️
#unittest 23.35% <59.09%> (ø) ⬆️
Impacted Files Coverage Δ
ctrd/container.go 59.28% <0%> (+0.95%) ⬆️
ctrd/container_lock.go 100% <100%> (ø) ⬆️
daemon/mgr/snapshot.go 89.85% <0%> (-4.35%) ⬇️
ctrd/watch.go 78.78% <0%> (-1.52%) ⬇️
cri/v1alpha1/cri.go 60.38% <0%> (-0.33%) ⬇️
daemon/mgr/container_types.go 83.33% <0%> (-0.31%) ⬇️
daemon/mgr/image.go 57.7% <0%> (-0.23%) ⬇️
daemon/mgr/container_upgrade.go 65.09% <0%> (ø)
pkg/utils/random_string.go 81.81% <0%> (ø)
cri/v1alpha2/cri_utils.go 90.57% <0%> (+0.28%) ⬆️
... and 9 more

@@ -55,7 +56,7 @@ func (c *Client) ContainerStats(ctx context.Context, id string) (*containerdtype

// containerStats returns stats of the container.
func (c *Client) containerStats(ctx context.Context, id string) (*containerdtypes.Metric, error) {
if !c.lock.Trylock(id) {
if !c.lock.TrylockWithTimeout(id, defaultTrylockTimeout) {
return nil, errtypes.ErrLockfailed
Copy link
Collaborator

@allencloud allencloud Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part would return an error of

failed: rpc error: code = Unknown desc = 
failed to stop container "c69994fce9611e75ca826c644edb8f4dbc32a2592afb66dd34e1d4a2d9a76c0e": 
failed to destroy container c69994fce9611e75ca826c644edb8f4dbc32a2592afb66dd34e1d4a2d9a76c0e:
 lock failed"

While I think some users of PouchContainer feel a little bit unclear about the error message, could we add code like

// LockFailedError is constructing a much more readable lock failed error.
func LockFailedError(containerID string) error {
	return errors.Wrapf(errtypes.ErrLockfailed, "container %s is accessed by other request and please try again", containerID)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the error will be disappeared after this PR 😢

@allencloud
Copy link
Collaborator

LGTM until some minor request updated.

@pouchrobot pouchrobot added the LGTM one maintainer or community participant agrees to merge the pull reuqest. label Oct 17, 2018
func (l *containerLock) TrylockWithTimeout(id string, t time.Duration) bool {
var retry = 32

lctx, cancel := context.WithTimeout(context.TODO(), t)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a brand new context? rather than an inherited context?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using two timer?

func (l *containerLock) TrylockWithTimeout(id string, t time.Duration) bool {
	var retry = 32

	timeoutTimer := time.NewTimer(t)
	for {
		ok := l.Trylock(id)
		if ok {
			timeoutTimer.Stop()
			return true
		}

		// sleep random duration by retry
		select {
		case <-time.After(time.Millisecond * time.Duration(rand.Intn(retry))):
			if retry < 2048 {
				retry = retry << 1
			}
			continue
		case <-timeoutTimer.C:
			return false
		}
	}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both are the same.

defer l.Unlock(id)

var (
releaseCh = make(chan struct{})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the usage of this variable releaseCh?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The releaseCh variable try to promise that TrylockWithRetry will execute before the sleep and unlock function in the main go routine. Otherwise, unlock is done, TrylockWithRetry will success immediately.

The ctrd maybe hold the lock for a while. In the concurrent request, it
might fail to serve request frequently. For this case, pouchcontainer
should retry in the specific duration to reduce the number of failed
request.

Signed-off-by: Wei Fu <fuweid89@gmail.com>
@fuweid
Copy link
Contributor Author

fuweid commented Oct 17, 2018

@allencloud @allencloud I used context as timeout. If the request doesn't set the timeout, it will retry like spin-lock.

@fuweid
Copy link
Contributor Author

fuweid commented Oct 17, 2018

next PR I will remove the TryLock....

@rudyfly
Copy link
Collaborator

rudyfly commented Oct 18, 2018

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
containerd-related LGTM one maintainer or community participant agrees to merge the pull reuqest. size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failed to stop container: lock failed
5 participants