Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent: bounce session after failed status update #1539

Merged

Conversation

stevvooe
Copy link
Contributor

@stevvooe stevvooe commented Sep 15, 2016

Previously, failed calls on an agent session to UpdateTaskStatus weren't
causing the agent to fall back into the session creation retry loop.
Without the retry loop, the agent ends up pounding the dispatcher with
status updates that will never drain.

This PR closes the session on a fatal error, ensuring that the agent
will fallback into a retry loop.

Consider backporting this change to 1.12.2.

Signed-off-by: Stephen J Day stephen.day@docker.com

cc @aaronlehmann

Closes #1533

@codecov-io
Copy link

codecov-io commented Sep 15, 2016

Current coverage is 53.72% (diff: 0.00%)

Merging #1539 into master will decrease coverage by 0.03%

@@             master      #1539   diff @@
==========================================
  Files            82         82          
  Lines         13432      13430     -2   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits           7221       7215     -6   
- Misses         5224       5227     +3   
- Partials        987        988     +1   

Sunburst

Powered by Codecov. Last update 3f3d300...353423b

@stevvooe
Copy link
Contributor Author

Confirmed fix with reproduction that always returns errors from UpdateTaskStatus:

diff --git a/manager/dispatcher/dispatcher.go b/manager/dispatcher/dispatcher.go
index 9ae3e0b..432af44 100644
--- a/manager/dispatcher/dispatcher.go
+++ b/manager/dispatcher/dispatcher.go
@@ -436,6 +436,8 @@ func (d *Dispatcher) register(ctx context.Context, nodeID string, description *a
 // UpdateTaskStatus updates status of task. Node should send such updates
 // on every status change of its tasks.
 func (d *Dispatcher) UpdateTaskStatus(ctx context.Context, r *api.UpdateTaskStatusRequest) (*api.UpdateTaskStatusResponse, error) {
+       return nil, grpc.Errorf(codes.InvalidArgument, "testing agent problem")
+
        nodeInfo, err := ca.RemoteNode(ctx)
        if err != nil {
                return nil, err

Before this patch, we get the following in the logs:

DEBU[0002] (*Agent).UpdateTaskStatus                     module=node/agent task.id=9qdufjres0iky2n2gx25yn8ut
ERRO[0002] sending task status update failed             error=rpc error: code = 3 desc = testing agent problem module=node/agent
ERRO[0002] failed reporting status to agent              error=rpc error: code = 3 desc = testing agent problem module=node/agent

After applying this PR we get the following:

DEBU[0002] (*Agent).UpdateTaskStatus                     module=node/agent task.id=9dzxad2v62bq2e305guckajck
DEBU[0002] agent: rebuild session                        module=node/agent
DEBU[0002] (*session).start                              module=node/agent

While it will spam the logs with this simple recreation because we keep reestablishing the session, note that we are clearly in the backoff loop ("agent: rebuild session").

@@ -342,11 +344,15 @@ func (s *session) close() error {
case <-s.closed:
return errSessionClosed
default:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the select with the addition of this sync.Once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Previously, failed calls on an agent session to UpdateTaskStatus weren't
causing the agent to fall back into the session creation retry loop.
Without the retry loop, the agent ends up pounding the dispatcher with
status updates that will never drain.

This PR closes the session on a fatal error, ensuring that the agent
will fallback into a retry loop.

Signed-off-by: Stephen J Day <stephen.day@docker.com>
@stevvooe stevvooe force-pushed the shutdown-session-on-update-status-failure branch from 4f045f8 to 353423b Compare September 15, 2016 01:32
@aaronlehmann
Copy link
Collaborator

LGTM

Copy link
Contributor

@LK4D4 LK4D4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@LK4D4 LK4D4 merged commit 65af901 into moby:master Sep 15, 2016
@stevvooe stevvooe deleted the shutdown-session-on-update-status-failure branch September 15, 2016 19:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

agent: session not poisoned after terminal status update failure
5 participants