Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry logic #292

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .golangci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ linters-settings:
goimports:
local-prefixes: github.com/networkservicemesh
gocyclo:
min-complexity: 15
min-complexity: 20
maligned:
suggest-new: true
dupl:
Expand Down
2 changes: 2 additions & 0 deletions internal/config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ type Config struct {
ConnectTo url.URL `default:"unix:///var/lib/networkservicemesh/nsm.io.sock" desc:"url to connect to NSM" split_words:"true"`
DialTimeout time.Duration `default:"5s" desc:"timeout to dial NSMgr" split_words:"true"`
RequestTimeout time.Duration `default:"15s" desc:"timeout to request NSE" split_words:"true"`
RetryTimeout time.Duration `default:"0s" desc:"retry timeout" split_words:"true"`
RetryInterval time.Duration `default:"100ms" desc:"retry interval" split_words:"true"`
MaxTokenLifetime time.Duration `default:"10m" desc:"maximum lifetime of tokens" split_words:"true"`

Labels []string `default:"" desc:"A list of client labels with format key1=val1,key2=val2, will be used a primary list for network services" split_words:"true"`
Expand Down
42 changes: 29 additions & 13 deletions main.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ import (
"os"
"os/signal"
"syscall"
"time"

nested "github.com/antonfisher/nested-logrus-formatter"
"github.com/edwarnicke/grpcfd"
Expand Down Expand Up @@ -119,7 +120,6 @@ func main() {
grpcfd.WithChainStreamInterceptor(),
grpcfd.WithChainUnaryInterceptor(),
grpc.WithDefaultCallOptions(
grpc.WaitForReady(true),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edwarnicke
I think there will be a problem with this heal implementation - networkservicemesh/sdk#1107
Because, we have only one attempt to heal the connection. If this attempt fails - link - we don't start a new eventLoop and therefore no new attempt will occur .
Let's take a look at local nsmgr restart scenario. nsmgr will not be able to recreate instantly.
grpc.WaitForReady (or probably WithBlock) ensures that we wait for a new nsmgr to start and successfully complete the first healing attempt.
I think that's the reason why ci is red here...
Am I thinking right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glazychev-art Good eye, yes... I've been thinking on that too. I think the fix belongs in the heal implementation though. Even retaining the WithBlock and WaitForReady we'd need to fix the single try in heal.

The issue with WithBlock and WaitForReady is they are the moral equivalent of putting long sleeps in your code... its better to fail back to a definitive point that can act intelligently for the chain as soon as possible than to linger in the middle.

As to the CI, that could in fact be why its red. Interestingly, it doesn't fail for me locally... but I do believe there is something genuinely wrong we need to fix in heal that has been caught by the CI.

grpc.PerRPCCredentials(token.NewPerRPCCredentials(spiffejwt.TokenGeneratorFunc(source, c.MaxTokenLifetime))),
),
grpc.WithTransportCredentials(
Expand Down Expand Up @@ -226,21 +226,37 @@ func main() {
}
}

requestCtx, cancelRequest := context.WithTimeout(ctx, c.RequestTimeout)
defer cancelRequest()

resp, err := nsmClient.Request(requestCtx, request)
if err != nil {
logger.Fatalf("failed connect to NSMgr: %v", err.Error())
retryCtx, retryCancel := context.WithCancel(ctx)
if c.RequestTimeout != 0 {
retryCtx, retryCancel = context.WithTimeout(ctx, c.RetryTimeout)
defer retryCancel()
}

defer func() {
closeCtx, cancelClose := context.WithTimeout(ctx, c.RequestTimeout)
defer cancelClose()
_, _ = nsmClient.Close(closeCtx, resp)
}()
for {
requestCtx, cancelRequest := context.WithTimeout(ctx, c.RequestTimeout)
defer cancelRequest()

resp, err := nsmClient.Request(requestCtx, request)
if err != nil {
logger.Errorf("failed connect to NSMgr: %v", err.Error())
select {
case <-retryCtx.Done():
logger.Fatalf("failed to connect to %s after %d retries")
default:
time.Sleep(c.RetryInterval)
}
continue
}

logger.Infof("successfully connected to %v. Response: %v", u.NetworkService(), resp)
defer func() {
closeCtx, cancelClose := context.WithTimeout(ctx, c.RequestTimeout)
defer cancelClose()
_, _ = nsmClient.Close(closeCtx, resp)
}()

logger.Infof("successfully connected to %v. Response: %v", u.NetworkService(), resp)
break
}
}

// Wait for cancel event to terminate
Expand Down