Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add optional timeout to CSI lib connection utils #19

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 17 additions & 2 deletions connection/connection.go
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ const terminationLogPath = "/dev/termination-log"
// For other connections, the default behavior from gRPC is used and
// loss of connection is not detected reliably.
func Connect(address string, options ...Option) (*grpc.ClientConn, error) {
return connect(address, []grpc.DialOption{}, options)
return connect(address, options)
}

// Option is the type of all optional parameters for Connect.
Expand All @@ -74,6 +74,14 @@ func OnConnectionLoss(reconnect func() bool) Option {
}
}

// WithTimeout sets the connection timeout duration
// If not specified, all connections will retry forever.
func WithTimeout(timeout time.Duration) Option {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more flexible solution for specifying a timeout is to add a context.Context parameter to the Connect function. Then one can specify both a deadline and a timeout, depending on the situation. If we expose a timeout option at all in the API, then I would prefer the solution based on context.

@andrewsykim why do you think such an option is needed?

When @jsafrane added this function, he intentionally limited the API to just what the CSI sidecars need, to ensure consistent behavior. That's why the testing is partly using an internal helper function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connect forever option doesn't work well with the liveness probe sidecar because the health check will hang instead of returning an error.

@andrewsykim can you confirm if the Kubernetes liveness probe can appropriately handle the health check hanging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, the default (and minimum) livenessprobe timeout on kubelet is 1s, so liveness check hanging and not returning an error would still allow the kubelet to restart the container within the timeout specified on the container spec.

If we're okay assuming this, then maybe a timeout option is not needed at all. If we do want the livenessprobe sidecar to explicitly return 500 when a timeout is reached then being able to pass in a timeout via the connection lib would be useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more flexible solution for specifying a timeout is to add a context.Context parameter to the Connect function. Then one can specify both a deadline and a timeout, depending on the situation. If we expose a timeout option at all in the API, then I would prefer the solution based on context.

I can update to use context instead, though we may lose some of the functionality we get passing in the timeout as a grpc dial option. Either works for me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

liveness check hanging and not returning an error would still allow the kubelet to restart the container within the timeout specified

Then I think in that case, it is fine to have the liveness probe hang on connect(). The only downside I see, is that healthz requests may pile up on the liveness probe if the driver is forever in an unready state.

How about something like this for the liveness probe:

  1. Instead of connecting to the driver on every healthz check, connect to it once in the beginning. This can try to connect forever.
  2. Whenever a healthz request comes, call Probe() with timeout, so that we can return a 500 immediately and not have outstanding requests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What functionality could get lost?

Note that gRPC already deprecated WithTimeout in favor of context.WithTimeout - https://godoc.org/google.golang.org/grpc#WithTimeout.

Copy link
Contributor Author

@andrewsykim andrewsykim Feb 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something like this for the liveness probe:

  1. Instead of connecting to the driver on every healthz check, connect to it once in the beginning. This can try to connect forever.
  2. Whenever a healthz request comes, call Probe() with timeout, so that we can return a 500 immediately and not have outstanding requests.

Sounds good. I'll update livenessprobe to do this as soon as #17 is merged

Note that gRPC already deprecated WithTimeout in favor of context.WithTimeout - https://godoc.org/google.golang.org/grpc#WithTimeout.

I didn't know this, this is good to know thanks!

return func(o *options) {
o.timeout = timeout
}
}

// ExitOnConnectionLoss returns callback for OnConnectionLoss() that writes
// an error to /dev/termination-log and exits.
func ExitOnConnectionLoss() func() bool {
Expand All @@ -89,21 +97,28 @@ func ExitOnConnectionLoss() func() bool {

type options struct {
reconnect func() bool
timeout time.Duration
}

// connect is the internal implementation of Connect. It has more options to enable testing.
func connect(address string, dialOptions []grpc.DialOption, connectOptions []Option) (*grpc.ClientConn, error) {
Copy link
Contributor Author

@andrewsykim andrewsykim Feb 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to remove dialOptions here because I don't think it makes sense to pass in dial options into an unexported function. Also, it seems it was only added for the connect timeout test case anyway.

func connect(address string, connectOptions []Option) (*grpc.ClientConn, error) {
var o options
for _, option := range connectOptions {
option(&o)
}

var dialOptions []grpc.DialOption
dialOptions = append(dialOptions,
grpc.WithInsecure(), // Don't use TLS, it's usually local Unix domain socket in a container.
grpc.WithBackoffMaxDelay(time.Second), // Retry every second after failure.
grpc.WithBlock(), // Block until connection succeeds.
grpc.WithUnaryInterceptor(LogGRPC), // Log all messages.
)

if o.timeout != 0 {
dialOptions = append(dialOptions, grpc.WithTimeout(o.timeout))
}

unixPrefix := "unix://"
if strings.HasPrefix(address, "/") {
// It looks like filesystem path.
Expand Down
6 changes: 3 additions & 3 deletions connection/connection_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -150,13 +150,13 @@ func TestWaitForServer(t *testing.T) {
}
}

func TestTimout(t *testing.T) {
func TestConnectWithTimout(t *testing.T) {
tmp := tmpDir(t)
defer os.RemoveAll(tmp)

startTime := time.Now()
timeout := 5 * time.Second
conn, err := connect(path.Join(tmp, "no-such.sock"), []grpc.DialOption{grpc.WithTimeout(timeout)}, nil)
timeout := 1 * time.Second
conn, err := Connect(path.Join(tmp, "no-such.sock"), WithTimeout(timeout))
endTime := time.Now()
if assert.Error(t, err, "connection should fail") {
assert.InEpsilon(t, timeout, endTime.Sub(startTime), 1, "connection timeout")
Expand Down