Add reconnect logic for stdio pipes #1197

dcantah · 2021-10-15T23:19:11Z

This change adds retry logic on the stdio relay if the server end of the named pipe
disconnects. This is a common case if containerd restarts for example.
The current approach is to make a io.Writer wrapper that handles the
reconnection logic on a write failure if it can be determined that the error
is from a disconnect.

This exposes a new shim config option to tailor the retry timeout as well as
an annotation so it can be set per container.

Signed-off-by: Daniel Canter dcanter@microsoft.com

dcantah · 2021-10-18T11:42:36Z

Interesting linter warning lol. I feel it's helpful to include the unit regardless but let me know if we'd rather just ignore this in the linter.

Error: ST1011: var ioRetryTimeoutInSec is of type time.Duration; don't use unit-specific suffix "Sec" (stylecheck)
Error: ST1011: var ioRetryTimeoutInSec is of type time.Duration; don't use unit-specific suffix "Sec" (stylecheck)
Error: ST1011: var ioRetryTimeoutInSec is of type time.Duration; don't use unit-specific suffix "Sec" (stylecheck)
Error: ST1011: var retryTimeoutInSec is of type time.Duration; don't use unit-specific suffix "Sec" (stylecheck)
Error: ST1011: var timeoutInSec is of type time.Duration; don't use unit-specific suffix "Sec" (stylecheck)

internal/cmd/io_npipe.go

kevpar · 2021-10-19T07:55:50Z

Interesting linter warning lol. I feel it's helpful to include the unit regardless but let me know if we'd rather just ignore this in the linter.

Error: ST1011: var ioRetryTimeoutInSec is of type time.Duration; don't use unit-specific suffix "Sec" (stylecheck)
Error: ST1011: var ioRetryTimeoutInSec is of type time.Duration; don't use unit-specific suffix "Sec" (stylecheck)
Error: ST1011: var ioRetryTimeoutInSec is of type time.Duration; don't use unit-specific suffix "Sec" (stylecheck)
Error: ST1011: var retryTimeoutInSec is of type time.Duration; don't use unit-specific suffix "Sec" (stylecheck)
Error: ST1011: var timeoutInSec is of type time.Duration; don't use unit-specific suffix "Sec" (stylecheck)

Once it's a time.Duration it doesn't have any implicit units anymore though, does it? Adding Sec here seems misleading for a duration value.

cmd/containerd-shim-runhcs-v1/task_hcs.go

internal/cmd/io_npipe.go

kevpar · 2021-10-19T08:10:34Z

I thought it might be helpful to provide some context on the use of time.Duration values in Go. The general idea of this type is it represents an abstract "amount of time" rather than any specific units. Though it is actually backed by an int64 representing nanoseconds, it's preferred to not rely on this detail. Any time you want to convert some specific number of time units (e.g. "5 seconds") into a Duration, or vice-versa, you should do so by multiplying or dividing by the time unit constants.

e.g.

var d time.Duration = 5 * time.Second                      // Create a duration representing 5 seconds.
var timeInMilliseconds int64 = int64(d / time.Millisecond) // Convert d back into a number of millisecond time units.

More info on this can be found under the Constants section of the time package docs.

dcantah · 2021-10-19T09:16:33Z

@kevpar Welp my changing to InSec everywhere was not worth it. Swapping back

dcantah · 2021-10-19T14:14:19Z

Error: SA5011: possible nil pointer dereference (staticcheck)
Error: SA5011(related information): this check suggests that the pointer can be nil (staticcheck)
Error: SA5011: possible nil pointer dereference (staticcheck)
Error: SA5011(related information): this check suggests that the pointer can be nil (staticcheck)
Error: SA5011: possible nil pointer dereference (staticcheck)
Error: SA5011(related information): this check suggests that the pointer can be nil (staticcheck)

Not seeing this lint error locally, I assume it's referring to the conn on the retryWriter but.. Also just changed to what it used to be mainly so even more confused. Running same version locally and everything. I wish it would display what line it's referring to

dcantah · 2021-10-19T14:51:09Z

I'm not seeing this on the runs that it does on my fork when I push either https://github.com/dcantah/hcsshim/runs/3940750279?check_suite_focus=true. Trying to squash here and see if it does anything..

Edit: And the answers a no

msscotb

LGTM pending resolution of the time.Duration warnings.

cmd/containerd-shim-runhcs-v1/service_internal.go

cmd/containerd-shim-runhcs-v1/task_hcs.go

kevpar · 2021-10-19T17:29:30Z

I question whether the IO retry timeout should actually be configurable via annotation. In general I think trying to make everything configurable all the time just results in a messier, harder to understand system. I can't think of a good use case where someone would want different IO retry timeouts for different containers. Can you give some context on your thinking here?

internal/cmd/io_npipe.go

dcantah · 2021-10-19T17:35:55Z

@kevpar My thinking started and ended with most of the deployment wide config options (the containerd-shim options) have annotation equivalents so mostly following the status quo. Also just a quick way to check if it's set without passing around a timeout field or passing the shim options to newHcsTask and pals.

internal/cmd/io_npipe.go

dcantah · 2021-10-19T18:08:40Z

I question whether the IO retry timeout should actually be configurable via annotation. In general I think trying to make everything configurable all the time just results in a messier, harder to understand system. I can't think of a good use case where someone would want different IO retry timeouts for different containers. Can you give some context on your thinking here?

Replied above also but I don't have a strong justification (or opinion) for it being configurable per container as well. The only one I can really think of would be for tests. I can remove.

dcantah · 2021-10-21T13:40:13Z

This linter issue makes no sense to me.. I don't see it locally on the same version, and even just installing staticcheck directly doesn't have this pop up after running. What the heck is going on?

Edit: If you set to only check for issues in the current PR the linter error goes away. Guess it was something in the task_hcs.go file that's been there..?

internal/cmd/io_npipe.go

ambarve

LGTM!

cmd/containerd-shim-runhcs-v1/task_hcs.go

internal/cmd/io_npipe.go

.github/workflows/ci.yml

cmd/containerd-shim-runhcs-v1/options/runhcs.proto

internal/cmd/io_npipe.go

kevpar · 2021-10-27T02:44:44Z

PR Feedback #1

Get rid of annotation to configure per container.

Swap to context.Background() instead of TODO()

Add back in return after comment that was an accident.

Signed-off-by: Daniel Canter dcanter@microsoft.com

Pretty sure the #1 in this commit will cause an issue reference.

dcantah · 2021-10-27T02:46:00Z

PR Feedback #1

Get rid of annotation to configure per container.

Swap to context.Background() instead of TODO()

Add back in return after comment that was an accident.

Signed-off-by: Daniel Canter dcanter@microsoft.com

Pretty sure the #1 in this commit will cause an issue reference.

It's just going to get squashed on check-in

kevpar · 2021-10-27T02:48:52Z

PR Feedback #1

Get rid of annotation to configure per container.

Swap to context.Background() instead of TODO()

Add back in return after comment that was an accident.

Signed-off-by: Daniel Canter dcanter@microsoft.com

Pretty sure the #1 in this commit will cause an issue reference.

It's just going to get squashed on check-in

IMO just something to keep in mind and avoid in the future.

msscotb

LGTM

dcantah · 2021-10-27T20:28:03Z

@kevpar You'd given this a couple rounds of review so even though we're at the usual 2 approvals let me know if you want to give this one more scan

kevpar · 2021-10-27T21:00:23Z

internal/cmd/io_npipe.go

+
+func (nprw *nPipeRetryWriter) Write(p []byte) (n int, err error) {
+	for {
+		n, err = nprw.Conn.Write(p)


There is a case where we could write n > 0 bytes, but also get a disconnected error. You may want to track how many bytes we have written so far, and write from that position in p.

I'm not seeing any change here?

I see them at least https://github.com/microsoft/hcsshim/pull/1197/files#diff-c2f4e279d1565a941e5d19241190315385f550d516340a3c4b6f6799ce802b3aR105-R110

Nevermind, GitHub had me on an old revision.

internal/cmd/io_npipe.go

kevpar

LGTM

kevpar · 2021-10-28T22:54:06Z

Looks like CI is failing though...

dcantah · 2021-10-28T22:55:23Z

Looks like CI is failing though...

Yep 😑 I can't make sense of this, it doesn't fail locally, passed the last revision, and now it's back. It ALSO doesn't fail on my fork when pushing which kicks off the CI as well. Color me confused. I left it from last night so you could see what I was talking about

Edit: And now it passes 😂

dcantah · 2021-10-28T23:21:11Z

Gonna squash the commits and check-in

kevpar · 2021-10-29T00:13:22Z

Gonna squash the commits and check-in

Seems reasonable. We can look into the CI flakiness separately.

This change adds retry logic on the stdio relay if the server end of the named pipe disconnects. This is a common case if containerd restarts for example. The current approach is to make a io.Writer wrapper that handles the reconnection logic on a write failure if it can be determined that the error is from a disconnect. A new shim config option is exposed to tailor the retry timeout. This changes also adds cenkalti/backoff/v4 as a dependency to be used for handling exponential backoff logic for the stdio connection retry attempts. Retrying at a fixed interval is a bit naive as all of the shims would potentially be trying to reconnect to 3 pipes continuously all in <timeout> bursts. This allows us to space out the connections, set an upper limit on timeout intervals and add an element of randomness to the retry attempts. Signed-off-by: Daniel Canter <dcanter@microsoft.com>

Related work items: microsoft#1067, microsoft#1097, microsoft#1119, microsoft#1170, microsoft#1176, microsoft#1180, microsoft#1181, microsoft#1182, microsoft#1183, microsoft#1184, microsoft#1185, microsoft#1186, microsoft#1187, microsoft#1188, microsoft#1189, microsoft#1191, microsoft#1193, microsoft#1194, microsoft#1195, microsoft#1196, microsoft#1197, microsoft#1200, microsoft#1201, microsoft#1202, microsoft#1203, microsoft#1204, microsoft#1205, microsoft#1206, microsoft#1207, microsoft#1209, microsoft#1210, microsoft#1211, microsoft#1218, microsoft#1219, microsoft#1220, microsoft#1223

Add reconnect logic for stdio pipes

dcantah requested a review from a team as a code owner October 15, 2021 23:19

dcantah mentioned this pull request Oct 18, 2021

[release/1.5] task delete: Closes task IO before waiting containerd/containerd#6129

Merged

dcantah force-pushed the retry-stdio-conns2 branch from 72c4f43 to 34e118f Compare October 18, 2021 11:28

msscotb reviewed Oct 19, 2021

View reviewed changes

internal/cmd/io_npipe.go Outdated Show resolved Hide resolved

msscotb reviewed Oct 19, 2021

View reviewed changes

internal/cmd/io_npipe.go Outdated Show resolved Hide resolved

kevpar reviewed Oct 19, 2021

View reviewed changes

cmd/containerd-shim-runhcs-v1/task_hcs.go Outdated Show resolved Hide resolved

kevpar reviewed Oct 19, 2021

View reviewed changes

internal/cmd/io_npipe.go Outdated Show resolved Hide resolved

kevpar reviewed Oct 19, 2021

View reviewed changes

internal/cmd/io_npipe.go Outdated Show resolved Hide resolved

dcantah force-pushed the retry-stdio-conns2 branch 2 times, most recently from 3de0457 to ef166c0 Compare October 19, 2021 14:49

msscotb approved these changes Oct 19, 2021

View reviewed changes

kevpar reviewed Oct 19, 2021

View reviewed changes

cmd/containerd-shim-runhcs-v1/service_internal.go Outdated Show resolved Hide resolved

kevpar reviewed Oct 19, 2021

View reviewed changes

cmd/containerd-shim-runhcs-v1/task_hcs.go Outdated Show resolved Hide resolved

kevpar reviewed Oct 19, 2021

View reviewed changes

internal/cmd/io_npipe.go Outdated Show resolved Hide resolved

kevpar reviewed Oct 19, 2021

View reviewed changes

internal/cmd/io_npipe.go Outdated Show resolved Hide resolved

kevpar reviewed Oct 19, 2021

View reviewed changes

internal/cmd/io_npipe.go Outdated Show resolved Hide resolved

dcantah force-pushed the retry-stdio-conns2 branch 3 times, most recently from dbdd9b5 to e30f98f Compare October 21, 2021 13:32

dcantah assigned kevpar Oct 21, 2021

ambarve reviewed Oct 22, 2021

View reviewed changes

internal/cmd/io_npipe.go Outdated Show resolved Hide resolved

dcantah force-pushed the retry-stdio-conns2 branch 2 times, most recently from 9017e5a to c9cc26a Compare October 25, 2021 20:36

ambarve approved these changes Oct 25, 2021

View reviewed changes

msscotb reviewed Oct 26, 2021

View reviewed changes

cmd/containerd-shim-runhcs-v1/task_hcs.go Show resolved Hide resolved

msscotb reviewed Oct 27, 2021

View reviewed changes

internal/cmd/io_npipe.go Show resolved Hide resolved

kevpar reviewed Oct 27, 2021

View reviewed changes

.github/workflows/ci.yml Show resolved Hide resolved

kevpar reviewed Oct 27, 2021

View reviewed changes

cmd/containerd-shim-runhcs-v1/options/runhcs.proto Outdated Show resolved Hide resolved

kevpar reviewed Oct 27, 2021

View reviewed changes

internal/cmd/io_npipe.go Show resolved Hide resolved

msscotb approved these changes Oct 27, 2021

View reviewed changes

kevpar reviewed Oct 27, 2021

View reviewed changes

internal/cmd/io_npipe.go Outdated Show resolved Hide resolved

kevpar reviewed Oct 27, 2021

View reviewed changes

internal/cmd/io_npipe.go Show resolved Hide resolved

dcantah force-pushed the retry-stdio-conns2 branch from e99f57c to e7aadf5 Compare October 28, 2021 00:28

kevpar approved these changes Oct 28, 2021

View reviewed changes

dcantah force-pushed the retry-stdio-conns2 branch from e7aadf5 to b047802 Compare October 29, 2021 00:22

dcantah force-pushed the retry-stdio-conns2 branch from b047802 to 573c137 Compare October 29, 2021 03:30

dcantah merged commit 27c580d into microsoft:master Oct 29, 2021

dcantah mentioned this pull request Nov 1, 2021

[release/0.8] Support specifying a specific logrus log level for shim log output #1213

Merged

princepereira pushed a commit to princepereira/hcsshim that referenced this pull request Aug 29, 2024

Merge pull request microsoft#1197 from dcantah/retry-stdio-conns2

c55c40b

Add reconnect logic for stdio pipes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add reconnect logic for stdio pipes #1197

Add reconnect logic for stdio pipes #1197

dcantah commented Oct 15, 2021

dcantah commented Oct 18, 2021

kevpar commented Oct 19, 2021

kevpar commented Oct 19, 2021 •

edited

Loading

dcantah commented Oct 19, 2021

dcantah commented Oct 19, 2021 •

edited

Loading

dcantah commented Oct 19, 2021 •

edited

Loading

msscotb left a comment

kevpar commented Oct 19, 2021

dcantah commented Oct 19, 2021

dcantah commented Oct 19, 2021

dcantah commented Oct 21, 2021 •

edited

Loading

ambarve left a comment

kevpar commented Oct 27, 2021

dcantah commented Oct 27, 2021

kevpar commented Oct 27, 2021

msscotb left a comment

dcantah commented Oct 27, 2021

kevpar Oct 27, 2021

dcantah Oct 27, 2021

dcantah Oct 28, 2021

kevpar Oct 28, 2021

dcantah Oct 28, 2021

kevpar Oct 28, 2021

kevpar left a comment

kevpar commented Oct 28, 2021

dcantah commented Oct 28, 2021 •

edited

Loading

dcantah commented Oct 28, 2021

kevpar commented Oct 29, 2021

Add reconnect logic for stdio pipes #1197

Add reconnect logic for stdio pipes #1197

Conversation

dcantah commented Oct 15, 2021

dcantah commented Oct 18, 2021

kevpar commented Oct 19, 2021

kevpar commented Oct 19, 2021 • edited Loading

dcantah commented Oct 19, 2021

dcantah commented Oct 19, 2021 • edited Loading

dcantah commented Oct 19, 2021 • edited Loading

msscotb left a comment

Choose a reason for hiding this comment

kevpar commented Oct 19, 2021

dcantah commented Oct 19, 2021

dcantah commented Oct 19, 2021

dcantah commented Oct 21, 2021 • edited Loading

ambarve left a comment

Choose a reason for hiding this comment

kevpar commented Oct 27, 2021

dcantah commented Oct 27, 2021

kevpar commented Oct 27, 2021

msscotb left a comment

Choose a reason for hiding this comment

dcantah commented Oct 27, 2021

kevpar Oct 27, 2021

Choose a reason for hiding this comment

dcantah Oct 27, 2021

Choose a reason for hiding this comment

dcantah Oct 28, 2021

Choose a reason for hiding this comment

kevpar Oct 28, 2021

Choose a reason for hiding this comment

dcantah Oct 28, 2021

Choose a reason for hiding this comment

kevpar Oct 28, 2021

Choose a reason for hiding this comment

kevpar left a comment

Choose a reason for hiding this comment

kevpar commented Oct 28, 2021

dcantah commented Oct 28, 2021 • edited Loading

dcantah commented Oct 28, 2021

kevpar commented Oct 29, 2021

kevpar commented Oct 19, 2021 •

edited

Loading

dcantah commented Oct 19, 2021 •

edited

Loading

dcantah commented Oct 19, 2021 •

edited

Loading

dcantah commented Oct 21, 2021 •

edited

Loading

dcantah commented Oct 28, 2021 •

edited

Loading