-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add reconnect logic for stdio pipes #1197
Conversation
72c4f43
to
34e118f
Compare
Interesting linter warning lol. I feel it's helpful to include the unit regardless but let me know if we'd rather just ignore this in the linter.
|
Once it's a |
I thought it might be helpful to provide some context on the use of e.g.
More info on this can be found under the Constants section of the |
@kevpar Welp my changing to InSec everywhere was not worth it. Swapping back |
Error: SA5011: possible nil pointer dereference (staticcheck) Not seeing this lint error locally, I assume it's referring to the conn on the retryWriter but.. Also just changed to what it used to be mainly so even more confused. Running same version locally and everything. I wish it would display what line it's referring to |
3de0457
to
ef166c0
Compare
I'm not seeing this on the runs that it does on my fork when I push either https://github.com/dcantah/hcsshim/runs/3940750279?check_suite_focus=true. Trying to squash here and see if it does anything.. Edit: And the answers a no |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending resolution of the time.Duration warnings.
I question whether the IO retry timeout should actually be configurable via annotation. In general I think trying to make everything configurable all the time just results in a messier, harder to understand system. I can't think of a good use case where someone would want different IO retry timeouts for different containers. Can you give some context on your thinking here? |
@kevpar My thinking started and ended with most of the deployment wide config options (the containerd-shim options) have annotation equivalents so mostly following the status quo. Also just a quick way to check if it's set without passing around a timeout field or passing the shim options to newHcsTask and pals. |
Replied above also but I don't have a strong justification (or opinion) for it being configurable per container as well. The only one I can really think of would be for tests. I can remove. |
dbdd9b5
to
e30f98f
Compare
This linter issue makes no sense to me.. I don't see it locally on the same version, and even just installing staticcheck directly doesn't have this pop up after running. What the heck is going on? Edit: If you set to only check for issues in the current PR the linter error goes away. Guess it was something in the task_hcs.go file that's been there..? |
9017e5a
to
c9cc26a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Pretty sure the |
It's just going to get squashed on check-in |
IMO just something to keep in mind and avoid in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@kevpar You'd given this a couple rounds of review so even though we're at the usual 2 approvals let me know if you want to give this one more scan |
internal/cmd/io_npipe.go
Outdated
|
||
func (nprw *nPipeRetryWriter) Write(p []byte) (n int, err error) { | ||
for { | ||
n, err = nprw.Conn.Write(p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a case where we could write n > 0 bytes, but also get a disconnected error. You may want to track how many bytes we have written so far, and write from that position in p
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing any change here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, GitHub had me on an old revision.
e99f57c
to
e7aadf5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Looks like CI is failing though... |
Yep 😑 I can't make sense of this, it doesn't fail locally, passed the last revision, and now it's back. It ALSO doesn't fail on my fork when pushing which kicks off the CI as well. Color me confused. I left it from last night so you could see what I was talking about Edit: And now it passes 😂 |
Gonna squash the commits and check-in |
Seems reasonable. We can look into the CI flakiness separately. |
e7aadf5
to
b047802
Compare
This change adds retry logic on the stdio relay if the server end of the named pipe disconnects. This is a common case if containerd restarts for example. The current approach is to make a io.Writer wrapper that handles the reconnection logic on a write failure if it can be determined that the error is from a disconnect. A new shim config option is exposed to tailor the retry timeout. This changes also adds cenkalti/backoff/v4 as a dependency to be used for handling exponential backoff logic for the stdio connection retry attempts. Retrying at a fixed interval is a bit naive as all of the shims would potentially be trying to reconnect to 3 pipes continuously all in <timeout> bursts. This allows us to space out the connections, set an upper limit on timeout intervals and add an element of randomness to the retry attempts. Signed-off-by: Daniel Canter <dcanter@microsoft.com>
b047802
to
573c137
Compare
Related work items: microsoft#1067, microsoft#1097, microsoft#1119, microsoft#1170, microsoft#1176, microsoft#1180, microsoft#1181, microsoft#1182, microsoft#1183, microsoft#1184, microsoft#1185, microsoft#1186, microsoft#1187, microsoft#1188, microsoft#1189, microsoft#1191, microsoft#1193, microsoft#1194, microsoft#1195, microsoft#1196, microsoft#1197, microsoft#1200, microsoft#1201, microsoft#1202, microsoft#1203, microsoft#1204, microsoft#1205, microsoft#1206, microsoft#1207, microsoft#1209, microsoft#1210, microsoft#1211, microsoft#1218, microsoft#1219, microsoft#1220, microsoft#1223
Add reconnect logic for stdio pipes
This change adds retry logic on the stdio relay if the server end of the named pipe
disconnects. This is a common case if containerd restarts for example.
The current approach is to make a io.Writer wrapper that handles the
reconnection logic on a write failure if it can be determined that the error
is from a disconnect.
This exposes a new shim config option to tailor the retry timeout as well as
an annotation so it can be set per container.
Signed-off-by: Daniel Canter dcanter@microsoft.com