Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: enable TCP keepalive on new connections from net.Dial #23459

Closed
CAFxX opened this issue Jan 17, 2018 · 22 comments
Closed

net: enable TCP keepalive on new connections from net.Dial #23459

CAFxX opened this issue Jan 17, 2018 · 22 comments

Comments

@CAFxX
Copy link
Contributor

CAFxX commented Jan 17, 2018

In our experience, missing network timeouts are one of the first sources of resiliency bugs in server applications. Unfortunately, these bugs are rare to experience during development and are not very easy to write tests for. As a result, we often see components hanging because somewhere in a dependency three layers down someone forgot to add an explicit timeout to a network connection/request/call. This is the result of many factors, one being that many languages (Go included) does not have default timeouts for network operations even though having a default timeout would align better with the common case (how many times do network operations require to not have timeouts?)

I realize this may be seen as an over-reaction to the problem, but I would argue that if the Go standard packages had default timeouts set on all network operations a whole class of resiliency problems would, for the majority of uses, disappear.

A couple of notes:

  • I'm not arguing it would solve all network-triggered resiliency problems in all cases, only in a good majority.
  • I'm also not arguing that users shouldn't be allowed to explicitly disable such timeouts, or to set them to something different. While I argue that the common case is that network operations should have a timeout, I agree that there may be a minority of cases where no timeouts make sense (the most common being when a different mechanism to detect timeouts is in use).

To summarize the proposal:

I'm unsure if this would break go1 compat, so feel free to classify as Go2 if required.

@gopherbot gopherbot added this to the Proposal milestone Jan 17, 2018
@bradfitz
Copy link
Contributor

Seems like a good idea. More concretely, I think the time.Duration zero values in a number of places should probably mean "sane default", and "no timeout" should be opt-in with either a negative duration or some really gigantic duration.

@bronze1man
Copy link
Contributor

bronze1man commented Jan 17, 2018

But the default behave should be once the connection read something successful, the read deadline reset base on current time add the timeout. So that http.DefaultClient can download some very large files from the internet.

@cznic
Copy link
Contributor

cznic commented Jan 17, 2018

But the default behave should be once the connection read something successful, the read deadline reset base on current time add the timeout.

See Slowloris.

@bronze1man
Copy link
Contributor

@cznic interesting.

The default network timeout will make the simple download very large files task more complex.

So we can not use the same/simple configure for download very large files and avoid Slowloris problem.

And I am also wrong about write timeout.

The write timeout should be useful as the server send data to a disconnected client.
The read timeout is for the server receive data from a disconnceted client.

The read/write deadline reset base on current time add the timeout is bad in some cases when consider Slowloris attack.

@davecb
Copy link

davecb commented Jan 17, 2018

I'll claim that all things should have timeouts, but some may need to be set to specific values instead of what you might prefer.

I tend to use the idiom

flag.IntVar(&runFor, "for", 0, "number of records to use, eg 1000 ")
...
if runFor == 0 {
    runFor = math.MaxInt64

And thanks, by the way: I'm now off to make sure all my library timeouts are set properly

@rsc
Copy link
Contributor

rsc commented Jan 22, 2018

The net API doesn't have timeouts at all right now. It has deadlines, which are simpler in various ways but not really great for the problem you described.

Would it be enough to set TCP keepalive on by default, as we do in the HTTP server, instead of setting some kind of persistent timeout in Go itself?

@ianlancetaylor
Copy link
Contributor

As a possible smaller change, I think it may be worth considering adding a variant of net.Listen that takes a timeout duration. This would not set a timeout on the listen operation itself, but each connection it accepts would be initialized with a call to SetDeadline(time.Now() + timeout). This would give a prominent position in the net package for servers that should set a limit on the time required by each client. (The net.Listener interface would also have to be updated.)

@CAFxX
Copy link
Contributor Author

CAFxX commented Jan 23, 2018

Just to frame my post in a little bit more of context: servers don't just accept connections, they also open connections - both in response to events (e.g. incoming requests) as well as independently as part of their lifecycles.

That is to say, the "listener side" is certainly in scope but it's not the full picture.

The point I was making in my initial proposal is more about the general approach/policy rather than about how to go implement it... especially since I have the feeling that a "holistic solution" will be out of reach in go1 (as @rsc's comment seem to imply)

Maybe the go1 implementation concerns should be split into a separate proposal? Or maybe this proposal would be about the "general approach/policy", and then (if an agreement is reached) we can open go1 and go2 issues to discuss how to implement the policy?

@ianlancetaylor
Copy link
Contributor

Yes, let's discuss a general approach first.

@neild
Copy link
Contributor

neild commented Jan 23, 2018

we often see components hanging because somewhere in a dependency three layers down someone forgot to add an explicit timeout to a network connection/request/call.

FWIW, this is one of the problems which the context package is intended to solve: Plumbing cancellation or timeout signals down through layers of the call stack.

One problem is that it's currently quite difficult to hook a context's cancellation signal to low-level network operations' deadline-based cancellation; see #20280, and in particular the post linked from @zombiezen's comment at #20280 (comment)

@CAFxX
Copy link
Contributor Author

CAFxX commented Jan 23, 2018

@neild agreed, although if you replace "add an explicit timeout" with "correctly wire a context" the point (that the default for network ops should not be "wait forever") still stands:

we often see components hanging because somewhere in a dependency three layers down someone forgot to correctly wire a context to a network connection/request/call.

@rsc rsc changed the title Proposal: all network operations should have a default timeout proposal: net: all operations should have a default timeout Jan 29, 2018
@rsc
Copy link
Contributor

rsc commented Feb 5, 2018

Let's enable TCP keepalive by default and then see what problems still remain to be solved. It seems like TCP keepalives would handle essentially all "accidental" (not malicious) long-lived connections.

Does anyone object to enabling TCP keepalive by default?

@mikioh
Copy link
Contributor

mikioh commented Feb 6, 2018

I have no strong opinion on enabling TCP keepalive by default. The following is just random thoughts.

  1. TCP keepalive is a mechanism that tries to keep in-flight TCP connections alive by using zero-payload TCP segments, I mean, not a feature like application-level dead peer detection. If the peer application gets stuck for some reason and remains the established TCP connection alive, the mechanism cannot detect such condition.

  2. The original issue sounds to me like there should be a better application-transport API surface including some help for implementing the dogma: computer networking? it's kinda illusion, huh? if you have any problem with it, don't tackle the problem directly, don't try to reclaim the lost stuff, do abort and retry. In addition, perhaps the net package might be not suitable for such purpose because the package intersects platform- and network-specific features and provides only read and write operations for byte sequences. I think it's better to have the abstraction, a bit more well-cooked network operation things such as Messages, Requests, Responses, and Calls, in packages other than the net package.

@CAFxX
Copy link
Contributor Author

CAFxX commented Feb 18, 2018

@rsc I am in favor of enabling TCP keepalives by default. I actually think they will solve part of the problem but, just to clarify, I don't think we can say that

TCP keepalives would handle essentially all "accidental" (not malicious) long-lived connections

TCP keepalives are normally handled by the TCP implementation in the kernel, so a "stuck" application would not be detected (until it's killed or crashes). Also TCP keepalives have the downside of being applicable only to TCP. 😅

These are the reasons I think starting with TCP keepalives is a good first step.

@rsc
Copy link
Contributor

rsc commented Feb 26, 2018

OK, let's do TCP keepalives on by default.
Also, Brad suggested making the http.Server configuration have zero duration timeouts mean "a good limit" instead of "unlimited". That sounds good too.

@rsc rsc changed the title proposal: net: all operations should have a default timeout net: enable TCP keepalive on new connections from net.Dial Feb 26, 2018
@rsc rsc modified the milestones: Proposal, Go1.11 Feb 26, 2018
@CAFxX
Copy link
Contributor Author

CAFxX commented Feb 26, 2018

Great! Should I also open a followup go2 proposal to review the net APIs in light of what was discussed in this issue?

@CAFxX
Copy link
Contributor Author

CAFxX commented Mar 20, 2018

I was giving this a shot but I ran into something unexpected. Currently in net there is only support to set the keepalive interval/delay, but not the number of unacknowledged keepalives before killing the connection (TCP_KEEPCNT/tcp_keepalive_probes).
I suppose this means we are using the local kernel default for TCP_KEEPCNT, whatever that may be (e.g. on Linux this should be 9, on Windows 5, on Mac, FreeBSD and OpenBSD[1] 8).

Not being able to control this parameter means we can't precisely set the target total time before the keepalive mechanism considers the connection broken, since the total time is (initial delay + interval * count) and in go initial delay == interval.

What is the desired course of action here?

  • add a (private) method to set TCP_KEEPCNT on each platform; override the kernel defaults for our sockets and target the same total time on all platforms
  • assume that the local kernel defaults have not been changed; use different keepalive intervals (e.g. 18s on linux, 30s on windows, 20s on mac/bsd) depending on the platform so that in the "unchanged kernel defaults" case all platforms have the same total time (3m)
  • ignore the problem and use the same interval regardless of the platform; each platform will have a different total time (e.g. assuming an interval of 15s, on linux the total time would be 2m 30s, on windows 1m 30s, on mac/bsd 2m 15s)

[1] As a side note, it seems the openbsd support for keepalives states that openbsd doesn't support keepalives, but this does not seem to be the case.

@CAFxX
Copy link
Contributor Author

CAFxX commented Apr 11, 2018

@rsc if you have nothing against this, I will try to go with the third and simplest option ("use the same interval regardless of the platform")

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/107196 mentions this issue: net: enable TCP keepalives by default

@ianlancetaylor ianlancetaylor modified the milestones: Go1.11, Go1.12 Jun 27, 2018
CAFxX added a commit to CAFxX/go that referenced this issue Oct 19, 2018
This is just the first step in attempting to make all network connection have
timeouts as a "safe default". TCP keepalives only protect against certain
classes of network and host issues (e.g. server/OS crash), but do nothing
against application-level issues (e.g. an application that accepts connections
but then fails to serve requests).

The actual keep-alive duration (15s) is chosen to cause broken connections
to be closed after 2~3 minutes (depending on the OS, see golang#23549 for details).
We don't make the actual default value part of the public API for a number of
reasons:
- because it's not very useful by itself: as discussed in golang#23549 the actual
  "timeout" after which the connection is torn down is duration*(KEEPCNT+1),
  and we use the OS-wide value for KEEPCNT because there's currently no way
  to set it from Go.
- because it may change in the future: if users need to rely on a specific
  value they should explicitly set this value instead of relying on the default.

Fixes golang#23459

Change-Id: I348c03be97588d5001e6de0f377e7a93b51957fd
@zhyon404
Copy link

zhyon404 commented Jul 9, 2019

Does using net.ipv4.tcp_keepalive_time as the default keepalives interval sound like a good idea to you?

@CAFxX
Copy link
Contributor Author

CAFxX commented Jul 9, 2019

@zhyon404 we don't do that (and this issue is already closed, so you may want to open a new one if you found a different problem...)

Update: are you suggesting we should be using the OS default?

@networkimprov
Copy link

@pam4 and I discovered that keepalive is disabled by writing to a dead link #31490

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests