Description
What version of Go are you using (go version
)?
$ go version go version go1.16.3 darwin/amd64
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (go env
)?
go env
Output
$ go env GO111MODULE="" GOARCH="amd64" GOBIN="" GOCACHE="/Users/tyem/Library/Caches/go-build" GOENV="/Users/tyem/Library/Application Support/go/env" GOEXE="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="darwin" GOINSECURE="" GOMODCACHE="/Users/tyem/go/pkg/mod" GONOPROXY="" GONOSUMDB="" GOOS="darwin" GOPATH="/Users/tyem/go" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/usr/local/go" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64" GOVCS="" GOVERSION="go1.16.3" GCCGO="gccgo" AR="ar" CC="clang" CXX="clang++" CGO_ENABLED="1" GOMOD="/dev/null" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/4w/l3sttpnj00xctgf981g1t2cr0000gq/T/go-build590339154=/tmp/go-build -gno-record-gcc-switches -fno-common"
What did you do?
Try to fix several bugs with locking contention in net/http/h2_bundle.go.
What did you expect to see?
A non-blocking way to test if a mutex is locked.
What did you see instead?
A years-old proposal that was eventually closed due to nobody laying out a convincing enough case to provide such a feature, #6123. I was unable to find a way to re-open an old, locked issue so I am raising it fresh now that I believe I can make a convincing case for the need for this feature.
I think fixing several years-old serious lock-up problems with h2 provides a good justification for this functionality. I believe I have a quite small and simple fix for the following open issues:
x/net/http2: client can hang forever if headers' size exceeds connection's buffer size and server hangs past request time
#23559
x/net/http2: pool deadlock
#32388
net/http: a "bad" https connection which uses Header.Set could potentially block all https requests more than 15 mins in production environments
#33006
x/net/http2: Blocked Write on single connection causes all future calls to block indefinitely
#33425
To lay out the justification for readers not familiar with any of these bugs...
You have a pool of network connections.
You have a sync.Mutex for preventing races when accessing this pool struct.
Each network connection also has a sync.Mutex for preventing races when accessing the connection struct.
Several operations on the pool need to query the state of one or more connections.
In particular, consider a function for making a new connection that needs to see if any existing connections can be reused.
This function must:
Lock the pool mutex.
Look up existing connections to the same endpoint.
Ask each connection if it is in a state where it can be used for a new connection.
That last question requires looking inside of the connection struct which would require locking the connection's mutex.
Despite a great deal of effort over years, the complexity of a network connection is such that it is still possible for the connection mutex to be locked when an operation that is usually fast is done and then that operation takes a long time, leaving the connection mutex locked for a non-trivial duration. It is desirable to prevent such situations but there have been so many cases of this and several of them could not be eliminated after years. So the risk of such is always going to be present.
While having such a problem impact a single connection object is often not much of a problem, having it make it impossible for any new network connections to be made is a very serious problem and we should prevent that. The problem is so serious that h2_bundle has been disabled in important situations just to avoid this.
If I am holding a lock on the pool's mutex and need to ask the question, "Is this particular connection available to take this request?", then, I believe, the connection's mutex being locked means that I should be able to immediately answer with "No, this connection is not available to take a new request, because it is busy doing something that requires its mutex to be locked. We hope this situation will progress very quickly to a state where the mutex is no longer locked and at that point we could ask the question again, but, no, right now, the connection is not ready and available."
But sync.Mutex does not currently allow me to code that, and I think it is completely reasonable to allow such. This can be easily done via the simple addition of:
// LockIfNoWait either locks m immediately if it is not in use, returning
// true, or, when a lock on m is already held, immediately returns false.
func (m *Mutex) LockIfNoWait() bool {
if atomic.CompareAndSwapInt32(&m.state, 0, mutexLocked) {
if race.Enabled {
race.Acquire(unsafe.Pointer(m))
}
return true
}
return false
}
Such an addition then allows fixes to h2_bundle.go that are also very simple.
Now, even if we do a great deal of complicated work to eliminate all of the places where a connection mutex is held during a call that has a rare chance of taking a bit of time and even convince ourselves that we won't ever let such bugs ever creep in again, I would still prefer that "Is this connection available?" be answered immediately with "no" if the question is asked while the connection's mutex is held. Having the "open a new connection" function have to potentially wait for any number of connection mutexes to be unlocked just does not sound like a good strategy for making a pool that provides high performance when there are tons of connections being managed.
One of the places this has happened is holding the connection mutex when the connection is Close()d. It is completely reasonable for the connection struct to remain locked during such an operation as nothing useful can be done with a connection that is in the process of closing... except, of course, if you need to ask the connection a question and you are in a context where waiting for the answer would tie up all progress for anybody wanting to deal with network connections.
And the cases that cause this to happen are often sudden, ungraceful loss of network connectivity to a service. In such a case, it doesn't matter much if the mutex for connections to such a server are locked or not because you won't be able to make useful connections to the service anyway (better to try a new connection where that gets done in a way that doesn't hold the new connection's mutex locked). But having one service failing slowly block all other network connections is terrible.
Another approach would be to have two layers to a connection object and 2 mutexes. And simple state information would be handled by one mutex and complex information would be handled by another. You'd have to be very careful to always lock the slow mutex before locking the fast one. And be very careful about which things are protected by which mutex. What are the trade-offs of such an approach? Cons: A great increase in complexity, lots of ways to easily make mistakes about what is fast vs slow, pool operations may have to wait for any number of fast connection mutexes to be unlocked, and fixing the above long-lived serious bugs will take other people spending a long time working on them (vs. me already having a simple fix). Pros: You don't have to add a basic feature of mutexes that is provided by most other implementations and has been requested by several people and that I've seen lots of reasonable uses for. (I would have piled on to the request for this myself a few times before this but the issue was already there and locked so I just did a more disruptive design change to work around this basic lack.)
So I'm hoping that case is convincing so I can make more progress on fixing those bugs.
Thanks!