Skip to content
This repository has been archived by the owner on Apr 19, 2024. It is now read-only.

fix: mutex deadlocks in PeerClient #223

Merged
merged 23 commits into from
Feb 28, 2024

Conversation

miparnisari
Copy link
Contributor

@miparnisari miparnisari commented Feb 20, 2024

Description

Close #222.

The mutex deadlock is proven by the flaky TestPeerClientShutdown:

[25/02/24 7:09:39] ~/GitHub/gubernator (master) $ go test ./... -run=TestPeerClientShutdown -count=1 -race -timeout=10s
?   	github.com/mailgun/gubernator/v2/cmd/gubernator	[no test files]
?   	github.com/mailgun/gubernator/v2/cmd/gubernator-cli	[no test files]
?   	github.com/mailgun/gubernator/v2/cmd/gubernator-cluster	[no test files]
?   	github.com/mailgun/gubernator/v2/cmd/healthcheck	[no test files]
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9990"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9990 ..." instance="127.0.0.1:9990"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9980 ..." instance="127.0.0.1:9990"
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9991"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9991 ..." instance="127.0.0.1:9991"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9981 ..." instance="127.0.0.1:9991"
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9992"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9992 ..." instance="127.0.0.1:9992"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9982 ..." instance="127.0.0.1:9992"
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9993"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9993 ..." instance="127.0.0.1:9993"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9983 ..." instance="127.0.0.1:9993"
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9994"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9994 ..." instance="127.0.0.1:9994"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9984 ..." instance="127.0.0.1:9994"
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9995"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9995 ..." instance="127.0.0.1:9995"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9985 ..." instance="127.0.0.1:9995"
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9890"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9890 ..." instance="127.0.0.1:9890"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9880 ..." instance="127.0.0.1:9890"
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9891"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9891 ..." instance="127.0.0.1:9891"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9881 ..." instance="127.0.0.1:9891"
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9892"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9892 ..." instance="127.0.0.1:9892"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9882 ..." instance="127.0.0.1:9892"
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9893"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9893 ..." instance="127.0.0.1:9893"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9883 ..." instance="127.0.0.1:9893"
panic: test timed out after 10s
running tests:
	TestPeerClientShutdown (10s)
	TestPeerClientShutdown/Global (10s)


goroutine 566 [sync.RWMutex.Lock]:
sync.runtime_SemacquireRWMutex(0xc000fefedc?, 0xbc?, 0xc000fefec8?)
	/opt/homebrew/Cellar/go/1.21.6/libexec/src/runtime/sema.go:87 +0x28
sync.(*RWMutex).Lock(0xc000fefec8)
	/opt/homebrew/Cellar/go/1.21.6/libexec/src/sync/rwmutex.go:152 +0x88
github.com/mailgun/gubernator/v2.(*PeerClient).connect(0xc000fefe00, {0xc00150a1a0?, 0x1?})
	/Users/maria.inesparnisari/GitHub/gubernator/peer_client.go:123 +0x3b4
github.com/mailgun/gubernator/v2.(*PeerClient).getPeerRateLimitsBatch(0xc000fefe00, {0x1054e7e50, 0x1063ddd60}, 0xc00086a200)
	/Users/maria.inesparnisari/GitHub/gubernator/peer_client.go:299 +0x170
github.com/mailgun/gubernator/v2.(*PeerClient).GetPeerRateLimit(0xc000fefe00, {0x1054e7e50, 0x1063ddd60}, 0xc00086a200)
	/Users/maria.inesparnisari/GitHub/gubernator/peer_client.go:199 +0x4fc
github.com/mailgun/gubernator/v2_test.TestPeerClientShutdown.func1.1()
	/Users/maria.inesparnisari/GitHub/gubernator/peer_client_test.go:71 +0x12c
created by github.com/mailgun/gubernator/v2_test.TestPeerClientShutdown.func1 in goroutine 565
	/Users/maria.inesparnisari/GitHub/gubernator/peer_client_test.go:68 +0x3cc

Sometimes this test also gives "panic: send on close channel". It happens because when you call shutdown, we are closing the channel, but we are not preventing any further requests, which will then try to write to that channel.

Testing

With this PR:

25/02/24 7:11:21] ~/GitHub/gubernator (fix-peer-client) $ go test ./... -run=TestPeerClientShutdown -count=10 -race -timeout=5s
?   	github.com/mailgun/gubernator/v2/cmd/gubernator	[no test files]
?   	github.com/mailgun/gubernator/v2/cmd/gubernator-cli	[no test files]
?   	github.com/mailgun/gubernator/v2/cmd/gubernator-cluster	[no test files]
?   	github.com/mailgun/gubernator/v2/cmd/healthcheck	[no test files]
ok  	github.com/mailgun/gubernator/v2	7.453s
ok  	github.com/mailgun/gubernator/v2/cluster	2.966s [no tests to run]
[25/02/24 7:11:37] ~/GitHub/gubernator (fix-peer-client) $

@miparnisari miparnisari mentioned this pull request Feb 20, 2024
peer_client.go Outdated Show resolved Hide resolved
@miparnisari miparnisari changed the title fix race issuers in PeerClient fix race issueS in PeerClient Feb 20, 2024
@miparnisari miparnisari changed the title fix race issueS in PeerClient fix mutex deadlocks in PeerClient Feb 20, 2024
@miparnisari miparnisari marked this pull request as ready for review February 20, 2024 03:49
@miparnisari miparnisari marked this pull request as draft February 20, 2024 16:35
@thrawn01
Copy link
Contributor

This code is overly complex and I loath changing it. 😭

I'm probably missing something obvious, but where are we demonstrating there is a a deadlock and where is the deadlock?

peer_client.go Outdated
// Take the write lock since we're going to modify the closing state
c.mutex.Lock()
// Shutdown waits until outstanding requests have finished
func (c *PeerClient) Shutdown() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shutdown/Close methods should always accept and context.Cancel as there may be some unknown or unforeseen reason the close cannot complete and it then hangs forever. This has happened in production on other services and it's always good to put a limit on how long you will wait for a shutdown to complete. This protects the service from having to be killed by the orchestration system without successfully saving it's internal state. Shutting down a connection is a low priority thing, if for some reason the workgroup Wait() hangs forever, then we can cancel the shutdown wait, and not interfere with normal shutdown of any service that uses gubernator as a library.

Copy link
Contributor Author

@miparnisari miparnisari Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never seen a shutdown function accept a context. E.g. sql.Close, or connection.Close.

The context object is meant to be used when there is the potential chance that the request is cancelled by the user. This is not the case, Shutdown is invoked internally, not by a client. We should not use the context as a crutch to fix deadlock issues.
I can refactor this to avoid mutex altogether, but I would like to know if the work would be useful (given the v3 release)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few things in the works. We are moving development to a new repo which will not be associated with the mailgun GitHub organization. After that, the plan is to bump to a v3 release, drop GRPC support and continue development there.

But first we want to merge all these open PRs.

I had hoped we could finish merging last week, as I'm traveling this week, but hopefully we can get it all done!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And to add, Mailgun plans to abandon this repo in favor of @thrawn01's new repo. I suppose you can continue using the /v2 flavor with the new repo until ready to migrate to /v3.

Either way, it looks like /v2 will be deprecated very soon and gRPC support along with it. I suggest not investing too much time into it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! I can lend a hand in the new repo 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of interest, what was the biggest factor for the decision to drop gRPC?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of interest, what was the biggest factor for the decision to drop gRPC?

@thrawn01 and I had some big conversations about this. Reduced complexity and improved performance. HTTP can perform more efficiently at scale than gRPC if done right.

This would be a good time for @thrawn01 to mention his work on Duh-RPC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTTP can perform more efficiently at scale than gRPC if done right.

What's the source for this? My understanding was this this in fact the opposite 😄 https://aws.amazon.com/compare/the-difference-between-grpc-and-rest/

A gRPC API is better for these use cases:
High-performance systems
High data loads
Real-time or streaming applications

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another source: https://github.com/grpc-ecosystem/grpc-gateway#background

gRPC is great -- it generates API clients and server stubs in many programming languages, it is fast, easy-to-use, bandwidth-efficient and its design is combat-proven by Google. However, you might still want to provide a traditional RESTful JSON API as well. Reasons can range from maintaining backward-compatibility, supporting languages or clients that are not well supported by gRPC, to simply maintaining the aesthetics and tooling involved with a RESTful JSON architecture.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intend to publish some more formal information about our GRPC journey at Mailgun. But as a quick overview, we are doing some very high performance and large scale stuff with GRPC and it hasn't lived up to all the hype.

In fact we have discovered that GRPC is slightly slower than using plan old HTTP/1 !!!!!!

As a result, I started working on a project called DUH-RPC, which is still in the works, but you can see the benchmarks comparing HTTP/1 with GRPC here https://github.com/duh-rpc/duh-go-benchmarks

The spec I'm working on is in progress but is here, feed back is welcome.
https://github.com/duh-rpc/duh-go

I'll be traveling for the next 2 weeks, and I'm not always near my laptop, so I don't expect much movement on these projects until after. I still get notifications for these conversations, so I'll try to reply as I can.

@miparnisari
Copy link
Contributor Author

miparnisari commented Feb 21, 2024

This code is overly complex and I loath changing it. 😭

I'm probably missing something obvious, but where are we demonstrating there is a a deadlock and where is the deadlock?

The deadlock is proven by the flaky test. (If you run go test with -count=100 you will see it)

Sometimes I get "panic: send on close channel". It happens because when you call shutdown, we are closing the channel, but we are not preventing any further requests, which will then try to write to that channel

And sometimes the test times out, it's because one of the lines of code never finishes acquiring the mutex. I forget which line it was

@Baliedge
Copy link
Contributor

@miparnisari Is this PR ready for merge? Not concerned about the go-bench error. But the PR is still in draft mode.

}

assert.True(t, true, isExpectedErr)
Copy link
Contributor Author

@miparnisari miparnisari Feb 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is asserting that true == true 😅 it's also bad practice to do assertions inside goroutines (https://github.com/Antonboom/testifylint#go-require)

Comment on lines -79 to -80
switch err.(type) {
case *gubernator.PeerErr:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not unwrapping errors, so it will never enter the 1st case

peer_client.go Outdated Show resolved Hide resolved
@miparnisari miparnisari changed the title fix mutex deadlocks in PeerClient fix: mutex deadlocks in PeerClient Feb 26, 2024
@miparnisari
Copy link
Contributor Author

miparnisari commented Feb 26, 2024

@Baliedge i don't know what to do about the newly failing TestLeakyBucketDivBug. It seems to be failing because of the immediately preceding test TestHealthCheck. When I run it in isolation, it passes:

[26/02/24 12:36:35] ~/GitHub/gubernator (fix-peer-client) $ go test ./... -run=TestLeakyBucketDivBug -race -count=1 -p=1
ok      github.com/mailgun/gubernator/v2        2.411s
ok      github.com/mailgun/gubernator/v2/cluster        2.014s [no tests to run]
?       github.com/mailgun/gubernator/v2/cmd/gubernator [no test files]
?       github.com/mailgun/gubernator/v2/cmd/gubernator-cli     [no test files]
?       github.com/mailgun/gubernator/v2/cmd/gubernator-cluster [no test files]
?       github.com/mailgun/gubernator/v2/cmd/healthcheck        [no test files]
[26/02/24 12:36:47] ~/GitHub/gubernator (fix-peer-client) $ 

I have the suspicion that TestHealthCheck is setting lastErr and TestLeakyBucketDivBug is regurgitating that.

@miparnisari miparnisari marked this pull request as ready for review February 26, 2024 18:33
@Baliedge
Copy link
Contributor

@Baliedge i don't know what to do about the newly failing TestLeakyBucketDivBug. It seems to be failing because of the immediately preceding test TestHealthCheck.

Since TestHealthCheck causes a server side shutdown and restart, I wonder if it's not ready in time for the next test.

functional_test.go Outdated Show resolved Hide resolved
functional_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@Baliedge Baliedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@Baliedge Baliedge merged commit 63fd9ff into mailgun:master Feb 28, 2024
4 checks passed
@miparnisari
Copy link
Contributor Author

miparnisari commented Feb 29, 2024

@Baliedge make test still fails locally 😭

image

@Baliedge
Copy link
Contributor

@miparnisari I noticed after merging. It does work on retry. If you have an idea how to make it better, let's try it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky test: TestPeerClientShutdown
4 participants