-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xds/server: Fix xDS Server leak #7664
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7664 +/- ##
==========================================
+ Coverage 81.79% 81.83% +0.04%
==========================================
Files 361 361
Lines 27821 27823 +2
==========================================
+ Hits 22757 22770 +13
+ Misses 3863 3851 -12
- Partials 1201 1202 +1
|
@@ -202,17 +203,17 @@ func (l *listenerWrapper) maybeUpdateFilterChains() { | |||
// gracefully shut down with a grace period of 10 minutes for long-lived | |||
// RPC's, such that clients will reconnect and have the updated | |||
// configuration apply." - A36 | |||
var connsToClose []*connWrapper | |||
var connsToClose map[*connWrapper]bool | |||
if l.activeFilterChainManager != nil { // If there is a filter chain manager to clean up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implication is that if it's nil
then there should be nothing in l.conns
?
But is there any harm in doing this unconditionally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's a harm in doing this unconditionally, I think this only skips this block on warm up before any active filter chains are present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched this to do it unconditionally. Let me know if you think it looks cleaner, if not I can switch back.
@@ -304,15 +305,15 @@ func (l *listenerWrapper) Accept() (net.Conn, error) { | |||
return nil, fmt.Errorf("received connection with non-TCP address (local: %T, remote %T)", conn.LocalAddr(), conn.RemoteAddr()) | |||
} | |||
|
|||
l.mu.RLock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this field still used as an RWMutex in other places? If not the type should change to a regular mutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok will switch.
return cw, nil | ||
} | ||
} | ||
|
||
func (l *listenerWrapper) RemoveConn(conn *connWrapper) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unexport? I don't think this is here for an interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good point I forgot conn wrapper was in same package.
server := grpc.NewServer(grpc.Creds(insecure.NewCredentials())) | ||
testgrpc.RegisterTestServiceServer(server, &testService{}) | ||
wg := sync.WaitGroup{} | ||
wg.Add(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary?
Does Serve
block on anything after Stop
(not GracefulStop
) is called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but I forget the testing requirement for the goroutines not leaking. The signal of Stop causes Serve to eventually exit, but then it's not guaranteed it'll exit before the goroutine returns. Although there really isn't a distinction because you can execute wg.Done() yield testing goroutine returns and goroutine hasn't finished, but I think you mentioned to me a while back that this is ok with respect to leak checker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted this wait group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, all the functionality you're doing is already implemented in the channel itself for its own reasons:
Serve
does the wg.Done
when it exits:
Line 851 in 6f50403
s.serveWG.Done() |
And Stop
does the wg.Wait
(way) before it returns:
Line 1911 in 6f50403
s.serveWG.Wait() |
(Stop
needs Serve
to end in order to be sure it can close all the connections the channel created.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok, that's interesting. Noted for next time. I already deleted the wait group and the wait, so I'll go ahead and merge this.
t.Fatalf("grpc.NewClient failed with err: %v", err) | ||
} | ||
client := testgrpc.NewTestServiceClient(cc) | ||
if _, err := client.EmptyCall(ctx, &testpb.Empty{}, grpc.WaitForReady(true)); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WFR should basically never be needed unless you're expecting transient errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted.
var lenConns int | ||
for ; ctx.Err() == nil; <-time.After(time.Millisecond) { | ||
lisWrapper.mu.Lock() | ||
if lenConns = len(lisWrapper.conns); lenConns == 0 { | ||
lisWrapper.mu.Unlock() | ||
break | ||
} | ||
lisWrapper.mu.Unlock() | ||
} | ||
if ctx.Err() != nil { | ||
t.Fatalf("timeout waiting for lis wrapper conns to clear, size: %v", lenConns) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a simplification:
var lenConns int | |
for ; ctx.Err() == nil; <-time.After(time.Millisecond) { | |
lisWrapper.mu.Lock() | |
if lenConns = len(lisWrapper.conns); lenConns == 0 { | |
lisWrapper.mu.Unlock() | |
break | |
} | |
lisWrapper.mu.Unlock() | |
} | |
if ctx.Err() != nil { | |
t.Fatalf("timeout waiting for lis wrapper conns to clear, size: %v", lenConns) | |
} | |
lenConns := 1 | |
for ; ctx.Err() == nil && lenConns > 0; <-time.After(time.Millisecond) { | |
lisWrapper.mu.Lock() | |
lenConns = len(lisWrapper.conns) | |
lisWrapper.mu.Unlock() | |
} | |
if lenConns > 0 { | |
t.Fatalf("timeout waiting for lis wrapper conns to clear, size: %v", lenConns) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched. Verified it works the same.
Discovered in #7657.
My Dynamic RDS fix added a ref to the server transport in the wrapped connection: https://github.com/grpc/grpc-go/pull/6915/files#diff-dd56a1b7688625b5b70cd616b08c301d12f7f01edbd9be95c506743fd58a6155R140 to Drain correctly even right after Accepting the connection. It also added a ref to wrapped connection, which only gets cleared on the xDS Server's Serving State changing to Not Serving and filter chain updates: https://github.com/grpc/grpc-go/pull/6915/files#diff-e4706c72ae912399b7f8ee6f04cec2374ef7a7679b12358f201ddb0b45e34146R344. In the production environment/test case the connection closes but the listener continues to listen and the server continues to serve in a state that is not NOT_SERVING thus keeping a ref to the wrapped connection around. The solution is to clear the ref to the wrapped connection when the connection closes.
RELEASE NOTES: