Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xds/server: Fix xDS Server leak #7664

Merged
merged 6 commits into from
Sep 27, 2024
Merged

xds/server: Fix xDS Server leak #7664

merged 6 commits into from
Sep 27, 2024

Conversation

zasweq
Copy link
Contributor

@zasweq zasweq commented Sep 23, 2024

Discovered in #7657.

My Dynamic RDS fix added a ref to the server transport in the wrapped connection: https://github.com/grpc/grpc-go/pull/6915/files#diff-dd56a1b7688625b5b70cd616b08c301d12f7f01edbd9be95c506743fd58a6155R140 to Drain correctly even right after Accepting the connection. It also added a ref to wrapped connection, which only gets cleared on the xDS Server's Serving State changing to Not Serving and filter chain updates: https://github.com/grpc/grpc-go/pull/6915/files#diff-e4706c72ae912399b7f8ee6f04cec2374ef7a7679b12358f201ddb0b45e34146R344. In the production environment/test case the connection closes but the listener continues to listen and the server continues to serve in a state that is not NOT_SERVING thus keeping a ref to the wrapped connection around. The solution is to clear the ref to the wrapped connection when the connection closes.

RELEASE NOTES:

  • xds/server: Fix a memory leak that occurred when connections closed but the server remained active while not receiving new LDS Configuration

@zasweq zasweq requested a review from dfawley September 23, 2024 19:21
@zasweq zasweq added this to the 1.68 Release milestone Sep 23, 2024
Copy link

codecov bot commented Sep 23, 2024

Codecov Report

Attention: Patch coverage is 93.75000% with 1 line in your changes missing coverage. Please review.

Project coverage is 81.83%. Comparing base (bcf9171) to head (d73fba0).
Report is 7 commits behind head on master.

Files with missing lines Patch % Lines
xds/internal/server/listener_wrapper.go 93.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7664      +/-   ##
==========================================
+ Coverage   81.79%   81.83%   +0.04%     
==========================================
  Files         361      361              
  Lines       27821    27823       +2     
==========================================
+ Hits        22757    22770      +13     
+ Misses       3863     3851      -12     
- Partials     1201     1202       +1     
Files with missing lines Coverage Δ
xds/internal/server/conn_wrapper.go 77.04% <100.00%> (-4.62%) ⬇️
xds/internal/server/listener_wrapper.go 79.01% <93.33%> (+0.39%) ⬆️

... and 19 files with indirect coverage changes

@@ -202,17 +203,17 @@ func (l *listenerWrapper) maybeUpdateFilterChains() {
// gracefully shut down with a grace period of 10 minutes for long-lived
// RPC's, such that clients will reconnect and have the updated
// configuration apply." - A36
var connsToClose []*connWrapper
var connsToClose map[*connWrapper]bool
if l.activeFilterChainManager != nil { // If there is a filter chain manager to clean up.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implication is that if it's nil then there should be nothing in l.conns?

But is there any harm in doing this unconditionally?

Copy link
Contributor Author

@zasweq zasweq Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a harm in doing this unconditionally, I think this only skips this block on warm up before any active filter chains are present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched this to do it unconditionally. Let me know if you think it looks cleaner, if not I can switch back.

@@ -304,15 +305,15 @@ func (l *listenerWrapper) Accept() (net.Conn, error) {
return nil, fmt.Errorf("received connection with non-TCP address (local: %T, remote %T)", conn.LocalAddr(), conn.RemoteAddr())
}

l.mu.RLock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this field still used as an RWMutex in other places? If not the type should change to a regular mutex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok will switch.

return cw, nil
}
}

func (l *listenerWrapper) RemoveConn(conn *connWrapper) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unexport? I don't think this is here for an interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point I forgot conn wrapper was in same package.

server := grpc.NewServer(grpc.Creds(insecure.NewCredentials()))
testgrpc.RegisterTestServiceServer(server, &testService{})
wg := sync.WaitGroup{}
wg.Add(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary?

Does Serve block on anything after Stop (not GracefulStop) is called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but I forget the testing requirement for the goroutines not leaking. The signal of Stop causes Serve to eventually exit, but then it's not guaranteed it'll exit before the goroutine returns. Although there really isn't a distinction because you can execute wg.Done() yield testing goroutine returns and goroutine hasn't finished, but I think you mentioned to me a while back that this is ok with respect to leak checker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted this wait group.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, all the functionality you're doing is already implemented in the channel itself for its own reasons:

Serve does the wg.Done when it exits:

s.serveWG.Done()

And Stop does the wg.Wait (way) before it returns:

grpc-go/server.go

Line 1911 in 6f50403

s.serveWG.Wait()

(Stop needs Serve to end in order to be sure it can close all the connections the channel created.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok, that's interesting. Noted for next time. I already deleted the wait group and the wait, so I'll go ahead and merge this.

t.Fatalf("grpc.NewClient failed with err: %v", err)
}
client := testgrpc.NewTestServiceClient(cc)
if _, err := client.EmptyCall(ctx, &testpb.Empty{}, grpc.WaitForReady(true)); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WFR should basically never be needed unless you're expecting transient errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted.

Comment on lines 250 to 261
var lenConns int
for ; ctx.Err() == nil; <-time.After(time.Millisecond) {
lisWrapper.mu.Lock()
if lenConns = len(lisWrapper.conns); lenConns == 0 {
lisWrapper.mu.Unlock()
break
}
lisWrapper.mu.Unlock()
}
if ctx.Err() != nil {
t.Fatalf("timeout waiting for lis wrapper conns to clear, size: %v", lenConns)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a simplification:

Suggested change
var lenConns int
for ; ctx.Err() == nil; <-time.After(time.Millisecond) {
lisWrapper.mu.Lock()
if lenConns = len(lisWrapper.conns); lenConns == 0 {
lisWrapper.mu.Unlock()
break
}
lisWrapper.mu.Unlock()
}
if ctx.Err() != nil {
t.Fatalf("timeout waiting for lis wrapper conns to clear, size: %v", lenConns)
}
lenConns := 1
for ; ctx.Err() == nil && lenConns > 0; <-time.After(time.Millisecond) {
lisWrapper.mu.Lock()
lenConns = len(lisWrapper.conns)
lisWrapper.mu.Unlock()
}
if lenConns > 0 {
t.Fatalf("timeout waiting for lis wrapper conns to clear, size: %v", lenConns)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched. Verified it works the same.

@dfawley dfawley assigned zasweq and unassigned dfawley Sep 25, 2024
@zasweq zasweq assigned dfawley and unassigned zasweq Sep 26, 2024
@dfawley dfawley assigned zasweq and unassigned dfawley Sep 27, 2024
@zasweq zasweq merged commit 941102b into grpc:master Sep 27, 2024
14 checks passed
arjan-bal pushed a commit to arjan-bal/grpc-go that referenced this pull request Sep 30, 2024
arjan-bal pushed a commit to arjan-bal/grpc-go that referenced this pull request Sep 30, 2024
purnesh42H pushed a commit to purnesh42H/grpc-go that referenced this pull request Sep 30, 2024
purnesh42H pushed a commit to purnesh42H/grpc-go that referenced this pull request Oct 1, 2024
purnesh42H added a commit that referenced this pull request Oct 2, 2024
Co-authored-by: Zach Reyes <39203661+zasweq@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants