Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpcfd produces memory leaks #25

Closed
denis-tingaikin opened this issue May 7, 2024 · 8 comments · Fixed by #30
Closed

grpcfd produces memory leaks #25

denis-tingaikin opened this issue May 7, 2024 · 8 comments · Fixed by #30
Assignees

Comments

@denis-tingaikin
Copy link
Collaborator

Motivation

During using the grpcfd, we found goroutine leaks that could produce memory leaks in involved applications.

image

Soltuion

  1. Add tests. Repository doesn't have tests that is looking critical.
  2. Change intrerfaces

https://github.com/edwarnicke/grpcfd/blob/main/connwrap.go#L44-L56

type FDSender interface {
	SendFD(fd uintptr) <-chan error
	SendFile(file SyscallConn) <-chan error
	SendFilename(filename string) <-chan error
}

type FDRecver interface {
	RecvFD(dev, inode uint64) <-chan uintptr
	RecvFile(dev, ino uint64) <-chan *os.File
	RecvFileByURL(urlStr string) (<-chan *os.File, error)
	RecvFDByURL(urlStr string) (<-chan uintptr, error)
}

The provided elements could produce memory leaks because they don't use context for canceling operations.

Suggested to change it to

type FDSender interface {
	SendFD(ctx context.Context, fd uintptr) <-chan error
	SendFile(ctx context.Context, file SyscallConn) <-chan error
	SendFilename(ctx context.Context, filename string) <-chan error
}

type FDRecver interface {
	RecvFD(ctx context.Context, dev, inode uint64) <-chan uintptr
	RecvFile(ctx context.Context, dev, ino uint64) <-chan *os.File
	RecvFileByURL(ctx context.Context, urlStr string) (<-chan *os.File, error)
	RecvFDByURL(ctx context.Context, urlStr string) (<-chan uintptr, error)
}

The using of context can prevent edge cases for grpcfd.

Related to networkservicemesh/cmd-nsmgr#675

@denis-tingaikin
Copy link
Collaborator Author

Profile: goroutinelimit1.zip

@edwarnicke
Copy link
Owner

edwarnicke commented May 8, 2024

Some things to keep in mind wrt resource management in grpcfd:

  1. We have three go routines:
  2. One in RecvFile - which loops until the <-chan uintptr returned from RecvFD is closed
  3. One in SendFile that loops until the <-chan error returned by SendFD is closed.
  4. One in SendFilename that loops until the <-chan error returned by SendFile is closed.

Go Routine in RecvFile

The go routine in RecvFile -loops until the <-chan uintptr returned from RecvFD is closed.

RecvFD creates the fdCh it returns. In connWrap.recvExecutor.AsyncExec either we have an fd already for the requested (dev,ino), in which case we see the fdCh closed or we don't, in which case we store fdCh in connWrap.recvFDChans.

This leaves us with two questions:

  1. When is the fdCh stored connWrap.recvFDChans closed?
  2. Could AsyncExec be deadlocking?

When is the fdCh stored connWrap.recvFDChans closed?

In connWrap.Read if a read results in us receiving an fd we iterate through the connWrap.recvFDChans to see if there's a fdCh waiting to receive it. If so we send it and close the fdCh. So if we receive an fd for the (dev,ino) after its been asked for, the fdCh should be closed, and the corresponding go routine(s) should exit and not leak.

If the connWrap as a net.Conn is Closed or garbage collected we call connWrap.close and loop through connWrap.recvFDchans closing all the fdChs which would cause all corresponding go routines to exit.

Net-net: all go routines originating in RecvFile should exit after the connWrap as a net.Conn is Closed or garbage collected unless AsyncExec is deadlocking, or something really strange is happening. Expectations should therefore be that no go routine is leaked past the Close or garbage collection point.

@denis-tingaikin
Copy link
Collaborator Author

New incident reported by Szilard:

nsmgr_goroutineprofiles_20240519215928.tar.gz

@denis-tingaikin
Copy link
Collaborator Author

image

@denis-tingaikin
Copy link
Collaborator Author

@edwarnicke

I've found a suspect place, pelase have a look

  1. here creates a errCh with size 1
    errCh := make(chan error, 1)
  2. here it creates a goroutine that can write in the errCh
    go func(errChIn <-chan error, errChOut chan<- error) {
  3. and if we got error from
    err = raw.Control(func(fd uintptr) {

    then we can get a leak

@edwarnicke
Copy link
Owner

Good catch. I think upping the buffers to 50 is likely overkill... but I am fine with it if that's how you'd like to address things :)

@denis-tingaikin
Copy link
Collaborator Author

I think upping the buffers to 50 is likely overkill...

Totally agree. Currently, it's needed for testing. We can quickly use this solution only for patch release (v1.13.1), but we first need to get results.

@denis-tingaikin denis-tingaikin changed the title grpcfd could produce memory leaks grpcfd produces memory leaks May 21, 2024
@denis-tingaikin
Copy link
Collaborator Author

Good catch. I think upping the buffers to 50 is likely overkill. 

Hmm, I can also be wrong because, in this case, we should panic about sending on the closed channel. So, most likely, it can be deadlocks in executors or cached connections that are not closed, as you mentioned above. 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants