Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server-side goroutine leak on receive message error #142

Closed
austinvazquez opened this issue May 1, 2023 · 5 comments · Fixed by #141
Closed

Server-side goroutine leak on receive message error #142

austinvazquez opened this issue May 1, 2023 · 5 comments · Fixed by #141

Comments

@austinvazquez
Copy link
Contributor

austinvazquez commented May 1, 2023

Hey folks, while tracking down a production error I inadvertently found a rare error in the receive message logic for server connections.

To set some context the receive message logic relies on two goroutines working together.
goroutine A: https://github.com/containerd/ttrpc/blob/main/server.go#L136
goroutine B: https://github.com/containerd/ttrpc/blob/main/server.go#L369-L487

A [built-in Go] error channel is used for communication of errors between goroutines A and B.
https://github.com/containerd/ttrpc/blob/main/server.go#L339

The first thing to note is on message header read error the channel returns the underlying error to goroutine B. Based on the current implementation, these errors can only be from io.ReadFull. io.ReadFull can return io.EOF or io.ErrUnexpectedEOF. Reference: https://pkg.go.dev/io#ReadFull
https://github.com/containerd/ttrpc/blob/main/channel.go#L73-L76

This is fine because goroutine A has error handling to check for these errors and stop execution if they occur.
https://github.com/containerd/ttrpc/blob/main/server.go#L545-L555

The issue begins when considering errors on message read by goroutine B. In this branch, goroutine B has successfully read a message header for a message with some length > 0; however, when reading the message an error has occurred by io.ReadFull either io.EOF or io.ErrUnexpectedEOF.

What I have found is for these errors we wrap the error with more information to clarify the error (opposed to on message header read) occurred on message read.
https://github.com/containerd/ttrpc/blob/main/channel.go#L138

However, goroutine A's error handling logic is not currently setup to check for wrapped errors.
https://github.com/containerd/ttrpc/blob/main/server.go#L550-L554

So the effect is goroutine B is stopped due to a [non-gRPC] receive error, but goroutine A is never stopped resulting in leaked goroutine and server connection.
Stop goroutine B on receive error: https://github.com/containerd/ttrpc/blob/main/server.go#L380-L385
Again (currently) only stop goroutine A on some receive errors: https://github.com/containerd/ttrpc/blob/main/server.go#L550-L554

@austinvazquez
Copy link
Contributor Author

#141 is an attempt to resolve by stopping goroutine A on any receive error since goroutine B stops regardless of the error message it sends to A.

@dmcgowan
Copy link
Member

dmcgowan commented May 1, 2023

I inadvertently found a rare error in the receive message logic for server connections

What is the error you saw?

@austinvazquez
Copy link
Contributor Author

I am seeing this error message in my shim logs.

level=error msg=\\\"error receiving message\\\" error=\\\"failed reading message: unexpected EOF\\\"\"

I do not believe ttrpc is the root cause, but as I was exploring the branch of code after message read fails I found the issue above.

@dmcgowan
Copy link
Member

dmcgowan commented May 2, 2023

I think it first makes sense to get a change in to fix the unwrapping, that is an obvious bug to fix. The larger change of always returning needs a little more consideration, there could be errors we don't want to immediately return on (such as context cancellation).

@austinvazquez
Copy link
Contributor Author

Okay yeah that would be the safer route. Also I'd like to spend some time and brainstorm how to write some unit tests for this edge case. Let me update the PR for just the unwrapping and go from there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants