Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transport: reuse memory on read path #1455

Closed
MakMukhi opened this issue Aug 22, 2017 · 52 comments
Closed

transport: reuse memory on read path #1455

MakMukhi opened this issue Aug 22, 2017 · 52 comments
Labels
P2 Type: Performance Performance improvements (CPU, network, memory, etc)

Comments

@MakMukhi
Copy link
Contributor

One possible idea is to use a leaky buffer.

@MakMukhi MakMukhi added the Type: Performance Performance improvements (CPU, network, memory, etc) label Aug 22, 2017
@dfawley
Copy link
Member

dfawley commented Aug 22, 2017

Specific areas where we would like to do this:

@muirdm
Copy link
Contributor

muirdm commented Dec 2, 2017

Is someone actively working on this? I've been playing around with different buffer pool strategies, but don't want to look too much more if I'm just duplicating work.

I've been trying bucketed []byte buffer pool where there are tiers of buffer sizes. For example, if you ask for a 700 byte buffer, it might pull a buffer from the 1024 byte pool, slice it down to 700 bytes and return it.

I'm trying a *sync.Pool implementation (one *sync.Pool per buffer size bucket) and a leaky buffer implementation (one buffered chan []byte per buffer size bucket). They certainly help, but neither one has plucked all the low-hanging garbage fruit on the first try.

@awalterschulze
Copy link

awalterschulze commented Dec 2, 2017 via email

@awalterschulze
Copy link

awalterschulze commented Dec 2, 2017 via email

@muirdm
Copy link
Contributor

muirdm commented Dec 2, 2017

There is notable garbage produced outside of proto marshaling in the http2 transport, so I was hoping to address that too. I was looking at more generic []byte buffer reuse since it could be used by the existing codec, and used by the http2 transport.

How is pooling the codec better than just pooling the []byte slice? The existing codec could pull its []byte from []byte pool, and also implement similar gogoproto optimized logic taking advantage of Size() and MarshalTo() as appropriate.

@awalterschulze
Copy link

Thats interesting. I didn't think that the http2 transport could also benefit from buffer pooling. At the previous company I worked, we had a similar buffer pooling scheme for a different purpose. The API asks for a buffer size and returns that size of byte buffer for you. It does this by looking through its slice of buffers for a big enough buffer and then returning the resized to your smaller size. When buffers are returned they are stretched back to their original capacity.
This buffer pool also had retention policies, which were a bit more complex, but implemented orthogonally with the buffer pool.

@dfawley
Copy link
Member

dfawley commented Dec 2, 2017

My thinking for the codec was to add an optional method to it that, if implemented, would be called when the buffer it returns from Marshal is no longer needed by grpc. Then the codec could re-use that memory. Without something like this, it's impossible for the codec to safely reuse old memory buffers.

The http2 transport uses byte buffers to hold data it reads from the network; this can and should be re-used with a pool as well.

@muirdm
Copy link
Contributor

muirdm commented Dec 2, 2017

My thought was instead of returning the buffer to the codec, return the buffer to the generic buffer pool instead. Then anyone can pull a buffer from the pool, and consumers of the buffers can put them back in the pool when they are done. It is safe to never return a buffer, or even to return a buffer that you did not get from the pool originally. On the other hand it is very unsafe to return a buffer and then still use it, or a slice of it. We would need some approach to make ourselves confident that use-after-release doesn't happen.

@awalterschulze
Copy link

awalterschulze commented Dec 3, 2017 via email

@MakMukhi
Copy link
Contributor Author

MakMukhi commented Dec 4, 2017

The lowest hanging fruit, IMHO, is to reuse the memory buffer allocated to copy bytes from data frame, as @dfawley suggested.

data := make([]byte, len(f.Data()))

This buffer can be recycled when the application has read it and copied to its preallocated chunck to be used for unmarshling.

r.last = m.data[copied:]

We can have a few sync.Pool with different sizes increasing in say, powers of 4, and capped to the max size of http2MaxFrameLen. Also it'd be great if the API is such that the base(4 in this example) is configurable so we can try different values with our benchmarks.

One problem with sync.Pool is that it drops all the memory during every GC cycle. In following efforts we can work with the Go team to resolve that. Using a leaky buffer comes with the onus of doing our own GC cycles for that memory. I don't think that's a good idea.

Things that C++ and Java do don't necessarily apply to Go.
C-core takes a user configurable parameter to cap the max amount of memory to be used by the binary. They start releasing memory as pressure builds up. So, it's essentially a library specific GC.

Java, IIRC, allocates memory from the OS space and keeps reusing that without ever freeing it again. This memory is not freed by the GC either because it's outside of JVM's jurisdiction.

I personally think, in gRPC-Go, we shouldn't go about allocating a big a chunk and doing our memory management on that. We should at least give sync.Pool a chance and see if that can be optimized to fit our use case.

@awalterschulze
Copy link

Fair enough. It would be great to compare sync.Pool with classic buffer reuse.

@Zeymo
Copy link
Contributor

Zeymo commented Dec 21, 2017

sync.Pool is not enough as follow issue say, golang/go#23199 . Unbound capacity buffer will cause worse GC or memory overhead . Maybe chunked capacity pool/buffer would be better

@MakMukhi
Copy link
Contributor Author

@Zeymo Thanks for the suggestions. We are aware of sync.Pool's behavior to not return the most fitting buffer. In the suggestion above I mention using buckets of sync.Pool with different sizes.
However, as an exercise it may be interesting to know how would a single sync.Pool perform for the transport layer since we know no single buffer can be greater than http2MaxFrameLen which is 16K for now.

@coocood
Copy link
Contributor

coocood commented Dec 23, 2017

@MakMukhi

I've got another idea to reuse the buffer for stream.

  1. Add a buf field in parser
type parser struct {
	r io.Reader
	header [5]byte
        buf []byte
}
  1. Make the buf if the length is greater than buf length
if length > cap(p.buf) {
        p.buf = make([]byte, int(length))
}
msg = p.buf[:length]

What do you think?

@xenoscopic
Copy link

xenoscopic commented Dec 24, 2017

@coocood

For your second code snippet, I think you'd actually want to compare against the buffer's capacity rather than length. You can actually slice up to the capacity of a buffer, not just the length. The gob package does something very similar:

https://github.com/golang/go/blob/f3f507b2d86bfb38c7e466a465d5b6463cfd4184/src/encoding/gob/decode.go#L64

That way, if you need buffers of size X, X-1, and then X, you aren't allocating twice.

@coocood
Copy link
Contributor

coocood commented Dec 24, 2017

@havoc-io
snippet updated, thanks.

@MakMukhi
Copy link
Contributor Author

@coocood This actually seems like a good idea to reuse buffer on the gRPC layer (perhaps a variation of this might work on the transport layer as well). It's simple, doesn't require global shared pools. However, the only thing that worries me is the case when we have long lived streams with intermittent reading/writing activity of large messages (say >=1MB). In such a case, this stream will carry a large buffer for its life. Now if there were thousands of such streams we're in serious trouble. Perhaps, this problem can be solved by freeing this buffer every once in a while from the stream monitor loop here:

go func() {

@coocood
Copy link
Contributor

coocood commented Dec 28, 2017

@MakMukhi
How about just adding a CallOption to disable the buffer reuse for the extreme use case?
Freeing a buffer by another goroutine adds a lot of complexity.

@MakMukhi
Copy link
Contributor Author

@coocood I like that idea too. You're right accessing the buffer from two different goroutines would require use of locks which is an additional performance overhead. CallOption seems like a good way to go about it.
I'm not actively working on memory reuse currently, though it's high priority on my list of things to do. If you want to take a stab at it, go ahead. Otherwise, hopefully, I'll get to it Q1 2018.
Also, thanks for these great ideas. :)

@coocood
Copy link
Contributor

coocood commented Dec 28, 2017

@MakMukhi
Great!
I'll send a PR in a few days.

coocood added a commit to coocood/grpc-go that referenced this issue Dec 31, 2017
coocood added a commit to coocood/grpc-go that referenced this issue Dec 31, 2017
@CAFxX
Copy link

CAFxX commented Apr 28, 2019

Just a note for @coocood: in #1774 (comment) you wrote:

I think it's very unlikely a client keeps long-lived connections to thousands of servers.

While I agree it's rare, this is exactly the case here (the API gateway the flamegraph is from is a "client" for a few hundred backends, and it keeps long-lived connections to them)

@canguler
Copy link

@dfawley - Change at #1774 is still applicable. If we get an unmarshalv implementation for proto, then it will no longer be applicable.

@canguler
Copy link

I am experimenting with the change at #1774. Even if we were to get an unmarshalv implementation in near future, I think this change (#1774) is small enough. So, if it gives an improvement without any regressions, we may still go ahead and take it.

@coocood
Copy link
Contributor

coocood commented May 1, 2019

@canguler @dfawley

I think UnmarshalV is not an ideal solution, the biggest problem is that the generated Unmarshal code cannot be used. It will be much slower for complex proto. If we generate UnmarshalV code for each type, that is too expensive for the existing projects.

@CAFxX
Copy link

CAFxX commented May 15, 2019

@dfawley do you need me looking into anything else? I'm not 100% sure though this issue is the best place to discuss the problem we're seeing as, although related, as you mentioned they're likely different issues.

@dfawley
Copy link
Member

dfawley commented May 15, 2019

@CAFxX - could you file a separate issue for the problems you're encountering, please?

In terms of additional information to gather, it would be interesting to know how often connections are being created in your setup. It seems like you have not only a large pool of connections, but new connections are also being created and destroyed frequently - is this the case? Also, I don't see any raw numbers in the flame graph - what is the total amount of allocations you are seeing (cumulative and per-minute/etc)?

@dfawley dfawley changed the title Reuse memory by caching buffers. transport: reuse memory on read path May 15, 2019
@dfawley
Copy link
Member

dfawley commented May 15, 2019

Narrowing the focus of this bug to a single item. Filed #2816 and #2817 to track other memory reuse areas.

@alextomaili
Copy link

How about to use sync.Pool to reuse memory buffers which creates by recvMsg method of parser to read gRPC message from i/o layer. In my project this is about 5% of total memory allocations.
I've created pool request #3220 to do this with minimal changes of code.

@grpc grpc deleted a comment from stale bot Dec 3, 2019
@kevinconaway
Copy link

@MakMukhi are there any plans to continue with this work ?

@MakMukhi
Copy link
Contributor Author

MakMukhi commented Jan 8, 2021

@dfawley should be able to help with prioritization/resource allocation here.

@dfawley
Copy link
Member

dfawley commented Jan 8, 2021

We don't have resources to focus on performance improvements at this time. This is unlikely to be prioritized for at least the next 3-6 months. If someone is interested in contributing, we would be willing to accept PRs if the author ensures that thorough performance benchmarking is done and shows clear wins.

@kevinconaway
Copy link

@dfawley Are you able to detail why #1987 was reverted? I saw a vague reference to security objections but it wasn't clear what they were and how a different approach might avoid them.

@dfawley
Copy link
Member

dfawley commented Jan 8, 2021

If this is the one I'm thinking of, it allowed one side to force potentially large allocations on the other side without actually sending a similar amount of data. I.e. the sender declares it will send 1GB of data on a stream; the receiver makes the allocation; then the sender doesn't send anything further on that stream. This could very quickly OOM the peer (across many streams).

@kevinconaway
Copy link

we would be willing to accept PRs if the author ensures that thorough performance benchmarking is done and shows clear wins.

From reading this issue and related ones, it seems there is a tension between users who have mostly small message types and do not (or may not) use compression and those who have much larger message types. It sounds like the former group does not need, and may actively be harmed by, adding any type of pooling or memory reuse while the latter group would see a substantial benefit from it.

If that is accurate, it doesn't sound like there is a one size fits all solution here. What are your thoughts on being able to add a type via a DialOption or CallOption that allows plugging in a type to help manage the memory used in the recv workflow? Based on my limited analysis, it looks like there are two places where we could reuse memory, 1.) reading the message from the transport stream 2.) decompressing the data from 1. in to a separate buffer for codec unmarshaling.

Perhaps something like:

type RecvBuffers interface {
    GetReadBuffer(size int) *bytes.Buffer
    ReturnReadBuffer(b *bytes.Buffer)

    GetDecompressionBuffer(size int) *bytes.Buffer
    ReturnDecompressionBuffer(b *bytes.Buffer)
}

Implementers could decide to return a pooled buffer based on the requested size (or not). The default implementation could behave as the recv flow does now by allocating the requested memory each time.

Assuming this could be implemented in a way that doesn't harm the "out of the box" performance characteristics, this would allow users to control their own memory usage.

Thoughts?

@kevinconaway
Copy link

@dfawley Thoughts on the above? I'm happy to attempt this but I don't want to go off on the wrong path

@dfawley
Copy link
Member

dfawley commented Jan 26, 2021

Sorry for the delay. This is a pretty involved issue, and it's been a year+ since I've really thought about it.

I don't like the sound of injecting custom memory re-use logic into grpc. If it needs to be manually configured, there should be one or two variables that can be tweaked instead. But ideally we can find a way to make things work well out of the box for 99%+ of users.

@alanprot
Copy link

alanprot commented Mar 15, 2023

Hi, is this still in consideration?

We just did some tests on the cortex project and we could see a significant reduction on cpu and memory utilization (~15%) on some of our high tps components.

On our tests we just created n sync.pools that holds slices with size 2^x where x grow from an initial value to n (something like https://github.com/cortexproject/cortex/blob/4d433cccf8db2080d6032b811dde4391aa919710/pkg/cortexpb/slicesPool.go#L22-L33)

Is this an approach we could take here?

@alanprot
Copy link

alanprot commented Jun 8, 2023

Seems this is being worked on: #5862 for streamed messages.
It may be possible to extend after for UnaryRPC.

@dfawley
Copy link
Member

dfawley commented Sep 18, 2023

I'm going to close this and #6578 can be used to track follow-ups of #5862, with #2816 and #2817 covering the codec buffer reuse.

@dfawley dfawley closed this as completed Sep 18, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P2 Type: Performance Performance improvements (CPU, network, memory, etc)
Projects
None yet
Development

No branches or pull requests