-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expensive range request memory usage optimization to reduce OOM #13458
Comments
@hexfusion @gyuho Really appreciate if you can give some comments. |
I have done something similar before: https://github.com/wilsonwang371/etcd/tree/range_cache But I remember the performance improvement was not significant. You can take a look at this. |
@wilsonwang371 Hi wilson, thanks for the reply but it's not exactly the same thing after taking a look. |
From what I understand, caching the result may not decrease the chance of OOM since anyway the cache is also holding the data. Using cache can only decrease the response time potentially I think. |
@wilsonwang371 Yeah, for a normal cache it works that way. But if you look at the code and the above profiling info, you can see that two parts are consuming the majority of the memory: 1. each grpc response will need one copy of response which is encoded from range response from backend and this part is immutable on etcd server side. 2. each backend range response itself needs to be decoded from bboltdb backend so this step also consumes a big part of the memory but can be shared with each other. Simply put this way, if the above line of code can be shared as one slice array, this redundant decoding memory allocation can be saved. You can check my above code commit for more details. And also verified in profiling below which you can see that part is very minimal now(only one copy of the decoded response): |
@wilsonwang371 Can you please comment on this idea? Am I making it clear? Thanks |
I took a look at your code change. I think there are few things we need to pay attention to:
In all, what I mean is that Marshall/Unmarshall costs are from protobuf. Using a cache might not be the correct way to solve this issue. Without higher and better protobuf alternatives, we shall stick with protobuf since it is very deeply embedded in the implementation. |
@wilsonwang371 Thanks so much for taking a look.
Cache may not be the accurate wording. In the implementation, there's no extra memory usage. Instead, it just keeps the response data in memory for a bit longer time for potential reuse and can still be released if no hit. There can be policies when to enable it or when to disable it. High cache miss is just going to release the cache eariler.
A bit confused but this proposal is not trying to make it fast but just to reuse some memory allocation in the case of a lot repetitive range requests like list pods concurrently. In all, the idea is to only limit the impact on expensive range case scenarios and disable it when not necessary. From: bboltdb -> intermediate memory data -> actual response bboltdb -> intermediate memory data -> actual response bboltdb -> intermediate memory data -> actual response To: bboltdb -> actual response bboltdb -> intermediate memory data -> actual response bboltdb -> actual response so only one intermediate memory is needed instead of 3 and no extra memory will be allocated, just keep it there for a while and delay the releasing. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
When etcd server receiving highly concurrent range requests, e.g. listing all the pods repetitively, the memory usage increases quickly and has a chance of OOM. It's a similar issue to this and this
Some proposals using range stream/rate limiting to solve it in long term but no progress due to its complexity.
So basically, for each range request, it will need to allocate memory and decode data from bboltdb, then transform the memory data into actual grpc response. So say 100 list pods at the same time, it will need 100 copies of the intermediate data and 100 actual grpc response. 100 response data is necessary while 100 intermediate data in memory can be reduced by reusing only one copy.
Test profiling below also shows two parts are consuming the most memory:
Intermediate data: each response from backend needs to allocate memory for each revision's value in bboltdb
Actual response: each grpc response itself needs to encode the response from backend into grpc stream
The response part can't be optimized on etcd-server side unless we use proxy or api-server side cache to reduce the requests to server. This is possible but needs to support quorum read which is out of the scope of this issue.
The intermediate part, however is redundant for repetitive requests. For example, list all pods at the same time will only need one copy of the revision list's response from bboltdb.
So one naive idea is to "cache" these frequent range response for reuse.
Initial test has verified the idea, about 50% of memory can be saved for these repetitive range requests.
More details like caching eviction policy needs to be explored but this should explain my idea.
This maybe a naive idea and many more details to talk about but really appreciate if I can get some feedback while I'm doing my own tests. Thanks!
Simple cache POC Commit.
LFU cache sample commit
The text was updated successfully, but these errors were encountered: