Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http.send possible memory leak? #5381

Closed
lukyer opened this issue Nov 11, 2022 · 18 comments
Closed

http.send possible memory leak? #5381

lukyer opened this issue Nov 11, 2022 · 18 comments

Comments

@lukyer
Copy link
Contributor

lukyer commented Nov 11, 2022

Short description

Our policy sends many cached http.send requests with low max-age 10s. OPA server is configured with 100M max cache size:

caching:
  inter_query_builtin_cache:
    max_size_bytes: 100000000

I noticed that memory usage reported by OPA go_memstats_alloc_bytes looks like this:
image

It is similar to what k8s reports to us. Also gc reports from GODEBUG are similar.

It doesn't look like the max cache size has any effect and eventually the OPA pod will crash on OOM.
We do not experience any errors, all works as expected only the memory is still growing until pod gets killed.

Http send looks like:

    res = http.send({"method": "get",
                     "url": url,
                     "timeout": "10s",
                     "force_json_decode": true,
                     "cache": true,
                     "raise_error": false,
                     "headers": {"Authorization": sprintf("Bearer %s", [input.token])}})

Version:

FROM openpolicyagent/opa:0.45.0

Expected behavior

Memory usage should stay consistent around 100M-150M

Any ideas what could be wrong here? Thanks!

Possibly related: #5320

@lukyer lukyer added the bug label Nov 11, 2022
@lukyer
Copy link
Contributor Author

lukyer commented Nov 11, 2022

I briefly checked your code. Throwing in some random idea:

  • this cache size limiter doesn't take into account key size at all
    func (c *cache) unsafeInsert(k ast.Value, v InterQueryCacheValue) (dropped int) {
  • not sure what exactly gets stored with every http.send request as a key but maybe it is significant size so the 100M limit for values might not have even a chance to take effect?
  • so maybe you should also add there something like c.usage += k.SizeInBytes()?

@ashutosh-narkar
Copy link
Member

not sure what exactly gets stored with every http.send request as a key but maybe it is significant size so the 100M limit for values might not have even a chance to take effect?

The entire input object to http.send is the cache key . Looking at your input, it does not seem it would have a significant impact on the cache size.

so maybe you should also add there something like c.usage += k.SizeInBytes()

Yes. We could do that. It would have to be added in here.

You mentioned:

It doesn't look like the max cache size has any effect and eventually the OPA pod will crash on OOM.

We added this fix in 0.45.0 to drop items from the cache when the limit is hit. Not sure at this point if this something related to Go not releasing memory back to the OS or if we should start removing items from the cache when it say 80% full or some other issue. Btw how much memory does the OPA container have?

@ashutosh-narkar
Copy link
Member

Is this issue valid for your setup?

@asleire
Copy link
Contributor

asleire commented Nov 14, 2022

There are two parts to the problem:

  • Keys added to c.l here are only ever removed here, which is only ever happens when the cache is full
  • c.l allows for duplicate elements

This means c.l will grow and grow until the cache becomes full or OPA reaches its memory limit and becomes oomkilled

It is easier than one might think to be affected by this issue. Consider the following setup:

  • Cache size limit of 10 MB
  • http.send has 10 second cache lifetime
  • http.send uses bearer authentication, size of JWT is approx 1KB, JWT is renewed every 5 minutes
  • http.send fetches data for individual users, each user will have a unique http.send request that is cached
  • Response size for http.send is approx 2KB

If OPA in this scenario processes 100 active users, OPA would insert 100KB worth of keys into c.l every 10 seconds, while the actual cache would insert 200KB of data every 5 minutes. 100KB every 10 seconds means the size of c.l increases by 36 MB every hour! Meanwhile c.usage increases by only 2.4MB per hour. In this case the cache should peak after approx. 4 hours at a c.l size of 144 MB where one expected the cache to use only 10 MB.

It would be great if OPA considered the size of cache keys😄

@lukyer
Copy link
Contributor Author

lukyer commented Nov 14, 2022

The entire input object to http.send is the cache key

Bytes wise it might be easily 50% overhead in our case. Values we load over http.send are just around 500B jsons.

if we should start removing items from the cache when it say 80% full

It makes sense to me. Removing just enough items to free the cache space just for this particular item seems to be fragile solution. Anyway ultimately this would help a lot in our case #5320 (because we use small TTL around 10s only)

Btw how much memory does the OPA container have?

request 500M, limit 1000M - having it bigger or smaller has effect only on how long it takes to hit the issue

Is #5359 issue valid for your setup?

very probably it is, I also noticed that multiple concurrent requests are send when there is cache miss etc. I will check this issue in more detail later

@lukyer
Copy link
Contributor Author

lukyer commented Nov 14, 2022

@asleire Thanks for the example. I think it is actually pretty much our situation which causes the memory problems and unnecessarily high memory usage. I would consider these things to be solved:

  • take cache keys size into consideration
  • proactively remove cache items that are already stale (max-age has been reached)
  • when cache cleanup is started (cache is full), perhaps remove more items at once, not just "as few items as possible to be able to insert this new cache item"

What do you think?

@asleire
Copy link
Contributor

asleire commented Nov 14, 2022

take cache keys size into consideration

I think this alone would solve the problem entirely

proactively remove cache items that are already stale (max-age has been reached)

This would be nice when OPA shares memory with other applications. Otherwise, I don't think it matters if cache keys are taken into consideration

when cache cleanup is started (cache is full), perhaps remove more items at once, not just "as few items as possible to be able to insert this new cache item"

I don't think this would do any good. OPA would still need as much memory as it does when the cache is near-full. In comparison, the second bullet point might prevent OPA's cache from ever getting full to begin with

@lukyer
Copy link
Contributor Author

lukyer commented Nov 14, 2022

I think this alone would solve the problem entirely

It should solve memory size predictability and OOMs but we would still run into situation that OPA memory is maxed out and constant (never getting any lower because of missing (periodic) cache cleanup)

This would be nice when OPA shares memory with other applications. Otherwise, I don't think it matters if cache keys are taken into consideration

I guess it always matters if you pay for e.g. 1G memory for instances holding stale http.send responses there :)

I don't think this would do any good. OPA would still need as much memory as it does when the cache is near-full. In comparison, the second bullet point might prevent OPA's cache from ever getting full to begin with

Yeah, probably overengineering. The other two improvements should be enough 👍

@ashutosh-narkar
Copy link
Member

I've created this issue for including the cache key in the cache size calculation.

@lukyer
Copy link
Contributor Author

lukyer commented Dec 14, 2022

@ashutosh-narkar Hi, any updates on this one? Even after v0.47.3 bump (can be seen on the graph below) which fixes some cache related problems we are still experiencing this ever raising memory usage not respecting any cache limits (here is configured 100M cache limit):
image

Thanks!

@srenatus
Copy link
Contributor

💭 Do we have a more specific cache-size metric? The one here is including all heap allocations. Just wondering.

@lukyer
Copy link
Contributor Author

lukyer commented Dec 14, 2022

Not likely https://www.openpolicyagent.org/docs/latest/monitoring/#prometheus ? However it would be nice to have some.

I hope that this fix should help us to see memory usage stop raising at around 130MB.

@lukyer
Copy link
Contributor Author

lukyer commented Jan 10, 2023

any updates to this one? Nobody else experiencing this issue? 🤔
image

@asleire
Copy link
Contributor

asleire commented Jan 10, 2023

@lukyer I made a fix for the issue described here, it was merged a while ago and is part of the 0.48.0 release from yesterday. Have you tested it?

@lukyer
Copy link
Contributor Author

lukyer commented Jan 10, 2023

Ah I missed this one fix. I will bump OPA and keep it running for few days so we will know if it helped! Thanks a lot.

@lukyer
Copy link
Contributor Author

lukyer commented Jan 19, 2023

@asleire so unfortunately, 0.48.0 didn't solve our issue. Cache size: 100M, growing way over that until OOM kill.
image

I think the key size would have to be taken into account for cache size limit to fix it. Thoughts?

@asleire
Copy link
Contributor

asleire commented Jan 19, 2023

I think the key size would have to be taken into account for cache size limit to fix it. Thoughts?

Yep, that is the fix. The workaround however is to simply lower your cache limit. If for instance your average key size is 1kb and average response size is 200b, a cache limit of 100mb would mean 100mb of responses along with 500mb worth of keys for a total of 600mb of data

@ashutosh-narkar
Copy link
Member

Using #5385 to keep track of the changes needed in http.send.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

4 participants