`http.send` possible memory leak? #5381

lukyer · 2022-11-11T14:28:56Z

Short description

Our policy sends many cached http.send requests with low max-age 10s. OPA server is configured with 100M max cache size:

caching:
  inter_query_builtin_cache:
    max_size_bytes: 100000000

I noticed that memory usage reported by OPA go_memstats_alloc_bytes looks like this:

It is similar to what k8s reports to us. Also gc reports from GODEBUG are similar.

It doesn't look like the max cache size has any effect and eventually the OPA pod will crash on OOM.
We do not experience any errors, all works as expected only the memory is still growing until pod gets killed.

Http send looks like:

    res = http.send({"method": "get",
                     "url": url,
                     "timeout": "10s",
                     "force_json_decode": true,
                     "cache": true,
                     "raise_error": false,
                     "headers": {"Authorization": sprintf("Bearer %s", [input.token])}})

Version:

FROM openpolicyagent/opa:0.45.0

Expected behavior

Memory usage should stay consistent around 100M-150M

Any ideas what could be wrong here? Thanks!

Possibly related: #5320

The text was updated successfully, but these errors were encountered:

lukyer · 2022-11-11T15:02:53Z

I briefly checked your code. Throwing in some random idea:

this cache size limiter doesn't take into account key size at all

opa/topdown/cache/cache.go

Line 123 in 7b6a04e

func (c *cache) unsafeInsert(k ast.Value, v InterQueryCacheValue) (dropped int) {
not sure what exactly gets stored with every http.send request as a key but maybe it is significant size so the 100M limit for values might not have even a chance to take effect?
so maybe you should also add there something like c.usage += k.SizeInBytes()?

ashutosh-narkar · 2022-11-11T18:10:34Z

not sure what exactly gets stored with every http.send request as a key but maybe it is significant size so the 100M limit for values might not have even a chance to take effect?

The entire input object to http.send is the cache key . Looking at your input, it does not seem it would have a significant impact on the cache size.

so maybe you should also add there something like c.usage += k.SizeInBytes()

Yes. We could do that. It would have to be added in here.

You mentioned:

It doesn't look like the max cache size has any effect and eventually the OPA pod will crash on OOM.

We added this fix in 0.45.0 to drop items from the cache when the limit is hit. Not sure at this point if this something related to Go not releasing memory back to the OS or if we should start removing items from the cache when it say 80% full or some other issue. Btw how much memory does the OPA container have?

ashutosh-narkar · 2022-11-11T18:18:10Z

Is this issue valid for your setup?

asleire · 2022-11-14T08:36:51Z

There are two parts to the problem:

Keys added to c.l here are only ever removed here, which is only ever happens when the cache is full
c.l allows for duplicate elements

This means c.l will grow and grow until the cache becomes full or OPA reaches its memory limit and becomes oomkilled

It is easier than one might think to be affected by this issue. Consider the following setup:

Cache size limit of 10 MB
http.send has 10 second cache lifetime
http.send uses bearer authentication, size of JWT is approx 1KB, JWT is renewed every 5 minutes
http.send fetches data for individual users, each user will have a unique http.send request that is cached
Response size for http.send is approx 2KB

If OPA in this scenario processes 100 active users, OPA would insert 100KB worth of keys into c.l every 10 seconds, while the actual cache would insert 200KB of data every 5 minutes. 100KB every 10 seconds means the size of c.l increases by 36 MB every hour! Meanwhile c.usage increases by only 2.4MB per hour. In this case the cache should peak after approx. 4 hours at a c.l size of 144 MB where one expected the cache to use only 10 MB.

It would be great if OPA considered the size of cache keys😄

lukyer · 2022-11-14T08:38:25Z

The entire input object to http.send is the cache key

Bytes wise it might be easily 50% overhead in our case. Values we load over http.send are just around 500B jsons.

if we should start removing items from the cache when it say 80% full

It makes sense to me. Removing just enough items to free the cache space just for this particular item seems to be fragile solution. Anyway ultimately this would help a lot in our case #5320 (because we use small TTL around 10s only)

Btw how much memory does the OPA container have?

request 500M, limit 1000M - having it bigger or smaller has effect only on how long it takes to hit the issue

Is #5359 issue valid for your setup?

very probably it is, I also noticed that multiple concurrent requests are send when there is cache miss etc. I will check this issue in more detail later

lukyer · 2022-11-14T09:21:42Z

@asleire Thanks for the example. I think it is actually pretty much our situation which causes the memory problems and unnecessarily high memory usage. I would consider these things to be solved:

take cache keys size into consideration
proactively remove cache items that are already stale (max-age has been reached)
when cache cleanup is started (cache is full), perhaps remove more items at once, not just "as few items as possible to be able to insert this new cache item"

What do you think?

asleire · 2022-11-14T09:46:33Z

take cache keys size into consideration

I think this alone would solve the problem entirely

proactively remove cache items that are already stale (max-age has been reached)

This would be nice when OPA shares memory with other applications. Otherwise, I don't think it matters if cache keys are taken into consideration

when cache cleanup is started (cache is full), perhaps remove more items at once, not just "as few items as possible to be able to insert this new cache item"

I don't think this would do any good. OPA would still need as much memory as it does when the cache is near-full. In comparison, the second bullet point might prevent OPA's cache from ever getting full to begin with

lukyer · 2022-11-14T13:33:37Z

I think this alone would solve the problem entirely

It should solve memory size predictability and OOMs but we would still run into situation that OPA memory is maxed out and constant (never getting any lower because of missing (periodic) cache cleanup)

This would be nice when OPA shares memory with other applications. Otherwise, I don't think it matters if cache keys are taken into consideration

I guess it always matters if you pay for e.g. 1G memory for instances holding stale http.send responses there :)

I don't think this would do any good. OPA would still need as much memory as it does when the cache is near-full. In comparison, the second bullet point might prevent OPA's cache from ever getting full to begin with

Yeah, probably overengineering. The other two improvements should be enough 👍

ashutosh-narkar · 2022-11-14T23:16:12Z

I've created this issue for including the cache key in the cache size calculation.

lukyer · 2022-12-14T09:11:42Z

@ashutosh-narkar Hi, any updates on this one? Even after v0.47.3 bump (can be seen on the graph below) which fixes some cache related problems we are still experiencing this ever raising memory usage not respecting any cache limits (here is configured 100M cache limit):

Thanks!

srenatus · 2022-12-14T09:56:28Z

💭 Do we have a more specific cache-size metric? The one here is including all heap allocations. Just wondering.

lukyer · 2022-12-14T10:04:34Z

Not likely https://www.openpolicyagent.org/docs/latest/monitoring/#prometheus ? However it would be nice to have some.

I hope that this fix should help us to see memory usage stop raising at around 130MB.

lukyer · 2023-01-10T09:09:47Z

any updates to this one? Nobody else experiencing this issue? 🤔

asleire · 2023-01-10T09:29:52Z

@lukyer I made a fix for the issue described here, it was merged a while ago and is part of the 0.48.0 release from yesterday. Have you tested it?

lukyer · 2023-01-10T09:35:21Z

Ah I missed this one fix. I will bump OPA and keep it running for few days so we will know if it helped! Thanks a lot.

lukyer · 2023-01-19T10:17:11Z

@asleire so unfortunately, 0.48.0 didn't solve our issue. Cache size: 100M, growing way over that until OOM kill.

I think the key size would have to be taken into account for cache size limit to fix it. Thoughts?

asleire · 2023-01-19T10:56:33Z

I think the key size would have to be taken into account for cache size limit to fix it. Thoughts?

Yep, that is the fix. The workaround however is to simply lower your cache limit. If for instance your average key size is 1kb and average response size is 200b, a cache limit of 100mb would mean 100mb of responses along with 500mb worth of keys for a total of 600mb of data

ashutosh-narkar · 2023-09-26T23:36:31Z

Using #5385 to keep track of the changes needed in http.send.

lukyer added the bug label Nov 11, 2022

asleire mentioned this issue Dec 12, 2022

Prevent concurrent identical outgoing requests #5474

Closed

asleire mentioned this issue Dec 15, 2022

Fix interquery cache memory leak #5488

Merged

lukyer mentioned this issue Jan 10, 2023

Improve http.send cache cleaning #5320

Closed

ashutosh-narkar added performance builtins labels Mar 8, 2023

ashutosh-narkar closed this as completed Sep 26, 2023

ashutosh-narkar added this to Open Policy Agent Aug 5, 2024

ashutosh-narkar moved this to Done in Open Policy Agent Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`http.send` possible memory leak? #5381

`http.send` possible memory leak? #5381

lukyer commented Nov 11, 2022 •

edited

Loading

lukyer commented Nov 11, 2022 •

edited

Loading

ashutosh-narkar commented Nov 11, 2022

ashutosh-narkar commented Nov 11, 2022

asleire commented Nov 14, 2022

lukyer commented Nov 14, 2022 •

edited

Loading

lukyer commented Nov 14, 2022

asleire commented Nov 14, 2022

lukyer commented Nov 14, 2022

ashutosh-narkar commented Nov 14, 2022

lukyer commented Dec 14, 2022 •

edited

Loading

srenatus commented Dec 14, 2022

lukyer commented Dec 14, 2022

lukyer commented Jan 10, 2023

asleire commented Jan 10, 2023 •

edited

Loading

lukyer commented Jan 10, 2023

lukyer commented Jan 19, 2023

asleire commented Jan 19, 2023

ashutosh-narkar commented Sep 26, 2023

http.send possible memory leak? #5381

http.send possible memory leak? #5381

Comments

lukyer commented Nov 11, 2022 • edited Loading

Short description

Expected behavior

lukyer commented Nov 11, 2022 • edited Loading

ashutosh-narkar commented Nov 11, 2022

ashutosh-narkar commented Nov 11, 2022

asleire commented Nov 14, 2022

lukyer commented Nov 14, 2022 • edited Loading

lukyer commented Nov 14, 2022

asleire commented Nov 14, 2022

lukyer commented Nov 14, 2022

ashutosh-narkar commented Nov 14, 2022

lukyer commented Dec 14, 2022 • edited Loading

srenatus commented Dec 14, 2022

lukyer commented Dec 14, 2022

lukyer commented Jan 10, 2023

asleire commented Jan 10, 2023 • edited Loading

lukyer commented Jan 10, 2023

lukyer commented Jan 19, 2023

asleire commented Jan 19, 2023

ashutosh-narkar commented Sep 26, 2023

`http.send` possible memory leak? #5381

`http.send` possible memory leak? #5381

lukyer commented Nov 11, 2022 •

edited

Loading

lukyer commented Nov 11, 2022 •

edited

Loading

lukyer commented Nov 14, 2022 •

edited

Loading

lukyer commented Dec 14, 2022 •

edited

Loading

asleire commented Jan 10, 2023 •

edited

Loading