OPA out of memory #6753

itayhac · 2024-05-20T13:38:23Z

we are working with OPA as our policy agent.
we deploy multiple instances of OPA as docker containers on kubernetes.
Each OPA has k8s memory limit of 4GB.
also, each OPA loads a bundle with data.json file of about ~15Mb.

recently we have noticed that some of our OPA instances have been restarted due to OOM.
after further investigation we have found out that it happens when OPA is receiving frequent requests and memory fails to get free fast enough, which in turns results in OOM very fast (within 3 seconds).

Disclaimer:
the bundle i share here is a mock data that best mimics our use case.
i will share the heapdump that we got for the mimic data, and for actual production data (both with same rego code).

Please note, these are functions are taking almost 90 percent of the memory and the service gets OOMed out within seconds.

this is also true for our production memory profile.

OPA version - latest
bundle is provided.
Example of the both memory profile (using pprof) (both for mock data bundle and for production data run)
Go code that sends 100 requests to the local OPA.

Steps To Reproduce

run the following command to start OPA:
opa run --bundle itay_kenv_files/test_15mb.tar.gz --server --pprof --log-level=info
run the code to trigger OPA requests

Expected behavior

memory should remain low or at least get free shortly after the requests are being made.

Code that sends 100 request to OPA

package main

import (
	"bytes"
	"fmt"
	"log"
	"net/http"
	"sync"
	"time"
)

const (
	iterationsNumber = 100

)

func main() {
	log.Println("Starting OPA testing")

	var wg sync.WaitGroup
	wg.Add(iterationsNumber)
	for i := 0; i <= iterationsNumber; i++ {
		time.Sleep(40 *time.Millisecond)
		go sendRequest(i)
	}
	wg.Wait()
	fmt.Println("All go routines have finished.")

}

func sendRequest(i int) {
	log.Println("Sending request to opa. iteration number: ", i)

	// URL to which the POST request will be sent
	url := "http://localhost:8181/v1/data/test_policy/evaluator/access"
	
	jsonStr := []byte(`{
  		"input":{
	  	}
	}`)


	// Create a new HTTP request with POST method, specifying the URL and the request body
	req, err := http.NewRequest("POST", url, bytes.NewBuffer(jsonStr))
	if err != nil {
		log.Println("Error creating request:", err)
		return
	}

	// Set the Content-Type header to application/json since we're sending JSON data
	req.Header.Set("Content-Type", "application/json")

	// Create a new HTTP client
	client := &http.Client{}

	// Send the request via the HTTP client
	resp, err := client.Do(req)
	if err != nil {
		log.Println("Error sending request:", err)
		return
	}

	defer resp.Body.Close()

	// Print the HTTP response status code
	log.Println("Response Status:", resp.Status)
}

test_15mb.tar.gz
memory profile.zip

if further information regarding our production setup is required ill be happy to provide it.

ashutosh-narkar · 2024-05-20T23:37:05Z

Thanks for the detailed issue @itayhac. I tried to reproduce this by running OPA on docker and setting a 4GB memory limit. I increased the number of go routines from your script to send more concurrent requests to OPA. The maximum amount of memory consumed by OPA did not cross 200 MB. Is there something different in your actual setup vs the mock bundle you've provided here? I would expect the CPU usage to spike while OPA handles these requests but it's still unclear why OPA runs OOM.

itayhac · 2024-05-21T06:44:38Z

Hi @ashutosh-narkar , thank you so much for you fast and detailed reply.
i changed the files to reproduce the issue with 4GB memory (i increased the size and structure of the data.json file).

please retry and it should be reproduced.

ashutosh-narkar · 2024-05-22T23:32:32Z

One thing I noticed in the policy is you're using the object.get builtin on the data set instead of just accessing under data.rules for example. You can probably avoid using the builtin. Another thing I noticed when I run the stress test with the openpolicyagent/opa:0.64.1-static image variant there is no significant increase in memory. Have you seen that as well?

itayhac · 2024-05-26T08:55:54Z

any further thoughts?
@ashutosh-narkar can we label it as bug and prioritize it?

ashutosh-narkar · 2024-05-28T19:43:43Z

@itayhac can you please confirm if you're able to repro this issue with the upstream OPA images including any differences with the static variant. You mentioned (in a separate thread) that y'all are building your own images. Also this could be a relevant issue.

itayhac · 2024-05-29T03:41:50Z

the problem is reproduced with our own OPA image (we compile latest), and with both latest public images (static and non-static)

ashutosh-narkar · 2024-05-29T17:44:18Z

This could be related to #5946. In your policy you're referring to a large object and this can be replicated if you modify the policy to refer to the object w/o using the object.get builtin. @johanfylling did you encounter something like this while working on #6040 ?

johanfylling · 2024-05-29T21:39:38Z

@ashutosh-narkar, the work in #6040 focused solely on the CPU time aspect, and did not look at how memory usage was affected.

ashutosh-narkar · 2024-05-29T22:50:09Z

The data has some objects and arrays and I wonder if when referenced inside of the policy the interface-AST conversions are impacting performance in terms of CPU and memory.

ashutosh-narkar · 2024-05-31T20:19:41Z

We're looking to implement something like discussed in #4147. This should probably help with performance as we'll avoid the interface to AST conversion during eval.

stale · 2024-07-06T07:29:27Z

This issue has been automatically marked as inactive because it has not had any activity in the last 30 days. Although currently inactive, the issue could still be considered and actively worked on in the future. More details about the use-case this issue attempts to address, the value provided by completing it or possible solutions to resolve it would help to prioritize the issue.

ashutosh-narkar · 2024-07-31T01:14:22Z

@itayhac are you able to repro this with OPA v0.67.0? I was unable to repro this so would be good to verify incase I missed something.

stale · 2024-08-30T01:37:40Z

This issue has been automatically marked as inactive because it has not had any activity in the last 30 days. Although currently inactive, the issue could still be considered and actively worked on in the future. More details about the use-case this issue attempts to address, the value provided by completing it or possible solutions to resolve it would help to prioritize the issue.

itayhac added the bug label May 20, 2024

itayhac mentioned this issue Jun 5, 2024

OPA high latency - potential cause: bad memory allocations #6795

Open

stale bot added the inactive label Jul 6, 2024

stale bot removed the inactive label Jul 31, 2024

stale bot added the inactive label Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPA out of memory #6753

OPA out of memory #6753

itayhac commented May 20, 2024 •

edited

Loading

ashutosh-narkar commented May 20, 2024

itayhac commented May 21, 2024

ashutosh-narkar commented May 22, 2024

itayhac commented May 26, 2024

ashutosh-narkar commented May 28, 2024

itayhac commented May 29, 2024 •

edited

Loading

ashutosh-narkar commented May 29, 2024

johanfylling commented May 29, 2024

ashutosh-narkar commented May 29, 2024

ashutosh-narkar commented May 31, 2024

stale bot commented Jul 6, 2024

ashutosh-narkar commented Jul 31, 2024

stale bot commented Aug 30, 2024

OPA out of memory #6753

OPA out of memory #6753

Comments

itayhac commented May 20, 2024 • edited Loading

Steps To Reproduce

Expected behavior

ashutosh-narkar commented May 20, 2024

itayhac commented May 21, 2024

ashutosh-narkar commented May 22, 2024

itayhac commented May 26, 2024

ashutosh-narkar commented May 28, 2024

itayhac commented May 29, 2024 • edited Loading

ashutosh-narkar commented May 29, 2024

johanfylling commented May 29, 2024

ashutosh-narkar commented May 29, 2024

ashutosh-narkar commented May 31, 2024

stale bot commented Jul 6, 2024

ashutosh-narkar commented Jul 31, 2024

stale bot commented Aug 30, 2024

itayhac commented May 20, 2024 •

edited

Loading

itayhac commented May 29, 2024 •

edited

Loading